linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv3 0/3] rdmacg: IB/core: rdma controller support
@ 2016-01-30 15:23 Parav Pandit
  2016-01-30 15:23 ` [PATCHv3 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Parav Pandit @ 2016-01-30 15:23 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav

This patchset adds support for RDMA cgroup by addressing review comments
of [2], [1] and by implementing published RFC [3].

Overview:
Currently user space applications can easily take away all the rdma
device specific resources such as AH, CQ, QP, MR etc. Due to which other
applications in other cgroup or kernel space ULPs may not even get chance
to allocate any rdma resources. This results into service unavailibility.

RDMA cgroup addresses this issue by allowing resource accounting,
limit enforcement on per cgroup, per rdma device basis.

Resources are not defined by the RDMA cgroup. Resources are defined
by RDMA/IB stack & optionally by HCA vendor device drivers.
This allows rdma cgroup to remain constant while RDMA/IB
stack can evolve without the need of rdma cgroup update. A new
resource can be easily added by the RDMA/IB stack without touching
rdma cgroup.

RDMA uverbs layer will enforce limits on well defined RDMA verb
resources without any HCA vendor device driver involvement.

RDMA uverbs layer will not do accounting of hw vendor specific resources.
Instead rdma cgroup provides set of APIs through which vendor specific 
drivers can define their own resources (upto 64) that can be accounted by
rdma cgroup.

Resource limit enforcement is hierarchical.

When process is migrated with active RDMA resources, rdma cgroup
continues to uncharge original cgroup for allocated resource. New resource
is charged to current process's cgroup, which means if the process is
migrated with active resources, for new resources it will be charged to
new cgroup and old resources will be correctly uncharged from old cgroup.

Changes from v2:
 * Fixed compilation error reported by 0-DAY kernel test infrastructure
   for m68k achitecture where CONFIG_CGROUP is also not defined.
 * Fixed comment in patch to refer to legacy mode of cgroup, changed to 
   refer to v1 and v2 version.
 * Added more information in commit log for rdma controller patch.

Changes from v1:
 * (To address comments from Tejun)
   a. reduces 3 patches to single patch
   b. removed resource word from the cgroup configuration files
   c. changed cgroup configuration file names to match other cgroups
   d. removed .list file and merged functionality with .max file
 * Based on comment to merge to single patch for rdma controller;
   IB/core patches are reduced to single patch.
 * Removed pid cgroup map and simplified design -
   Charge/Uncharge caller stack keeps track of the rdmacg for
   given resource. This removes the need to maintain and perform
   hash lookup. This also allows little more accurate resource
   charging/uncharging when process moved from one to other cgroup
   with active resources and continue to allocate more.
 * Critical fix: Removed rdma cgroup's dependency on the kernel module
   header files to avoid crashes when modules are upgraded without kernel
   upgrade, which is very common due to high amount of changes in IB stack
   and it is also shipped as individual kernel modules.
 * uboject extended to keep track of the owner rdma cgroup, so that same
   rdmacg can be used while uncharging.
 * Added support functions to hide details of rdmacg device in uverbs
   modules for cases of cgroup enabled/disabled at compile time. This
   avoids multiple ifdefs for every API in uverbs layer.
 * Removed failure counters in first patch, which will be added once
   initial feature is merged.
 * Fixed stale rpool access which is getting freed, while doing
   configuration to rdma.verb.max file.
 * Fixed rpool resource leak while querying max, current values.

Changes from v0:
(To address comments from Haggai, Doug, Liran, Tejun, Sean, Jason)
 * Redesigned to support per device per cgroup limit settings by bringing
   concept of resource pool.
 * Redesigned to let IB stack define the resources instead of rdma
   controller using resource template.
 * Redesigned to support hw vendor specific limits setting
   (optional to drivers).
 * Created new rdma controller instead of piggyback on device cgroup.
 * Fixed race conditions for multiple tasks sharing rdma resources.
 * Removed dependency on the task_struct.

[1] https://lkml.org/lkml/2016/1/5/632
[2] https://lkml.org/lkml/2015/9/7/476
[3] https://lkml.org/lkml/2015/10/28/144

This patchset is for Tejun's for-4.5 branch.
It is not attempted on Doug's rdma tree yet, which I will do once I receive
comments for this pathset.

Parav Pandit (3):
  rdmacg: Added rdma cgroup controller.
  IB/core: added support to use rdma cgroup controller
  rdmacg: Added documentation for rdma controller

 Documentation/cgroup-v1/rdma.txt      |  122 ++++
 Documentation/cgroup-v2.txt           |   43 ++
 drivers/infiniband/core/Makefile      |    1 +
 drivers/infiniband/core/cgroup.c      |  108 ++++
 drivers/infiniband/core/core_priv.h   |   45 ++
 drivers/infiniband/core/device.c      |    8 +
 drivers/infiniband/core/uverbs_cmd.c  |  209 ++++++-
 drivers/infiniband/core/uverbs_main.c |   28 +
 include/linux/cgroup_rdma.h           |   80 +++
 include/linux/cgroup_subsys.h         |    4 +
 include/rdma/ib_verbs.h               |   27 +-
 init/Kconfig                          |   12 +
 kernel/Makefile                       |    1 +
 kernel/cgroup_rdma.c                  | 1021 +++++++++++++++++++++++++++++++++
 14 files changed, 1693 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/cgroup-v1/rdma.txt
 create mode 100644 drivers/infiniband/core/cgroup.c
 create mode 100644 include/linux/cgroup_rdma.h
 create mode 100644 kernel/cgroup_rdma.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-01-30 15:23 [PATCHv3 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
@ 2016-01-30 15:23 ` Parav Pandit
  2016-01-30 18:30   ` Tejun Heo
  2016-01-30 15:23 ` [PATCHv3 2/3] IB/core: added support to use " Parav Pandit
  2016-01-30 15:23 ` [PATCHv3 3/3] rdmacg: Added documentation for rdma controller Parav Pandit
  2 siblings, 1 reply; 16+ messages in thread
From: Parav Pandit @ 2016-01-30 15:23 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav

Added rdma cgroup controller that does accounting, limit enforcement
on rdma/IB verbs and hw resources.

Added rdma cgroup header file which defines its APIs to perform
charing/uncharing functionality and device registration which will
participate in controller functions of accounting and limit
enforcements. It also define rdmacg_device structure to bind IB stack
and RDMA cgroup controller.

RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup, per resource pool_type entity which allows setting
up accounting limits on per device basis.

RDMA cgroup returns error when user space applications try to allocate
resources more than its configured limit.

Rdma cgroup implements resource accounting for two types of resource
pools.
(a) RDMA IB specification level verb resources defined by IB stack
(b) HCA vendor device specific resources defined by vendor device driver

Resources are not defined by the RDMA cgroup, instead they are defined
by the external module, typically IB stack and optionally by HCA drivers
for those RDMA devices which doesn't have one to one mapping of IB verb
resource with hardware resource. This allows extending IB stack without
changing kernel, which is frequent as IB stack is going through changes
and enhancements.

Resource pool are created/destroyed dynamically whenever
charging/uncharging occurs respectively and whenever user configuration
is done. Its a tradeoff of memory vs little more code space that creates
resource pool whenever necessary, instead of creating them during cgroup
creation and device registration time.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 include/linux/cgroup_rdma.h   |   80 ++++
 include/linux/cgroup_subsys.h |    4 +
 init/Kconfig                  |   12 +
 kernel/Makefile               |    1 +
 kernel/cgroup_rdma.c          | 1021 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 1118 insertions(+)
 create mode 100644 include/linux/cgroup_rdma.h
 create mode 100644 kernel/cgroup_rdma.c

diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
new file mode 100644
index 0000000..5e6d9d8
--- /dev/null
+++ b/include/linux/cgroup_rdma.h
@@ -0,0 +1,80 @@
+#ifndef _CGROUP_RDMA_H
+#define _CGROUP_RDMA_H
+
+#include <linux/cgroup.h>
+
+/*
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+enum rdmacg_resource_pool_type {
+	RDMACG_RESOURCE_POOL_VERB,
+	RDMACG_RESOURCE_POOL_HW,
+	RDMACG_RESOURCE_POOL_TYPE_MAX,
+};
+
+struct rdma_cgroup {
+#ifdef CONFIG_CGROUP_RDMA
+	struct cgroup_subsys_state	css;
+
+	spinlock_t	cg_list_lock;	/* protects cgroup resource pool list */
+	struct list_head rpool_head;	/* head to keep track of all resource
+					 * pools that belongs to this cgroup.
+					 */
+#endif
+};
+
+#ifdef CONFIG_CGROUP_RDMA
+#define RDMACG_MAX_RESOURCE_INDEX (64)
+
+struct match_token;
+struct rdmacg_device;
+
+struct rdmacg_pool_info {
+	struct match_token *resource_table;
+	int resource_count;
+};
+
+struct rdmacg_resource_pool_ops {
+	struct rdmacg_pool_info*
+		(*get_resource_pool_tokens)(struct rdmacg_device *);
+};
+
+struct rdmacg_device {
+	struct rdmacg_resource_pool_ops
+				*rpool_ops[RDMACG_RESOURCE_POOL_TYPE_MAX];
+	struct list_head        rdmacg_list;
+	char                    *name;
+};
+
+/* APIs for RDMA/IB stack to publish when a device wants to
+ * participate in resource accounting
+ */
+void rdmacg_register_device(struct rdmacg_device *device, char *dev_name);
+void rdmacg_unregister_device(struct rdmacg_device *device);
+
+/* APIs for RDMA/IB stack to charge/uncharge pool specific resources */
+int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
+		      struct rdmacg_device *device,
+		      enum rdmacg_resource_pool_type type,
+		      int resource_index,
+		      int num);
+void rdmacg_uncharge(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device,
+		     enum rdmacg_resource_pool_type type,
+		     int resource_index,
+		     int num);
+
+void rdmacg_set_rpool_ops(struct rdmacg_device *device,
+			  enum rdmacg_resource_pool_type pool_type,
+			  struct rdmacg_resource_pool_ops *ops);
+void rdmacg_clear_rpool_ops(struct rdmacg_device *device,
+			    enum rdmacg_resource_pool_type pool_type);
+int rdmacg_query_limit(struct rdmacg_device *device,
+		       enum rdmacg_resource_pool_type type,
+		       int *limits, int max_count);
+
+#endif	/* CONFIG_CGROUP_RDMA */
+#endif	/* _CGROUP_RDMA_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..d0e597c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDMA)
+SUBSYS(rdma)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f8754f5..f8055f5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1070,6 +1070,18 @@ config CGROUP_PIDS
 	  since the PIDs limit only affects a process's ability to fork, not to
 	  attach to a cgroup.
 
+config CGROUP_RDMA
+	bool "RDMA controller"
+	help
+	  Provides enforcement of RDMA resources at RDMA/IB verb level and
+	  enforcement of any RDMA/IB capable hardware advertized resources.
+	  Its fairly easy for applications to exhaust RDMA resources, which
+	  can result into kernel consumers or other application consumers of
+	  RDMA resources left with no resources. RDMA controller is designed
+	  to stop this from happening.
+	  Attaching existing processes with active RDMA resources to the cgroup
+	  hierarchy will be allowed even if can cross the hierarchy's limit.
+
 config CGROUP_FREEZER
 	bool "Freezer controller"
 	help
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..26e413c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
+obj-$(CONFIG_CGROUP_RDMA) += cgroup_rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_rdma.c b/kernel/cgroup_rdma.c
new file mode 100644
index 0000000..305d791
--- /dev/null
+++ b/kernel/cgroup_rdma.c
@@ -0,0 +1,1021 @@
+/*
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+#include <linux/pid.h>
+#include <linux/spinlock.h>
+#include <linux/parser.h>
+#include <linux/atomic.h>
+#include <linux/seq_file.h>
+#include <linux/hashtable.h>
+#include <linux/cgroup.h>
+#include <linux/cgroup_rdma.h>
+
+static DEFINE_MUTEX(dev_mutex);
+static LIST_HEAD(dev_list);
+
+enum rdmacg_file_type {
+	RDMACG_VERB_RESOURCE_MAX,
+	RDMACG_VERB_RESOURCE_STAT,
+	RDMACG_HW_RESOURCE_MAX,
+	RDMACG_HW_RESOURCE_STAT,
+};
+
+#define RDMACG_USR_CMD_REMOVE "remove"
+
+/* resource tracker per resource for rdma cgroup */
+struct cg_resource {
+	int max;
+	atomic_t usage;
+};
+
+/**
+ * pool type indicating either it got created as part of default
+ * operation or user has configured the group.
+ * Depends on the creator of the pool, its decided to free up
+ * later or not.
+ */
+enum rpool_creator {
+	RDMACG_RPOOL_CREATOR_DEFAULT,
+	RDMACG_RPOOL_CREATOR_USR,
+};
+
+/**
+ * resource pool object which represents, per cgroup, per device,
+ * per resource pool_type resources. There are multiple instance
+ * of this object per cgroup, therefore it cannot be embedded within
+ * rdma_cgroup structure. Its maintained as list.
+ */
+struct cg_resource_pool {
+	struct list_head cg_list;
+	struct rdmacg_device *device;
+	enum rdmacg_resource_pool_type type;
+
+	struct cg_resource *resources;
+
+	atomic_t refcnt;	/* count active user tasks of this pool */
+	atomic_t creator;	/* user created or default type */
+};
+
+static struct rdma_cgroup *css_rdmacg(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct rdma_cgroup, css);
+}
+
+static struct rdma_cgroup *parent_rdmacg(struct rdma_cgroup *cg)
+{
+	return css_rdmacg(cg->css.parent);
+}
+
+static inline struct rdma_cgroup *task_rdmacg(struct task_struct *task)
+{
+	return css_rdmacg(task_css(task, rdma_cgrp_id));
+}
+
+static struct rdmacg_resource_pool_ops*
+	get_pool_ops(struct rdmacg_device *device,
+		     enum rdmacg_resource_pool_type pool_type)
+{
+	return device->rpool_ops[pool_type];
+}
+
+static inline void set_resource_limit(struct cg_resource_pool *rpool,
+				      int index, int new_max)
+{
+	rpool->resources[index].max = new_max;
+}
+
+static void _free_cg_rpool(struct cg_resource_pool *rpool)
+{
+	kfree(rpool->resources);
+	kfree(rpool);
+}
+
+static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
+			      struct cg_resource_pool *rpool)
+{
+	spin_lock(&cg->cg_list_lock);
+
+	/* if its started getting used by other task, before we take the
+	 * spin lock, then skip freeing it.
+	 */
+	if (atomic_read(&rpool->refcnt) == 0) {
+		list_del_init(&rpool->cg_list);
+		spin_unlock(&cg->cg_list_lock);
+
+		_free_cg_rpool(rpool);
+		return;
+	}
+	spin_unlock(&cg->cg_list_lock);
+}
+
+static void dealloc_cg_rpool(struct rdma_cgroup *cg,
+			     struct cg_resource_pool *rpool)
+{
+	/* Don't free the resource pool which is created by the
+	 * user, otherwise we lose the configured limits. We don't
+	 * gain much either by splitting storage of limit and usage.
+	 * So keep it around until user deletes the limits.
+	 */
+	if (atomic_read(&rpool->creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
+		_dealloc_cg_rpool(cg, rpool);
+}
+
+static void put_cg_rpool(struct rdma_cgroup *cg,
+			 struct cg_resource_pool *rpool)
+{
+	if (atomic_dec_and_test(&rpool->refcnt))
+		dealloc_cg_rpool(cg, rpool);
+}
+
+static struct cg_resource_pool*
+	alloc_cg_rpool(struct rdma_cgroup *cg,
+		       struct rdmacg_device *device,
+		       int count,
+		       enum rdmacg_resource_pool_type type)
+{
+	struct cg_resource_pool *rpool;
+	int i, ret;
+
+	rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
+	if (!rpool) {
+		ret = -ENOMEM;
+		goto err;
+	}
+	rpool->resources = kcalloc(count, sizeof(*rpool->resources),
+				   GFP_KERNEL);
+	if (!rpool->resources) {
+		ret = -ENOMEM;
+		goto alloc_err;
+	}
+
+	/* set pool ownership and type, so that it can be freed correctly */
+	rpool->device = device;
+	rpool->type = type;
+	INIT_LIST_HEAD(&rpool->cg_list);
+	atomic_set(&rpool->creator, RDMACG_RPOOL_CREATOR_DEFAULT);
+
+	for (i = 0; i < count; i++)
+		set_resource_limit(rpool, i, S32_MAX);
+
+	return rpool;
+
+alloc_err:
+	kfree(rpool);
+err:
+	return ERR_PTR(ret);
+}
+
+static struct cg_resource_pool*
+	find_cg_rpool(struct rdma_cgroup *cg,
+		      struct rdmacg_device *device,
+		      enum rdmacg_resource_pool_type type)
+
+{
+	struct cg_resource_pool *pool;
+
+	list_for_each_entry(pool, &cg->rpool_head, cg_list)
+		if (pool->device == device && pool->type == type)
+			return pool;
+
+	return NULL;
+}
+
+static struct cg_resource_pool*
+	_get_cg_rpool(struct rdma_cgroup *cg,
+		      struct rdmacg_device *device,
+		      enum rdmacg_resource_pool_type type)
+{
+	struct cg_resource_pool *rpool;
+
+	spin_lock(&cg->cg_list_lock);
+	rpool = find_cg_rpool(cg, device, type);
+	if (rpool)
+		goto found;
+found:
+	spin_unlock(&cg->cg_list_lock);
+	return rpool;
+}
+
+/**
+ * get_cg_rpool - get pid_cg map reference.
+ * @cg: cgroup for which resouce pool to be allocated.
+ * @device: device for which to allocate resource pool.
+ * @type: type of the resource pool.
+ *
+ * It searches a cgroup resource pool object for a given device and resource
+ * type, if it finds it would return it. If none is present, it would allocate
+ * new resource pool entry of default type.
+ * Returns resource pool on success, else return error pointer or null.
+ */
+static struct cg_resource_pool*
+	get_cg_rpool(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device,
+		     enum rdmacg_resource_pool_type type)
+{
+	struct cg_resource_pool *rpool, *other_rpool;
+	struct rdmacg_pool_info *pool_info;
+	struct rdmacg_resource_pool_ops *ops;
+	int ret = 0;
+
+	spin_lock(&cg->cg_list_lock);
+	rpool = find_cg_rpool(cg, device, type);
+	if (rpool) {
+		atomic_inc(&rpool->refcnt);
+		spin_unlock(&cg->cg_list_lock);
+		return rpool;
+	}
+	spin_unlock(&cg->cg_list_lock);
+
+	/* ops cannot be NULL at this stage, as caller made to charge/get
+	 * the resource pool being aware of such need and invoking with
+	 * because it has setup resource pool ops.
+	 */
+	ops = get_pool_ops(device, type);
+	pool_info = ops->get_resource_pool_tokens(device);
+	if (!pool_info) {
+		ret = -EINVAL;
+		goto err;
+	}
+	if (pool_info->resource_count == 0 ||
+	    pool_info->resource_count > RDMACG_MAX_RESOURCE_INDEX) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	/* allocate resource pool */
+	rpool = alloc_cg_rpool(cg, device, pool_info->resource_count, type);
+	if (IS_ERR_OR_NULL(rpool))
+		return rpool;
+
+	/* cgroup lock is held to synchronize with multiple
+	 * resource pool creation in parallel.
+	 */
+	spin_lock(&cg->cg_list_lock);
+	other_rpool = find_cg_rpool(cg, device, type);
+	/* if other task added resource pool for this device for this cgroup
+	 * free up which was recently created and use the one we found.
+	 */
+	if (other_rpool) {
+		atomic_inc(&other_rpool->refcnt);
+		spin_unlock(&cg->cg_list_lock);
+		_free_cg_rpool(rpool);
+		return other_rpool;
+	}
+
+	atomic_inc(&rpool->refcnt);
+	list_add_tail(&rpool->cg_list, &cg->rpool_head);
+
+	spin_unlock(&cg->cg_list_lock);
+	return rpool;
+
+err:
+	spin_unlock(&cg->cg_list_lock);
+	return ERR_PTR(ret);
+}
+
+/**
+ * uncharge_cg_resource - hierarchically uncharge resource for rdma cgroup
+ * @cg: pointer to cg to uncharge and all parents in hierarchy
+ * @device: pointer to ib device
+ * @type: the type of resource pool to uncharge
+ * @index: index of the resource to uncharge in cg (resource pool)
+ * @num: the number of rdma resource to uncharge
+ *
+ * It also frees the resource pool in the hierarchy for the resource pool
+ * of default type which was created as part of charing operation.
+ */
+static void uncharge_cg_resource(struct rdma_cgroup *cg,
+				 struct rdmacg_device *device,
+				 enum rdmacg_resource_pool_type type,
+				 int index, int num)
+{
+	struct cg_resource_pool *rpool;
+
+	rpool = _get_cg_rpool(cg, device, type);
+	/*
+	 * A negative count (or overflow) is invalid,
+	 * it indicates a bug in the rdma controller.
+	 */
+	WARN_ON_ONCE(atomic_add_negative(-num,
+					 &rpool->resources[index].usage));
+	put_cg_rpool(cg, rpool);
+}
+
+/**
+ * rdmacg_uncharge_resource - hierarchically uncharge rdma resource count
+ * @device: pointer to rdmacg device
+ * @type: the type of resource pool to charge
+ * @index: index of the resource to uncharge in cg in given resource pool type
+ * @num: the number of rdma resource to uncharge
+ *
+ */
+void rdmacg_uncharge(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device,
+		     enum rdmacg_resource_pool_type type,
+		     int index, int num)
+{
+	struct rdma_cgroup *p;
+
+	for (p = cg; p; p = parent_rdmacg(p))
+		uncharge_cg_resource(p, device, type, index, num);
+
+	css_put(&cg->css);
+}
+EXPORT_SYMBOL(rdmacg_uncharge);
+
+/**
+ * try_charge_resource - hierarchically try to charge given resource
+ * @cg: pointer to cg to charge and all parents in hierarchy
+ * @device: pointer to rdmacg device
+ * @type: the type of resource pool to charge
+ * @index: index of the resource to charge in cg (resource pool)
+ * @num: the number of rdma resource to charge
+ *
+ * This function follows the set limit. It will fail if the charge would cause
+ * the new value to exceed the hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ *
+ * It allocates resource pool in the hierarchy for each parent it come
+ * across for first resource. Later on resource pool will be available.
+ * Therefore it will be much faster thereon to charge/uncharge.
+ */
+static int try_charge_resource(struct rdma_cgroup **rdmacg,
+			       struct rdmacg_device *device,
+			       enum rdmacg_resource_pool_type type,
+			       int index, int num)
+{
+	struct cg_resource_pool *rpool;
+	struct rdma_cgroup *cg, *p, *q;
+	int ret;
+
+	cg = task_rdmacg(current);
+
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		s64 new;
+
+		rpool = get_cg_rpool(p, device, type);
+		if (IS_ERR_OR_NULL(rpool)) {
+			ret = PTR_ERR(rpool);
+			goto err;
+		}
+
+		new = atomic_add_return(num, &rpool->resources[index].usage);
+		if (new > rpool->resources[index].max) {
+			ret = -EAGAIN;
+			goto revert;
+		}
+	}
+	/* hold on to the css, as cgroup can be removed but resource
+	 * accounting happens on css.
+	 */
+	css_get(&cg->css);
+	*rdmacg = cg;
+	return 0;
+
+revert:
+	uncharge_cg_resource(p, device, type, index, num);
+err:
+	for (q = cg; q != p; q = parent_rdmacg(q))
+		uncharge_cg_resource(q, device, type, index, num);
+	return ret;
+}
+
+/**
+ * rdmacg_try_charge_resource - hierarchically try to charge
+ * the rdma resource count
+ * @device: pointer to rdmacg device
+ * @rdmacg: pointer to rdma cgroup which will own this resource.
+ * @type: the type of resource pool to charge
+ * @index: index of the resource to charge in cg (resource pool)
+ * @num: the number of rdma resource to charge
+ *
+ * This function follows charing resource in hierarchical way.
+ * It will fail if the charge would cause the new value to exceed the
+ * hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN, -ENOMEM or -EINVAL.
+ * Returns pointer to rdmacg for this resource.
+ */
+int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
+		      struct rdmacg_device *device,
+		      enum rdmacg_resource_pool_type type,
+		      int index, int num)
+{
+	/* Charger needs to account resources on three criteria.
+	 * (a) per cgroup (b) per device & (c) per resource type resource usage.
+	 * Per cgroup resource usage ensures that tasks of cgroup doesn't cross
+	 * the configured limits.
+	 * Per device provides granular configuration in multi device usage.
+	 * Per resource type allows resource charing for multiple
+	 * category of resources - currently (a) verb level & (b) hw driver
+	 * defined.
+	 */
+
+	/* charge the cgroup resource pool */
+	return try_charge_resource(rdmacg, device, type, index, num);
+}
+EXPORT_SYMBOL(rdmacg_try_charge);
+
+/**
+ * rdmacg_register_rdmacg_device - register rdmacg device to rdma controller.
+ * @device: pointer to rdmacg device whose resources need to be accounted.
+ *
+ * If IB stack wish a device to participate in rdma cgroup resource
+ * tracking, it must invoke this API to register with rdma cgroup before
+ * any user space application can start using the RDMA resources. IB stack
+ * and/or HCA driver must invoke rdmacg_set_rpool_ops() either for verb or
+ * for hw or for both the types as they are mandetory operations to have
+ * to register with rdma cgroup.
+ *
+ */
+void rdmacg_register_device(struct rdmacg_device *device, char *dev_name)
+{
+	INIT_LIST_HEAD(&device->rdmacg_list);
+	device->name = dev_name;
+
+	mutex_lock(&dev_mutex);
+	list_add_tail(&device->rdmacg_list, &dev_list);
+	mutex_unlock(&dev_mutex);
+}
+EXPORT_SYMBOL(rdmacg_register_device);
+
+/**
+ * rdmacg_unregister_rdmacg_device - unregister the rdmacg device
+ * from rdma controller.
+ * @device: pointer to rdmacg device which was previously registered with rdma
+ *          controller using rdmacg_register_device().
+ *
+ * IB stack must invoke this after all the resources of the IB device
+ * are destroyed and after ensuring that no more resources will be created
+ * when this API is invoked.
+ */
+void rdmacg_unregister_device(struct rdmacg_device *device)
+{
+	/* Synchronize with any active resource settings,
+	 * usage query happening via configfs.
+	 * At this stage, there should not be any active resource pools
+	 * for this device, as RDMA/IB stack is expected to shutdown,
+	 * tear down all the applications and free up resources.
+	 */
+	mutex_lock(&dev_mutex);
+	list_del_init(&device->rdmacg_list);
+	mutex_unlock(&dev_mutex);
+}
+EXPORT_SYMBOL(rdmacg_unregister_device);
+
+/**
+ * rdmacg_set_rpool_ops - helper function to set the resource pool
+ *                        call back function to provide matching
+ *                        string tokens to for defining names of the
+ *                        resources.
+ * @device: pointer to rdmacg device for which to set the resource pool
+ *          operations.
+ * @pool_type: resource pool type, either VERBS type or HW type.
+ * @ops: pointer to function pointers to return (a) matching token
+ *       which will be invoked to find out string tokens and
+ *       (b) to find out maximum resource limits that is supported
+ *       by each device of given resource pool type.
+ *
+ * This helper function allows setting resouce pool specific operation
+ * callback functions.
+ * It must be called one by respective subsystem that implements resource
+ * definition of rdma and owns the task of charging/uncharing the resource.
+ * This must be called before ib device is registered with the rdma cgroup
+ * using rdmacg_register_rdmacg_device().
+ */
+void rdmacg_set_rpool_ops(struct rdmacg_device *device,
+			  enum rdmacg_resource_pool_type pool_type,
+			  struct rdmacg_resource_pool_ops *ops)
+{
+	device->rpool_ops[pool_type] = ops;
+}
+EXPORT_SYMBOL(rdmacg_set_rpool_ops);
+
+/**
+ * rdmacg_clear_rpool_ops -
+ * helper function to clear the resource pool ops which was setup
+ * before using rdmacg_set_rpool_ops.
+ * @device: pointer to ib device for which to clear the resource pool
+ *          operations.
+ * @pool_type: resource pool type, either VERBS type or HW type.
+ *
+ * This must be called after deregistering ib device using rdma cgroup
+ * using rdmacg_unregister_rdmacg_device().
+ */
+void rdmacg_clear_rpool_ops(struct rdmacg_device *device,
+			    enum rdmacg_resource_pool_type pool_type)
+{
+	device->rpool_ops[pool_type] = NULL;
+}
+EXPORT_SYMBOL(rdmacg_clear_rpool_ops);
+
+/**
+ * rdmacg_query_limit - query the resource limits that
+ * might have been configured by the user.
+ * @device: pointer to ib device
+ * @pid: thread group pid that wants to know the resource limits
+ *       of the cgroup.
+ * @type: the type of resource pool to know the limits of.
+ * @limits: pointer to an array of limits where rdma cg will provide
+ *          the configured limits of the cgroup.
+ * @limits_array_len: number of array elements to be filled up at limits.
+ *
+ * This function follows charing resource in hierarchical way.
+ * It will fail if the charge would cause the new value to exceed the
+ * hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise appropriate error code.
+ */
+int rdmacg_query_limit(struct rdmacg_device *device,
+		       enum rdmacg_resource_pool_type type,
+		       int *limits, int max_count)
+{
+	struct rdma_cgroup *cg, *p, *q;
+	struct cg_resource_pool *rpool;
+	struct rdmacg_pool_info *pool_info;
+	struct rdmacg_resource_pool_ops *ops;
+	int i, status = 0;
+
+	cg = task_rdmacg(current);
+
+	ops = get_pool_ops(device, type);
+	if (!ops) {
+		status = -EINVAL;
+		goto err;
+	}
+	pool_info = ops->get_resource_pool_tokens(device);
+	if (!pool_info) {
+		status = -EINVAL;
+		goto err;
+	}
+	if (pool_info->resource_count == 0 ||
+	    pool_info->resource_count > RDMACG_MAX_RESOURCE_INDEX ||
+	    max_count > pool_info->resource_count) {
+		status = -EINVAL;
+		goto err;
+	}
+
+	/* initialize to max */
+	for (i = 0; i < max_count; i++)
+		limits[i] = S32_MAX;
+
+	/* check in hirerchy which pool get the least amount of
+	 * resource limits.
+	 */
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		/* get handle to cgroups rpool */
+		rpool = get_cg_rpool(cg, device, type);
+		if (IS_ERR_OR_NULL(rpool))
+			goto rpool_err;
+
+		for (i = 0; i < max_count; i++)
+			limits[i] = min_t(int, limits[i],
+					rpool->resources[i].max);
+
+		put_cg_rpool(cg, rpool);
+	}
+	return 0;
+
+rpool_err:
+	for (q = cg; q != p; q = parent_rdmacg(q))
+		put_cg_rpool(cg, _get_cg_rpool(cg, device, type));
+err:
+	return status;
+}
+EXPORT_SYMBOL(rdmacg_query_limit);
+
+static int rdmacg_parse_limits(char *options, struct match_token *opt_tbl,
+			       int *new_limits, u64 *enables)
+{
+	substring_t argstr[MAX_OPT_ARGS];
+	const char *c;
+	int err = -ENOMEM;
+
+	/* parse resource options */
+	while ((c = strsep(&options, " ")) != NULL) {
+		int token, intval, ret;
+
+		err = -EINVAL;
+		token = match_token((char *)c, opt_tbl, argstr);
+		if (token < 0)
+			goto err;
+
+		ret = match_int(&argstr[0], &intval);
+		if (ret < 0) {
+			pr_err("bad value (not int) at '%s'\n", c);
+			goto err;
+		}
+		new_limits[token] = intval;
+		*enables |= BIT(token);
+	}
+	return 0;
+
+err:
+	return err;
+}
+
+static enum rdmacg_resource_pool_type of_to_pool_type(int of_type)
+{
+	enum rdmacg_resource_pool_type pool_type;
+
+	switch (of_type) {
+	case RDMACG_VERB_RESOURCE_MAX:
+	case RDMACG_VERB_RESOURCE_STAT:
+		pool_type = RDMACG_RESOURCE_POOL_VERB;
+		break;
+	case RDMACG_HW_RESOURCE_MAX:
+	case RDMACG_HW_RESOURCE_STAT:
+	default:
+		pool_type = RDMACG_RESOURCE_POOL_HW;
+		break;
+	};
+	return pool_type;
+}
+
+static struct rdmacg_device *_rdmacg_get_device(const char *name)
+{
+	struct rdmacg_device *device;
+
+	list_for_each_entry(device, &dev_list, rdmacg_list)
+		if (!strcmp(name, device->name))
+			return device;
+
+	return NULL;
+}
+
+static void remove_unused_cg_rpool(struct rdma_cgroup *cg,
+				   struct rdmacg_device *device,
+				   enum rdmacg_resource_pool_type type,
+				   int count)
+{
+	struct cg_resource_pool *rpool = NULL;
+	int i;
+
+	spin_lock(&cg->cg_list_lock);
+	rpool = find_cg_rpool(cg, device, type);
+	if (!rpool) {
+		spin_unlock(&cg->cg_list_lock);
+		return;
+	}
+	/* found the resource pool, check now what to do
+	 * based on its reference count.
+	 */
+	if (atomic_read(&rpool->refcnt) == 0) {
+		/* if there is no active user of the rpool,
+		 * free the memory, default group will get
+		 * allocated automatically when new resource
+		 * is created.
+		 */
+		list_del_init(&rpool->cg_list);
+
+		spin_unlock(&cg->cg_list_lock);
+
+		_free_cg_rpool(rpool);
+	} else {
+		/* if there are active processes and thereby
+		 * active resources, set limits to max.
+		 * Resource pool will get freed later on,
+		 * when last resource will get deallocated.
+		 */
+		for (i = 0; i < count; i++)
+			set_resource_limit(rpool, i, S32_MAX);
+		atomic_set(&rpool->creator, RDMACG_RPOOL_CREATOR_DEFAULT);
+
+		spin_unlock(&cg->cg_list_lock);
+	}
+}
+
+static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
+				       char *buf, size_t nbytes, loff_t off)
+{
+	struct rdma_cgroup *cg = css_rdmacg(of_css(of));
+	const char *dev_name;
+	struct cg_resource_pool *rpool;
+	struct rdmacg_resource_pool_ops *ops;
+	struct rdmacg_device *device;
+	char *options = strstrip(buf);
+	enum rdmacg_resource_pool_type pool_type;
+	struct rdmacg_pool_info *resource_tokens;
+	u64 enables = 0;
+	int *new_limits;
+	int i = 0, ret = 0;
+	bool remove = false;
+
+	/* extract the device name first */
+	dev_name = strsep(&options, " ");
+	if (!dev_name) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	/* check if user asked to remove the cgroup limits */
+	if (strstr(options, RDMACG_USR_CMD_REMOVE))
+		remove = true;
+
+	new_limits = kcalloc(RDMACG_MAX_RESOURCE_INDEX, sizeof(int),
+			     GFP_KERNEL);
+	if (!new_limits) {
+		ret = -ENOMEM;
+		goto opt_err;
+	}
+
+	/* acquire lock to synchronize with hot plug devices */
+	mutex_lock(&dev_mutex);
+
+	device = _rdmacg_get_device(dev_name);
+	if (!device) {
+		ret = -ENODEV;
+		goto parse_err;
+	}
+	pool_type = of_to_pool_type(of_cft(of)->private);
+	ops = get_pool_ops(device, pool_type);
+	if (!ops) {
+		ret = -EINVAL;
+		goto parse_err;
+	}
+
+	resource_tokens = ops->get_resource_pool_tokens(device);
+	if (IS_ERR_OR_NULL(resource_tokens)) {
+		ret = -EINVAL;
+		goto parse_err;
+	}
+
+	if (remove) {
+		remove_unused_cg_rpool(cg, device, pool_type,
+				       resource_tokens->resource_count);
+		/* user asked to clear the limits; ignore rest of the options */
+		goto parse_err;
+	}
+
+	/* user didn't ask to remove, act on the options */
+	ret = rdmacg_parse_limits(options,
+				  resource_tokens->resource_table,
+				  new_limits, &enables);
+	if (ret)
+		goto parse_err;
+
+	rpool = get_cg_rpool(cg, device, pool_type);
+	if (IS_ERR_OR_NULL(rpool)) {
+		if (IS_ERR(rpool))
+			ret = PTR_ERR(rpool);
+		else
+			ret = -ENOMEM;
+		goto parse_err;
+	}
+	/* set pool type as user regardless of previous type as
+	 * user is configuring the limit now.
+	 */
+	atomic_set(&rpool->creator, RDMACG_RPOOL_CREATOR_USR);
+
+	/* now set the new limits on the existing or newly created rool */
+	while (enables) {
+		/* if user set the limit, enables bit is set */
+		if (enables & BIT(i)) {
+			enables &= ~BIT(i);
+			set_resource_limit(rpool, i, new_limits[i]);
+		}
+		i++;
+	}
+	atomic_dec(&rpool->refcnt);
+parse_err:
+	mutex_unlock(&dev_mutex);
+opt_err:
+	kfree(new_limits);
+err:
+	return ret ?: nbytes;
+}
+
+static int get_resource_val(struct cg_resource *resource,
+			    enum rdmacg_file_type type)
+{
+	int val = 0;
+
+	switch (type) {
+	case RDMACG_VERB_RESOURCE_MAX:
+	case RDMACG_HW_RESOURCE_MAX:
+		val = resource->max;
+		break;
+	case RDMACG_VERB_RESOURCE_STAT:
+	case RDMACG_HW_RESOURCE_STAT:
+		val = atomic_read(&resource->usage);
+		break;
+	default:
+		val = 0;
+		break;
+	};
+	return val;
+}
+
+static u32 *get_cg_rpool_values(struct rdma_cgroup *cg,
+				struct rdmacg_device *device,
+				enum rdmacg_resource_pool_type pool_type,
+				enum rdmacg_file_type resource_type,
+				int resource_count)
+{
+	struct cg_resource_pool *rpool;
+	u32 *value_tbl;
+	int i, ret;
+
+	value_tbl = kcalloc(resource_count, sizeof(u32), GFP_KERNEL);
+	if (!value_tbl) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	rpool = get_cg_rpool(cg, device, pool_type);
+	if (IS_ERR_OR_NULL(rpool)) {
+		if (IS_ERR(rpool))
+			ret = PTR_ERR(rpool);
+		else
+			ret = -ENOMEM;
+		goto err;
+	}
+
+	for (i = 0; i < resource_count; i++) {
+		value_tbl[i] = get_resource_val(&rpool->resources[i],
+						resource_type);
+	}
+	put_cg_rpool(cg, rpool);
+	return value_tbl;
+
+err:
+	return ERR_PTR(ret);
+}
+
+static int print_rpool_values(struct seq_file *sf,
+			      struct rdmacg_pool_info *pool_info,
+			      u32 *value_tbl)
+{
+	struct match_token *resource_table;
+	char *name;
+	int i, ret, name_len;
+
+	resource_table = pool_info->resource_table;
+
+	for (i = 0; i < pool_info->resource_count; i++) {
+		if (value_tbl[i] == S32_MAX) {
+			/* since we have to print max string, and token
+			 * string cannot be changed, make a copy and print
+			 * from there.
+			 */
+			name_len = strlen(resource_table[i].pattern);
+			name = kzalloc(name_len, GFP_KERNEL);
+			if (!name) {
+				ret = -ENOMEM;
+				goto err;
+			}
+			strcpy(name, resource_table[i].pattern);
+
+			/* change to string instead of int from %d to %s */
+			name[name_len - 1] = 's';
+			seq_printf(sf, name, "max");
+			kfree(name);
+		} else {
+			seq_printf(sf, resource_table[i].pattern, value_tbl[i]);
+		}
+		seq_putc(sf, ' ');
+	}
+	return 0;
+
+err:
+	return ret;
+}
+
+static int rdmacg_resource_read(struct seq_file *sf, void *v)
+{
+	struct rdmacg_device *device;
+	struct rdma_cgroup *cg = css_rdmacg(seq_css(sf));
+	struct rdmacg_pool_info *pool_info;
+	struct rdmacg_resource_pool_ops *ops;
+	u32 *value_tbl;
+	enum rdmacg_resource_pool_type pool_type;
+	int ret = 0;
+
+	pool_type = of_to_pool_type(seq_cft(sf)->private);
+
+	mutex_lock(&dev_mutex);
+
+	list_for_each_entry(device, &dev_list, rdmacg_list) {
+		ops = get_pool_ops(device, pool_type);
+		if (!ops)
+			continue;
+
+		pool_info = ops->get_resource_pool_tokens(device);
+		if (IS_ERR_OR_NULL(pool_info)) {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		/* get the value from resource pool */
+		value_tbl = get_cg_rpool_values(cg, device, pool_type,
+						seq_cft(sf)->private,
+						pool_info->resource_count);
+		if (!value_tbl) {
+			ret = -ENOMEM;
+			goto err;
+		}
+
+		seq_printf(sf, "%s ", device->name);
+		ret = print_rpool_values(sf, pool_info, value_tbl);
+		seq_putc(sf, '\n');
+
+		kfree(value_tbl);
+		if (ret)
+			break;
+	}
+
+	mutex_unlock(&dev_mutex);
+
+err:
+	return ret;
+}
+
+static struct cftype rdmacg_files[] = {
+	{
+		.name = "verb.max",
+		.write = rdmacg_resource_set_max,
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_VERB_RESOURCE_MAX,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "verb.current",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_VERB_RESOURCE_STAT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "hw.max",
+		.write = rdmacg_resource_set_max,
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_HW_RESOURCE_MAX,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "hw.current",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_HW_RESOURCE_STAT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{ }	/* terminate */
+};
+
+static struct cgroup_subsys_state *
+rdmacg_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct rdma_cgroup *cg;
+
+	cg = kzalloc(sizeof(*cg), GFP_KERNEL);
+	if (!cg)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&cg->rpool_head);
+	spin_lock_init(&cg->cg_list_lock);
+
+	return &cg->css;
+}
+
+static void rdmacg_css_free(struct cgroup_subsys_state *css)
+{
+	struct rdma_cgroup *cg = css_rdmacg(css);
+
+	kfree(cg);
+}
+
+/**
+ * rdmacg_css_offline - cgroup css_offline callback
+ * @css: css of interest
+ *
+ * This function is called when @css is about to go away and responsible
+ * for shooting down all rdmacg associated with @css. As part of that it
+ * marks all the resource pool as default type, so that when resources are
+ * freeded, associated resource pool can be freed as well.
+ *
+ */
+static void rdmacg_css_offline(struct cgroup_subsys_state *css)
+{
+	struct rdma_cgroup *cg = css_rdmacg(css);
+	struct cg_resource_pool *rpool;
+
+	spin_lock(&cg->cg_list_lock);
+
+	/* mark resource pool as of default type, so that they can be freed
+	 * later on when actual resources are freed.
+	 */
+	list_for_each_entry(rpool, &cg->rpool_head, cg_list)
+		atomic_set(&rpool->creator, RDMACG_RPOOL_CREATOR_DEFAULT);
+
+	spin_unlock(&cg->cg_list_lock);
+}
+
+struct cgroup_subsys rdma_cgrp_subsys = {
+	.css_alloc	= rdmacg_css_alloc,
+	.css_free	= rdmacg_css_free,
+	.css_offline	= rdmacg_css_offline,
+	.legacy_cftypes	= rdmacg_files,
+	.dfl_cftypes	= rdmacg_files,
+};
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 2/3] IB/core: added support to use rdma cgroup controller
  2016-01-30 15:23 [PATCHv3 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
  2016-01-30 15:23 ` [PATCHv3 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
@ 2016-01-30 15:23 ` Parav Pandit
  2016-01-30 15:23 ` [PATCHv3 3/3] rdmacg: Added documentation for rdma controller Parav Pandit
  2 siblings, 0 replies; 16+ messages in thread
From: Parav Pandit @ 2016-01-30 15:23 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav

- Added support APIs for IB core to register/unregister every RDMA device
with rdma cgroup for tracking verbs and hw resources.
- IB core registers with rdma cgroup controller and also defines resources
that can be accounted.
- Added support APIs for uverbs layer to make use of rdma controller.
- Added uverbs layer to perform resource charge/uncharge functionality.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 drivers/infiniband/core/Makefile      |   1 +
 drivers/infiniband/core/cgroup.c      | 108 ++++++++++++++++++
 drivers/infiniband/core/core_priv.h   |  45 ++++++++
 drivers/infiniband/core/device.c      |   8 ++
 drivers/infiniband/core/uverbs_cmd.c  | 209 +++++++++++++++++++++++++++++++---
 drivers/infiniband/core/uverbs_main.c |  28 +++++
 include/rdma/ib_verbs.h               |  27 ++++-
 7 files changed, 410 insertions(+), 16 deletions(-)
 create mode 100644 drivers/infiniband/core/cgroup.c

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..df40cee 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,6 +13,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
 				roce_gid_mgmt.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
diff --git a/drivers/infiniband/core/cgroup.c b/drivers/infiniband/core/cgroup.c
new file mode 100644
index 0000000..be0a2b8
--- /dev/null
+++ b/drivers/infiniband/core/cgroup.c
@@ -0,0 +1,108 @@
+#include <linux/kernel.h>
+#include <linux/parser.h>
+#include <linux/cgroup_rdma.h>
+
+#include "core_priv.h"
+
+/**
+ * resource table definition as to be seen by the user.
+ * Need to add entries to it when more resources are
+ * added/defined at IB verb/core layer.
+ */
+static match_table_t resource_tokens = {
+	{RDMA_VERB_RESOURCE_UCTX, "uctx=%d"},
+	{RDMA_VERB_RESOURCE_AH, "ah=%d"},
+	{RDMA_VERB_RESOURCE_PD, "pd=%d"},
+	{RDMA_VERB_RESOURCE_CQ, "cq=%d"},
+	{RDMA_VERB_RESOURCE_MR, "mr=%d"},
+	{RDMA_VERB_RESOURCE_MW, "mw=%d"},
+	{RDMA_VERB_RESOURCE_SRQ, "srq=%d"},
+	{RDMA_VERB_RESOURCE_QP, "qp=%d"},
+	{RDMA_VERB_RESOURCE_FLOW, "flow=%d"},
+	{-1, NULL}
+};
+
+/**
+ * setup table pointers for RDMA cgroup to access.
+ */
+static struct rdmacg_pool_info verbs_token_info = {
+	.resource_table = resource_tokens,
+	.resource_count =
+		(sizeof(resource_tokens) / sizeof(struct match_token)) - 1,
+};
+
+static struct rdmacg_pool_info*
+	rdmacg_get_resource_pool_tokens(struct rdmacg_device *device)
+{
+	return &verbs_token_info;
+}
+
+static struct rdmacg_resource_pool_ops verbs_pool_ops = {
+	.get_resource_pool_tokens = &rdmacg_get_resource_pool_tokens,
+};
+
+/**
+ * ib_device_register_rdmacg - register with rdma cgroup.
+ * @device: device to register to participate in resource
+ *          accounting by rdma cgroup.
+ *
+ * Register with the rdma cgroup. Should be called before
+ * exposing rdma device to user space applications to avoid
+ * resource accounting leak.
+ * HCA drivers should set resource pool ops first if they wish
+ * to support hw specific resource accounting before IB core
+ * registers with rdma cgroup.
+ */
+void ib_device_register_rdmacg(struct ib_device *device)
+{
+	rdmacg_set_rpool_ops(&device->cg_device,
+			     RDMACG_RESOURCE_POOL_VERB,
+			     &verbs_pool_ops);
+	rdmacg_register_device(&device->cg_device, device->name);
+}
+
+/**
+ * ib_device_unregister_rdmacg - unregister with rdma cgroup.
+ * @device: device to unregister.
+ *
+ * Unregister with the rdma cgroup. Should be called after
+ * all the resources are deallocated, and after a stage when any
+ * other resource allocation of user application cannot be done
+ * for this device to avoid any leak in accounting.
+ * HCA drivers should clear resource pool ops after ib stack
+ * unregisters with rdma cgroup.
+ */
+void ib_device_unregister_rdmacg(struct ib_device *device)
+{
+	rdmacg_unregister_device(&device->cg_device);
+	rdmacg_clear_rpool_ops(&device->cg_device,
+			       RDMACG_RESOURCE_POOL_VERB);
+}
+
+int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+			 struct ib_device *device,
+			 enum rdmacg_resource_pool_type type,
+			 int resource_index, int num)
+{
+	return rdmacg_try_charge(&cg_obj->cg, &device->cg_device,
+				 type, resource_index, num);
+}
+EXPORT_SYMBOL(ib_rdmacg_try_charge);
+
+void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+			struct ib_device *device,
+			enum rdmacg_resource_pool_type type,
+			int resource_index, int num)
+{
+	rdmacg_uncharge(cg_obj->cg, &device->cg_device,
+			type, resource_index, num);
+}
+EXPORT_SYMBOL(ib_rdmacg_uncharge);
+
+int ib_rdmacg_query_limit(struct ib_device *device,
+			  enum rdmacg_resource_pool_type type,
+			  int *limits, int max_count)
+{
+	return rdmacg_query_limit(&device->cg_device, type, limits, max_count);
+}
+EXPORT_SYMBOL(ib_rdmacg_query_limit);
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 5cf6eb7..977988a 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -37,6 +37,7 @@
 #include <linux/spinlock.h>
 
 #include <rdma/ib_verbs.h>
+#include <linux/cgroup_rdma.h>
 
 int  ib_device_register_sysfs(struct ib_device *device,
 			      int (*port_callback)(struct ib_device *,
@@ -92,4 +93,48 @@ int ib_cache_setup_one(struct ib_device *device);
 void ib_cache_cleanup_one(struct ib_device *device);
 void ib_cache_release_one(struct ib_device *device);
 
+#ifdef CONFIG_CGROUP_RDMA
+
+void ib_device_register_rdmacg(struct ib_device *device);
+void ib_device_unregister_rdmacg(struct ib_device *device);
+
+int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+			 struct ib_device *device,
+			 enum rdmacg_resource_pool_type type,
+			 int resource_index, int num);
+
+void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+			struct ib_device *device,
+			enum rdmacg_resource_pool_type type,
+			int resource_index, int num);
+
+int ib_rdmacg_query_limit(struct ib_device *device,
+			  enum rdmacg_resource_pool_type type,
+			  int *limits, int max_count);
+#else
+static inline int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+				       struct ib_device *device,
+				       enum rdmacg_resource_pool_type type,
+				       int resource_index, int num)
+{ return 0; }
+
+static inline void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+				      struct ib_device *device,
+				      enum rdmacg_resource_pool_type type,
+				      int resource_index, int num)
+{ }
+
+static inline int ib_rdmacg_query_limit(struct ib_device *device,
+					enum rdmacg_resource_pool_type type,
+					int *limits, int max_count)
+{
+	int i;
+
+	for (i = 0; i < max_count; i++)
+		limits[i] = S32_MAX;
+
+	return 0;
+}
+#endif
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 179e813..59cab6b 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -352,6 +352,10 @@ int ib_register_device(struct ib_device *device,
 		goto out;
 	}
 
+#ifdef CONFIG_CGROUP_RDMA
+	ib_device_register_rdmacg(device);
+#endif
+
 	ret = ib_device_register_sysfs(device, port_callback);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
@@ -405,6 +409,10 @@ void ib_unregister_device(struct ib_device *device)
 
 	mutex_unlock(&device_mutex);
 
+#ifdef CONFIG_CGROUP_RDMA
+	ib_device_unregister_rdmacg(device);
+#endif
+
 	ib_device_unregister_sysfs(device);
 	ib_cache_cleanup_one(device);
 
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 94816ae..78006d6 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -294,6 +294,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 #endif
 	struct ib_ucontext		 *ucontext;
 	struct file			 *filp;
+	struct ib_rdmacg_object          cg_obj;
 	int ret;
 
 	if (out_len < sizeof resp)
@@ -313,13 +314,21 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
+	ret = ib_rdmacg_try_charge(&cg_obj, ib_dev,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_UCTX, 1);
+	if (ret)
+		goto err;
+
 	ucontext = ib_dev->alloc_ucontext(ib_dev, &udata);
 	if (IS_ERR(ucontext)) {
 		ret = PTR_ERR(ucontext);
-		goto err;
+		goto err_alloc;
 	}
 
 	ucontext->device = ib_dev;
+	ucontext->cg_obj = cg_obj;
+
 	INIT_LIST_HEAD(&ucontext->pd_list);
 	INIT_LIST_HEAD(&ucontext->mr_list);
 	INIT_LIST_HEAD(&ucontext->mw_list);
@@ -386,6 +395,10 @@ err_free:
 	put_pid(ucontext->tgid);
 	ib_dev->dealloc_ucontext(ucontext);
 
+err_alloc:
+	ib_rdmacg_uncharge(&cg_obj, ib_dev, RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_UCTX, 1);
+
 err:
 	mutex_unlock(&file->mutex);
 	return ret;
@@ -394,7 +407,8 @@ err:
 static void copy_query_dev_fields(struct ib_uverbs_file *file,
 				  struct ib_device *ib_dev,
 				  struct ib_uverbs_query_device_resp *resp,
-				  struct ib_device_attr *attr)
+				  struct ib_device_attr *attr,
+				  int *limits)
 {
 	resp->fw_ver		= attr->fw_ver;
 	resp->node_guid		= ib_dev->node_guid;
@@ -405,14 +419,19 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file,
 	resp->vendor_part_id	= attr->vendor_part_id;
 	resp->hw_ver		= attr->hw_ver;
 	resp->max_qp		= attr->max_qp;
+	resp->max_qp		= min_t(int, attr->max_qp,
+					limits[RDMA_VERB_RESOURCE_QP]);
 	resp->max_qp_wr		= attr->max_qp_wr;
 	resp->device_cap_flags	= attr->device_cap_flags;
 	resp->max_sge		= attr->max_sge;
 	resp->max_sge_rd	= attr->max_sge_rd;
-	resp->max_cq		= attr->max_cq;
+	resp->max_cq		= min_t(int, attr->max_cq,
+					limits[RDMA_VERB_RESOURCE_CQ]);
 	resp->max_cqe		= attr->max_cqe;
-	resp->max_mr		= attr->max_mr;
-	resp->max_pd		= attr->max_pd;
+	resp->max_mr		= min_t(int, attr->max_mr,
+					limits[RDMA_VERB_RESOURCE_MR]);
+	resp->max_pd		= min_t(int, attr->max_pd,
+					limits[RDMA_VERB_RESOURCE_PD]);
 	resp->max_qp_rd_atom	= attr->max_qp_rd_atom;
 	resp->max_ee_rd_atom	= attr->max_ee_rd_atom;
 	resp->max_res_rd_atom	= attr->max_res_rd_atom;
@@ -421,16 +440,19 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file,
 	resp->atomic_cap		= attr->atomic_cap;
 	resp->max_ee			= attr->max_ee;
 	resp->max_rdd			= attr->max_rdd;
-	resp->max_mw			= attr->max_mw;
+	resp->max_mw			= min_t(int, attr->max_mw,
+						limits[RDMA_VERB_RESOURCE_MW]);
 	resp->max_raw_ipv6_qp		= attr->max_raw_ipv6_qp;
 	resp->max_raw_ethy_qp		= attr->max_raw_ethy_qp;
 	resp->max_mcast_grp		= attr->max_mcast_grp;
 	resp->max_mcast_qp_attach	= attr->max_mcast_qp_attach;
 	resp->max_total_mcast_qp_attach	= attr->max_total_mcast_qp_attach;
-	resp->max_ah			= attr->max_ah;
+	resp->max_ah			= min_t(int, attr->max_ah,
+						limits[RDMA_VERB_RESOURCE_AH]);
 	resp->max_fmr			= attr->max_fmr;
 	resp->max_map_per_fmr		= attr->max_map_per_fmr;
-	resp->max_srq			= attr->max_srq;
+	resp->max_srq			= min_t(int, attr->max_srq,
+						limits[RDMA_VERB_RESOURCE_SRQ]);
 	resp->max_srq_wr		= attr->max_srq_wr;
 	resp->max_srq_sge		= attr->max_srq_sge;
 	resp->max_pkeys			= attr->max_pkeys;
@@ -447,6 +469,7 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
 	struct ib_uverbs_query_device_resp resp;
 	struct ib_device_attr              attr;
 	int                                ret;
+	int                                limits[RDMA_VERB_RESOURCE_MAX];
 
 	if (out_len < sizeof resp)
 		return -ENOSPC;
@@ -458,14 +481,23 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	ret = ib_rdmacg_query_limit(ib_dev,
+				    RDMACG_RESOURCE_POOL_VERB,
+				    limits, RDMA_VERB_RESOURCE_MAX);
+	if (ret)
+		goto err;
+
 	memset(&resp, 0, sizeof resp);
-	copy_query_dev_fields(file, ib_dev, &resp, &attr);
+	copy_query_dev_fields(file, ib_dev, &resp, &attr, limits);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
 		return -EFAULT;
 
 	return in_len;
+
+err:
+	return ret;
 }
 
 ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
@@ -545,6 +577,14 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -ENOMEM;
 
+	ret = ib_rdmacg_try_charge(&uobj->cg_obj, file->device->ib_dev,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_PD, 1);
+	if (ret) {
+		kfree(uobj);
+		return -EPERM;
+	}
+
 	init_uobj(uobj, 0, file->ucontext, &pd_lock_class);
 	down_write(&uobj->mutex);
 
@@ -590,6 +630,9 @@ err_idr:
 	ib_dealloc_pd(pd);
 
 err:
+	ib_rdmacg_uncharge(&uobj->cg_obj, file->device->ib_dev,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_PD, 1);
 	put_uobj_write(uobj);
 	return ret;
 }
@@ -602,6 +645,7 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file,
 	struct ib_uverbs_dealloc_pd cmd;
 	struct ib_uobject          *uobj;
 	struct ib_pd		   *pd;
+	struct ib_device           *device;
 	int                         ret;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -622,6 +666,12 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file,
 	if (ret)
 		goto err_put;
 
+	device = uobj->context->device;
+
+	ib_rdmacg_uncharge(&uobj->cg_obj, device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_PD, 1);
+
 	uobj->live = 0;
 	put_uobj_write(uobj);
 
@@ -995,6 +1045,12 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
 		}
 	}
 
+	ret = ib_rdmacg_try_charge(&uobj->cg_obj, pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_MR, 1);
+	if (ret)
+		goto err_charge;
+
 	mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va,
 				     cmd.access_flags, &udata);
 	if (IS_ERR(mr)) {
@@ -1043,6 +1099,11 @@ err_unreg:
 	ib_dereg_mr(mr);
 
 err_put:
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_MR, 1);
+
+err_charge:
 	put_pd_read(pd);
 
 err_free:
@@ -1152,6 +1213,7 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
 	struct ib_uverbs_dereg_mr cmd;
 	struct ib_mr             *mr;
 	struct ib_uobject	 *uobj;
+	struct ib_pd             *pd;
 	int                       ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -1163,6 +1225,8 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
 
 	mr = uobj->object;
 
+	pd = mr->pd;
+
 	ret = ib_dereg_mr(mr);
 	if (!ret)
 		uobj->live = 0;
@@ -1172,6 +1236,10 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_MR, 1);
+
 	idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -1214,6 +1282,12 @@ ssize_t ib_uverbs_alloc_mw(struct ib_uverbs_file *file,
 		goto err_free;
 	}
 
+	ret = ib_rdmacg_try_charge(&uobj->cg_obj, pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_MW, 1);
+	if (ret)
+		goto err_charge;
+
 	mw = pd->device->alloc_mw(pd, cmd.mw_type);
 	if (IS_ERR(mw)) {
 		ret = PTR_ERR(mw);
@@ -1259,6 +1333,11 @@ err_unalloc:
 	ib_dealloc_mw(mw);
 
 err_put:
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_MW, 1);
+
+err_charge:
 	put_pd_read(pd);
 
 err_free:
@@ -1273,6 +1352,7 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
 {
 	struct ib_uverbs_dealloc_mw cmd;
 	struct ib_mw               *mw;
+	struct ib_pd               *pd;
 	struct ib_uobject	   *uobj;
 	int                         ret = -EINVAL;
 
@@ -1284,6 +1364,7 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
 		return -EINVAL;
 
 	mw = uobj->object;
+	pd = mw->pd;
 
 	ret = ib_dealloc_mw(mw);
 	if (!ret)
@@ -1294,6 +1375,10 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_MW, 1);
+
 	idr_remove_uobj(&ib_uverbs_mw_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -1393,6 +1478,12 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
 	if (cmd_sz > offsetof(typeof(*cmd), flags) + sizeof(cmd->flags))
 		attr.flags = cmd->flags;
 
+	ret = ib_rdmacg_try_charge(&obj->uobject.cg_obj, file->device->ib_dev,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_CQ, 1);
+	if (ret)
+		goto err_charge;
+
 	cq = ib_dev->create_cq(ib_dev, &attr,
 					     file->ucontext, uhw);
 	if (IS_ERR(cq)) {
@@ -1440,6 +1531,11 @@ err_free:
 	ib_destroy_cq(cq);
 
 err_file:
+	ib_rdmacg_uncharge(&obj->uobject.cg_obj, file->device->ib_dev,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_CQ, 1);
+
+err_charge:
 	if (ev_file)
 		ib_uverbs_release_ucq(file, ev_file, obj);
 
@@ -1720,6 +1816,10 @@ ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	ib_rdmacg_uncharge(&uobj->cg_obj, uobj->context->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_CQ, 1);
+
 	idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -1775,6 +1875,12 @@ static int create_qp(struct ib_uverbs_file *file,
 		  &qp_lock_class);
 	down_write(&obj->uevent.uobject.mutex);
 
+	pd  = idr_read_pd(cmd->pd_handle, file->ucontext);
+	if (!pd) {
+		ret = -EINVAL;
+		goto err_put;
+	}
+
 	if (cmd->qp_type == IB_QPT_XRC_TGT) {
 		xrcd = idr_read_xrcd(cmd->pd_handle, file->ucontext,
 				     &xrcd_uobj);
@@ -1809,8 +1915,7 @@ static int create_qp(struct ib_uverbs_file *file,
 
 		scq = idr_read_cq(cmd->send_cq_handle, file->ucontext, !!rcq);
 		rcq = rcq ?: scq;
-		pd  = idr_read_pd(cmd->pd_handle, file->ucontext);
-		if (!pd || !scq) {
+		if (!scq) {
 			ret = -EINVAL;
 			goto err_put;
 		}
@@ -1856,6 +1961,12 @@ static int create_qp(struct ib_uverbs_file *file,
 			goto err_put;
 		}
 
+	ret = ib_rdmacg_try_charge(&obj->uevent.uobject.cg_obj, pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_QP, 1);
+	if (ret)
+		goto err_put;
+
 	if (cmd->qp_type == IB_QPT_XRC_TGT)
 		qp = ib_create_qp(pd, &attr);
 	else
@@ -1863,7 +1974,7 @@ static int create_qp(struct ib_uverbs_file *file,
 
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
-		goto err_put;
+		goto err_create;
 	}
 
 	if (cmd->qp_type != IB_QPT_XRC_TGT) {
@@ -1938,6 +2049,11 @@ err_cb:
 err_destroy:
 	ib_destroy_qp(qp);
 
+err_create:
+	ib_rdmacg_uncharge(&obj->uevent.uobject.cg_obj, device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_QP, 1);
+
 err_put:
 	if (xrcd)
 		put_xrcd_read(xrcd_uobj);
@@ -2377,6 +2493,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 	struct ib_uverbs_destroy_qp_resp resp;
 	struct ib_uobject		*uobj;
 	struct ib_qp               	*qp;
+	struct ib_pd                    *pd;
 	struct ib_uqp_object        	*obj;
 	int                        	 ret = -EINVAL;
 
@@ -2389,6 +2506,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -EINVAL;
 	qp  = uobj->object;
+	pd  = qp->pd;
 	obj = container_of(uobj, struct ib_uqp_object, uevent.uobject);
 
 	if (!list_empty(&obj->mcast_list)) {
@@ -2405,6 +2523,10 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_QP, 1);
+
 	if (obj->uxrcd)
 		atomic_dec(&obj->uxrcd->refcnt);
 
@@ -2846,10 +2968,16 @@ ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
 	memset(&attr.dmac, 0, sizeof(attr.dmac));
 	memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16);
 
+	ret = ib_rdmacg_try_charge(&uobj->cg_obj, pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_AH, 1);
+	if (ret)
+		goto err_put;
+
 	ah = ib_create_ah(pd, &attr);
 	if (IS_ERR(ah)) {
 		ret = PTR_ERR(ah);
-		goto err_put;
+		goto err_create;
 	}
 
 	ah->uobject  = uobj;
@@ -2885,6 +3013,11 @@ err_copy:
 err_destroy:
 	ib_destroy_ah(ah);
 
+err_create:
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_AH, 1);
+
 err_put:
 	put_pd_read(pd);
 
@@ -2899,6 +3032,7 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 {
 	struct ib_uverbs_destroy_ah cmd;
 	struct ib_ah		   *ah;
+	struct ib_pd		   *pd;
 	struct ib_uobject	   *uobj;
 	int			    ret;
 
@@ -2909,6 +3043,7 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -EINVAL;
 	ah = uobj->object;
+	pd = ah->pd;
 
 	ret = ib_destroy_ah(ah);
 	if (!ret)
@@ -2919,6 +3054,10 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_AH, 1);
+
 	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -3171,10 +3310,17 @@ int ib_uverbs_ex_create_flow(struct ib_uverbs_file *file,
 		err = -EINVAL;
 		goto err_free;
 	}
+
+	err = ib_rdmacg_try_charge(&uobj->cg_obj, qp->pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_FLOW, 1);
+	if (err)
+		goto err_free;
+
 	flow_id = ib_create_flow(qp, flow_attr, IB_FLOW_DOMAIN_USER);
 	if (IS_ERR(flow_id)) {
 		err = PTR_ERR(flow_id);
-		goto err_free;
+		goto err_create;
 	}
 	flow_id->qp = qp;
 	flow_id->uobject = uobj;
@@ -3208,6 +3354,10 @@ err_copy:
 	idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
 destroy_flow:
 	ib_destroy_flow(flow_id);
+err_create:
+	ib_rdmacg_uncharge(&uobj->cg_obj, qp->pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_FLOW, 1);
 err_free:
 	kfree(flow_attr);
 err_put:
@@ -3228,6 +3378,7 @@ int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file,
 	struct ib_uverbs_destroy_flow	cmd;
 	struct ib_flow			*flow_id;
 	struct ib_uobject		*uobj;
+	struct ib_pd			*pd;
 	int				ret;
 
 	if (ucore->inlen < sizeof(cmd))
@@ -3245,11 +3396,16 @@ int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -EINVAL;
 	flow_id = uobj->object;
+	pd = flow_id->qp->pd;
 
 	ret = ib_destroy_flow(flow_id);
 	if (!ret)
 		uobj->live = 0;
 
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_FLOW, 1);
+
 	put_uobj_write(uobj);
 
 	idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
@@ -3316,6 +3472,12 @@ static int __uverbs_create_xsrq(struct ib_uverbs_file *file,
 	obj->uevent.events_reported = 0;
 	INIT_LIST_HEAD(&obj->uevent.event_list);
 
+	ret = ib_rdmacg_try_charge(&obj->uevent.uobject.cg_obj, pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_SRQ, 1);
+	if (ret)
+		goto err_put_cq;
+
 	srq = pd->device->create_srq(pd, &attr, udata);
 	if (IS_ERR(srq)) {
 		ret = PTR_ERR(srq);
@@ -3380,6 +3542,9 @@ err_destroy:
 	ib_destroy_srq(srq);
 
 err_put:
+	ib_rdmacg_uncharge(&obj->uevent.uobject.cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_SRQ, 1);
 	put_pd_read(pd);
 
 err_put_cq:
@@ -3540,6 +3705,7 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 	struct ib_uverbs_destroy_srq_resp resp;
 	struct ib_uobject		 *uobj;
 	struct ib_srq               	 *srq;
+	struct ib_pd                     *pd;
 	struct ib_uevent_object        	 *obj;
 	int                         	  ret = -EINVAL;
 	struct ib_usrq_object		 *us;
@@ -3554,6 +3720,7 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 	srq = uobj->object;
 	obj = container_of(uobj, struct ib_uevent_object, uobject);
 	srq_type = srq->srq_type;
+	pd = srq->pd;
 
 	ret = ib_destroy_srq(srq);
 	if (!ret)
@@ -3564,6 +3731,10 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_SRQ, 1);
+
 	if (srq_type == IB_SRQT_XRC) {
 		us = container_of(obj, struct ib_usrq_object, uevent);
 		atomic_dec(&us->uxrcd->refcnt);
@@ -3597,6 +3768,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	struct ib_uverbs_ex_query_device_resp resp;
 	struct ib_uverbs_ex_query_device  cmd;
 	struct ib_device_attr attr;
+	int    limits[RDMA_VERB_RESOURCE_MAX];
 	int err;
 
 	if (ucore->inlen < sizeof(cmd))
@@ -3623,7 +3795,14 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	if (err)
 		return err;
 
-	copy_query_dev_fields(file, ib_dev, &resp.base, &attr);
+	err = ib_rdmacg_query_limit(ib_dev,
+				    RDMACG_RESOURCE_POOL_VERB,
+				    limits, RDMA_VERB_RESOURCE_MAX);
+	if (err)
+		goto end;
+
+	copy_query_dev_fields(file, ib_dev, &resp.base, &attr, limits);
+
 	resp.comp_mask = 0;
 
 	if (ucore->outlen < resp.response_length + sizeof(resp.odp_caps))
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index e3ef288..1d8292c 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -49,6 +49,7 @@
 #include <asm/uaccess.h>
 
 #include "uverbs.h"
+#include "core_priv.h"
 
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("InfiniBand userspace verbs access");
@@ -214,6 +215,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) {
 		struct ib_ah *ah = uobj->object;
 
+		ib_rdmacg_uncharge(&uobj->cg_obj, ah->pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_AH, 1);
 		idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 		ib_destroy_ah(ah);
 		kfree(uobj);
@@ -223,6 +227,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->mw_list, list) {
 		struct ib_mw *mw = uobj->object;
 
+		ib_rdmacg_uncharge(&uobj->cg_obj, mw->pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_MW, 1);
 		idr_remove_uobj(&ib_uverbs_mw_idr, uobj);
 		ib_dealloc_mw(mw);
 		kfree(uobj);
@@ -231,6 +238,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->rule_list, list) {
 		struct ib_flow *flow_id = uobj->object;
 
+		ib_rdmacg_uncharge(&uobj->cg_obj, flow_id->qp->pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_FLOW, 1);
 		idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
 		ib_destroy_flow(flow_id);
 		kfree(uobj);
@@ -245,6 +255,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		if (qp != qp->real_qp) {
 			ib_close_qp(qp);
 		} else {
+			ib_rdmacg_uncharge(&uobj->cg_obj, qp->pd->device,
+					   RDMACG_RESOURCE_POOL_VERB,
+					   RDMA_VERB_RESOURCE_QP, 1);
 			ib_uverbs_detach_umcast(qp, uqp);
 			ib_destroy_qp(qp);
 		}
@@ -257,6 +270,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		struct ib_uevent_object *uevent =
 			container_of(uobj, struct ib_uevent_object, uobject);
 
+		ib_rdmacg_uncharge(&uobj->cg_obj, srq->pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_SRQ, 1);
 		idr_remove_uobj(&ib_uverbs_srq_idr, uobj);
 		ib_destroy_srq(srq);
 		ib_uverbs_release_uevent(file, uevent);
@@ -269,6 +285,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		struct ib_ucq_object *ucq =
 			container_of(uobj, struct ib_ucq_object, uobject);
 
+		ib_rdmacg_uncharge(&uobj->cg_obj, cq->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_CQ, 1);
 		idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 		ib_destroy_cq(cq);
 		ib_uverbs_release_ucq(file, ev_file, ucq);
@@ -278,6 +297,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) {
 		struct ib_mr *mr = uobj->object;
 
+		ib_rdmacg_uncharge(&uobj->cg_obj, mr->pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_MR, 1);
 		idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 		ib_dereg_mr(mr);
 		kfree(uobj);
@@ -298,11 +320,17 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) {
 		struct ib_pd *pd = uobj->object;
 
+		ib_rdmacg_uncharge(&uobj->cg_obj, pd->device,
+				   RDMACG_RESOURCE_POOL_VERB,
+				   RDMA_VERB_RESOURCE_PD, 1);
 		idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 		ib_dealloc_pd(pd);
 		kfree(uobj);
 	}
 
+	ib_rdmacg_uncharge(&context->cg_obj, context->device,
+			   RDMACG_RESOURCE_POOL_VERB,
+			   RDMA_VERB_RESOURCE_UCTX, 1);
 	put_pid(context->tgid);
 
 	return context->device->dealloc_ucontext(context);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9a68a19..e109752 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -55,6 +55,8 @@
 #include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
 
+#include <linux/cgroup_rdma.h>
+
 extern struct workqueue_struct *ib_wq;
 
 union ib_gid {
@@ -95,6 +97,19 @@ enum rdma_protocol_type {
 	RDMA_PROTOCOL_USNIC_UDP
 };
 
+enum rdma_resource_type {
+	RDMA_VERB_RESOURCE_UCTX,
+	RDMA_VERB_RESOURCE_AH,
+	RDMA_VERB_RESOURCE_PD,
+	RDMA_VERB_RESOURCE_CQ,
+	RDMA_VERB_RESOURCE_MR,
+	RDMA_VERB_RESOURCE_MW,
+	RDMA_VERB_RESOURCE_SRQ,
+	RDMA_VERB_RESOURCE_QP,
+	RDMA_VERB_RESOURCE_FLOW,
+	RDMA_VERB_RESOURCE_MAX,
+};
+
 __attribute_const__ enum rdma_transport_type
 rdma_node_get_transport(enum rdma_node_type node_type);
 
@@ -1231,6 +1246,12 @@ struct ib_fmr_attr {
 
 struct ib_umem;
 
+struct ib_rdmacg_object {
+#ifdef CONFIG_CGROUP_RDMA
+	struct rdma_cgroup	*cg;		/* owner rdma cgroup */
+#endif
+};
+
 struct ib_ucontext {
 	struct ib_device       *device;
 	struct list_head	pd_list;
@@ -1261,12 +1282,14 @@ struct ib_ucontext {
 	struct list_head	no_private_counters;
 	int                     odp_mrs_count;
 #endif
+	struct ib_rdmacg_object cg_obj;
 };
 
 struct ib_uobject {
 	u64			user_handle;	/* handle given to us by userspace */
 	struct ib_ucontext     *context;	/* associated user context */
 	void		       *object;		/* containing object */
+	struct ib_rdmacg_object cg_obj;
 	struct list_head	list;		/* link to context's list */
 	int			id;		/* index into kernel idr */
 	struct kref		ref;
@@ -1822,7 +1845,9 @@ struct ib_device {
 	u16                          is_switch:1;
 	u8                           node_type;
 	u8                           phys_port_cnt;
-
+#ifdef CONFIG_CGROUP_RDMA
+	struct rdmacg_device	     cg_device;
+#endif
 	/**
 	 * The following mandatory functions are used only at device
 	 * registration.  Keep functions such as these at the end of this
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 3/3] rdmacg: Added documentation for rdma controller
  2016-01-30 15:23 [PATCHv3 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
  2016-01-30 15:23 ` [PATCHv3 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
  2016-01-30 15:23 ` [PATCHv3 2/3] IB/core: added support to use " Parav Pandit
@ 2016-01-30 15:23 ` Parav Pandit
  2 siblings, 0 replies; 16+ messages in thread
From: Parav Pandit @ 2016-01-30 15:23 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav

Added documentation for rdma controller to use in v1 mode and
using new unified hirerchy mode v2.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 Documentation/cgroup-v1/rdma.txt | 122 +++++++++++++++++++++++++++++++++++++++
 Documentation/cgroup-v2.txt      |  43 ++++++++++++++
 2 files changed, 165 insertions(+)
 create mode 100644 Documentation/cgroup-v1/rdma.txt

diff --git a/Documentation/cgroup-v1/rdma.txt b/Documentation/cgroup-v1/rdma.txt
new file mode 100644
index 0000000..9519911
--- /dev/null
+++ b/Documentation/cgroup-v1/rdma.txt
@@ -0,0 +1,122 @@
+				RDMA Controller
+				----------------
+
+Contents
+--------
+
+1. Overview
+  1-1. What is RDMA controller?
+  1-2. Why RDMA controller needed?
+  1-3. How is RDMA controller implemented?
+2. Usage Examples
+
+1. Overview
+
+1-1. What is RDMA controller?
+-----------------------------
+
+RDMA controller allows user to limit RDMA/IB specific resources
+that a given set of processes can use. These processes are grouped using
+RDMA controller.
+
+RDMA controller currently allows two different type of resource
+pools.
+(a) RDMA IB specification level verb resources defined by IB stack
+(b) HCA vendor device specific resources
+
+RDMA controller controller allows maximum of upto 64 resources in
+a resource pool which is the internal construct of rdma cgroup explained
+at later part of this document.
+
+1-2. Why RDMA controller needed?
+--------------------------------
+
+Currently user space applications can easily take away all the rdma device
+specific resources such as AH, CQ, QP, MR etc. Due to which other applications
+in other cgroup or kernel space ULPs may not even get chance to allocate any
+rdma resources. This leads to service unavailability.
+
+Therefore RDMA controller is needed through which resource consumption
+of processes can be limited. Through this controller various different rdma
+resources described by IB uverbs layer and any HCA vendor driver can be
+accounted.
+
+1-3. How is RDMA controller implemented?
+----------------------------------------
+
+RDMA cgroup allows limit configuration of resources. These resources are not
+defined by the rdma controller. Instead they are defined by the IB stack
+and HCA device drivers(optionally).
+This provides great flexibility to allow IB stack to define new resources,
+without any changes to rdma cgroup.
+Rdma cgroup maintains resource accounting per cgroup, per device, per resource
+type using resource pool structure. Each such resource pool is limited up to
+64 resources in given resource pool by rdma cgroup, which can be extended
+later if required.
+
+This resource pool object is linked to the cgroup css. Typically there
+are 0 to 4 resource pool instances per cgroup, per device in most use cases.
+But nothing limits to have it more. At present hundreds of RDMA devices per
+single cgroup may not be handled optimally, however there is no known use case
+for such configuration either.
+
+Since RDMA resources can be allocated from any process and can be freed by any
+of the child processes which shares the address space, rdma resources are
+always owned by the creator cgroup css. This allows process migration from one
+to other cgroup without major complexity of transferring resource ownership;
+because such ownership is not really present due to shared nature of
+rdma resources. Linking resources around css also ensures that cgroups can be
+deleted after processes migrated. This allow progress migration as well with
+active resources, even though that’s not the primary use case.
+
+Whenever RDMA resource charing occurs, owner rdma cgroup is returned to
+the caller. Same rdma cgroup should be passed while uncharging the resource.
+This also allows process migrated with active RDMA resource to charge
+to new owner cgroup for new resource. It also allows to uncharge resource of
+a process from previously charged cgroup which is migrated to new cgroup,
+even though that is not a primary use case.
+
+Resource pool object is created in following situations.
+(a) User sets the limit and no previous resource pool exist for the device
+of interest for the cgroup.
+(b) No resource limits were configured, but IB/RDMA stack tries to
+charge the resource. So that it correctly uncharge them when applications are
+running without limits and later on when limits are enforced during uncharging,
+otherwise usage count will drop to negative. This is done using default
+resource pool. Instead of implementing any sort of time markers, default pool
+simplifies the design.
+
+Resource pool is destroyed if it was of default type (not created
+by administrative operation) and it’s the last resource getting
+deallocated. Resource pool created as administrative operation is not
+deleted, as it’s expected to be used in near future.
+
+If user setting tries to delete all the resource limit
+with active resources per device, RDMA cgroup just marks the pool as
+default pool with maximum limits for each resource, otherwise it deletes the
+default resource pool.
+
+2. Usage Examples
+-----------------
+
+(a) Configure resource limit:
+echo mlx4_0 mr=100 qp=10 ah=2 > /sys/fs/cgroup/rdma/1/rdma.verb.max
+echo ocrdma1 mr=120 qp=20 cq=10 > /sys/fs/cgroup/rdma/2/rdma.verb.max
+
+(b) Query resource limit:
+cat /sys/fs/cgroup/rdma/2/rdma.verb.max
+#Output:
+mlx4_0 mr=100 qp=10 ah=2
+ocrdma1 mr=120 qp=20 cq=10
+
+(c) Query current usage:
+cat /sys/fs/cgroup/rdma/2/rdma.verb.current
+#Output:
+mlx4_0 mr=95 qp=8 ah=2
+ocrdma1 mr=0 qp=20 cq=10
+
+(d) Delete resource limit:
+echo mlx4_0 remove > /sys/fs/cgroup/rdma/1/rdma.verb.max
+
+(e) Configure hw specific resource limit: (optional)
+echo vendor1 hw_qp=56 > /sys/fs/cgroup/rdma/2/rdma.hw.max
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 31d1f7b..6741529 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -47,6 +47,8 @@ CONTENTS
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
+  5-4. RDMA
+    5-4-1. RDMA Interface Files
 P. Information on Kernel Programming
   P-1. Filesystem Support for Writeback
 D. Deprecated v1 Core Features
@@ -1012,6 +1014,47 @@ writeback as follows.
 	total available memory and applied the same way as
 	vm.dirty[_background]_ratio.
 
+5-4. RDMA
+
+The "rdma" controller regulates the distribution of RDMA resources.
+This controller implements both RDMA/IB verb level and RDMA HCA
+driver level resource distribution.
+
+5-4-1. RDMA Interface Files
+
+  rdma.verb.max
+	A readwrite file that exists for all the cgroups except root that
+	describes current configured verbs resource limit for a RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_0 mr=1000 qp=104 ah=2
+	  ocrdma1 mr=900 qp=89 cq=10
+
+  rdma.verb.current
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup including root.
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_0 mr=1000 qp=102 ah=2 flow=10 srq=0
+	  ocrdma1 mr=900 qp=79 cq=10 flow=0 srq=0
+
+  rdma.hw.max
+	A readwrite file that exists for all the cgroups except root that
+	describes current configured HCA hardware resource limit for a
+	RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+  rdma.hw.current
+	A read-only file that describes current resource usage.
 
 P. Information on Kernel Programming
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-01-30 15:23 ` [PATCHv3 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
@ 2016-01-30 18:30   ` Tejun Heo
  2016-01-30 20:44     ` Parav Pandit
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-01-30 18:30 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, hannes,
	dledford, liranl, sean.hefty, jgunthorpe, haggaie, corbet,
	james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

Hello,

On Sat, Jan 30, 2016 at 08:53:05PM +0530, Parav Pandit wrote:
> +#ifdef CONFIG_CGROUP_RDMA
> +#define RDMACG_MAX_RESOURCE_INDEX (64)

The list of resources are known at compile time.  Why is this separate
fixed number necessary?

> +struct match_token;

There's no circular dependency issue here.  Include the appropriate
header.

> +struct rdmacg_device;
> +
> +struct rdmacg_pool_info {
> +	struct match_token *resource_table;
> +	int resource_count;
> +};
> +
> +struct rdmacg_resource_pool_ops {
> +	struct rdmacg_pool_info*
> +		(*get_resource_pool_tokens)(struct rdmacg_device *);
> +};

Why does it need external op table?  The types of resources are gonna
be fixed at compile time.  The only thing necessary is each device
declaring which resources are to be used.

> +struct rdmacg_device {
> +	struct rdmacg_resource_pool_ops
> +				*rpool_ops[RDMACG_RESOURCE_POOL_TYPE_MAX];
> +	struct list_head        rdmacg_list;
> +	char                    *name;
> +};
> +
> +/* APIs for RDMA/IB stack to publish when a device wants to
> + * participate in resource accounting
> + */

Please read CodingStyle.

> +config CGROUP_RDMA
> +	bool "RDMA controller"
> +	help
> +	  Provides enforcement of RDMA resources at RDMA/IB verb level and
> +	  enforcement of any RDMA/IB capable hardware advertized resources.
                                                      ^^^^^^^^^?
> +	  Its fairly easy for applications to exhaust RDMA resources, which
          ^^^
> +	  can result into kernel consumers or other application consumers of
                     ^ in ^ just say "consumers"?
> +	  RDMA resources left with no resources. RDMA controller is designed
          ^ The sentence doesn't read well.
> +	  to stop this from happening.
> +	  Attaching existing processes with active RDMA resources to the cgroup
> +	  hierarchy will be allowed even if can cross the hierarchy's limit.
                                         ^^^^^?

> +#define RDMACG_USR_CMD_REMOVE "remove"

Why is this necessary?

> +/* resource tracker per resource for rdma cgroup */
> +struct cg_resource {
> +	int max;
> +	atomic_t usage;
> +};

rdmacg_resource?  Also, wouldn't it be better to use u64?

> +/**
> + * pool type indicating either it got created as part of default
> + * operation or user has configured the group.
> + * Depends on the creator of the pool, its decided to free up
> + * later or not.
> + */
> +enum rpool_creator {
> +	RDMACG_RPOOL_CREATOR_DEFAULT,
> +	RDMACG_RPOOL_CREATOR_USR,
> +};

Why does this matter?

> +/**
> + * resource pool object which represents, per cgroup, per device,
> + * per resource pool_type resources. There are multiple instance
> + * of this object per cgroup, therefore it cannot be embedded within
> + * rdma_cgroup structure. Its maintained as list.
> + */
> +struct cg_resource_pool {
> +	struct list_head cg_list;
> +	struct rdmacg_device *device;
> +	enum rdmacg_resource_pool_type type;
> +
> +	struct cg_resource *resources;
> +
> +	atomic_t refcnt;	/* count active user tasks of this pool */
> +	atomic_t creator;	/* user created or default type */

Why the hell would creator need to be an atomic?  You're just using
set and read on the field.  What's going on?

> +static struct rdmacg_resource_pool_ops*
> +	get_pool_ops(struct rdmacg_device *device,
> +		     enum rdmacg_resource_pool_type pool_type)

static struct rdmacg_resource_pool_ops *
get_pool_ops(...)

> +{
> +	return device->rpool_ops[pool_type];
> +}
...
> +static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
> +			      struct cg_resource_pool *rpool)
> +{

Ugh... please refrain from single underscore prefixed names.  It's
best to give it a proper name.  e.g. if the function assumes lock is
grabbed by the user use _locked postfix and so on.

> +	spin_lock(&cg->cg_list_lock);
> +
> +	/* if its started getting used by other task, before we take the
> +	 * spin lock, then skip freeing it.
> +	 */

Again, CodingStyle.

> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
> +			     struct cg_resource_pool *rpool)
> +{
> +	/* Don't free the resource pool which is created by the
> +	 * user, otherwise we lose the configured limits. We don't
> +	 * gain much either by splitting storage of limit and usage.
> +	 * So keep it around until user deletes the limits.
> +	 */
> +	if (atomic_read(&rpool->creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
> +		_dealloc_cg_rpool(cg, rpool);
> +}

The config is tied to the device.  If the device goes away, all its
pools go away.  If the device stays, all configs stay.

> +static struct cg_resource_pool*
> +	find_cg_rpool(struct rdma_cgroup *cg,
> +		      struct rdmacg_device *device,
> +		      enum rdmacg_resource_pool_type type)
> +
> +{
> +	struct cg_resource_pool *pool;
> +

lockdep_assert_held(...)

> +	list_for_each_entry(pool, &cg->rpool_head, cg_list)
> +		if (pool->device == device && pool->type == type)
> +			return pool;
> +
> +	return NULL;
> +}
> +
> +static struct cg_resource_pool*
> +	_get_cg_rpool(struct rdma_cgroup *cg,
> +		      struct rdmacg_device *device,
> +		      enum rdmacg_resource_pool_type type)
> +{
> +	struct cg_resource_pool *rpool;
> +
> +	spin_lock(&cg->cg_list_lock);
> +	rpool = find_cg_rpool(cg, device, type);
> +	if (rpool)
> +		goto found;
> +found:

That's one extremely silly way to write noop.

> +	spin_unlock(&cg->cg_list_lock);
> +	return rpool;
> +}

This function doesn't make any sense.  Push locking to the caller and
use find_cg_rpool().

> +static struct cg_resource_pool*
> +	get_cg_rpool(struct rdma_cgroup *cg,
> +		     struct rdmacg_device *device,
> +		     enum rdmacg_resource_pool_type type)
> +{
> +	struct cg_resource_pool *rpool, *other_rpool;
> +	struct rdmacg_pool_info *pool_info;
> +	struct rdmacg_resource_pool_ops *ops;
> +	int ret = 0;
> +
> +	spin_lock(&cg->cg_list_lock);
> +	rpool = find_cg_rpool(cg, device, type);
> +	if (rpool) {
> +		atomic_inc(&rpool->refcnt);
> +		spin_unlock(&cg->cg_list_lock);
> +		return rpool;
> +	}

Why does it need refcnting?  Things stay if the device is there.
Things go away if the device goes away.  No?  Can device go away while
resources are allocated?

> +	spin_unlock(&cg->cg_list_lock);
> +
> +	/* ops cannot be NULL at this stage, as caller made to charge/get
> +	 * the resource pool being aware of such need and invoking with
> +	 * because it has setup resource pool ops.
> +	 */
> +	ops = get_pool_ops(device, type);
> +	pool_info = ops->get_resource_pool_tokens(device);
> +	if (!pool_info) {
> +		ret = -EINVAL;
> +		goto err;
> +	}

Please just do

enum {
	RDMA_RES_VERB_A,
	RDMA_RES_VERB_B,
	...
	RDMA_RES_VERB_MAX,

	RDMA_RES_HW_BASE = RDMA_RES_VERB_MAX,
	RDMA_RES_HW_A = RDMA_RES_HW_BASE,
	RDMA_RES_HW_B,
	...
	RDMA_RES_HW_MAX,
};

static const char rdma_res_name[] = {
	[RDMA_RES_VERB_A]	= "...",
	...
};

And then let each device specify bitmap of resources that it supports
on registration.

> +	if (pool_info->resource_count == 0 ||
> +	    pool_info->resource_count > RDMACG_MAX_RESOURCE_INDEX) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	/* allocate resource pool */
> +	rpool = alloc_cg_rpool(cg, device, pool_info->resource_count, type);
> +	if (IS_ERR_OR_NULL(rpool))
> +		return rpool;
> +
> +	/* cgroup lock is held to synchronize with multiple
> +	 * resource pool creation in parallel.
> +	 */
> +	spin_lock(&cg->cg_list_lock);
> +	other_rpool = find_cg_rpool(cg, device, type);
> +	/* if other task added resource pool for this device for this cgroup
> +	 * free up which was recently created and use the one we found.
> +	 */
> +	if (other_rpool) {
> +		atomic_inc(&other_rpool->refcnt);
> +		spin_unlock(&cg->cg_list_lock);
> +		_free_cg_rpool(rpool);
> +		return other_rpool;
> +	}
> +
> +	atomic_inc(&rpool->refcnt);
> +	list_add_tail(&rpool->cg_list, &cg->rpool_head);
> +
> +	spin_unlock(&cg->cg_list_lock);
> +	return rpool;
> +
> +err:
> +	spin_unlock(&cg->cg_list_lock);
> +	return ERR_PTR(ret);
> +}

You're grabbing the lock anyway.  Why bother with atomics at all?
Just grab the lock on entry, look up the entry and then do normal
integer ops.

The whole thing is still way more complex than it needs to be.  Please
slim it down.  It really shouldn't take half as much code.  Keep it
simple, please.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-01-30 18:30   ` Tejun Heo
@ 2016-01-30 20:44     ` Parav Pandit
  2016-01-31 10:02       ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Parav Pandit @ 2016-01-30 20:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hi Tejun,

On Sun, Jan 31, 2016 at 12:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Sat, Jan 30, 2016 at 08:53:05PM +0530, Parav Pandit wrote:
>> +#ifdef CONFIG_CGROUP_RDMA
>> +#define RDMACG_MAX_RESOURCE_INDEX (64)
>
> The list of resources are known at compile time.  Why is this separate
> fixed number necessary?
>
Its not necessary. It was carry forwarded from older design. I will
remove in v4.

>> +struct match_token;
>
> There's no circular dependency issue here.  Include the appropriate
> header.

o.k. I will fix it.

>
>> +struct rdmacg_device;
>> +
>> +struct rdmacg_pool_info {
>> +     struct match_token *resource_table;
>> +     int resource_count;
>> +};
>> +
>> +struct rdmacg_resource_pool_ops {
>> +     struct rdmacg_pool_info*
>> +             (*get_resource_pool_tokens)(struct rdmacg_device *);
>> +};
>
> Why does it need external op table?  The types of resources are gonna
> be fixed at compile time.
IB modules are kernel loadable modules which are defining the resource.
So array size and their names are not defined at compile time of rdma
cgroup code.
Kernel code also cannot depend on loadable module file either via
inclusion as that in v2 patch.
That leads to crash when loadable modules are upgraded.
V1 patch had IB resources defined in the header file of rdmacg, which
I believe is very restrictive model with evolving rdma stack and
features.
Therefore it will be kept as defined in v3 patch in IB headers (non
compile time for rdma cgroup). So support infrastructure APIs will
continue.

> The only thing necessary is each device
> declaring which resources are to be used.
>
Thats what rpool_ops structure does, allowing to query name strings
and type of it by utilizing the match tokens.


>> +struct rdmacg_device {
>> +     struct rdmacg_resource_pool_ops
>> +                             *rpool_ops[RDMACG_RESOURCE_POOL_TYPE_MAX];
>> +     struct list_head        rdmacg_list;
>> +     char                    *name;
>> +};
>> +
>> +/* APIs for RDMA/IB stack to publish when a device wants to
>> + * participate in resource accounting
>> + */
>
> Please read CodingStyle.
>
Sorry about it. I will add the initial blank line. From driver
background I was avoiding it.

>> +config CGROUP_RDMA
>> +     bool "RDMA controller"
>> +     help
>> +       Provides enforcement of RDMA resources at RDMA/IB verb level and
>> +       enforcement of any RDMA/IB capable hardware advertized resources.
>                                                       ^^^^^^^^^?
>> +       Its fairly easy for applications to exhaust RDMA resources, which
>           ^^^
>> +       can result into kernel consumers or other application consumers of
>                      ^ in ^ just say "consumers"?
>> +       RDMA resources left with no resources. RDMA controller is designed
>           ^ The sentence doesn't read well.
>> +       to stop this from happening.
>> +       Attaching existing processes with active RDMA resources to the cgroup
>> +       hierarchy will be allowed even if can cross the hierarchy's limit.
>                                          ^^^^^?
>
Fixed them. Please review them in next patch.

>> +#define RDMACG_USR_CMD_REMOVE "remove"
>
> Why is this necessary?
>
User can unset the configured limit by writing "remove" for a given
device, instead of writing max values for all the resources.
As I explained in cover note and other comment, when its marked for
remove, the resource pool is marked as of default type so that it can
be freed. When all resources are freed, we don't free the rpool
because it holds the configured limit.

>> +/* resource tracker per resource for rdma cgroup */
>> +struct cg_resource {
>> +     int max;
>> +     atomic_t usage;
>> +};
>
> rdmacg_resource?  Also, wouldn't it be better to use u64?
>
I will change to rdmacg_resource. As present we have 24-bit wide resources.
64-bit might not needed in near future. If there are inputs comes from
Intel/Mellanox for this I will bump to 64-bit.
Otherwise I prefer to keep it 32-bit.

>> +/**
>> + * pool type indicating either it got created as part of default
>> + * operation or user has configured the group.
>> + * Depends on the creator of the pool, its decided to free up
>> + * later or not.
>> + */
>> +enum rpool_creator {
>> +     RDMACG_RPOOL_CREATOR_DEFAULT,
>> +     RDMACG_RPOOL_CREATOR_USR,
>> +};
>
> Why does this matter?
As you got in later comment and as I explained above, basically
resource marked as of user type is not freed, until either device goes
away or either user wants to clear the configuration.

>
>> +/**
>> + * resource pool object which represents, per cgroup, per device,
>> + * per resource pool_type resources. There are multiple instance
>> + * of this object per cgroup, therefore it cannot be embedded within
>> + * rdma_cgroup structure. Its maintained as list.
>> + */
>> +struct cg_resource_pool {
>> +     struct list_head cg_list;
>> +     struct rdmacg_device *device;
>> +     enum rdmacg_resource_pool_type type;
>> +
>> +     struct cg_resource *resources;
>> +
>> +     atomic_t refcnt;        /* count active user tasks of this pool */
>> +     atomic_t creator;       /* user created or default type */
>
> Why the hell would creator need to be an atomic?  You're just using
> set and read on the field.  What's going on?

Yes, I can make creator non atomic.

> static struct rdmacg_resource_pool_ops *
> get_pool_ops(...)
ok. Will fix it.

>
>> +{
>> +     return device->rpool_ops[pool_type];
>> +}
> ...
>> +static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
>> +                           struct cg_resource_pool *rpool)
>> +{
>
> Ugh... please refrain from single underscore prefixed names.  It's
> best to give it a proper name.  e.g. if the function assumes lock is
> grabbed by the user use _locked postfix and so on.
>
o.k. For this particular one, I merged them to single function.

>> +     spin_lock(&cg->cg_list_lock);
>> +
>> +     /* if its started getting used by other task, before we take the
>> +      * spin lock, then skip freeing it.
>> +      */
>
> Again, CodingStyle.

I will fix it.

>
>> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
>> +                          struct cg_resource_pool *rpool)
>> +{
>> +     /* Don't free the resource pool which is created by the
>> +      * user, otherwise we lose the configured limits. We don't
>> +      * gain much either by splitting storage of limit and usage.
>> +      * So keep it around until user deletes the limits.
>> +      */
>> +     if (atomic_read(&rpool->creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
>> +             _dealloc_cg_rpool(cg, rpool);
>> +}
>
> The config is tied to the device.  If the device goes away, all its
> pools go away.  If the device stays, all configs stay.
>
config stays, until the resources are allocated. If device is there,
we don't create resource pool for all the created cgroups to save
memory with this little extra code.

>> +static struct cg_resource_pool*
>> +     find_cg_rpool(struct rdma_cgroup *cg,
>> +                   struct rdmacg_device *device,
>> +                   enum rdmacg_resource_pool_type type)
>> +
>> +{
>> +     struct cg_resource_pool *pool;
>> +
>
> lockdep_assert_held(...)
>
>> +     list_for_each_entry(pool, &cg->rpool_head, cg_list)
>> +             if (pool->device == device && pool->type == type)
>> +                     return pool;
>> +
>> +     return NULL;
>> +}
>> +
>> +static struct cg_resource_pool*
>> +     _get_cg_rpool(struct rdma_cgroup *cg,
>> +                   struct rdmacg_device *device,
>> +                   enum rdmacg_resource_pool_type type)
>> +{
>> +     struct cg_resource_pool *rpool;
>> +
>> +     spin_lock(&cg->cg_list_lock);
>> +     rpool = find_cg_rpool(cg, device, type);
>> +     if (rpool)
>> +             goto found;
>> +found:
>
> That's one extremely silly way to write noop.
>
>> +     spin_unlock(&cg->cg_list_lock);
>> +     return rpool;
>> +}
>
> This function doesn't make any sense.  Push locking to the caller and
> use find_cg_rpool().

This is nice wrapper around find_cg_rpool to write clean code. I would
like to keep it for code readability.
if(rpool) check can be removed, because for this function its going to
be true always.

>
>> +static struct cg_resource_pool*
>> +     get_cg_rpool(struct rdma_cgroup *cg,
>> +                  struct rdmacg_device *device,
>> +                  enum rdmacg_resource_pool_type type)
>> +{
>> +     struct cg_resource_pool *rpool, *other_rpool;
>> +     struct rdmacg_pool_info *pool_info;
>> +     struct rdmacg_resource_pool_ops *ops;
>> +     int ret = 0;
>> +
>> +     spin_lock(&cg->cg_list_lock);
>> +     rpool = find_cg_rpool(cg, device, type);
>> +     if (rpool) {
>> +             atomic_inc(&rpool->refcnt);
>> +             spin_unlock(&cg->cg_list_lock);
>> +             return rpool;
>> +     }
>
> Why does it need refcnting?  Things stay if the device is there.
> Things go away if the device goes away.  No?

No. If there is one device and 100 cgroups, we create resource pools
when there is actually any of the process wants to perform charging.
(Instead of creating 100 resource pools just because cgroup exists).
So reference count of the rpool checks that when last resource is
freed, it frees the resource pool, if its allocated as default pool.
If user has configured the pool, than it stays (until device goes away).

> Can device go away while resources are allocated?
No. Thats ensured by the IB stack as following correct sequence of destruction.

>
>> +     spin_unlock(&cg->cg_list_lock);
>> +
>> +     /* ops cannot be NULL at this stage, as caller made to charge/get
>> +      * the resource pool being aware of such need and invoking with
>> +      * because it has setup resource pool ops.
>> +      */
>> +     ops = get_pool_ops(device, type);
>> +     pool_info = ops->get_resource_pool_tokens(device);
>> +     if (!pool_info) {
>> +             ret = -EINVAL;
>> +             goto err;
>> +     }
>
> Please just do
>
> enum {
>         RDMA_RES_VERB_A,
>         RDMA_RES_VERB_B,
>         ...
>         RDMA_RES_VERB_MAX,
>
>         RDMA_RES_HW_BASE = RDMA_RES_VERB_MAX,
>         RDMA_RES_HW_A = RDMA_RES_HW_BASE,
>         RDMA_RES_HW_B,
>         ...
>         RDMA_RES_HW_MAX,
> };
>
> static const char rdma_res_name[] = {
>         [RDMA_RES_VERB_A]       = "...",
>         ...
> };

What you have described is done in little different way in the
loadable kernel module as explained earlier to let it defined by the
IB stack.
Otherwise this needs to be defined in rdma cgroup header file like my
v0 patch, which I certainly want to avoid.
>
> And then let each device specify bitmap of resources that it supports
> on registration.
>
>> +     if (pool_info->resource_count == 0 ||
>> +         pool_info->resource_count > RDMACG_MAX_RESOURCE_INDEX) {
>> +             ret = -EINVAL;
>> +             goto err;
>> +     }
>> +
>> +     /* allocate resource pool */
>> +     rpool = alloc_cg_rpool(cg, device, pool_info->resource_count, type);
>> +     if (IS_ERR_OR_NULL(rpool))
>> +             return rpool;
>> +
>> +     /* cgroup lock is held to synchronize with multiple
>> +      * resource pool creation in parallel.
>> +      */
>> +     spin_lock(&cg->cg_list_lock);
>> +     other_rpool = find_cg_rpool(cg, device, type);
>> +     /* if other task added resource pool for this device for this cgroup
>> +      * free up which was recently created and use the one we found.
>> +      */
>> +     if (other_rpool) {
>> +             atomic_inc(&other_rpool->refcnt);
>> +             spin_unlock(&cg->cg_list_lock);
>> +             _free_cg_rpool(rpool);
>> +             return other_rpool;
>> +     }
>> +
>> +     atomic_inc(&rpool->refcnt);
>> +     list_add_tail(&rpool->cg_list, &cg->rpool_head);
>> +
>> +     spin_unlock(&cg->cg_list_lock);
>> +     return rpool;
>> +
>> +err:
>> +     spin_unlock(&cg->cg_list_lock);
>> +     return ERR_PTR(ret);
>> +}
>
> You're grabbing the lock anyway.  Why bother with atomics at all?
> Just grab the lock on entry, look up the entry and then do normal
> integer ops.
While we free the entry, we don't grab the lock if the refcnt is not
reached zero.
Please refer to put_cg_rpool that does that.

>
> The whole thing is still way more complex than it needs to be.  Please
> slim it down.  It really shouldn't take half as much code.  Keep it
> simple, please.

I have tried to keep as simple as possible.
If you have specific comments like above, I can certainly address them.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-01-30 20:44     ` Parav Pandit
@ 2016-01-31 10:02       ` Tejun Heo
  2016-01-31 10:41         ` Parav Pandit
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-01-31 10:02 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Sun, Jan 31, 2016 at 02:14:13AM +0530, Parav Pandit wrote:
...
> V1 patch had IB resources defined in the header file of rdmacg, which
> I believe is very restrictive model with evolving rdma stack and
> features.

Wasn't this the model that we agreed upon?  Besides, even if the
resources are to be supplied by the driver, a better way would be
letting it specify the tables of resources.  There's no reason for
indirection through a callback.

> Therefore it will be kept as defined in v3 patch in IB headers (non
> compile time for rdma cgroup). So support infrastructure APIs will
> continue.

So, what we discussed before just went out the window?

> > The only thing necessary is each device
> > declaring which resources are to be used.
> >
> Thats what rpool_ops structure does, allowing to query name strings
> and type of it by utilizing the match tokens.

Keep the resource types in an array in minimal way and match with the
information from core side.  It doesn't make any sense to use match
tokens in defining resources when the resource type is always fixed.

> > Why is this necessary?
> >
> User can unset the configured limit by writing "remove" for a given
> device, instead of writing max values for all the resources.
> As I explained in cover note and other comment, when its marked for
> remove, the resource pool is marked as of default type so that it can
> be freed. When all resources are freed, we don't free the rpool
> because it holds the configured limit.

I still don't get it.  Why isn't this tied to the lifetime of the
device?

> >> +enum rpool_creator {
> >> +     RDMACG_RPOOL_CREATOR_DEFAULT,
> >> +     RDMACG_RPOOL_CREATOR_USR,
> >> +};
> >
> > Why does this matter?
> As you got in later comment and as I explained above, basically
> resource marked as of user type is not freed, until either device goes
> away or either user wants to clear the configuration.

You're re-stating the same thing without explaining the reasoning
behind it.  Why is this different from other controllers?  What's the
justification?

> >> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
> >> +                          struct cg_resource_pool *rpool)
> >> +{
> >> +     /* Don't free the resource pool which is created by the
> >> +      * user, otherwise we lose the configured limits. We don't
> >> +      * gain much either by splitting storage of limit and usage.
> >> +      * So keep it around until user deletes the limits.
> >> +      */
> >> +     if (atomic_read(&rpool->creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
> >> +             _dealloc_cg_rpool(cg, rpool);
> >> +}
> >
> > The config is tied to the device.  If the device goes away, all its
> > pools go away.  If the device stays, all configs stay.
> >
> config stays, until the resources are allocated. If device is there,
> we don't create resource pool for all the created cgroups to save
> memory with this little extra code.

Yeah, create on demand all you want but why is the end of lifetime
tied to who created?

> >> +static struct cg_resource_pool*
> >> +     _get_cg_rpool(struct rdma_cgroup *cg,
> >> +                   struct rdmacg_device *device,
> >> +                   enum rdmacg_resource_pool_type type)
> >> +{
> >> +     struct cg_resource_pool *rpool;
> >> +
> >> +     spin_lock(&cg->cg_list_lock);
> >> +     rpool = find_cg_rpool(cg, device, type);
> >> +     if (rpool)
> >> +             goto found;
> >> +found:
> >
> > That's one extremely silly way to write noop.
> >
> >> +     spin_unlock(&cg->cg_list_lock);
> >> +     return rpool;
> >> +}
> >
> > This function doesn't make any sense.  Push locking to the caller and
> > use find_cg_rpool().
> 
> This is nice wrapper around find_cg_rpool to write clean code. I would
> like to keep it for code readability.
> if(rpool) check can be removed, because for this function its going to
> be true always.

No, the locking scheme doesn't make any sense.  Except for some
special cases, sequence like the following indicates that the code is
buggy or at least silly.

	lock;
	obj = find_refcnted_obj();
	unlock;
	return obj;

In this particular case, just push out locking to the users.

> >> +static struct cg_resource_pool*
> >> +     get_cg_rpool(struct rdma_cgroup *cg,
> >> +                  struct rdmacg_device *device,
> >> +                  enum rdmacg_resource_pool_type type)
> >> +{
> >> +     struct cg_resource_pool *rpool, *other_rpool;
> >> +     struct rdmacg_pool_info *pool_info;
> >> +     struct rdmacg_resource_pool_ops *ops;
> >> +     int ret = 0;
> >> +
> >> +     spin_lock(&cg->cg_list_lock);
> >> +     rpool = find_cg_rpool(cg, device, type);
> >> +     if (rpool) {
> >> +             atomic_inc(&rpool->refcnt);
> >> +             spin_unlock(&cg->cg_list_lock);
> >> +             return rpool;
> >> +     }
> >
> > Why does it need refcnting?  Things stay if the device is there.
> > Things go away if the device goes away.  No?
> 
> No. If there is one device and 100 cgroups, we create resource pools
> when there is actually any of the process wants to perform charging.
> (Instead of creating 100 resource pools just because cgroup exists).
> So reference count of the rpool checks that when last resource is
> freed, it frees the resource pool, if its allocated as default pool.
> If user has configured the pool, than it stays (until device goes away).

Just create on demand and keep it around till the device is
unregistered.

> What you have described is done in little different way in the
> loadable kernel module as explained earlier to let it defined by the
> IB stack.
> Otherwise this needs to be defined in rdma cgroup header file like my
> v0 patch, which I certainly want to avoid.

IIRC, I clearly objected to delegating resource definition to
individual drivers.

As it currently stands,

 Nacked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-01-31 10:02       ` Tejun Heo
@ 2016-01-31 10:41         ` Parav Pandit
  2016-01-31 11:04           ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Parav Pandit @ 2016-01-31 10:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Sun, Jan 31, 2016 at 3:32 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Sun, Jan 31, 2016 at 02:14:13AM +0530, Parav Pandit wrote:
> ...
>> V1 patch had IB resources defined in the header file of rdmacg, which
>> I believe is very restrictive model with evolving rdma stack and
>> features.
>
> Wasn't this the model that we agreed upon?  Besides, even if the
> resources are to be supplied by the driver, a better way would be
> letting it specify the tables of resources.  There's no reason for
> indirection through a callback.

No. We agreed that let IB stack define in the header file that rdmacg
can include.
However when I started with that I realized that, such design has
basic flaw that IB stack is compiled as loadable modules.
cgroup which is compiled along with kernel, cannot rely on the header
file of the loadable module, as it will lead to incompatibly and
possible crash.
Therefore its defined as indirect table. So instead of a callback, it
can be further simplified to return as pointer to data structure
stored in the rdmacg_device (similar to the name field).
Would that be reasonable?

>
>> Therefore it will be kept as defined in v3 patch in IB headers (non
>> compile time for rdma cgroup). So support infrastructure APIs will
>> continue.
>
> So, what we discussed before just went out the window?

No. As explained above, structure size and num fields are constant, so
lets do above way, without a callback function.

>> > The only thing necessary is each device
>> > declaring which resources are to be used.
>> >
>> Thats what rpool_ops structure does, allowing to query name strings
>> and type of it by utilizing the match tokens.
>
> Keep the resource types in an array in minimal way and match with the
> information from core side.  It doesn't make any sense to use match
> tokens in defining resources when the resource type is always fixed.
>
Match token is being used in other places also typically where user
configuration is involved.
so Match token infrastructure APIs help to avoid rewriting parser.
What exactly is the issue in using match token?
Resource type is fixed - sure it is with given version of loadable
module. But if new feature/resource is added in IB stack, at that
point new token will be added in the array.

>> > Why is this necessary?
>> >
>> User can unset the configured limit by writing "remove" for a given
>> device, instead of writing max values for all the resources.
>> As I explained in cover note and other comment, when its marked for
>> remove, the resource pool is marked as of default type so that it can
>> be freed. When all resources are freed, we don't free the rpool
>> because it holds the configured limit.
>
> I still don't get it.  Why isn't this tied to the lifetime of the
> device?
Let me try to take example.
Say there is one device and 100 cgroups, with single hierarchy.
Out of which 15 are active cgroups allocating rdma resources, rest 85
are not active in terms of rdma resource usage.
So rpool object is created for those 15 cgroups (regardless of user
has configured limits or not).

In this design its not tied to the lifetime of the device. At the
sametime if device goes away, those 15 will be freed anyway because
device doesn't exist.

Now coming to remove command's need.
If user has previously configured limit of say mr=15. Now if wants to
remove that configuration and don't want to bother for the limit.
So the way, its done is by issuing "remove" command.
Should I name is "reset".
When user issues "remove" command it could still happen that there are
active rdma resources. So we cannot really free the rpool object.
That is freed when last resource is uncharged.
Make sense?

>
>> >> +enum rpool_creator {
>> >> +     RDMACG_RPOOL_CREATOR_DEFAULT,
>> >> +     RDMACG_RPOOL_CREATOR_USR,
>> >> +};
>> >
>> > Why does this matter?
>> As you got in later comment and as I explained above, basically
>> resource marked as of user type is not freed, until either device goes
>> away or either user wants to clear the configuration.
>
> You're re-stating the same thing without explaining the reasoning
> behind it.  Why is this different from other controllers?  What's the
> justification?

As in above example of 15-85, if we keep it or create is we will end
up allocating rpool objects for all the 100 cgroups, which might not
be necessary. If it happens to be multi level hierarchy, it will end
up having in each level.
So therefore its allocated and freed dynamically.

>
>> >> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
>> >> +                          struct cg_resource_pool *rpool)
>> >> +{
>> >> +     /* Don't free the resource pool which is created by the
>> >> +      * user, otherwise we lose the configured limits. We don't
>> >> +      * gain much either by splitting storage of limit and usage.
>> >> +      * So keep it around until user deletes the limits.
>> >> +      */
>> >> +     if (atomic_read(&rpool->creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
>> >> +             _dealloc_cg_rpool(cg, rpool);
>> >> +}
>> >
>> > The config is tied to the device.  If the device goes away, all its
>> > pools go away.  If the device stays, all configs stay.
>> >
>> config stays, until the resources are allocated. If device is there,
>> we don't create resource pool for all the created cgroups to save
>> memory with this little extra code.
>
> Yeah, create on demand all you want but why is the end of lifetime
> tied to who created?
>
If its created by user configuration, and if we free the rpool object
when last resource is freed which is holding the configuration,
user configuration is lost.
Also it doesn't make much sense to have two different allocation for
limit and usage configuration. Pointer storing overhead is more than
the actual content.

>> >> +static struct cg_resource_pool*
>> >> +     _get_cg_rpool(struct rdma_cgroup *cg,
>> >> +                   struct rdmacg_device *device,
>> >> +                   enum rdmacg_resource_pool_type type)
>> >> +{
>> >> +     struct cg_resource_pool *rpool;
>> >> +
>> >> +     spin_lock(&cg->cg_list_lock);
>> >> +     rpool = find_cg_rpool(cg, device, type);
>> >> +     if (rpool)
>> >> +             goto found;
>> >> +found:
>> >
>> > That's one extremely silly way to write noop.
>> >
>> >> +     spin_unlock(&cg->cg_list_lock);
>> >> +     return rpool;
>> >> +}
>> >
>> > This function doesn't make any sense.  Push locking to the caller and
>> > use find_cg_rpool().
>>
>> This is nice wrapper around find_cg_rpool to write clean code. I would
>> like to keep it for code readability.
>> if(rpool) check can be removed, because for this function its going to
>> be true always.
>
> No, the locking scheme doesn't make any sense.  Except for some
> special cases, sequence like the following indicates that the code is
> buggy or at least silly.
>
Help me to understand, silly means - unacceptable?
I am still trying to understand why a wrapper function makes the code buggy.

>         lock;
>         obj = find_refcnted_obj();
>         unlock;
>         return obj;
>
> In this particular case, just push out locking to the users.
>

>> >> +static struct cg_resource_pool*
>> >> +     get_cg_rpool(struct rdma_cgroup *cg,
>> >> +                  struct rdmacg_device *device,
>> >> +                  enum rdmacg_resource_pool_type type)
>> >> +{
>> >> +     struct cg_resource_pool *rpool, *other_rpool;
>> >> +     struct rdmacg_pool_info *pool_info;
>> >> +     struct rdmacg_resource_pool_ops *ops;
>> >> +     int ret = 0;
>> >> +
>> >> +     spin_lock(&cg->cg_list_lock);
>> >> +     rpool = find_cg_rpool(cg, device, type);
>> >> +     if (rpool) {
>> >> +             atomic_inc(&rpool->refcnt);
>> >> +             spin_unlock(&cg->cg_list_lock);
>> >> +             return rpool;
>> >> +     }
>> >
>> > Why does it need refcnting?  Things stay if the device is there.
>> > Things go away if the device goes away.  No?
>>
>> No. If there is one device and 100 cgroups, we create resource pools
>> when there is actually any of the process wants to perform charging.
>> (Instead of creating 100 resource pools just because cgroup exists).
>> So reference count of the rpool checks that when last resource is
>> freed, it frees the resource pool, if its allocated as default pool.
>> If user has configured the pool, than it stays (until device goes away).
>
> Just create on demand and keep it around till the device is
> unregistered.
>
Why you don't want them to be freed when there are no requester
allocating the resource?
Device usually stays for longer time, but applications go way and come
back more often, so freeing them makes more sense when not in use.
What exactly is the problem in freeing when uncharing occurs, due to
which I should defer it to device unregistration stage?

>> What you have described is done in little different way in the
>> loadable kernel module as explained earlier to let it defined by the
>> IB stack.
>> Otherwise this needs to be defined in rdma cgroup header file like my
>> v0 patch, which I certainly want to avoid.
>
> IIRC, I clearly objected to delegating resource definition to
> individual drivers.
>
I understand that. Instead of individual driver, it can be well
abstracted out at IB stack level that drivers will use.
So that it will be in cgroup.c or other file, but with that I don't
anticipate a design change.
I have not received review comments from Sean Hefty or any of the
Intel folks who requested this feature.
So I will keep this feature around for a while. I will ping him as
well to finish his reviews and if there is any resource definitions
that they can spell out.
In absence of those inputs, other possibility is to just define verb
level resource. I am keeping doors open for more review comments from
IB folks on how do they see this.

> As it currently stands,
>
>  Nacked-by: Tejun Heo <tj@kernel.org>
>
No problem. Lets resolve these review comments for v6.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-01-31 10:41         ` Parav Pandit
@ 2016-01-31 11:04           ` Tejun Heo
  2016-01-31 17:50             ` Parav Pandit
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-01-31 11:04 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Sun, Jan 31, 2016 at 04:11:54PM +0530, Parav Pandit wrote:
> No. We agreed that let IB stack define in the header file that rdmacg
> can include.
> However when I started with that I realized that, such design has
> basic flaw that IB stack is compiled as loadable modules.
> cgroup which is compiled along with kernel, cannot rely on the header
> file of the loadable module, as it will lead to incompatibly and
> possible crash.

Yes, it can.  It just becomes a part of kernel ABI that the IB stack
module depends on.

> > Keep the resource types in an array in minimal way and match with the
> > information from core side.  It doesn't make any sense to use match
> > tokens in defining resources when the resource type is always fixed.
> >
> Match token is being used in other places also typically where user
> configuration is involved.
> so Match token infrastructure APIs help to avoid rewriting parser.
> What exactly is the issue in using match token?
> Resource type is fixed - sure it is with given version of loadable
> module. But if new feature/resource is added in IB stack, at that
> point new token will be added in the array.

Sure, use parser for parsing but you end up stripping =%d in other
places anyway.  Do it the other way.  Let the definitions contain
what's essential and do the extra stuff in code handling it, not the
other way around.

> Now coming to remove command's need.
> If user has previously configured limit of say mr=15. Now if wants to
> remove that configuration and don't want to bother for the limit.
> So the way, its done is by issuing "remove" command.
> Should I name is "reset".

How is this different from setting the limits to max?

> When user issues "remove" command it could still happen that there are
> active rdma resources. So we cannot really free the rpool object.
> That is freed when last resource is uncharged.
> Make sense?

Not really.

> > You're re-stating the same thing without explaining the reasoning
> > behind it.  Why is this different from other controllers?  What's the
> > justification?
> 
> As in above example of 15-85, if we keep it or create is we will end
> up allocating rpool objects for all the 100 cgroups, which might not
> be necessary. If it happens to be multi level hierarchy, it will end
> up having in each level.
> So therefore its allocated and freed dynamically.

Just keeping them around till device destruction would work just fine.
If you wanna insist on freeing them dynamically, free them when both
their usages and configs are nil.  How the object is created shouldn't
be a factor.

> If its created by user configuration, and if we free the rpool object
> when last resource is freed which is holding the configuration,
> user configuration is lost.
> Also it doesn't make much sense to have two different allocation for
> limit and usage configuration. Pointer storing overhead is more than
> the actual content.

See above.  If you want to free dynamically, free when the thing
doesn't contain any information.  Better, just don't worry about it.
It's unlikely to matter.

> > No, the locking scheme doesn't make any sense.  Except for some
> > special cases, sequence like the following indicates that the code is
> > buggy or at least silly.
> >
> Help me to understand, silly means - unacceptable?

Yes, it's a clear anti-pattern for a refcnted object.  Again, just
push the locking to the users and drop the atomics.

> Why you don't want them to be freed when there are no requester
> allocating the resource?
> Device usually stays for longer time, but applications go way and come
> back more often, so freeing them makes more sense when not in use.
> What exactly is the problem in freeing when uncharing occurs, due to
> which I should defer it to device unregistration stage?

Two things.  Freeing dynamically doesn't require creator type or
refcnting here, so the code makes no sense to me.  It's complicated
without any purpose.  Second, it's not that big a struct.  Just leave
them around till it's clear that this is problematic.  For the Nth
time in this whole thing, keep things simple.  Stop overengineering.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-01-31 11:04           ` Tejun Heo
@ 2016-01-31 17:50             ` Parav Pandit
  2016-02-01 18:40               ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Parav Pandit @ 2016-01-31 17:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hi Doug, Liran, Jason,

How would you like to see RDMA verb resources being defined - in RDMA
cgroup or in IB stack?
In current patch v5, its defined by the IB stack which is often
shipped as different package due to high amount of changes, bug fixes,
features.
In v0 patch it was defined by the RDMA cgroup, which means any new
resource addition/definition requires kernel upgrade. Which is hard to
change often.
If resources are defined by RDMA cgroup kernel than adding/removing
resource means, someone have to do lot of engineering with different
versions of kernel support and loadable module support using compat.h
etc at driver level, which in my mind is some amount of engineering
compare to what v5 has to offer and its already available. With one
round of cleanup in resource definition, it should be usable.

Its important to provide this feedback to Tejun and me, so that we
take informed design decision.

Hi Tejun,

On Sun, Jan 31, 2016 at 4:34 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Sun, Jan 31, 2016 at 04:11:54PM +0530, Parav Pandit wrote:
>> No. We agreed that let IB stack define in the header file that rdmacg
>> can include.
>> However when I started with that I realized that, such design has
>> basic flaw that IB stack is compiled as loadable modules.
>> cgroup which is compiled along with kernel, cannot rely on the header
>> file of the loadable module, as it will lead to incompatibly and
>> possible crash.
>
> Yes, it can.  It just becomes a part of kernel ABI that the IB stack
> module depends on.
>
o.k. Its doable, but I believe its simple but its restrictive.
If rest of the RDMA experts are ok with it, I will change it to do it that way.
We get to hear their feedback before we put this as ABI.

>> > Keep the resource types in an array in minimal way and match with the
>> > information from core side.  It doesn't make any sense to use match
>> > tokens in defining resources when the resource type is always fixed.
>> >
>> Match token is being used in other places also typically where user
>> configuration is involved.
>> so Match token infrastructure APIs help to avoid rewriting parser.
>> What exactly is the issue in using match token?
>> Resource type is fixed - sure it is with given version of loadable
>> module. But if new feature/resource is added in IB stack, at that
>> point new token will be added in the array.
>
> Sure, use parser for parsing but you end up stripping =%d in other
> places anyway.  Do it the other way.  Let the definitions contain
> what's essential and do the extra stuff in code handling it, not the
> other way around.
>
>> Now coming to remove command's need.
>> If user has previously configured limit of say mr=15. Now if wants to
>> remove that configuration and don't want to bother for the limit.
>> So the way, its done is by issuing "remove" command.
>> Should I name is "reset".
>
> How is this different from setting the limits to max?
>
We can set it to max. But during freeing time, checking array of 10+
elements whether its max or not, didn't find efficient either.

>> When user issues "remove" command it could still happen that there are
>> active rdma resources. So we cannot really free the rpool object.
>> That is freed when last resource is uncharged.
>> Make sense?
>
> Not really.
>
>> > You're re-stating the same thing without explaining the reasoning
>> > behind it.  Why is this different from other controllers?  What's the
>> > justification?
>>
>> As in above example of 15-85, if we keep it or create is we will end
>> up allocating rpool objects for all the 100 cgroups, which might not
>> be necessary. If it happens to be multi level hierarchy, it will end
>> up having in each level.
>> So therefore its allocated and freed dynamically.
>
> Just keeping them around till device destruction would work just fine.

Most likely not. I tried to explained that in reply to Haggai's email
for RFC, also in rdma.txt documentation patch 3/3.
Let me try again.
Device stays in a system for long time. Mostly for months until
drivers are unloaded or firmware crashes or some other event.
While on other hand, cgroup life cycle is in hours or more.
So rpool that we allocated dynamically, it needs to be freed or
atleast when set to max or when cgroup is deleted.
But there is catch there, it cannot be freed because there are
processed which has allocated the resource from this cgroup, but have
been migrated to other cgroup which will uncharge sometime later.
(This is not a usecase, but its the artifact of cgroup interface).
So rpool needs to be freed at later when uncharge occurs. At that
point we drop the reference to the css, so that css can be freed and
same css->id can get circulated for new cgroup.
If we don't do this, and wait until device destruction, rpool just
keeps growing!
I think above is more important design aspect than just saving memory to me.

Let me know if you have different design thought here.
(Like engineering can_attach() callback in rdma_cgroup, which I don't
see necessary either).

> If you wanna insist on freeing them dynamically, free them when both
> their usages and configs are nil.
Agree. Thats what the code is doing using marking object type being default.
If I remove the object type, as alternative,
It requires that when we uncharge resource, we should run through the
max array to see if they are all set to max.
And keep one aggregate counter for usage (apart from individual)
counter for each resource.

> How the object is created shouldn't be a factor.

Loop for max and aggregate counter can avoid creator variable.

>
>> If its created by user configuration, and if we free the rpool object
>> when last resource is freed which is holding the configuration,
>> user configuration is lost.
>> Also it doesn't make much sense to have two different allocation for
>> limit and usage configuration. Pointer storing overhead is more than
>> the actual content.
>
> See above.  If you want to free dynamically, free when the thing
> doesn't contain any information.
o.k.

> Better, just don't worry about it. It's unlikely to matter.
I tried to explain above about the need to free dynamically, instead
of device destruction time.

>
>> > No, the locking scheme doesn't make any sense.  Except for some
>> > special cases, sequence like the following indicates that the code is
>> > buggy or at least silly.
>> >
>> Help me to understand, silly means - unacceptable?
>
> Yes, it's a clear anti-pattern for a refcnted object.  Again, just
> push the locking to the users and drop the atomics.
>
ok. Let me see how can I refactor this part of code. I will get back
on this if I hit a block.


>> Why you don't want them to be freed when there are no requester
>> allocating the resource?
>> Device usually stays for longer time, but applications go way and come
>> back more often, so freeing them makes more sense when not in use.
>> What exactly is the problem in freeing when uncharing occurs, due to
>> which I should defer it to device unregistration stage?
>
> Two things.  Freeing dynamically doesn't require creator type or
> refcnting here, so the code makes no sense to me.  It's complicated
> without any purpose.  Second, it's not that big a struct.  Just leave
> them around till it's clear that this is problematic.
Above explanation should justify that its problematic to keep that
around until device destruction.

> For the Nth time in this whole thing, keep things simple.  Stop overengineering.

I have tried to keep in simple. I respect your comments and have
improved upon most of the valuable comments that you gave in
subsequent patches from v0 to v5.
Few exceptions which I will work through now are:
1. resource definition place (cgroup vs IB stack) - avoid callback.
2. lock functions
3. dynamic freeing (which I think is justified now above).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-01-31 17:50             ` Parav Pandit
@ 2016-02-01 18:40               ` Tejun Heo
  2016-02-01 18:59                 ` Parav Pandit
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-02-01 18:40 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Sun, Jan 31, 2016 at 11:20:45PM +0530, Parav Pandit wrote:
> > Yes, it can.  It just becomes a part of kernel ABI that the IB stack
> > module depends on.
> >
> o.k. Its doable, but I believe its simple but its restrictive.

The whole point is to make it somewhat restrictive so that low level
drivers don't go crazy with it and the defined resources are clearly
identified and documented.

> If rest of the RDMA experts are ok with it, I will change it to do it that way.
> We get to hear their feedback before we put this as ABI.

So, I'm really not gonna go for individual drivers defining resources
on their own.  That's a trainwreck waiting to happen.  There needs to
be a lot more scrutiny than that.

> >> Now coming to remove command's need.
> >> If user has previously configured limit of say mr=15. Now if wants to
> >> remove that configuration and don't want to bother for the limit.
> >> So the way, its done is by issuing "remove" command.
> >> Should I name is "reset".
> >
> > How is this different from setting the limits to max?
>
> We can set it to max. But during freeing time, checking array of 10+
> elements whether its max or not, didn't find efficient either.

So, in terms of user interface, it doesn't buy anything, right?  It's
generally a bad idea to mix implementation details like "checking 10+
elems on free" with interface decisions.  If the overhead of scanning
the array matters, it can easily be resolved by keeping the count of
non-max values.

It could be that setting all attributes to "max" is inconvenient from
interface POV (I don't think this would be a common use case tho).  If
so, please justify that, update the interface conventions along with
other similar knobs.  Don't introduce one-off custom thing.

...
> If we don't do this, and wait until device destruction, rpool just
> keeps growing!
>
> I think above is more important design aspect than just saving memory to me.
> 
> Let me know if you have different design thought here.
> (Like engineering can_attach() callback in rdma_cgroup, which I don't
> see necessary either).

Pin the css while things are in flight and free from css_free?

> > If you wanna insist on freeing them dynamically, free them when both
> > their usages and configs are nil.
> Agree. Thats what the code is doing using marking object type being default.
> If I remove the object type, as alternative,

You don't need object type for that.  Think about it again.  All you
need is a refcnt and cgroup already has a facility for that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-02-01 18:40               ` Tejun Heo
@ 2016-02-01 18:59                 ` Parav Pandit
  2016-02-10 16:27                   ` Haggai Eran
  0 siblings, 1 reply; 16+ messages in thread
From: Parav Pandit @ 2016-02-01 18:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Feb 2, 2016 at 12:10 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Sun, Jan 31, 2016 at 11:20:45PM +0530, Parav Pandit wrote:
>> > Yes, it can.  It just becomes a part of kernel ABI that the IB stack
>> > module depends on.
>> >
>> o.k. Its doable, but I believe its simple but its restrictive.
>
> The whole point is to make it somewhat restrictive so that low level
> drivers don't go crazy with it and the defined resources are clearly
> identified and documented.
>
>> If rest of the RDMA experts are ok with it, I will change it to do it that way.
>> We get to hear their feedback before we put this as ABI.
>
> So, I'm really not gonna go for individual drivers defining resources
> on their own.  That's a trainwreck waiting to happen.  There needs to
> be a lot more scrutiny than that.
>
Not every low level driver. I started with that infrastructure in
v2,v3 but I got your inputs and
I align with that. It could be just single IB stack in one header file
in one enum list would be sufficient.
You have already given that example.
With that mostly two resource type that I have can also shrink to just
single type.
Will wait to hear from them, in case if they have any different thought.


>> >> Now coming to remove command's need.
>> >> If user has previously configured limit of say mr=15. Now if wants to
>> >> remove that configuration and don't want to bother for the limit.
>> >> So the way, its done is by issuing "remove" command.
>> >> Should I name is "reset".
>> >
>> > How is this different from setting the limits to max?
>>
>> We can set it to max. But during freeing time, checking array of 10+
>> elements whether its max or not, didn't find efficient either.
>
> So, in terms of user interface, it doesn't buy anything, right?  It's
> generally a bad idea to mix implementation details like "checking 10+
> elems on free" with interface decisions.  If the overhead of scanning
> the array matters, it can easily be resolved by keeping the count of
> non-max values.
>
> It could be that setting all attributes to "max" is inconvenient from
> interface POV (I don't think this would be a common use case tho).  If
> so, please justify that, update the interface conventions along with
> other similar knobs.  Don't introduce one-off custom thing.
>
ok. setting all values to max is not really a common use case. So I
believe its ok to do that and therefore remove "remove" option.

> ...
>> If we don't do this, and wait until device destruction, rpool just
>> keeps growing!
>>
>> I think above is more important design aspect than just saving memory to me.
>>
>> Let me know if you have different design thought here.
>> (Like engineering can_attach() callback in rdma_cgroup, which I don't
>> see necessary either).
>
> Pin the css while things are in flight and free from css_free?

Yes. Thats what is done in current patch. css_get on charge and
css_put on uncharge, that takes care nicely to free the rpool.

>
>> > If you wanna insist on freeing them dynamically, free them when both
>> > their usages and configs are nil.
>> Agree. Thats what the code is doing using marking object type being default.
>> If I remove the object type, as alternative,
>
> You don't need object type for that.  Think about it again.  All you
> need is a refcnt and cgroup already has a facility for that.
>
ok. Let me attempt that.

I will wait for day or two before I roll out v6 to get any feedback if
Liran, Haggai or Doug has any comments on definition part for resource
type.


> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-02-01 18:59                 ` Parav Pandit
@ 2016-02-10 16:27                   ` Haggai Eran
  2016-02-11 21:17                     ` Parav Pandit
  2016-02-11 21:19                     ` Parav Pandit
  0 siblings, 2 replies; 16+ messages in thread
From: Haggai Eran @ 2016-02-10 16:27 UTC (permalink / raw)
  To: Parav Pandit, Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Jonathan Corbet, james.l.morris, serge,
	Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module

On 01/02/2016 20:59, Parav Pandit wrote:
> On Tue, Feb 2, 2016 at 12:10 AM, Tejun Heo <tj@kernel.org> wrote:
>> So, I'm really not gonna go for individual drivers defining resources
>> on their own.  That's a trainwreck waiting to happen.  There needs to
>> be a lot more scrutiny than that.
>>
> Not every low level driver. I started with that infrastructure in
> v2,v3 but I got your inputs and
> I align with that. It could be just single IB stack in one header file
> in one enum list would be sufficient.
> You have already given that example.
> With that mostly two resource type that I have can also shrink to just
> single type.
> Will wait to hear from them, in case if they have any different thought.

Hi,

Sorry for the late reply.

I think that starting with the standard set of resources that uverbs 
provide is good, and if we need in the future new types of resources 
we can add them later.

On 31/01/2016 19:50, Parav Pandit wrote:
> How would you like to see RDMA verb resources being defined - in RDMA
> cgroup or in IB stack?
> In current patch v5, its defined by the IB stack which is often
> shipped as different package due to high amount of changes, bug fixes,
> features.
> In v0 patch it was defined by the RDMA cgroup, which means any new
> resource addition/definition requires kernel upgrade. Which is hard to
> change often.

There is indeed an effort to backport the latest RDMA subsystem modules to 
older kernels, and it would be preferable to be able to introduce new
resources through these modules. However, I understand that there are no
other cgroups that are modules or depend on modules this way, so I would
understand if you decide against it.

> If resources are defined by RDMA cgroup kernel than adding/removing
> resource means, someone have to do lot of engineering with different
> versions of kernel support and loadable module support using compat.h
> etc at driver level, which in my mind is some amount of engineering
> compare to what v5 has to offer and its already available. With one
> round of cleanup in resource definition, it should be usable.
If I understand correctly, if the resources are defined in the cgroup,
you simply won't be able to add new resources with a module update,
no matter how hard you work.

I agree that if the cgroup code is changed for cleanup or whatever 
reason, the backporting may become difficult, but that's just life.

> Its important to provide this feedback to Tejun and me, so that we
> take informed design decision.

Sure. I hope this patchset gets accepted eventually, as I believe it 
solves a real problem. Today RDMA application can easily hog these 
resources and the rdma cgroup allows users to prevent that.

Regards,
Haggai

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-02-10 16:27                   ` Haggai Eran
@ 2016-02-11 21:17                     ` Parav Pandit
  2016-02-11 21:19                     ` Parav Pandit
  1 sibling, 0 replies; 16+ messages in thread
From: Parav Pandit @ 2016-02-11 21:17 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Tejun Heo, cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Jonathan Corbet, james.l.morris, serge,
	Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module

On Wed, Feb 10, 2016 at 9:57 PM, Haggai Eran <haggaie@mellanox.com> wrote:
>
> There is indeed an effort to backport the latest RDMA subsystem modules to
> older kernels, and it would be preferable to be able to introduce new
> resources through these modules.
Right. There is hardly 10 to 20 lines of code that allows to support
this functionality. So I think its good thing to have.


> > If resources are defined by RDMA cgroup kernel than adding/removing
> > resource means, someone have to do lot of engineering with different
> > versions of kernel support and loadable module support using compat.h
> > etc at driver level, which in my mind is some amount of engineering
> > compare to what v5 has to offer and its already available. With one
> > round of cleanup in resource definition, it should be usable.
> If I understand correctly, if the resources are defined in the cgroup,
> you simply won't be able to add new resources with a module update,
> no matter how hard you work.
>

Yes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-02-10 16:27                   ` Haggai Eran
  2016-02-11 21:17                     ` Parav Pandit
@ 2016-02-11 21:19                     ` Parav Pandit
  2016-02-12 16:45                       ` Tejun Heo
  1 sibling, 1 reply; 16+ messages in thread
From: Parav Pandit @ 2016-02-11 21:19 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Tejun Heo, cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Jonathan Corbet, james.l.morris, serge,
	Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module

Hi Tejun,

(Sending again as by mistake previous mail was non plain text, sorry
about that).

So based on your comments, Haggai's comments below and past
experience, I will spin v6.o
I have made changes, and testing them. I am likely to post during
coming long weekend.

I have simplified many pieces.
Can you please confirm below design/implementation looks ok to you?
If you want to review v6 directly, I am fine with that too instead of
below post.

1. Removed two type of resource pool, made is single type (as you
described in past comment)
2. Removed match tokens and have array definition like "qp", "mr", "cq" etc.
3. Wrote small parser and avoided match_token API as that won't work
due to different array definition
4. Removed one-off remove API to unconfigure cgroup, instead all
resource should be set to max.
5. Removed resource pool type (user/default), instead having max_num_cnt,
when ref_cnt drops to zero and max_num_cnt = total_rescource_cnt, pool is freed.
6. Resource definition ownership is now only with IB stack at single
header file, no longer in each low level driver.
This goes through IB maintainer and other reviewers eyes.
This continue to give flexibility to not force kernel upgrade for few
enums additions for new resource type.
7. Wherever possible pool lock is pushed out, except for hierarchical
charging/unchanging points, as it not possible to do so, due to
iterative process involves blocking allocations of rpool. Coming up
more levels up to release locks doesn't make any sense either.
This is anyway slow path where rpool is not allocated. Except for
typical first resource allocation, this is less traveled path.
8.Other minor cleanups.
9. Avoided %d manipulation due to removal of match_token and replaced
with seq_putc etc friend functions.

Parav

On Wed, Feb 10, 2016 at 9:57 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 01/02/2016 20:59, Parav Pandit wrote:
>> On Tue, Feb 2, 2016 at 12:10 AM, Tejun Heo <tj@kernel.org> wrote:
>>> So, I'm really not gonna go for individual drivers defining resources
>>> on their own.  That's a trainwreck waiting to happen.  There needs to
>>> be a lot more scrutiny than that.
>>>
>> Not every low level driver. I started with that infrastructure in
>> v2,v3 but I got your inputs and
>> I align with that. It could be just single IB stack in one header file
>> in one enum list would be sufficient.
>> You have already given that example.
>> With that mostly two resource type that I have can also shrink to just
>> single type.
>> Will wait to hear from them, in case if they have any different thought.
>
> Hi,
>
> Sorry for the late reply.
>
> I think that starting with the standard set of resources that uverbs
> provide is good, and if we need in the future new types of resources
> we can add them later.
>
> On 31/01/2016 19:50, Parav Pandit wrote:
>> How would you like to see RDMA verb resources being defined - in RDMA
>> cgroup or in IB stack?
>> In current patch v5, its defined by the IB stack which is often
>> shipped as different package due to high amount of changes, bug fixes,
>> features.
>> In v0 patch it was defined by the RDMA cgroup, which means any new
>> resource addition/definition requires kernel upgrade. Which is hard to
>> change often.
>
> There is indeed an effort to backport the latest RDMA subsystem modules to
> older kernels, and it would be preferable to be able to introduce new
> resources through these modules. However, I understand that there are no
> other cgroups that are modules or depend on modules this way, so I would
> understand if you decide against it.
>
>> If resources are defined by RDMA cgroup kernel than adding/removing
>> resource means, someone have to do lot of engineering with different
>> versions of kernel support and loadable module support using compat.h
>> etc at driver level, which in my mind is some amount of engineering
>> compare to what v5 has to offer and its already available. With one
>> round of cleanup in resource definition, it should be usable.
> If I understand correctly, if the resources are defined in the cgroup,
> you simply won't be able to add new resources with a module update,
> no matter how hard you work.
>
> I agree that if the cgroup code is changed for cleanup or whatever
> reason, the backporting may become difficult, but that's just life.
>
>> Its important to provide this feedback to Tejun and me, so that we
>> take informed design decision.
>
> Sure. I hope this patchset gets accepted eventually, as I believe it
> solves a real problem. Today RDMA application can easily hog these
> resources and the rdma cgroup allows users to prevent that.
>
> Regards,
> Haggai

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.
  2016-02-11 21:19                     ` Parav Pandit
@ 2016-02-12 16:45                       ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2016-02-12 16:45 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Haggai Eran, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Jonathan Corbet, james.l.morris, serge,
	Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module

Hello, Parav.

On Fri, Feb 12, 2016 at 02:49:38AM +0530, Parav Pandit wrote:
> 1. Removed two type of resource pool, made is single type (as you
> described in past comment)
> 2. Removed match tokens and have array definition like "qp", "mr", "cq" etc.
> 3. Wrote small parser and avoided match_token API as that won't work
> due to different array definition
> 4. Removed one-off remove API to unconfigure cgroup, instead all
> resource should be set to max.
> 5. Removed resource pool type (user/default), instead having max_num_cnt,
> when ref_cnt drops to zero and max_num_cnt = total_rescource_cnt, pool is freed.
> 6. Resource definition ownership is now only with IB stack at single
> header file, no longer in each low level driver.
> This goes through IB maintainer and other reviewers eyes.
> This continue to give flexibility to not force kernel upgrade for few
> enums additions for new resource type.
> 7. Wherever possible pool lock is pushed out, except for hierarchical
> charging/unchanging points, as it not possible to do so, due to
> iterative process involves blocking allocations of rpool. Coming up
> more levels up to release locks doesn't make any sense either.
> This is anyway slow path where rpool is not allocated. Except for
> typical first resource allocation, this is less traveled path.
> 8.Other minor cleanups.
> 9. Avoided %d manipulation due to removal of match_token and replaced
> with seq_putc etc friend functions.

Sounds great.  Can't tell too much without looking at the code tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-02-12 16:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-30 15:23 [PATCHv3 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
2016-01-30 15:23 ` [PATCHv3 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
2016-01-30 18:30   ` Tejun Heo
2016-01-30 20:44     ` Parav Pandit
2016-01-31 10:02       ` Tejun Heo
2016-01-31 10:41         ` Parav Pandit
2016-01-31 11:04           ` Tejun Heo
2016-01-31 17:50             ` Parav Pandit
2016-02-01 18:40               ` Tejun Heo
2016-02-01 18:59                 ` Parav Pandit
2016-02-10 16:27                   ` Haggai Eran
2016-02-11 21:17                     ` Parav Pandit
2016-02-11 21:19                     ` Parav Pandit
2016-02-12 16:45                       ` Tejun Heo
2016-01-30 15:23 ` [PATCHv3 2/3] IB/core: added support to use " Parav Pandit
2016-01-30 15:23 ` [PATCHv3 3/3] rdmacg: Added documentation for rdma controller Parav Pandit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).