linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
@ 2016-08-31  8:37 Parav Pandit
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
                   ` (4 more replies)
  0 siblings, 5 replies; 44+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module, pandit.parav

rdmacg: IB/core: rdma controller support

Patch is generated and tested against below Doug's linux-rdma
git tree.

URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
Branch: master

Patchset is also compiled and tested against below Tejun's cgroup tree
using cgroup v2 mode.
URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
Branch: master

Overview:
Currently user space applications can easily take away all the rdma
device specific resources such as AH, CQ, QP, MR etc. Due to which other
applications in other cgroup or kernel space ULPs may not even get chance
to allocate any rdma resources. This results into service unavailibility.

RDMA cgroup addresses this issue by allowing resource accounting,
limit enforcement on per cgroup, per rdma device basis.

RDMA uverbs layer will enforce limits on well defined RDMA verb
resources without any HCA vendor device driver involvement.

RDMA uverbs layer will not do limit enforcement of HCA hw vendor
specific resources. Instead rdma cgroup provides set of APIs
through which vendor specific drivers can do resource accounting
by making use of rdma cgroup.

Resource limit enforcement is hierarchical.

When process is migrated with active RDMA resources, rdma cgroup
continues to uncharge original cgroup for allocated resource. New resource
is charged to current process's cgroup, which means if the process is
migrated with active resources, for new resources it will be charged to
new cgroup and old resources will be correctly uncharged from old cgroup.

Changes from v11:
  * (To address comments from Tejun)
   1. Added information in Documentation about nested-keyed file
  * (To address comments from Rami Rosen)
   1. Corrected typo errors in Documentation
  * (To address comments from Leon Romanovsky)
   1. Changed cgroup.c copyright to match with other files of the IB stack
      which is dual license GPLv2 + BSD

Changes from v10:
  * (To address comments from Tejun, Christoph)
   1. Removed unused rpool_list_lock from rdma_cgroup structure.
   2. Moved rdma resource definition to rdma cgroup instead of IB stack
   3. Added prefix rdmacg to static instances
   4. Simplified locking with single mutex for all operations
   5. Following approach of atomically allocating object and
      charging resource in hirerchy
   6. Code simplification due to single lock
   7. Using for_each_set_bit API for bit operation
   8. Renamed list heads as Objects instead of _head
   9. Renamed list entries as _node instead of _list.
  10. Made usage_num to 64 bit to avoid overflow and to avoid 
      additional code to track non zero number of usage counts.
  * (To address comments from Doug)
   1. Added copyright and GPLv2 license

Changes from v9:
  * (To address comments from Tejun)
   1. Included clear documentation of resources.
   2. Fixed issue of race condition of process migration during
      charging stage.
   3. Fixed comments and code to adhere to CodingStyle.
   4. Simplified and removed support to charge/uncharge multiple
      resource.
   5. Fixed replaced refcnt with usage_num that tracks how many
      resources are unused to trigger freeing the object.
   6. Simplified locking scheme to use single spin lock for whole
      subsystem.

Changes from v8:
 * Fixed compilation error.
 * Fixed warning reported by checkpatch script.

Changes from v7:
 * (To address comments from Haggai)
   1. Removed max_limit from query_limit function as it is
      unnecessary.
   2. Kept existing printk as it is to instead of replacing all
      with pr_warn except newly added printk.

Changes from v6:
 * (To address comments from Haggai)
   1. Made functions as void wherever necessary.
   2. Code cleanup related to correting few spelling mistakes
      in comments, correcting comments to reflect the code.
   3. Removed max_count parameter from query_limit as its not
      necessary.
   4. Fixed printk to pr_warn.
   5. Removed dependency on pd, instead relying on ib_dev.
   6. Added more documentation to reflect that IB stack honors
      configured limit during query_device operation.
   7. Added pr_warn and avoided system crash in case of
      IB stack or rdma cgroup bug.
 * (To address comments from Leon)
   1. Removed #ifdef CONFIG_CGROUP_RDMA from .c files and added
      necessary dummy functions in header file.
   2. Removed unwanted forward declaration.
 * Fixed uncharing to rdma controller after resource is released
   from verb layer, instead of uncharing first. This ensures that
   uncharging doesn't complete while resource is still allocated.
 
Changes from v5:
 * (To address comments from Tejun)
   1. Removed two type of resource pool, made is single type (as Tejun
      described in past comment)
   2. Removed match tokens and have array definition like "qp", "mr",
      "cq" etc.
   3. Wrote small parser and avoided match_token API as that won't work
      due to different array definitions
   4. Removed one-off remove API to unconfigure cgroup, instead all
      resource should be set to max.
   5. Removed resource pool type (user/default), instead having
      max_num_cnt, when ref_cnt drops to zero and
      max_num_cnt = total_rescource_cnt, pool is freed.
   6. Resource definition ownership is now only with IB stack at single
      header file, no longer in each low level driver.
      This goes through IB maintainer and other reviewers eyes.
      This continue to give flexibility to not force kernel upgrade for
      few enums additions for new resource type.
   7. Wherever possible pool lock is pushed out, except for hierarchical
      charging/unchanging points, as it not possible to do so, due to
      iterative process involves blocking allocations of rpool. Coming up
      more levels up to release locks doesn't make any sense either.
      This is anyway slow path where rpool is not allocated. Except for
      typical first resource allocation, this is less travelled path.
   8. Avoided %d manipulation due to removal of match_token and replaced
      with seq_putc etc friend functions.
 * Other minor cleanups.
 * Fixed rdmacg_register_device to return error in case of IB stack
   tries to register for than 64 resources.
 * Fixed not allowing negative value on resource setting.
 * Fixed cleaning up resource pools during device removal.
 * Simplfied and rename table length field to use ARRAY_SIZE macro.
 * Updated documentation to reflect single resource pool and shorter
   file names.

Changes from v4:
 * Fixed compilation errors for lockdep_assert_held reported by kbuild
   test robot
 * Fixed compilation warning reported by coccinelle for extra
   semicolon.
 * Fixed compilation error for inclusion of linux/parser.h which
   cannot be included in any header file, as that triggers multiple
   inclusion error. parser.h is included in C files which intent to
   use it.
 * Removed unused header file inclusion in cgroup_rdma.c

Changes from v3:
 * (To address comments from Tejun)
   1. Renamed cg_resource to rdmacg_resource
   2. Merged dealloc_cg_rpool and _dealloc_cg_rpool to single function
   3. Renamed _find_cg_rpool to find_cg_rpool_locked()
   5. Removed RDMACG_MAX_RESOURCE_INDEX limitation
   6. Fixed few alignments.
   7. Improved description for RDMA cgroup configuration menu
   8. Renamed cg_list_lock to rpool_list_lock to reflect the lock
      is for rpools.
   9. Renamed _get_cg_rpool to find_cg_rpool.
   10. Made creator as int variable, instead of atomic as its not 
      required to be atomic.
 * Fixed freeing right rpool during query_limit error path
 * Added copywrite for cgroup.c
 * Removed including parser.h from cgroup.c as its included by
   cgroup_rdma.h
 * Reduced try_charge functions to single function and removed duplicate
   comments.

Changes from v2:
 * Fixed compilation error reported by 0-DAY kernel test infrastructure
   for m68k architecture where CONFIG_CGROUP is also not defined.
 * Fixed comment in patch to refer to legacy mode of cgroup, changed to 
   refer to v1 and v2 version.
 * Added more information in commit log for rdma controller patch.

Changes from v1:
 * (To address comments from Tejun)
   a. reduces 3 patches to single patch
   b. removed resource word from the cgroup configuration files
   c. changed cgroup configuration file names to match other cgroups
   d. removed .list file and merged functionality with .max file
 * Based on comment to merge to single patch for rdma controller;
   IB/core patches are reduced to single patch.
 * Removed pid cgroup map and simplified design -
   Charge/Uncharge caller stack keeps track of the rdmacg for
   given resource. This removes the need to maintain and perform
   hash lookup. This also allows little more accurate resource
   charging/uncharging when process moved from one to other cgroup
   with active resources and continue to allocate more.
 * Critical fix: Removed rdma cgroup's dependency on the kernel module
   header files to avoid crashes when modules are upgraded without kernel
   upgrade, which is very common due to high amount of changes in IB stack
   and it is also shipped as individual kernel modules.
 * uboject extended to keep track of the owner rdma cgroup, so that same
   rdmacg can be used while uncharging.
 * Added support functions to hide details of rdmacg device in uverbs
   modules for cases of cgroup enabled/disabled at compile time. This
   avoids multiple ifdefs for every API in uverbs layer.
 * Removed failure counters in first patch, which will be added once
   initial feature is merged.
 * Fixed stale rpool access which is getting freed, while doing
   configuration to rdma.verb.max file.
 * Fixed rpool resource leak while querying max, current values.

Changes from v0:
(To address comments from Haggai, Doug, Liran, Tejun, Sean, Jason)
 * Redesigned to support per device per cgroup limit settings by bringing
   concept of resource pool.
 * Redesigned to let IB stack define the resources instead of rdma
   controller using resource template.
 * Redesigned to support hw vendor specific limits setting
   (optional to drivers).
 * Created new rdma controller instead of piggyback on device cgroup.
 * Fixed race conditions for multiple tasks sharing rdma resources.
 * Removed dependency on the task_struct.


Parav Pandit (3):
  rdmacg: Added rdma cgroup controller
  IB/core: added support to use rdma cgroup controller
  rdmacg: Added documentation for rdmacg

 Documentation/cgroup-v1/rdma.txt    | 117 +++++++
 Documentation/cgroup-v2.txt         |  45 +++
 drivers/infiniband/core/Makefile    |   1 +
 drivers/infiniband/core/cgroup.c    |  93 +++++
 drivers/infiniband/core/core_priv.h |  41 +++
 drivers/infiniband/core/device.c    |  10 +
 include/linux/cgroup_rdma.h         |  66 ++++
 include/linux/cgroup_subsys.h       |   4 +
 init/Kconfig                        |  10 +
 kernel/Makefile                     |   1 +
 kernel/cgroup_rdma.c                | 664 ++++++++++++++++++++++++++++++++++++
 11 files changed, 1052 insertions(+)
 create mode 100644 Documentation/cgroup-v1/rdma.txt
 create mode 100644 drivers/infiniband/core/cgroup.c
 create mode 100644 include/linux/cgroup_rdma.h
 create mode 100644 kernel/cgroup_rdma.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
@ 2016-08-31  8:37 ` Parav Pandit
  2016-08-31  9:38   ` Leon Romanovsky
  2016-08-31 15:07   ` Matan Barak
  2016-08-31  8:37 ` [PATCHv12 2/3] IB/core: added support to use " Parav Pandit
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 44+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module, pandit.parav

Added rdma cgroup controller that does accounting, limit enforcement
on rdma/IB verbs and hw resources.

Added rdma cgroup header file which defines its APIs to perform
charing/uncharing functionality. It also defined APIs for RDMA/IB
stack for device registration. Devices which are registered will
participate in controller functions of accounting and limit
enforcements. It define rdmacg_device structure to bind IB stack
and RDMA cgroup controller.

RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup entity which allows setting up accounting limits
on per device basis.

Currently resources are defined by the RDMA cgroup.

Resource pool is created/destroyed dynamically whenever
charging/uncharging occurs respectively and whenever user
configuration is done. Its a tradeoff of memory vs little more code
space that creates resource pool object whenever necessary, instead of
creating them during cgroup creation and device registration time.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 include/linux/cgroup_rdma.h   |  66 +++++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  10 +
 kernel/Makefile               |   1 +
 kernel/cgroup_rdma.c          | 664 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 745 insertions(+)
 create mode 100644 include/linux/cgroup_rdma.h
 create mode 100644 kernel/cgroup_rdma.c

diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
new file mode 100644
index 0000000..6710e28
--- /dev/null
+++ b/include/linux/cgroup_rdma.h
@@ -0,0 +1,66 @@
+/*
+ * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License. See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#ifndef _CGROUP_RDMA_H
+#define _CGROUP_RDMA_H
+
+#include <linux/cgroup.h>
+
+enum rdmacg_resource_type {
+	RDMACG_VERB_RESOURCE_UCTX,
+	RDMACG_VERB_RESOURCE_AH,
+	RDMACG_VERB_RESOURCE_PD,
+	RDMACG_VERB_RESOURCE_CQ,
+	RDMACG_VERB_RESOURCE_MR,
+	RDMACG_VERB_RESOURCE_MW,
+	RDMACG_VERB_RESOURCE_SRQ,
+	RDMACG_VERB_RESOURCE_QP,
+	RDMACG_VERB_RESOURCE_FLOW,
+	/*
+	 * add any hw specific resource here as RDMA_HW_RESOURCE_NAME
+	 */
+	RDMACG_RESOURCE_MAX,
+};
+
+#ifdef CONFIG_CGROUP_RDMA
+
+struct rdma_cgroup {
+	struct cgroup_subsys_state	css;
+
+	/*
+	 * head to keep track of all resource pools
+	 * that belongs to this cgroup.
+	 */
+	struct list_head		rpools;
+};
+
+struct rdmacg_device {
+	struct list_head	dev_node;
+	struct list_head	rpools;
+	char			*name;
+};
+
+/*
+ * APIs for RDMA/IB stack to publish when a device wants to
+ * participate in resource accounting
+ */
+int rdmacg_register_device(struct rdmacg_device *device);
+void rdmacg_unregister_device(struct rdmacg_device *device);
+
+/* APIs for RDMA/IB stack to charge/uncharge pool specific resources */
+int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
+		      struct rdmacg_device *device,
+		      enum rdmacg_resource_type index);
+void rdmacg_uncharge(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device,
+		     enum rdmacg_resource_type index);
+void rdmacg_query_limit(struct rdmacg_device *device,
+			int *limits);
+
+#endif	/* CONFIG_CGROUP_RDMA */
+#endif	/* _CGROUP_RDMA_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..d0e597c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDMA)
+SUBSYS(rdma)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..c7dc64b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1080,6 +1080,16 @@ config CGROUP_PIDS
 	  since the PIDs limit only affects a process's ability to fork, not to
 	  attach to a cgroup.
 
+config CGROUP_RDMA
+	bool "RDMA controller"
+	help
+	  Provides enforcement of RDMA resources defined by IB stack.
+	  It is fairly easy for consumers to exhaust RDMA resources, which
+	  can result into resource unavailability to other consumers.
+	  RDMA controller is designed to stop this from happening.
+	  Attaching processes with active RDMA resources to the cgroup
+	  hierarchy is allowed even if can cross the hierarchy's limit.
+
 config CGROUP_FREEZER
 	bool "Freezer controller"
 	help
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e..d2b76d0 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -67,6 +67,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
+obj-$(CONFIG_CGROUP_RDMA) += cgroup_rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_rdma.c b/kernel/cgroup_rdma.c
new file mode 100644
index 0000000..6ab9bd9
--- /dev/null
+++ b/kernel/cgroup_rdma.c
@@ -0,0 +1,664 @@
+/*
+ * RDMA resource limiting controller for cgroups.
+ *
+ * Used to allow a cgroup hierarchy to stop processes from consuming
+ * additional RDMA resources after a certain limit is reached.
+ *
+ * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License. See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cgroup.h>
+#include <linux/parser.h>
+#include <linux/cgroup_rdma.h>
+
+#define RDMACG_MAX_STR "max"
+
+/*
+ * Protects list of resource pools maintained on per cgroup basis
+ * and rdma device list.
+ */
+static DEFINE_MUTEX(rdmacg_mutex);
+static LIST_HEAD(rdmacg_devices);
+
+enum rdmacg_file_type {
+	RDMACG_RESOURCE_TYPE_MAX,
+	RDMACG_RESOURCE_TYPE_STAT,
+};
+
+/*
+ * resource table definition as to be seen by the user.
+ * Need to add entries to it when more resources are
+ * added/defined at IB verb/core layer.
+ */
+static char const *rdmacg_resource_names[] = {
+	[RDMACG_VERB_RESOURCE_UCTX]	= "uctx",
+	[RDMACG_VERB_RESOURCE_AH]	= "ah",
+	[RDMACG_VERB_RESOURCE_PD]	= "pd",
+	[RDMACG_VERB_RESOURCE_CQ]	= "cq",
+	[RDMACG_VERB_RESOURCE_MR]	= "mr",
+	[RDMACG_VERB_RESOURCE_MW]	= "mw",
+	[RDMACG_VERB_RESOURCE_SRQ]	= "srq",
+	[RDMACG_VERB_RESOURCE_QP]	= "qp",
+	[RDMACG_VERB_RESOURCE_FLOW]	= "flow",
+};
+
+/* resource tracker for each resource of rdma cgroup */
+struct rdmacg_resource {
+	int max;
+	int usage;
+};
+
+/*
+ * resource pool object which represents per cgroup, per device
+ * resources. There are multiple instances of this object per cgroup,
+ * therefore it cannot be embedded within rdma_cgroup structure. It
+ * is maintained as list.
+ */
+struct rdmacg_resource_pool {
+	struct rdmacg_device	*device;
+	struct rdmacg_resource	resources[RDMACG_RESOURCE_MAX];
+
+	struct list_head	cg_node;
+	struct list_head	dev_node;
+
+	/* count active user tasks of this pool */
+	u64			usage_sum;
+	/* total number counts which are set to max */
+	int			num_max_cnt;
+};
+
+static struct rdma_cgroup *css_rdmacg(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct rdma_cgroup, css);
+}
+
+static struct rdma_cgroup *parent_rdmacg(struct rdma_cgroup *cg)
+{
+	return css_rdmacg(cg->css.parent);
+}
+
+static inline struct rdma_cgroup *get_current_rdmacg(void)
+{
+	return css_rdmacg(task_get_css(current, rdma_cgrp_id));
+}
+
+static void set_resource_limit(struct rdmacg_resource_pool *rpool,
+			       int index, int new_max)
+{
+	if (new_max == S32_MAX) {
+		if (rpool->resources[index].max != S32_MAX)
+			rpool->num_max_cnt++;
+	} else {
+		if (rpool->resources[index].max == S32_MAX)
+			rpool->num_max_cnt--;
+	}
+	rpool->resources[index].max = new_max;
+}
+
+static void set_all_resource_max_limit(struct rdmacg_resource_pool *rpool)
+{
+	int i;
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
+		set_resource_limit(rpool, i, S32_MAX);
+}
+
+static void free_cg_rpool_locked(struct rdmacg_resource_pool *rpool)
+{
+	lockdep_assert_held(&rdmacg_mutex);
+
+	list_del(&rpool->cg_node);
+	list_del(&rpool->dev_node);
+	kfree(rpool);
+}
+
+static struct rdmacg_resource_pool *
+find_cg_rpool_locked(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device)
+
+{
+	struct rdmacg_resource_pool *pool;
+
+	lockdep_assert_held(&rdmacg_mutex);
+
+	list_for_each_entry(pool, &cg->rpools, cg_node)
+		if (pool->device == device)
+			return pool;
+
+	return NULL;
+}
+
+static struct rdmacg_resource_pool *
+get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
+{
+	struct rdmacg_resource_pool *rpool;
+
+	rpool = find_cg_rpool_locked(cg, device);
+	if (rpool)
+		return rpool;
+
+	rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
+	if (!rpool)
+		return ERR_PTR(-ENOMEM);
+
+	rpool->device = device;
+	set_all_resource_max_limit(rpool);
+
+	INIT_LIST_HEAD(&rpool->cg_node);
+	INIT_LIST_HEAD(&rpool->dev_node);
+	list_add_tail(&rpool->cg_node, &cg->rpools);
+	list_add_tail(&rpool->dev_node, &device->rpools);
+	return rpool;
+}
+
+/**
+ * uncharge_cg_locked - uncharge resource for rdma cgroup
+ * @cg: pointer to cg to uncharge and all parents in hierarchy
+ * @device: pointer to rdmacg device
+ * @index: index of the resource to uncharge in cg (resource pool)
+ *
+ * It also frees the resource pool which was created as part of
+ * charging operation when there are no resources attached to
+ * resource pool.
+ */
+static void
+uncharge_cg_locked(struct rdma_cgroup *cg,
+		   struct rdmacg_device *device,
+		   enum rdmacg_resource_type index)
+{
+	struct rdmacg_resource_pool *rpool;
+
+	rpool = find_cg_rpool_locked(cg, device);
+
+	/*
+	 * rpool cannot be null at this stage. Let kernel operate in case
+	 * if there a bug in IB stack or rdma controller, instead of crashing
+	 * the system.
+	 */
+	if (unlikely(!rpool)) {
+		pr_warn("Invalid device %p or rdma cgroup %p\n", cg, device);
+		return;
+	}
+
+	rpool->resources[index].usage--;
+
+	/*
+	 * A negative count (or overflow) is invalid,
+	 * it indicates a bug in the rdma controller.
+	 */
+	WARN_ON_ONCE(rpool->resources[index].usage < 0);
+	rpool->usage_sum--;
+	if (rpool->usage_sum == 0 &&
+	    rpool->num_max_cnt == RDMACG_RESOURCE_MAX) {
+		/*
+		 * No user of the rpool and all entries are set to max, so
+		 * safe to delete this rpool.
+		 */
+		free_cg_rpool_locked(rpool);
+	}
+}
+
+/**
+ * rdmacg_uncharge_hirerchy - hierarchically uncharge rdma resource count
+ * @device: pointer to rdmacg device
+ * @stop_cg: while traversing hirerchy, when meet with stop_cg cgroup
+ *           stop uncharging
+ * @index: index of the resource to uncharge in cg in given resource pool
+ */
+static void rdmacg_uncharge_hirerchy(struct rdma_cgroup *cg,
+				     struct rdmacg_device *device,
+				     struct rdma_cgroup *stop_cg,
+				     enum rdmacg_resource_type index)
+{
+	struct rdma_cgroup *p;
+
+	mutex_lock(&rdmacg_mutex);
+
+	for (p = cg; p != stop_cg; p = parent_rdmacg(p))
+		uncharge_cg_locked(p, device, index);
+
+	mutex_unlock(&rdmacg_mutex);
+
+	css_put(&cg->css);
+}
+
+/**
+ * rdmacg_uncharge - hierarchically uncharge rdma resource count
+ * @device: pointer to rdmacg device
+ * @index: index of the resource to uncharge in cgroup in given resource pool
+ */
+void rdmacg_uncharge(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device,
+		     enum rdmacg_resource_type index)
+{
+	if (index >= RDMACG_RESOURCE_MAX)
+		return;
+
+	rdmacg_uncharge_hirerchy(cg, device, NULL, index);
+}
+EXPORT_SYMBOL(rdmacg_uncharge);
+
+/**
+ * rdmacg_try_charge - hierarchically try to charge the rdma resource
+ * @rdmacg: pointer to rdma cgroup which will own this resource
+ * @device: pointer to rdmacg device
+ * @index: index of the resource to charge in cgroup (resource pool)
+ *
+ * This function follows charging resource in hierarchical way.
+ * It will fail if the charge would cause the new value to exceed the
+ * hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN, -ENOMEM or -EINVAL.
+ * Returns pointer to rdmacg for this resource when charging is successful.
+ *
+ * Charger needs to account resources on two criteria.
+ * (a) per cgroup & (b) per device resource usage.
+ * Per cgroup resource usage ensures that tasks of cgroup doesn't cross
+ * the configured limits. Per device provides granular configuration
+ * in multi device usage. It allocates resource pool in the hierarchy
+ * for each parent it come across for first resource. Later on resource
+ * pool will be available. Therefore it will be much faster thereon
+ * to charge/uncharge.
+ */
+int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
+		      struct rdmacg_device *device,
+		      enum rdmacg_resource_type index)
+{
+	struct rdma_cgroup *cg, *p;
+	struct rdmacg_resource_pool *rpool;
+	s64 new;
+	int ret = 0;
+
+	if (index >= RDMACG_RESOURCE_MAX)
+		return -EINVAL;
+
+	/*
+	 * hold on to css, as cgroup can be removed but resource
+	 * accounting happens on css.
+	 */
+	cg = get_current_rdmacg();
+
+	mutex_lock(&rdmacg_mutex);
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		rpool = get_cg_rpool_locked(p, device);
+		if (IS_ERR_OR_NULL(rpool)) {
+			ret = PTR_ERR(rpool);
+			goto err;
+		} else {
+			new = rpool->resources[index].usage + 1;
+			if (new > rpool->resources[index].max) {
+				ret = -EAGAIN;
+				goto err;
+			} else {
+				rpool->resources[index].usage = new;
+				rpool->usage_sum++;
+			}
+		}
+	}
+	mutex_unlock(&rdmacg_mutex);
+
+	*rdmacg = cg;
+	return 0;
+
+err:
+	mutex_unlock(&rdmacg_mutex);
+	rdmacg_uncharge_hirerchy(cg, device, p, index);
+	return ret;
+}
+EXPORT_SYMBOL(rdmacg_try_charge);
+
+/**
+ * rdmacg_register_device - register rdmacg device to rdma controller.
+ * @device: pointer to rdmacg device whose resources need to be accounted.
+ *
+ * If IB stack wish a device to participate in rdma cgroup resource
+ * tracking, it must invoke this API to register with rdma cgroup before
+ * any user space application can start using the RDMA resources.
+ * Returns 0 on success or EINVAL when table length given is beyond
+ * supported size.
+ */
+int rdmacg_register_device(struct rdmacg_device *device)
+{
+	INIT_LIST_HEAD(&device->dev_node);
+	INIT_LIST_HEAD(&device->rpools);
+
+	mutex_lock(&rdmacg_mutex);
+	list_add_tail(&device->dev_node, &rdmacg_devices);
+	mutex_unlock(&rdmacg_mutex);
+	return 0;
+}
+EXPORT_SYMBOL(rdmacg_register_device);
+
+/**
+ * rdmacg_unregister_device - unregister rdmacg device from rdma controller.
+ * @device: pointer to rdmacg device which was previously registered with rdma
+ *          controller using rdmacg_register_device().
+ *
+ * IB stack must invoke this after all the resources of the IB device
+ * are destroyed and after ensuring that no more resources will be created
+ * when this API is invoked.
+ */
+void rdmacg_unregister_device(struct rdmacg_device *device)
+{
+	struct rdmacg_resource_pool *rpool, *tmp;
+
+	/*
+	 * Synchronize with any active resource settings,
+	 * usage query happening via configfs.
+	 */
+	mutex_lock(&rdmacg_mutex);
+	list_del_init(&device->dev_node);
+
+	/*
+	 * Now that this device is off the cgroup list, its safe to free
+	 * all the rpool resources.
+	 */
+	list_for_each_entry_safe(rpool, tmp, &device->rpools, dev_node)
+		free_cg_rpool_locked(rpool);
+
+	mutex_unlock(&rdmacg_mutex);
+}
+EXPORT_SYMBOL(rdmacg_unregister_device);
+
+/**
+ * rdmacg_query_limit - query the resource limits that
+ * might have been configured by the user.
+ * @device: pointer to rdmacg device
+ * @limits: pointer to an array of limits where rdma cg will provide
+ *          the configured limits of the cgroup.
+ *
+ * This function honors resource limit in hierarchical way.
+ * In a cgroup hirarchy, it considers the limit of a controller which has
+ * smallest limit configured.
+ */
+void rdmacg_query_limit(struct rdmacg_device *device, int *limits)
+{
+	struct rdma_cgroup *cg, *p;
+	struct rdmacg_resource_pool *rpool;
+	int i;
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
+		limits[i] = S32_MAX;
+
+	cg = get_current_rdmacg();
+	/*
+	 * Check in hirerchy which pool get the least amount of
+	 * resource limits.
+	 */
+	mutex_lock(&rdmacg_mutex);
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		rpool = find_cg_rpool_locked(cg, device);
+		if (!rpool)
+			continue;
+
+		for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
+			limits[i] = min_t(int, limits[i],
+					rpool->resources[i].max);
+	}
+	mutex_unlock(&rdmacg_mutex);
+	css_put(&cg->css);
+}
+EXPORT_SYMBOL(rdmacg_query_limit);
+
+static int parse_resource(char *c, int *intval)
+{
+	substring_t argstr;
+	const char **table = &rdmacg_resource_names[0];
+	char *name, *value = c;
+	size_t len;
+	int ret, i = 0;
+
+	name = strsep(&value, "=");
+	if (!name || !value)
+		return -EINVAL;
+
+	len = strlen(value);
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+		if (strcmp(table[i], name))
+			continue;
+
+		argstr.from = value;
+		argstr.to = value + len;
+
+		ret = match_int(&argstr, intval);
+		if (ret >= 0) {
+			if (*intval < 0)
+				break;
+			return i;
+		}
+		if (strncmp(value, RDMACG_MAX_STR, len) == 0) {
+			*intval = S32_MAX;
+			return i;
+		}
+		break;
+	}
+	return -EINVAL;
+}
+
+static int rdmacg_parse_limits(char *options,
+			       int *new_limits, unsigned long *enables)
+{
+	char *c;
+	int err = -EINVAL;
+
+	/* parse resource options */
+	while ((c = strsep(&options, " ")) != NULL) {
+		int index, intval;
+
+		index = parse_resource(c, &intval);
+		if (index < 0)
+			goto err;
+
+		new_limits[index] = intval;
+		*enables |= BIT(index);
+	}
+	return 0;
+
+err:
+	return err;
+}
+
+static struct rdmacg_device *rdmacg_get_device_locked(const char *name)
+{
+	struct rdmacg_device *device;
+
+	lockdep_assert_held(&rdmacg_mutex);
+
+	list_for_each_entry(device, &rdmacg_devices, dev_node)
+		if (!strcmp(name, device->name))
+			return device;
+
+	return NULL;
+}
+
+static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
+				       char *buf, size_t nbytes, loff_t off)
+{
+	struct rdma_cgroup *cg = css_rdmacg(of_css(of));
+	const char *dev_name;
+	struct rdmacg_resource_pool *rpool;
+	struct rdmacg_device *device;
+	char *options = strstrip(buf);
+	int *new_limits;
+	unsigned long enables = 0;
+	int i = 0, ret = 0;
+
+	/* extract the device name first */
+	dev_name = strsep(&options, " ");
+	if (!dev_name) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	new_limits = kcalloc(RDMACG_RESOURCE_MAX, sizeof(int), GFP_KERNEL);
+	if (!new_limits) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	ret = rdmacg_parse_limits(options, new_limits, &enables);
+	if (ret)
+		goto parse_err;
+
+	/* acquire lock to synchronize with hot plug devices */
+	mutex_lock(&rdmacg_mutex);
+
+	device = rdmacg_get_device_locked(dev_name);
+	if (!device) {
+		ret = -ENODEV;
+		goto dev_err;
+	}
+
+	rpool = get_cg_rpool_locked(cg, device);
+	if (IS_ERR_OR_NULL(rpool)) {
+		ret = PTR_ERR(rpool);
+		goto dev_err;
+	}
+
+	/* now set the new limits of the rpool */
+	for_each_set_bit(i, &enables, RDMACG_RESOURCE_MAX)
+		set_resource_limit(rpool, i, new_limits[i]);
+
+	if (rpool->usage_sum == 0 &&
+	    rpool->num_max_cnt == RDMACG_RESOURCE_MAX) {
+		/*
+		 * No user of the rpool and all entries are set to max, so
+		 * safe to delete this rpool.
+		 */
+		free_cg_rpool_locked(rpool);
+	}
+
+dev_err:
+	mutex_unlock(&rdmacg_mutex);
+
+parse_err:
+	kfree(new_limits);
+
+err:
+	return ret ?: nbytes;
+}
+
+static void print_rpool_values(struct seq_file *sf,
+			       struct rdmacg_resource_pool *rpool)
+{
+	enum rdmacg_file_type sf_type;
+	int i;
+	u32 value;
+
+	sf_type = seq_cft(sf)->private;
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+		seq_puts(sf, rdmacg_resource_names[i]);
+		seq_putc(sf, '=');
+		if (sf_type == RDMACG_RESOURCE_TYPE_MAX) {
+			if (rpool)
+				value = rpool->resources[i].max;
+			else
+				value = S32_MAX;
+		} else {
+			if (rpool)
+				value = rpool->resources[i].usage;
+		}
+
+		if (value == S32_MAX)
+			seq_puts(sf, RDMACG_MAX_STR);
+		else
+			seq_printf(sf, "%d", value);
+		seq_putc(sf, ' ');
+	}
+}
+
+static int rdmacg_resource_read(struct seq_file *sf, void *v)
+{
+	struct rdmacg_device *device;
+	struct rdmacg_resource_pool *rpool;
+	struct rdma_cgroup *cg = css_rdmacg(seq_css(sf));
+
+	mutex_lock(&rdmacg_mutex);
+
+	list_for_each_entry(device, &rdmacg_devices, dev_node) {
+		seq_printf(sf, "%s ", device->name);
+
+		rpool = find_cg_rpool_locked(cg, device);
+		print_rpool_values(sf, rpool);
+
+		seq_putc(sf, '\n');
+	}
+
+	mutex_unlock(&rdmacg_mutex);
+	return 0;
+}
+
+static struct cftype rdmacg_files[] = {
+	{
+		.name = "max",
+		.write = rdmacg_resource_set_max,
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_RESOURCE_TYPE_MAX,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "current",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_RESOURCE_TYPE_STAT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{ }	/* terminate */
+};
+
+static struct cgroup_subsys_state *
+rdmacg_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct rdma_cgroup *cg;
+
+	cg = kzalloc(sizeof(*cg), GFP_KERNEL);
+	if (!cg)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&cg->rpools);
+	return &cg->css;
+}
+
+static void rdmacg_css_free(struct cgroup_subsys_state *css)
+{
+	struct rdma_cgroup *cg = css_rdmacg(css);
+
+	kfree(cg);
+}
+
+/**
+ * rdmacg_css_offline - cgroup css_offline callback
+ * @css: css of interest
+ *
+ * This function is called when @css is about to go away and responsible
+ * for shooting down all rdmacg associated with @css. As part of that it
+ * marks all the resource pool entries to max value, so that when resources are
+ * uncharged, associated resource pool can be freed as well.
+ */
+static void rdmacg_css_offline(struct cgroup_subsys_state *css)
+{
+	struct rdma_cgroup *cg = css_rdmacg(css);
+	struct rdmacg_resource_pool *rpool;
+
+	mutex_lock(&rdmacg_mutex);
+
+	list_for_each_entry(rpool, &cg->rpools, cg_node)
+		set_all_resource_max_limit(rpool);
+
+	mutex_unlock(&rdmacg_mutex);
+}
+
+struct cgroup_subsys rdma_cgrp_subsys = {
+	.css_alloc	= rdmacg_css_alloc,
+	.css_free	= rdmacg_css_free,
+	.css_offline	= rdmacg_css_offline,
+	.legacy_cftypes	= rdmacg_files,
+	.dfl_cftypes	= rdmacg_files,
+};
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv12 2/3] IB/core: added support to use rdma cgroup controller
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
@ 2016-08-31  8:37 ` Parav Pandit
  2016-08-31  8:37 ` [PATCHv12 3/3] rdmacg: Added documentation for rdmacg Parav Pandit
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module, pandit.parav

Added support APIs for IB core to register/unregister every IB/RDMA
device with rdma cgroup for tracking verbs and hw resources.
IB core registers with rdma cgroup controller.
Added support APIs for uverbs layer to make use of rdma controller.
Added uverbs layer to perform resource charge/uncharge functionality.
Added support during query_device uverb operation to ensure it
returns resource limits by honoring rdma cgroup configured limits.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 drivers/infiniband/core/Makefile    |  1 +
 drivers/infiniband/core/cgroup.c    | 93 +++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/core_priv.h | 41 ++++++++++++++++
 drivers/infiniband/core/device.c    | 10 ++++
 4 files changed, 145 insertions(+)
 create mode 100644 drivers/infiniband/core/cgroup.c

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index edaae9f..e426ac8 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,6 +13,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
 				multicast.o mad.o smi.o agent.o mad_rmpp.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
 
 ib_cm-y :=			cm.o
 
diff --git a/drivers/infiniband/core/cgroup.c b/drivers/infiniband/core/cgroup.c
new file mode 100644
index 0000000..ffe7234
--- /dev/null
+++ b/drivers/infiniband/core/cgroup.c
@@ -0,0 +1,93 @@
+/*
+ * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "core_priv.h"
+
+/*
+ * resource table definition as to be seen by the user.
+ * Need to add entries to it when more resources are
+ * added/defined at IB verb/core layer.
+ */
+
+/**
+ * ib_device_register_rdmacg - register with rdma cgroup.
+ * @device: device to register to participate in resource
+ *          accounting by rdma cgroup.
+ *
+ * Register with the rdma cgroup. Should be called before
+ * exposing rdma device to user space applications to avoid
+ * resource accounting leak.
+ * Returns 0 on success or otherwise failure code.
+ */
+int ib_device_register_rdmacg(struct ib_device *device)
+{
+	device->cg_device.name = device->name;
+	return rdmacg_register_device(&device->cg_device);
+}
+
+/**
+ * ib_device_unregister_rdmacg - unregister with rdma cgroup.
+ * @device: device to unregister.
+ *
+ * Unregister with the rdma cgroup. Should be called after
+ * all the resources are deallocated, and after a stage when any
+ * other resource allocation by user application cannot be done
+ * for this device to avoid any leak in accounting.
+ */
+void ib_device_unregister_rdmacg(struct ib_device *device)
+{
+	rdmacg_unregister_device(&device->cg_device);
+}
+
+int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+			 struct ib_device *device,
+			 enum rdmacg_resource_type resource_index)
+{
+	return rdmacg_try_charge(&cg_obj->cg, &device->cg_device,
+				 resource_index);
+}
+EXPORT_SYMBOL(ib_rdmacg_try_charge);
+
+void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+			struct ib_device *device,
+			enum rdmacg_resource_type resource_index)
+{
+	rdmacg_uncharge(cg_obj->cg, &device->cg_device,
+			resource_index);
+}
+EXPORT_SYMBOL(ib_rdmacg_uncharge);
+
+void ib_rdmacg_query_limit(struct ib_device *device, int *limits)
+{
+	rdmacg_query_limit(&device->cg_device, limits);
+}
+EXPORT_SYMBOL(ib_rdmacg_query_limit);
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 19d499d..d1e432e 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -35,6 +35,7 @@
 
 #include <linux/list.h>
 #include <linux/spinlock.h>
+#include <linux/cgroup_rdma.h>
 
 #include <rdma/ib_verbs.h>
 
@@ -124,6 +125,46 @@ int ib_cache_setup_one(struct ib_device *device);
 void ib_cache_cleanup_one(struct ib_device *device);
 void ib_cache_release_one(struct ib_device *device);
 
+#ifdef CONFIG_CGROUP_RDMA
+int ib_device_register_rdmacg(struct ib_device *device);
+void ib_device_unregister_rdmacg(struct ib_device *device);
+
+int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+			 struct ib_device *device,
+			 enum rdmacg_resource_type resource_index);
+
+void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+			struct ib_device *device,
+			enum rdmacg_resource_type resource_index);
+
+void ib_rdmacg_query_limit(struct ib_device *device, int *limits);
+#else
+static inline int ib_device_register_rdmacg(struct ib_device *device)
+{ return 0; }
+
+static inline void ib_device_unregister_rdmacg(struct ib_device *device)
+{ }
+
+static inline int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+				       struct ib_device *device,
+				       enum rdmacg_resource_type resource_index)
+{ return 0; }
+
+static inline void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+				      struct ib_device *device,
+				      enum rdmacg_resource_type resource_index)
+{ }
+
+static inline void ib_rdmacg_query_limit(struct ib_device *device,
+					 int *limits)
+{
+	int i;
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
+		limits[i] = S32_MAX;
+}
+#endif
+
 static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
 					 struct net_device *upper)
 {
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 760ef60..08e3259 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -363,10 +363,18 @@ int ib_register_device(struct ib_device *device,
 		goto out;
 	}
 
+	ret = ib_device_register_rdmacg(device);
+	if (ret) {
+		pr_warn("Couldn't register device with rdma cgroup\n");
+		ib_cache_cleanup_one(device);
+		goto out;
+	}
+
 	memset(&device->attrs, 0, sizeof(device->attrs));
 	ret = device->query_device(device, &device->attrs, &uhw);
 	if (ret) {
 		pr_warn("Couldn't query the device attributes\n");
+		ib_device_unregister_rdmacg(device);
 		ib_cache_cleanup_one(device);
 		goto out;
 	}
@@ -375,6 +383,7 @@ int ib_register_device(struct ib_device *device,
 	if (ret) {
 		pr_warn("Couldn't register device %s with driver model\n",
 			device->name);
+		ib_device_unregister_rdmacg(device);
 		ib_cache_cleanup_one(device);
 		goto out;
 	}
@@ -424,6 +433,7 @@ void ib_unregister_device(struct ib_device *device)
 
 	mutex_unlock(&device_mutex);
 
+	ib_device_unregister_rdmacg(device);
 	ib_device_unregister_sysfs(device);
 	ib_cache_cleanup_one(device);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCHv12 3/3] rdmacg: Added documentation for rdmacg
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
  2016-08-31  8:37 ` [PATCHv12 2/3] IB/core: added support to use " Parav Pandit
@ 2016-08-31  8:37 ` Parav Pandit
  2016-08-31 13:56 ` [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Tejun Heo
  2016-10-05 11:22 ` Leon Romanovsky
  4 siblings, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module, pandit.parav

Added documentation for v1 and v2 version describing high
level design and usage examples on using rdma controller.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 Documentation/cgroup-v1/rdma.txt | 117 +++++++++++++++++++++++++++++++++++++++
 Documentation/cgroup-v2.txt      |  45 +++++++++++++++
 2 files changed, 162 insertions(+)
 create mode 100644 Documentation/cgroup-v1/rdma.txt

diff --git a/Documentation/cgroup-v1/rdma.txt b/Documentation/cgroup-v1/rdma.txt
new file mode 100644
index 0000000..28cb59e
--- /dev/null
+++ b/Documentation/cgroup-v1/rdma.txt
@@ -0,0 +1,117 @@
+				RDMA Controller
+				----------------
+
+Contents
+--------
+
+1. Overview
+  1-1. What is RDMA controller?
+  1-2. Why RDMA controller needed?
+  1-3. How is RDMA controller implemented?
+2. Usage Examples
+
+1. Overview
+
+1-1. What is RDMA controller?
+-----------------------------
+
+RDMA controller allows user to limit RDMA/IB specific resources that a given
+set of processes can use. These processes are grouped using RDMA controller.
+
+RDMA controller defines well defined verb resources which can be limited for
+processes of a cgroup.
+
+1-2. Why RDMA controller needed?
+--------------------------------
+
+Currently user space applications can easily take away all the rdma device
+specific resources such as AH, CQ, QP, MR etc. Due to which other applications
+in other cgroup or kernel space ULPs may not even get chance to allocate any
+rdma resources. This can leads to service unavailability.
+
+Therefore RDMA controller is needed through which resource consumption
+of processes can be limited. Through this controller various different rdma
+resources can be accounted.
+
+1-3. How is RDMA controller implemented?
+----------------------------------------
+
+RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains
+resource accounting per cgroup, per device using resource pool structure.
+Each such resource pool is limited up to 64 resources in given resource pool
+by rdma cgroup, which can be extended later if required.
+
+This resource pool object is linked to the cgroup css. Typically there
+are 0 to 4 resource pool instances per cgroup, per device in most use cases.
+But nothing limits to have it more. At present hundreds of RDMA devices per
+single cgroup may not be handled optimally, however there is no
+known use case or requirement for such configuration either.
+
+Since RDMA resources can be allocated from any process and can be freed by any
+of the child processes which shares the address space, rdma resources are
+always owned by the creator cgroup css. This allows process migration from one
+to other cgroup without major complexity of transferring resource ownership;
+because such ownership is not really present due to shared nature of
+rdma resources. Linking resources around css also ensures that cgroups can be
+deleted after processes migrated. This allow progress migration as well with
+active resources, even though that is not a primary use case.
+
+Whenever RDMA resource charging occurs, owner rdma cgroup is returned to
+the caller. Same rdma cgroup should be passed while uncharging the resource.
+This also allows process migrated with active RDMA resource to charge
+to new owner cgroup for new resource. It also allows to uncharge resource of
+a process from previously charged cgroup which is migrated to new cgroup,
+even though that is not a primary use case.
+
+Resource pool object is created in following situations.
+(a) User sets the limit and no previous resource pool exist for the device
+of interest for the cgroup.
+(b) No resource limits were configured, but IB/RDMA stack tries to
+charge the resource. So that it correctly uncharge them when applications are
+running without limits and later on when limits are enforced during uncharging,
+otherwise usage count will drop to negative.
+
+Resource pool is destroyed if all the resource limits are set to max and
+it is the last resource getting deallocated.
+
+User should set all the limit to max value if it intents to remove/unconfigure
+the resource pool for a particular device.
+
+IB stack honors limits enforced by the rdma controller. When application
+query about maximum resource limits of IB device, it returns minimum of
+what is configured by user for a given cgroup and what is supported by
+IB device.
+
+Following resources can be accounted by rdma controller.
+  uctx		Maximum number of User Contexts
+  pd		Maximum number of Protection domains
+  ah		Maximum number of Address handles
+  mr		Maximum number of Memory Regions
+  mw		Maximum number of Memory Windows
+  cq		Maximum number of Completion Queues
+  srq		Maximum number of Shared Receive Queues
+  qp		Maximum number of Queue Pairs
+  flow		Maximum number of Flows
+
+
+2. Usage Examples
+-----------------
+
+(a) Configure resource limit:
+echo mlx4_0 mr=100 qp=10 ah=2 > /sys/fs/cgroup/rdma/1/rdma.max
+echo ocrdma1 mr=120 qp=20 cq=10 > /sys/fs/cgroup/rdma/2/rdma.max
+
+(b) Query resource limit:
+cat /sys/fs/cgroup/rdma/2/rdma.max
+#Output:
+mlx4_0 uctx=max pd=max ah=2 mr=100 mw=max cq=max srq=max qp=10 flow=max
+ocrdma1 uctx=max pd=max ah=max mr=120 mw=max cq=10 srq=max qp=20 flow=max
+
+(c) Query current usage:
+cat /sys/fs/cgroup/rdma/2/rdma.current
+#Output:
+mlx4_0 uctx=1 pd=2 ah=2 mr=95 mw=0 cq=2 srq=0 qp=8 flow=0
+ocrdma1 uctx=1 pd=6 ah=9 mr=20 mw=0 cq=1 srq=0 qp=2 flow=0
+
+(d) Delete resource limit:
+echo mlx4_0 mr=max qp=max ah=max > /sys/fs/cgroup/rdma/1/rdma.max
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 4cc07ce..ca63a31 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -47,6 +47,8 @@ CONTENTS
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
+  5-4. RDMA
+    5-4-1. RDMA Interface Files
 6. Namespace
   6-1. Basics
   6-2. The Root and Views
@@ -1119,6 +1121,49 @@ writeback as follows.
 	vm.dirty[_background]_ratio.
 
 
+5-4. RDMA
+
+The "rdma" controller regulates the distribution and accounting of
+of RDMA resources.
+
+5-4-1. RDMA Interface Files
+
+  rdma.max
+	A readwrite nested-keyed file that exists for all the cgroups
+	except root that describes current configured resource limit
+	for a RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+	The following nested keys are defined.
+
+	  uctx		Maximum number of User Contexts
+	  pd		Maximum number of Protection domains
+	  ah		Maximum number of Address handles
+	  mr		Maximum number of Memory Regions
+	  mw		Maximum number of Memory Windows
+	  cq		Maximum number of Completion Queues
+	  srq		Maximum number of Shared Receive Queues
+	  qp		Maximum number of Queue Pairs
+	  flow		Maximum number of Flows
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_0 uctx=max pd=4 ah=2 mr=10 mw=max cq=1 srq=1 qp=10 flow=10
+	  ocrdma1 uctx=2 pd=2 ah=2 mr=20 mw=max cq=1 srq=1 qp=10 flow=10
+
+  rdma.current
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup except root.
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_1 uctx=1 ah=0 pd=1 cq=4 qp=4 mr=100 srq=0 flow=10
+	  ocrdma1 uctx=2 pd=2 ah=2 mr=20 mw=max cq=1 srq=1 qp=10 flow=10
+
+
 6. Namespace
 
 6-1. Basics
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
@ 2016-08-31  9:38   ` Leon Romanovsky
  2016-09-07 15:07     ` Parav Pandit
  2016-08-31 15:07   ` Matan Barak
  1 sibling, 1 reply; 44+ messages in thread
From: Leon Romanovsky @ 2016-08-31  9:38 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie,
	corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Wed, Aug 31, 2016 at 02:07:25PM +0530, Parav Pandit wrote:
> Added rdma cgroup controller that does accounting, limit enforcement
> on rdma/IB verbs and hw resources.
>
> Added rdma cgroup header file which defines its APIs to perform
> charing/uncharing functionality. It also defined APIs for RDMA/IB
> stack for device registration. Devices which are registered will
> participate in controller functions of accounting and limit
> enforcements. It define rdmacg_device structure to bind IB stack
> and RDMA cgroup controller.
>
> RDMA resources are tracked using resource pool. Resource pool is per
> device, per cgroup entity which allows setting up accounting limits
> on per device basis.
>
> Currently resources are defined by the RDMA cgroup.
>
> Resource pool is created/destroyed dynamically whenever
> charging/uncharging occurs respectively and whenever user
> configuration is done. Its a tradeoff of memory vs little more code
> space that creates resource pool object whenever necessary, instead of
> creating them during cgroup creation and device registration time.
>
> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
> ---

<...>

> +
> +static struct rdmacg_resource_pool *
> +get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
> +{
> +	struct rdmacg_resource_pool *rpool;
> +
> +	rpool = find_cg_rpool_locked(cg, device);
> +	if (rpool)
> +		return rpool;
> +
> +	rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
> +	if (!rpool)
> +		return ERR_PTR(-ENOMEM);
> +
> +	rpool->device = device;
> +	set_all_resource_max_limit(rpool);
> +
> +	INIT_LIST_HEAD(&rpool->cg_node);
> +	INIT_LIST_HEAD(&rpool->dev_node);
> +	list_add_tail(&rpool->cg_node, &cg->rpools);
> +	list_add_tail(&rpool->dev_node, &device->rpools);
> +	return rpool;
> +}

<...>

> +	for (p = cg; p; p = parent_rdmacg(p)) {
> +		rpool = get_cg_rpool_locked(p, device);
> +		if (IS_ERR_OR_NULL(rpool)) {

get_cg_rpool_locked always returns !NULL (error, or pointer)

> +			ret = PTR_ERR(rpool);
> +			goto err;

I didn't review the whole series yet.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
                   ` (2 preceding siblings ...)
  2016-08-31  8:37 ` [PATCHv12 3/3] rdmacg: Added documentation for rdmacg Parav Pandit
@ 2016-08-31 13:56 ` Tejun Heo
  2016-10-05 11:22 ` Leon Romanovsky
  4 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2016-08-31 13:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, hannes,
	dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie, corbet,
	james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module

Hello,

On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
> rdmacg: IB/core: rdma controller support
> 
> Patch is generated and tested against below Doug's linux-rdma
> git tree.
> 
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
> Branch: master
> 
> Patchset is also compiled and tested against below Tejun's cgroup tree
> using cgroup v2 mode.
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
> Branch: master
> 
> Overview:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources. This results into service unavailibility.

Generally looks good to me.  Please feel free to add

 Acked-by: Tejun Heo <tj@kernel.org>

for the whole series.  Also, once reviews from rdma side are done,
please let me know what's the preferable way to route the patchset.  I
can route it through the cgroup tree or it can go through rdma one.

Thanks a lot for the persisence!

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
  2016-08-31  9:38   ` Leon Romanovsky
@ 2016-08-31 15:07   ` Matan Barak
  2016-08-31 21:16     ` Tejun Heo
  1 sibling, 1 reply; 44+ messages in thread
From: Matan Barak @ 2016-08-31 15:07 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford, hch, liranl, sean.hefty, jgunthorpe,
	haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, akpm, linux-security-module

On 31/08/2016 11:37, Parav Pandit wrote:
> Added rdma cgroup controller that does accounting, limit enforcement
> on rdma/IB verbs and hw resources.
>
> Added rdma cgroup header file which defines its APIs to perform
> charing/uncharing functionality. It also defined APIs for RDMA/IB
> stack for device registration. Devices which are registered will
> participate in controller functions of accounting and limit
> enforcements. It define rdmacg_device structure to bind IB stack
> and RDMA cgroup controller.
>
> RDMA resources are tracked using resource pool. Resource pool is per
> device, per cgroup entity which allows setting up accounting limits
> on per device basis.
>
> Currently resources are defined by the RDMA cgroup.
>
> Resource pool is created/destroyed dynamically whenever
> charging/uncharging occurs respectively and whenever user
> configuration is done. Its a tradeoff of memory vs little more code
> space that creates resource pool object whenever necessary, instead of
> creating them during cgroup creation and device registration time.
>
> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
> ---
>  include/linux/cgroup_rdma.h   |  66 +++++
>  include/linux/cgroup_subsys.h |   4 +
>  init/Kconfig                  |  10 +
>  kernel/Makefile               |   1 +
>  kernel/cgroup_rdma.c          | 664 ++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 745 insertions(+)
>  create mode 100644 include/linux/cgroup_rdma.h
>  create mode 100644 kernel/cgroup_rdma.c
>
> diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
> new file mode 100644
> index 0000000..6710e28
> --- /dev/null
> +++ b/include/linux/cgroup_rdma.h
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
> + *
> + * This file is subject to the terms and conditions of version 2 of the GNU
> + * General Public License. See the file COPYING in the main directory of the
> + * Linux distribution for more details.
> + */
> +
> +#ifndef _CGROUP_RDMA_H
> +#define _CGROUP_RDMA_H
> +
> +#include <linux/cgroup.h>
> +
> +enum rdmacg_resource_type {
> +	RDMACG_VERB_RESOURCE_UCTX,
> +	RDMACG_VERB_RESOURCE_AH,
> +	RDMACG_VERB_RESOURCE_PD,
> +	RDMACG_VERB_RESOURCE_CQ,
> +	RDMACG_VERB_RESOURCE_MR,
> +	RDMACG_VERB_RESOURCE_MW,
> +	RDMACG_VERB_RESOURCE_SRQ,
> +	RDMACG_VERB_RESOURCE_QP,
> +	RDMACG_VERB_RESOURCE_FLOW,
> +	/*
> +	 * add any hw specific resource here as RDMA_HW_RESOURCE_NAME
> +	 */
> +	RDMACG_RESOURCE_MAX,
> +};
> +
> +#ifdef CONFIG_CGROUP_RDMA
> +

Currently, there are some discussions regarding the RDMA ABI. The 
current proposed approach (after a lot of discussions in the OFVWG) is 
to have driver dependent object types rather than the fixed set of IB 
object types we have today.
AFAIK, some vendors might want to use the RDMA subsystem for a different 
fabrics which has a different set of objects.
You could see RFCs for such concepts both from Mellanox and Intel on the 
linux-rdma mailing list.

Saying that, maybe we need to make the resource types a bit more 
flexible and dynamic.

Regards,
Matan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31 15:07   ` Matan Barak
@ 2016-08-31 21:16     ` Tejun Heo
  2016-09-01  7:25       ` Matan Barak
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2016-08-31 21:16 UTC (permalink / raw)
  To: Matan Barak
  Cc: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, hannes, dledford, hch, liranl, sean.hefty, jgunthorpe,
	haggaie, corbet, james.l.morris, serge, ogerlitz, akpm,
	linux-security-module

Hello,

On Wed, Aug 31, 2016 at 06:07:30PM +0300, Matan Barak wrote:
> Currently, there are some discussions regarding the RDMA ABI. The current
> proposed approach (after a lot of discussions in the OFVWG) is to have
> driver dependent object types rather than the fixed set of IB object types
> we have today.
> AFAIK, some vendors might want to use the RDMA subsystem for a different
> fabrics which has a different set of objects.
> You could see RFCs for such concepts both from Mellanox and Intel on the
> linux-rdma mailing list.
> 
> Saying that, maybe we need to make the resource types a bit more flexible
> and dynamic.

That'd be back to square one and Christoph was dead against it too,
so...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31 21:16     ` Tejun Heo
@ 2016-09-01  7:25       ` Matan Barak
  2016-09-01  8:44         ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Matan Barak @ 2016-09-01  7:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, hannes, dledford, hch, liranl, sean.hefty, jgunthorpe,
	haggaie, corbet, james.l.morris, serge, ogerlitz, akpm,
	linux-security-module

On 01/09/2016 00:16, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 31, 2016 at 06:07:30PM +0300, Matan Barak wrote:
>> Currently, there are some discussions regarding the RDMA ABI. The current
>> proposed approach (after a lot of discussions in the OFVWG) is to have
>> driver dependent object types rather than the fixed set of IB object types
>> we have today.
>> AFAIK, some vendors might want to use the RDMA subsystem for a different
>> fabrics which has a different set of objects.
>> You could see RFCs for such concepts both from Mellanox and Intel on the
>> linux-rdma mailing list.
>>
>> Saying that, maybe we need to make the resource types a bit more flexible
>> and dynamic.
>
> That'd be back to square one and Christoph was dead against it too,
> so...
>

Well, if I recall, the reason doing so last time was in order to allow 
flexible updating of ib_core independently, which is obviously not a 
good reason (to say the least).

Since the new ABI will probably define new object types (all recent 
proposals go this way), the current approach could lead to either trying 
to map new objects to existing cgroup resource types, which could lead 
to some weird non 1:1 mapping, or having a split definitions - such that 
each driver will declare its objects both in the cgroups mechanism and 
in its driver dispatch table.
Even worse than that, drivers could simply ignore the cgroups support 
while implementing their own resource types and get a very broken 
containers support.


> Thanks.
>

Matan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-01  7:25       ` Matan Barak
@ 2016-09-01  8:44         ` Christoph Hellwig
  2016-09-07  7:55           ` Parav Pandit
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2016-09-01  8:44 UTC (permalink / raw)
  To: Matan Barak
  Cc: Tejun Heo, Parav Pandit, cgroups, linux-doc, linux-kernel,
	linux-rdma, lizefan, hannes, dledford, hch, liranl, sean.hefty,
	jgunthorpe, haggaie, corbet, james.l.morris, serge, ogerlitz,
	akpm, linux-security-module

On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
> Well, if I recall, the reason doing so last time was in order to allow 
> flexible updating of ib_core independently, which is obviously not a good 
> reason (to say the least).
>
> Since the new ABI will probably define new object types (all recent 
> proposals go this way), the current approach could lead to either trying to 
> map new objects to existing cgroup resource types, which could lead to some 
> weird non 1:1 mapping, or having a split definitions - such that each 
> driver will declare its objects both in the cgroups mechanism and in its 
> driver dispatch table.
> Even worse than that, drivers could simply ignore the cgroups support while 
> implementing their own resource types and get a very broken containers 
> support.

Sorry guys, but arbitrary extensibility for something not finished is the
worst idea ever.  We have two options here:

 a) delay cgroups support until the grand rewrite is done
 b) add it now and deal with the consequences later

That being said, adding random non-Verbs hardwasre to the RDMA core is
the second worst idea ever.  Guess I need to catch up with the
discussion and start using the flame thrower.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-01  8:44         ` Christoph Hellwig
@ 2016-09-07  7:55           ` Parav Pandit
  2016-09-07  8:51             ` Matan Barak
  2016-09-10 16:12             ` Christoph Hellwig
  0 siblings, 2 replies; 44+ messages in thread
From: Parav Pandit @ 2016-09-07  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matan Barak, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Matan,

On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
>> Well, if I recall, the reason doing so last time was in order to allow
>> flexible updating of ib_core independently, which is obviously not a good
>> reason (to say the least).
>>
>> Since the new ABI will probably define new object types (all recent
>> proposals go this way), the current approach could lead to either trying to
>> map new objects to existing cgroup resource types, which could lead to some
>> weird non 1:1 mapping, or having a split definitions - such that each
>> driver will declare its objects both in the cgroups mechanism and in its
>> driver dispatch table.

>> Even worse than that, drivers could simply ignore the cgroups support while
>> implementing their own resource types and get a very broken containers
>> support.
If drivers are broken due to ignorance of not-calling cgroup APIs,
that should be ok.
That particular driver should fix it.
If the resource creation using uverbs is using well defined rdma level
resource, uverbs level will make sure to honor cgroup limits without
involving hw drivers anyway.

RDMA Verb level resource is charged/uncharged by RDMA core.
RDMA HW level resource is charged/uncharged by RDMA HW driver using
well defined API and resource by cgroup core.
This scheme ensures that there is 1:1 mapping.

I don't think current definition of resource type is carved out on stone.
They can be extended as we forward.
As many of us agree that, they should be well defined and it should be
agreed by cgroup and rdma community.

For example, today we have RDMA_VERB_xxx resources.
New well defined RDMA HW resources can be defined in rdma_cgroup.h
file as RDMA_HW_xx in same enum table.

>
> Sorry guys, but arbitrary extensibility for something not finished is the
> worst idea ever.  We have two options here:
>
>  a) delay cgroups support until the grand rewrite is done
>  b) add it now and deal with the consequences later
>
Can we do (b) now and differ adding any HW resources to cgroup until
they are clearly called out.
Architecture and APIs are already in place to support this.

> That being said, adding random non-Verbs hardwasre to the RDMA core is
> the second worst idea ever.

We can differ adding HW resource to core and cgroup until they are
clearly defined.
In that case current architecture still holds good.

> Guess I need to catch up with the
> discussion and start using the flame thrower.

Matan,
Can you please point us to the new RFC/ABI email thread which
describes the design, partitioning of code between core vs hw drivers
etc.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07  7:55           ` Parav Pandit
@ 2016-09-07  8:51             ` Matan Barak
  2016-09-07 14:54               ` Parav Pandit
  2016-09-10 16:14               ` Christoph Hellwig
  2016-09-10 16:12             ` Christoph Hellwig
  1 sibling, 2 replies; 44+ messages in thread
From: Matan Barak @ 2016-09-07  8:51 UTC (permalink / raw)
  To: Parav Pandit, Christoph Hellwig
  Cc: Tejun Heo, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris, serge, Or Gerlitz, Andrew Morton,
	linux-security-module

On 07/09/2016 10:55, Parav Pandit wrote:
> Hi Matan,
>
> On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <hch@lst.de> wrote:
>> On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
>>> Well, if I recall, the reason doing so last time was in order to allow
>>> flexible updating of ib_core independently, which is obviously not a good
>>> reason (to say the least).
>>>
>>> Since the new ABI will probably define new object types (all recent
>>> proposals go this way), the current approach could lead to either trying to
>>> map new objects to existing cgroup resource types, which could lead to some
>>> weird non 1:1 mapping, or having a split definitions - such that each
>>> driver will declare its objects both in the cgroups mechanism and in its
>>> driver dispatch table.
>
>>> Even worse than that, drivers could simply ignore the cgroups support while
>>> implementing their own resource types and get a very broken containers
>>> support.
> If drivers are broken due to ignorance of not-calling cgroup APIs,
> that should be ok.
> That particular driver should fix it.
> If the resource creation using uverbs is using well defined rdma level
> resource, uverbs level will make sure to honor cgroup limits without
> involving hw drivers anyway.
>

All recent proposals of the new ABI schema deals with extending the 
flexibility of the current schema by letting drivers define their 
specific types, actions, attributes, etc. Even more than that, the 
dispatching starts from the driver and it chooses if it wants to use the 
common RDMA core layer or have it's own wise implementation instead.
Some drivers might even prefer not to implement the current verbs types.
These decisions were made in the OFVWG meetings.

Anyway, maybe we should consider using a more higher-level logic objects 
that could fit multiple drivers requirements.

> RDMA Verb level resource is charged/uncharged by RDMA core.
> RDMA HW level resource is charged/uncharged by RDMA HW driver using
> well defined API and resource by cgroup core.
> This scheme ensures that there is 1:1 mapping.
>

Sounds reasonable, but what about drivers which ignore the common code 
and implement it in their own way? What about drivers which don't 
support the standard RDMA types at all?
I guess we should find a balance between something abstract and common 
enough that will ease administrator configuration but be specific enough 
for the various models we have (or will have) in the RDMA stack.

> I don't think current definition of resource type is carved out on stone.
> They can be extended as we forward.
> As many of us agree that, they should be well defined and it should be
> agreed by cgroup and rdma community.
>

Of course, but since the ABI and cgroups model are somehow related, they 
should be dealt with together and approved by Doug who participated in 
some of the OFVWG meetings.

> For example, today we have RDMA_VERB_xxx resources.
> New well defined RDMA HW resources can be defined in rdma_cgroup.h
> file as RDMA_HW_xx in same enum table.
>

So a driver will change the cgroups file for every new type it adds?
Will we just have a super set enum of all crazy types vendors added?

>>
>> Sorry guys, but arbitrary extensibility for something not finished is the
>> worst idea ever.  We have two options here:
>>
>>  a) delay cgroups support until the grand rewrite is done
>>  b) add it now and deal with the consequences later
>>
> Can we do (b) now and differ adding any HW resources to cgroup until
> they are clearly called out.
> Architecture and APIs are already in place to support this.
>

Since this affect the user, it's better to think how it fits our 
"optional standard"/"vendor types" model first. Maybe we could force all 
standard types even if the driver we use doesn't support any of them.

>> That being said, adding random non-Verbs hardwasre to the RDMA core is
>> the second worst idea ever.
>
> We can differ adding HW resource to core and cgroup until they are
> clearly defined.
> In that case current architecture still holds good.
>

Clearly we should differ adding the actual code until a driver could 
declare such objects, but we need to decide how to expose the standard 
optional RDMA types (basically, the types you've added) and how future 
driver specific types fit in.

>> Guess I need to catch up with the
>> discussion and start using the flame thrower.
>
> Matan,
> Can you please point us to the new RFC/ABI email thread which
> describes the design, partitioning of code between core vs hw drivers
> etc.
>

One proposal is [1]. There's another one from Sean which aims for 
similar targets regards the driver specific types.

[1] https://www.spinics.net/lists/linux-rdma/msg38997.html

Matan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07  8:51             ` Matan Barak
@ 2016-09-07 14:54               ` Parav Pandit
  2016-09-10 16:14               ` Christoph Hellwig
  1 sibling, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-09-07 14:54 UTC (permalink / raw)
  To: Matan Barak
  Cc: Christoph Hellwig, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Wed, Sep 7, 2016 at 2:21 PM, Matan Barak <matanb@mellanox.com> wrote:
> On 07/09/2016 10:55, Parav Pandit wrote:
>>
>> Hi Matan,
>>
>> On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <hch@lst.de> wrote:
>>>
>>> On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
>>>>
>>>> Well, if I recall, the reason doing so last time was in order to allow
>>>> flexible updating of ib_core independently, which is obviously not a
>>>> good
>>>> reason (to say the least).
>>>>
>>>> Since the new ABI will probably define new object types (all recent
>>>> proposals go this way), the current approach could lead to either trying
>>>> to
>>>> map new objects to existing cgroup resource types, which could lead to
>>>> some
>>>> weird non 1:1 mapping, or having a split definitions - such that each
>>>> driver will declare its objects both in the cgroups mechanism and in its
>>>> driver dispatch table.
>>
>>
>>>> Even worse than that, drivers could simply ignore the cgroups support
>>>> while
>>>> implementing their own resource types and get a very broken containers
>>>> support.
>>
>> If drivers are broken due to ignorance of not-calling cgroup APIs,
>> that should be ok.
>> That particular driver should fix it.
>> If the resource creation using uverbs is using well defined rdma level
>> resource, uverbs level will make sure to honor cgroup limits without
>> involving hw drivers anyway.
>>
>
> All recent proposals of the new ABI schema deals with extending the
> flexibility of the current schema by letting drivers define their specific
> types, actions, attributes, etc. Even more than that, the dispatching starts
> from the driver and it chooses if it wants to use the common RDMA core layer
> or have it's own wise implementation instead.
> Some drivers might even prefer not to implement the current verbs types.
> These decisions were made in the OFVWG meetings.

o.k. If some drivers doesn't implement current verbs type, there is no
point in controlling it either.
In such case application space library won't even invoke resource
allocation/free for unsupported resources.
For resources (type in your word) which are not defined in cgroup, but
exist in hw driver, cannot be controlled by cgroup.
As you highlighted in your [1], "driver's specific attributes could
someday become core's standard attributes", we should be able to add
new resource type to existing rdma cgroup.

>
> Anyway, maybe we should consider using a more higher-level logic objects
> that could fit multiple drivers requirements.
I do not have any other objects other than QP, MR etc in mind currently.
I can think of a RDMA specific socket that encompass one PD, AH,
couple of MRs, and QP.
But we don't have such resource abstraction and its data transport
APIs in place.
There is rsocket, various versions of MPI, libfabric etc in place.
So I am not sure which high level objects to defined at this point.
If you have such objects nailed down, we should be able to add them in
incremental development.

>
>> RDMA Verb level resource is charged/uncharged by RDMA core.
>> RDMA HW level resource is charged/uncharged by RDMA HW driver using
>> well defined API and resource by cgroup core.
>> This scheme ensures that there is 1:1 mapping.
>>
>
> Sounds reasonable, but what about drivers which ignore the common code and
> implement it in their own way?
Shouldn't Doug ask them to use cgroup infrastructure instead of
implementing resource charging/uncharing in their own way.
It still unlikely or difficult for drivers to group processes them
selves like cgroup to implement things in their own way.
I agree, they can completely ignore RDMA verbs resources and implement
their own HW resources.

As you pointed below, we need to find balance between which hw
resource to be defined and which not.
For example, newly added XRQ object, which could be a pure buffer_tag
with receive queue for other vendor. I am not sure how to abstract
them into single object.



> What about drivers which don't support the
> standard RDMA types at all?
> I guess we should find a balance between something abstract and common
> enough that will ease administrator configuration but be specific enough for
> the various models we have (or will have) in the RDMA stack.
>
>> I don't think current definition of resource type is carved out on stone.
>> They can be extended as we forward.
>> As many of us agree that, they should be well defined and it should be
>> agreed by cgroup and rdma community.
>>
>
> Of course, but since the ABI and cgroups model are somehow related, they
> should be dealt with together and approved by Doug who participated in some
> of the OFVWG meetings.
Sure.
>
>> For example, today we have RDMA_VERB_xxx resources.
>> New well defined RDMA HW resources can be defined in rdma_cgroup.h
>> file as RDMA_HW_xx in same enum table.
>>
>
> So a driver will change the cgroups file for every new type it adds?
Well, we wanted to avoid that such churn in cgroup file, thats why v11
was defining resources in IB core. But I guess that was not
acceptable. I had NAK from Christoph and Tejun on that idea.

> Will we just have a super set enum of all crazy types vendors added?
As you said, we need to find balance. I frankly don't know how to do
so. There has to be some reasonable number of types. As we go along
Doug, Tejun and others should approve adding such.
If I am not wrong in past one year, may be two more resource types got
added? XRQ, state-less Queues?

>
>>>
>>> Sorry guys, but arbitrary extensibility for something not finished is the
>>> worst idea ever.  We have two options here:
>>>
>>>  a) delay cgroups support until the grand rewrite is done
>>>  b) add it now and deal with the consequences later
>>>
>> Can we do (b) now and differ adding any HW resources to cgroup until
>> they are clearly called out.
>> Architecture and APIs are already in place to support this.
>>
>
> Since this affect the user, it's better to think how it fits our "optional
> standard"/"vendor types" model first. Maybe we could force all standard
> types even if the driver we use doesn't support any of them.

If vendor doesn't support a given type, user won't allocate it. So its
just don't care condition.
I dont see a need to force standard types either.

>
>>> That being said, adding random non-Verbs hardwasre to the RDMA core is
>>> the second worst idea ever.
>>
>>
>> We can differ adding HW resource to core and cgroup until they are
>> clearly defined.
>> In that case current architecture still holds good.
>>
>
> Clearly we should differ adding the actual code until a driver could declare
> such objects, but we need to decide how to expose the standard optional RDMA
> types (basically, the types you've added) and how future driver specific
> types fit in.
>

o.k. Few more handful of driver specific types should be ok to add in cgroup.
I will let others speak up if thats not acceptable. Current code
already documents and provide infrastructure for that.

>>
>>
>> Matan,
>> Can you please point us to the new RFC/ABI email thread which
>> describes the design, partitioning of code between core vs hw drivers
>> etc.
>>
>
> One proposal is [1]. There's another one from Sean which aims for similar
> targets regards the driver specific types.
>
> [1] https://www.spinics.net/lists/linux-rdma/msg38997.html
>
I looked at the RFC briefly. I can see that old infrastructure (a) and
(b) is not going away.
So It should be ok. to charge/uncharge those standard resources from
those hooks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31  9:38   ` Leon Romanovsky
@ 2016-09-07 15:07     ` Parav Pandit
  2016-09-08  6:12       ` Leon Romanovsky
  0 siblings, 1 reply; 44+ messages in thread
From: Parav Pandit @ 2016-09-07 15:07 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

Hi Leon,

>> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
>> +static struct rdmacg_resource_pool *
>> +get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
>> +{
>> +     struct rdmacg_resource_pool *rpool;
>> +
>> +     rpool = find_cg_rpool_locked(cg, device);
>> +     if (rpool)
>> +             return rpool;
>> +
>> +     rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
>> +     if (!rpool)
>> +             return ERR_PTR(-ENOMEM);
>> +
>> +     rpool->device = device;
>> +     set_all_resource_max_limit(rpool);
>> +
>> +     INIT_LIST_HEAD(&rpool->cg_node);
>> +     INIT_LIST_HEAD(&rpool->dev_node);
>> +     list_add_tail(&rpool->cg_node, &cg->rpools);
>> +     list_add_tail(&rpool->dev_node, &device->rpools);
>> +     return rpool;
>> +}
>
> <...>
>
>> +     for (p = cg; p; p = parent_rdmacg(p)) {
>> +             rpool = get_cg_rpool_locked(p, device);
>> +             if (IS_ERR_OR_NULL(rpool)) {
>
> get_cg_rpool_locked always returns !NULL (error, or pointer)

Can this change go as incremental change after this patch, since this
patch is close to completion?
Or I need to revise v13?

>
>> +                     ret = PTR_ERR(rpool);
>> +                     goto err;
>
> I didn't review the whole series yet.

Did you get a chance to review the series?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07 15:07     ` Parav Pandit
@ 2016-09-08  6:12       ` Leon Romanovsky
  2016-09-08 10:20         ` Parav Pandit
  0 siblings, 1 reply; 44+ messages in thread
From: Leon Romanovsky @ 2016-09-08  6:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 1626 bytes --]

On Wed, Sep 07, 2016 at 08:37:23PM +0530, Parav Pandit wrote:
> Hi Leon,
>
> >> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
> >> +static struct rdmacg_resource_pool *
> >> +get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
> >> +{
> >> +     struct rdmacg_resource_pool *rpool;
> >> +
> >> +     rpool = find_cg_rpool_locked(cg, device);
> >> +     if (rpool)
> >> +             return rpool;
> >> +
> >> +     rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
> >> +     if (!rpool)
> >> +             return ERR_PTR(-ENOMEM);
> >> +
> >> +     rpool->device = device;
> >> +     set_all_resource_max_limit(rpool);
> >> +
> >> +     INIT_LIST_HEAD(&rpool->cg_node);
> >> +     INIT_LIST_HEAD(&rpool->dev_node);
> >> +     list_add_tail(&rpool->cg_node, &cg->rpools);
> >> +     list_add_tail(&rpool->dev_node, &device->rpools);
> >> +     return rpool;
> >> +}
> >
> > <...>
> >
> >> +     for (p = cg; p; p = parent_rdmacg(p)) {
> >> +             rpool = get_cg_rpool_locked(p, device);
> >> +             if (IS_ERR_OR_NULL(rpool)) {
> >
> > get_cg_rpool_locked always returns !NULL (error, or pointer)
>
> Can this change go as incremental change after this patch, since this
> patch is close to completion?
> Or I need to revise v13?

Sure, it is cleanup. It is not worth of respinning.

>
> >
> >> +                     ret = PTR_ERR(rpool);
> >> +                     goto err;
> >
> > I didn't review the whole series yet.
>
> Did you get a chance to review the series?

We need to decide on fundamental question before reviewing it, which is
"how to fit rdmacg to new ABI model".

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-08  6:12       ` Leon Romanovsky
@ 2016-09-08 10:20         ` Parav Pandit
  0 siblings, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-09-08 10:20 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

On Thu, Sep 8, 2016 at 11:42 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, Sep 07, 2016 at 08:37:23PM +0530, Parav Pandit wrote:
>> Did you get a chance to review the series?
>
> We need to decide on fundamental question before reviewing it, which is
> "how to fit rdmacg to new ABI model".

>From last discussion with Matan in this email thread, it appears that -
only broken case are:
(a) HW vendor driver specific resources (if they have crazy big list),
which cannot be abstracted out well enough, won't be controlled by
rdma cgroup.
(b) Such resource objects are not well defined today with new ABI model.
If such objects are well defined today, lets call them out and discuss
with Doug, Tejun, Christoph and larger group, whether they qualify for
inclusion or not.

rdma cgroup currently supports including handful of HW resource that
can be abstracted (at least at functionality level).

Please include any other option issue, if any.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07  7:55           ` Parav Pandit
  2016-09-07  8:51             ` Matan Barak
@ 2016-09-10 16:12             ` Christoph Hellwig
  2016-09-11  7:40               ` Matan Barak
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2016-09-10 16:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Christoph Hellwig, Matan Barak, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Wed, Sep 07, 2016 at 01:25:13PM +0530, Parav Pandit wrote:
> >  a) delay cgroups support until the grand rewrite is done
> >  b) add it now and deal with the consequences later
> >
> Can we do (b) now and differ adding any HW resources to cgroup until
> they are clearly called out.
> Architecture and APIs are already in place to support this.

Sounds fine to me.  The only thing I want to avoid is pie in the
sky "future proofing" that leads to horrible architectures.  And I assume
that's what Matan proposed.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07  8:51             ` Matan Barak
  2016-09-07 14:54               ` Parav Pandit
@ 2016-09-10 16:14               ` Christoph Hellwig
  2016-09-10 17:01                 ` Jason Gunthorpe
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2016-09-10 16:14 UTC (permalink / raw)
  To: Matan Barak
  Cc: Parav Pandit, Christoph Hellwig, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Wed, Sep 07, 2016 at 11:51:42AM +0300, Matan Barak wrote:
> All recent proposals of the new ABI schema deals with extending the 
> flexibility of the current schema by letting drivers define their specific 
> types, actions, attributes, etc. Even more than that, the dispatching 
> starts from the driver and it chooses if it wants to use the common RDMA 
> core layer or have it's own wise implementation instead.
> Some drivers might even prefer not to implement the current verbs types.
> These decisions were made in the OFVWG meetings.

OFVWG meetings have absolutely zero relevance for Linux development.
More "flexibility" for drivers just means giving up on designing a
coherent API and leaving it to drivers authors to add crap to their
own drivers.  That's a major step backwards.

> Sounds reasonable, but what about drivers which ignore the common code and 
> implement it in their own way? What about drivers which don't support the 
> standard RDMA types at all?

They should not be using the code in drivers/infiniband.  usnic is such
an example of a driver that should never have been added in it's current
form.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-10 16:14               ` Christoph Hellwig
@ 2016-09-10 17:01                 ` Jason Gunthorpe
  2016-09-11  8:07                   ` Matan Barak
  2016-09-11 13:34                   ` Christoph Hellwig
  0 siblings, 2 replies; 44+ messages in thread
From: Jason Gunthorpe @ 2016-09-10 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matan Barak, Parav Pandit, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Haggai Eran,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Sat, Sep 10, 2016 at 06:14:42PM +0200, Christoph Hellwig wrote:
> OFVWG meetings have absolutely zero relevance for Linux development.

Well, to be fair there are a fair number of kernel developers on that
particular call..

> More "flexibility" for drivers just means giving up on designing a
> coherent API and leaving it to drivers authors to add crap to their
> own drivers.  That's a major step backwards.

Sadly, it isn't a step backwards, it is status quo - at least as far
as the uapi is concerned.

Every single user space driver has its own private abi file, carefully
hidden in their driver, and dutifully copied over to user space:

providers/cxgb3/iwch-abi.h
providers/cxgb4/cxgb4-abi.h
providers/hfi1verbs/hfi-abi.h
providers/i40iw/i40iw-abi.h
providers/ipathverbs/ipath-abi.h
providers/mlx4/mlx4-abi.h
providers/mlx5/mlx5-abi.h
providers/mthca/mthca-abi.h
providers/nes/nes-abi.h
providers/ocrdma/ocrdma_abi.h
providers/rxe/rxe-abi.h

Just to pick two random examples:

struct mlx5_create_cq {
        struct ibv_create_cq            ibv_cmd;
        __u64                           buf_addr;
        __u64                           db_addr;
        __u32				cqe_size;
};

struct iwch_create_cq {
        struct ibv_create_cq ibv_cmd;
        uint64_t user_rptr_addr;
};

Love to hear ideas on a way forward that doesn't involve rewriting
everything :(

> They should not be using the code in drivers/infiniband.  usnic is such
> an example of a driver that should never have been added in it's current
> form.

+1

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-10 16:12             ` Christoph Hellwig
@ 2016-09-11  7:40               ` Matan Barak
  0 siblings, 0 replies; 44+ messages in thread
From: Matan Barak @ 2016-09-11  7:40 UTC (permalink / raw)
  To: Christoph Hellwig, Parav Pandit
  Cc: Tejun Heo, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris, serge, Or Gerlitz, Andrew Morton,
	linux-security-module

On 10/09/2016 19:12, Christoph Hellwig wrote:
> On Wed, Sep 07, 2016 at 01:25:13PM +0530, Parav Pandit wrote:
>>>  a) delay cgroups support until the grand rewrite is done
>>>  b) add it now and deal with the consequences later
>>>
>> Can we do (b) now and differ adding any HW resources to cgroup until
>> they are clearly called out.
>> Architecture and APIs are already in place to support this.
>
> Sounds fine to me.  The only thing I want to avoid is pie in the
> sky "future proofing" that leads to horrible architectures.  And I assume
> that's what Matan proposed.
>

NO, that not what I proposed. User-kernel API/ABI should be designed 
with drivers differences in mind. The internal design or internals APIs 
could ignore such things as they can be changed later.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-10 17:01                 ` Jason Gunthorpe
@ 2016-09-11  8:07                   ` Matan Barak
  2016-09-11 13:34                   ` Christoph Hellwig
  1 sibling, 0 replies; 44+ messages in thread
From: Matan Barak @ 2016-09-11  8:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Parav Pandit, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Haggai Eran,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On 10/09/2016 20:01, Jason Gunthorpe wrote:
> On Sat, Sep 10, 2016 at 06:14:42PM +0200, Christoph Hellwig wrote:
>> OFVWG meetings have absolutely zero relevance for Linux development.
>
> Well, to be fair there are a fair number of kernel developers on that
> particular call..
>
>> More "flexibility" for drivers just means giving up on designing a
>> coherent API and leaving it to drivers authors to add crap to their
>> own drivers.  That's a major step backwards.
>
> Sadly, it isn't a step backwards, it is status quo - at least as far
> as the uapi is concerned.
>
> Every single user space driver has its own private abi file, carefully
> hidden in their driver, and dutifully copied over to user space:
>
> providers/cxgb3/iwch-abi.h
> providers/cxgb4/cxgb4-abi.h
> providers/hfi1verbs/hfi-abi.h
> providers/i40iw/i40iw-abi.h
> providers/ipathverbs/ipath-abi.h
> providers/mlx4/mlx4-abi.h
> providers/mlx5/mlx5-abi.h
> providers/mthca/mthca-abi.h
> providers/nes/nes-abi.h
> providers/ocrdma/ocrdma_abi.h
> providers/rxe/rxe-abi.h
>
> Just to pick two random examples:
>
> struct mlx5_create_cq {
>         struct ibv_create_cq            ibv_cmd;
>         __u64                           buf_addr;
>         __u64                           db_addr;
>         __u32				cqe_size;
> };
>
> struct iwch_create_cq {
>         struct ibv_create_cq ibv_cmd;
>         uint64_t user_rptr_addr;
> };
>
> Love to hear ideas on a way forward that doesn't involve rewriting
> everything :(
>

Yeah, unfortunately, the RDMA ABI is more driver specific ABI than a 
common user-kernel ABI. I guess this will become even worse, as the RDMA 
subsystem is evolving to serve more drivers with different object types. 
For example, I would like to hear how hfi1 are going to define their 
user-kernel ABI (once they leave the custom ioctls).

>> They should not be using the code in drivers/infiniband.  usnic is such
>> an example of a driver that should never have been added in it's current
>> form.
>
> +1
>
> Jason
>

Matan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-10 17:01                 ` Jason Gunthorpe
  2016-09-11  8:07                   ` Matan Barak
@ 2016-09-11 13:34                   ` Christoph Hellwig
  2016-09-11 14:35                     ` Leon Romanovsky
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2016-09-11 13:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Sat, Sep 10, 2016 at 11:01:51AM -0600, Jason Gunthorpe wrote:
> Sadly, it isn't a step backwards, it is status quo - at least as far
> as the uapi is concerned.

Sort of, see below:

> struct mlx5_create_cq {
>         struct ibv_create_cq            ibv_cmd;
>         __u64                           buf_addr;
>         __u64                           db_addr;
>         __u32				cqe_size;
> };
> 
> struct iwch_create_cq {
>         struct ibv_create_cq ibv_cmd;
>         uint64_t user_rptr_addr;
> };
> 
> Love to hear ideas on a way forward that doesn't involve rewriting
> everything :(

We stil always have the common structure first.  And at least for
cgroups supports that's what matters.

Re the actual structures - we'll really need to make sure we

 a) expose proper userspace abi headers in the kernel for all code
    in the RDMA subsystem
 b) actually use that in the userspace components

I've posted some initial work toward a) a while ago, and once we
agree on adopting your common repo I'd really like to start through
with that work.  I think it's a pre-requisite for any major new
userspace ABI work.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 13:34                   ` Christoph Hellwig
@ 2016-09-11 14:35                     ` Leon Romanovsky
  2016-09-11 17:14                       ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Leon Romanovsky @ 2016-09-11 14:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On Sun, Sep 11, 2016 at 03:34:21PM +0200, Christoph Hellwig wrote:
> On Sat, Sep 10, 2016 at 11:01:51AM -0600, Jason Gunthorpe wrote:
> > Sadly, it isn't a step backwards, it is status quo - at least as far
> > as the uapi is concerned.
>
> Sort of, see below:
>
> > struct mlx5_create_cq {
> >         struct ibv_create_cq            ibv_cmd;
> >         __u64                           buf_addr;
> >         __u64                           db_addr;
> >         __u32				cqe_size;
> > };
> >
> > struct iwch_create_cq {
> >         struct ibv_create_cq ibv_cmd;
> >         uint64_t user_rptr_addr;
> > };
> >
> > Love to hear ideas on a way forward that doesn't involve rewriting
> > everything :(
>
> We stil always have the common structure first.  And at least for
> cgroups supports that's what matters.
>
> Re the actual structures - we'll really need to make sure we
>
>  a) expose proper userspace abi headers in the kernel for all code
>     in the RDMA subsystem
>  b) actually use that in the userspace components
>
> I've posted some initial work toward a) a while ago, and once we
> agree on adopting your common repo I'd really like to start through
> with that work.  I think it's a pre-requisite for any major new
> userspace ABI work.

I started to work on it over weekend and it is worth do not do same work twice.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 14:35                     ` Leon Romanovsky
@ 2016-09-11 17:14                       ` Jason Gunthorpe
  2016-09-11 17:24                         ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2016-09-11 17:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Sun, Sep 11, 2016 at 05:35:22PM +0300, Leon Romanovsky wrote:

> > We stil always have the common structure first.  And at least for
> > cgroups supports that's what matters.
> >
> > Re the actual structures - we'll really need to make sure we
> >
> >  a) expose proper userspace abi headers in the kernel for all code
> >     in the RDMA subsystem
> >  b) actually use that in the userspace components
> >
> > I've posted some initial work toward a) a while ago, and once we

Did it get merged? Do you have a pointer?

> > agree on adopting your common repo I'd really like to start through
> > with that work.  I think it's a pre-requisite for any major new
> > userspace ABI work.
> 
> I started to work on it over weekend and it is worth do not do same work twice.

Yes, I also agree that it is important before we tackle the uapi
conversion to get this fully sorted.

I've already done several cases working with the existing uapi headers:

https://github.com/jgunthorpe/rdma-plumbing/commit/f4f40689440dbc9c57b55548b04b15fe808a1767
https://github.com/jgunthorpe/rdma-plumbing/commit/0cf1893dce4791dafa035bcb6ee045a6ce0ff3c3
https://github.com/jgunthorpe/rdma-plumbing/commit/0522fc42aac4a5e8fc888dcca4341c9bc1dc58ca

[.. and this is a strong argument why we need the common repo, doing
this without it would be very hard, as everything is cross-linked, I
couldnn't unwind libibcm until I fixed a bit of verbs, and rdmacm can't
even include its uapi header until the duplicate definitions in the
verbs copy are delt with .. and I've also learned we are making
changing to the kernel uapi header and since nothing uses them we never even
compile test :( :( eg
https://github.com/torvalds/linux/commit/b493d91d333e867a043f7ff1397bcba6e2d0dda2]

However, everything under verbs is not straightforward. The files in
userspace are not copies...

user:

struct ibv_query_device {
       __u32 command;
       __u16 in_words;
       __u16 out_words;
       __u64 response;
       __u64 driver_data[0];
};

kernel:

struct ib_uverbs_query_device {
        __u64 response;
        __u64 driver_data[0];
};

eg the userspace version stuffs the header into the struct and the
kernel version does not. Presumably this is for efficiency so that no
copies are required when marshaling. This impacts everything :(

I'm thinking the best way forward might be to use a script and
transform userspace into:

struct ibv_query_device {
	struct ib_uverbs_cmd_hdr hdr;
	struct ib_uverbs_query_device cmd;
};

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 17:14                       ` Jason Gunthorpe
@ 2016-09-11 17:24                         ` Christoph Hellwig
  2016-09-11 17:52                           ` Jason Gunthorpe
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2016-09-11 17:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Christoph Hellwig, Matan Barak, Parav Pandit,
	Tejun Heo, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Haggai Eran, Jonathan Corbet, james.l.morris, serge,
	Or Gerlitz, Andrew Morton, linux-security-module

On Sun, Sep 11, 2016 at 11:14:09AM -0600, Jason Gunthorpe wrote:
> > > We stil always have the common structure first.  And at least for
> > > cgroups supports that's what matters.
> > >
> > > Re the actual structures - we'll really need to make sure we
> > >
> > >  a) expose proper userspace abi headers in the kernel for all code
> > >     in the RDMA subsystem
> > >  b) actually use that in the userspace components
> > >
> > > I've posted some initial work toward a) a while ago, and once we
> 
> Did it get merged? Do you have a pointer?

http://www.spinics.net/lists/linux-rdma/msg31958.html

> this without it would be very hard, as everything is cross-linked, I
> couldnn't unwind libibcm until I fixed a bit of verbs, and rdmacm can't
> even include its uapi header until the duplicate definitions in the
> verbs copy are delt with .. and I've also learned we are making
> changing to the kernel uapi header and since nothing uses them we never even
> compile test :( :( eg
> https://github.com/torvalds/linux/commit/b493d91d333e867a043f7ff1397bcba6e2d0dda2]

> However, everything under verbs is not straightforward. The files in
> userspace are not copies...
> 
> user:
> 
> struct ibv_query_device {
>        __u32 command;
>        __u16 in_words;
>        __u16 out_words;
>        __u64 response;
>        __u64 driver_data[0];
> };
> 
> kernel:
> 
> struct ib_uverbs_query_device {
>         __u64 response;
>         __u64 driver_data[0];
> };

We'll obviously need different strutures for the libibvers API
and the kernel interface in this case, and we'll need to figure out
how to properly translate them.  I think a cast, plus compile time
type checking ala BUILD_BUG_ON is the way to go.

> eg the userspace version stuffs the header into the struct and the
> kernel version does not. Presumably this is for efficiency so that no
> copies are required when marshaling. This impacts everything :(
> 
> I'm thinking the best way forward might be to use a script and
> transform userspace into:
> 
> struct ibv_query_device {
> 	struct ib_uverbs_cmd_hdr hdr;
> 	struct ib_uverbs_query_device cmd;
> };

That would break the users of the interface.  However automatically
generating the user ABI from the kernel one might still be a good idea
in the long run.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 17:24                         ` Christoph Hellwig
@ 2016-09-11 17:52                           ` Jason Gunthorpe
  2016-09-12  5:07                             ` Leon Romanovsky
  0 siblings, 1 reply; 44+ messages in thread
From: Jason Gunthorpe @ 2016-09-11 17:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
> > > > I've posted some initial work toward a) a while ago, and once we
> > 
> > Did it get merged? Do you have a pointer?
> 
> http://www.spinics.net/lists/linux-rdma/msg31958.html

Right, I remember that. Certainly the right direction

> > However, everything under verbs is not straightforward. The files in
> > userspace are not copies...
> > 
> > user:
> > 
> > struct ibv_query_device {
> >        __u32 command;
> >        __u16 in_words;
> >        __u16 out_words;
> >        __u64 response;
> >        __u64 driver_data[0];
> > };
> > 
> > kernel:
> > 
> > struct ib_uverbs_query_device {
> >         __u64 response;
> >         __u64 driver_data[0];
> > };
> 
> We'll obviously need different strutures for the libibvers API
> and the kernel interface in this case, and we'll need to figure out
> how to properly translate them.  I think a cast, plus compile time
> type checking ala BUILD_BUG_ON is the way to go.

I'm not sure I follow, which would I cast?

BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
             sizeof(ib_uverbs_query_device))

?

> > I'm thinking the best way forward might be to use a script and
> > transform userspace into:
> > 
> > struct ibv_query_device {
> > 	struct ib_uverbs_cmd_hdr hdr;
> > 	struct ib_uverbs_query_device cmd;
> > };
> 
> That would break the users of the interface.

Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
identical the modified libibverbs would still be binary compatible
with all providers but not source compatible. Since all kernel
supported providers are in rdma-plumbing we can add the '.cmd.' at the
same time.

The kernel uapi header would stay the same.

> However automatically generating the user ABI from the kernel one
> might still be a good idea in the long run.

My preference would be to try and use the kernel headers directly.

Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 17:52                           ` Jason Gunthorpe
@ 2016-09-12  5:07                             ` Leon Romanovsky
  2016-09-14  7:06                               ` Parav Pandit
  0 siblings, 1 reply; 44+ messages in thread
From: Leon Romanovsky @ 2016-09-12  5:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 2227 bytes --]

On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
> > > > > I've posted some initial work toward a) a while ago, and once we
> > >
> > > Did it get merged? Do you have a pointer?
> >
> > http://www.spinics.net/lists/linux-rdma/msg31958.html
>
> Right, I remember that. Certainly the right direction
>
> > > However, everything under verbs is not straightforward. The files in
> > > userspace are not copies...
> > >
> > > user:
> > >
> > > struct ibv_query_device {
> > >        __u32 command;
> > >        __u16 in_words;
> > >        __u16 out_words;
> > >        __u64 response;
> > >        __u64 driver_data[0];
> > > };
> > >
> > > kernel:
> > >
> > > struct ib_uverbs_query_device {
> > >         __u64 response;
> > >         __u64 driver_data[0];
> > > };
> >
> > We'll obviously need different strutures for the libibvers API
> > and the kernel interface in this case, and we'll need to figure out
> > how to properly translate them.  I think a cast, plus compile time
> > type checking ala BUILD_BUG_ON is the way to go.
>
> I'm not sure I follow, which would I cast?
>
> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>              sizeof(ib_uverbs_query_device))
>
> ?
>
> > > I'm thinking the best way forward might be to use a script and
> > > transform userspace into:
> > >
> > > struct ibv_query_device {
> > > 	struct ib_uverbs_cmd_hdr hdr;
> > > 	struct ib_uverbs_query_device cmd;
> > > };
> >
> > That would break the users of the interface.
>
> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
> identical the modified libibverbs would still be binary compatible
> with all providers but not source compatible. Since all kernel
> supported providers are in rdma-plumbing we can add the '.cmd.' at the
> same time.
>
> The kernel uapi header would stay the same.
>
> > However automatically generating the user ABI from the kernel one
> > might still be a good idea in the long run.
>
> My preference would be to try and use the kernel headers directly.

I thought the same, especially after realizing that they are almost
copy/paste from the vendor *-abi.h files.

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-12  5:07                             ` Leon Romanovsky
@ 2016-09-14  7:06                               ` Parav Pandit
  2016-09-14  8:14                                 ` Matan Barak
                                                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Parav Pandit @ 2016-09-14  7:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Dennis,

Do you know how would HFI1 driver would work along with rdma cgroup?

Hi Matan, Leon, Jason,
Apart from HFI1, is there any other concern?
Or Patch is good to go?

4.8 dates are close by (2 weeks) and there are two git trees involved
(that might cause merge error to Linus) so if there are no issues, I
would like to make request to Doug to consider it for 4.8 early on.

Parav

On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>> > > > > I've posted some initial work toward a) a while ago, and once we
>> > >
>> > > Did it get merged? Do you have a pointer?
>> >
>> > http://www.spinics.net/lists/linux-rdma/msg31958.html
>>
>> Right, I remember that. Certainly the right direction
>>
>> > > However, everything under verbs is not straightforward. The files in
>> > > userspace are not copies...
>> > >
>> > > user:
>> > >
>> > > struct ibv_query_device {
>> > >        __u32 command;
>> > >        __u16 in_words;
>> > >        __u16 out_words;
>> > >        __u64 response;
>> > >        __u64 driver_data[0];
>> > > };
>> > >
>> > > kernel:
>> > >
>> > > struct ib_uverbs_query_device {
>> > >         __u64 response;
>> > >         __u64 driver_data[0];
>> > > };
>> >
>> > We'll obviously need different strutures for the libibvers API
>> > and the kernel interface in this case, and we'll need to figure out
>> > how to properly translate them.  I think a cast, plus compile time
>> > type checking ala BUILD_BUG_ON is the way to go.
>>
>> I'm not sure I follow, which would I cast?
>>
>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>              sizeof(ib_uverbs_query_device))
>>
>> ?
>>
>> > > I'm thinking the best way forward might be to use a script and
>> > > transform userspace into:
>> > >
>> > > struct ibv_query_device {
>> > >   struct ib_uverbs_cmd_hdr hdr;
>> > >   struct ib_uverbs_query_device cmd;
>> > > };
>> >
>> > That would break the users of the interface.
>>
>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>> identical the modified libibverbs would still be binary compatible
>> with all providers but not source compatible. Since all kernel
>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>> same time.
>>
>> The kernel uapi header would stay the same.
>>
>> > However automatically generating the user ABI from the kernel one
>> > might still be a good idea in the long run.
>>
>> My preference would be to try and use the kernel headers directly.
>
> I thought the same, especially after realizing that they are almost
> copy/paste from the vendor *-abi.h files.
>
>>
>> Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-14  7:06                               ` Parav Pandit
@ 2016-09-14  8:14                                 ` Matan Barak
  2016-09-14  9:19                                   ` Parav Pandit
  2016-09-15 18:56                                 ` Leon Romanovsky
  2016-09-19 13:10                                 ` Dalessandro, Dennis
  2 siblings, 1 reply; 44+ messages in thread
From: Matan Barak @ 2016-09-14  8:14 UTC (permalink / raw)
  To: Parav Pandit, Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On 14/09/2016 10:06, Parav Pandit wrote:
> Hi Dennis,
>
> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?

I just wonder how things like RSS will work. For example, a RSS QP 
doesn't really have a queue (if I recall, it's connected to work queues 
via an indirection table). So, when a user creates such a QP, do you 
want to account it as a regular QP?
How are work queues accounted?


> Or Patch is good to go?
>
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
>
> Parav
>
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
>> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>>>>>>> I've posted some initial work toward a) a while ago, and once we
>>>>>
>>>>> Did it get merged? Do you have a pointer?
>>>>
>>>> http://www.spinics.net/lists/linux-rdma/msg31958.html
>>>
>>> Right, I remember that. Certainly the right direction
>>>
>>>>> However, everything under verbs is not straightforward. The files in
>>>>> userspace are not copies...
>>>>>
>>>>> user:
>>>>>
>>>>> struct ibv_query_device {
>>>>>        __u32 command;
>>>>>        __u16 in_words;
>>>>>        __u16 out_words;
>>>>>        __u64 response;
>>>>>        __u64 driver_data[0];
>>>>> };
>>>>>
>>>>> kernel:
>>>>>
>>>>> struct ib_uverbs_query_device {
>>>>>         __u64 response;
>>>>>         __u64 driver_data[0];
>>>>> };
>>>>
>>>> We'll obviously need different strutures for the libibvers API
>>>> and the kernel interface in this case, and we'll need to figure out
>>>> how to properly translate them.  I think a cast, plus compile time
>>>> type checking ala BUILD_BUG_ON is the way to go.
>>>
>>> I'm not sure I follow, which would I cast?
>>>
>>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>>              sizeof(ib_uverbs_query_device))
>>>
>>> ?
>>>
>>>>> I'm thinking the best way forward might be to use a script and
>>>>> transform userspace into:
>>>>>
>>>>> struct ibv_query_device {
>>>>>   struct ib_uverbs_cmd_hdr hdr;
>>>>>   struct ib_uverbs_query_device cmd;
>>>>> };
>>>>
>>>> That would break the users of the interface.
>>>
>>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>>> identical the modified libibverbs would still be binary compatible
>>> with all providers but not source compatible. Since all kernel
>>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>>> same time.
>>>
>>> The kernel uapi header would stay the same.
>>>
>>>> However automatically generating the user ABI from the kernel one
>>>> might still be a good idea in the long run.
>>>
>>> My preference would be to try and use the kernel headers directly.
>>
>> I thought the same, especially after realizing that they are almost
>> copy/paste from the vendor *-abi.h files.
>>
>>>
>>> Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-14  8:14                                 ` Matan Barak
@ 2016-09-14  9:19                                   ` Parav Pandit
  0 siblings, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-09-14  9:19 UTC (permalink / raw)
  To: Matan Barak
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Tejun Heo,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Matan,

On Wed, Sep 14, 2016 at 1:44 PM, Matan Barak <matanb@mellanox.com> wrote:
> On 14/09/2016 10:06, Parav Pandit wrote:
>>
>> Hi Dennis,
>>
>> Do you know how would HFI1 driver would work along with rdma cgroup?
>>
>> Hi Matan, Leon, Jason,
>> Apart from HFI1, is there any other concern?
>
>
> I just wonder how things like RSS will work. For example, a RSS QP doesn't
> really have a queue (if I recall, it's connected to work queues via an
> indirection table). So, when a user creates such a QP, do you want to
> account it as a regular QP?
> How are work queues accounted?

ib_create_rwq_ind_table verb allows creating indirection table.
I assume it allows creating multiple such tables.
If it is so, than number of tables should be a cgroup resource that we
can add in follow on patch.
By doing so, one container doesn't takeaway all the tables.

>
>
>> Or Patch is good to go?
>>
>> 4.8 dates are close by (2 weeks) and there are two git trees involved
>> (that might cause merge error to Linus) so if there are no issues, I
>> would like to make request to Doug to consider it for 4.8 early on.
>>
>> Parav
>>
>> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>
>>> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>>>>
>>>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>>>>>>>>
>>>>>>>> I've posted some initial work toward a) a while ago, and once we
>>>>>>
>>>>>>
>>>>>> Did it get merged? Do you have a pointer?
>>>>>
>>>>>
>>>>> http://www.spinics.net/lists/linux-rdma/msg31958.html
>>>>
>>>>
>>>> Right, I remember that. Certainly the right direction
>>>>
>>>>>> However, everything under verbs is not straightforward. The files in
>>>>>> userspace are not copies...
>>>>>>
>>>>>> user:
>>>>>>
>>>>>> struct ibv_query_device {
>>>>>>        __u32 command;
>>>>>>        __u16 in_words;
>>>>>>        __u16 out_words;
>>>>>>        __u64 response;
>>>>>>        __u64 driver_data[0];
>>>>>> };
>>>>>>
>>>>>> kernel:
>>>>>>
>>>>>> struct ib_uverbs_query_device {
>>>>>>         __u64 response;
>>>>>>         __u64 driver_data[0];
>>>>>> };
>>>>>
>>>>>
>>>>> We'll obviously need different strutures for the libibvers API
>>>>> and the kernel interface in this case, and we'll need to figure out
>>>>> how to properly translate them.  I think a cast, plus compile time
>>>>> type checking ala BUILD_BUG_ON is the way to go.
>>>>
>>>>
>>>> I'm not sure I follow, which would I cast?
>>>>
>>>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>>>              sizeof(ib_uverbs_query_device))
>>>>
>>>> ?
>>>>
>>>>>> I'm thinking the best way forward might be to use a script and
>>>>>> transform userspace into:
>>>>>>
>>>>>> struct ibv_query_device {
>>>>>>   struct ib_uverbs_cmd_hdr hdr;
>>>>>>   struct ib_uverbs_query_device cmd;
>>>>>> };
>>>>>
>>>>>
>>>>> That would break the users of the interface.
>>>>
>>>>
>>>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>>>> identical the modified libibverbs would still be binary compatible
>>>> with all providers but not source compatible. Since all kernel
>>>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>>>> same time.
>>>>
>>>> The kernel uapi header would stay the same.
>>>>
>>>>> However automatically generating the user ABI from the kernel one
>>>>> might still be a good idea in the long run.
>>>>
>>>>
>>>> My preference would be to try and use the kernel headers directly.
>>>
>>>
>>> I thought the same, especially after realizing that they are almost
>>> copy/paste from the vendor *-abi.h files.
>>>
>>>>
>>>> Jason
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-14  7:06                               ` Parav Pandit
  2016-09-14  8:14                                 ` Matan Barak
@ 2016-09-15 18:56                                 ` Leon Romanovsky
  2016-09-21  4:43                                   ` Parav Pandit
  2016-09-19 13:10                                 ` Dalessandro, Dennis
  2 siblings, 1 reply; 44+ messages in thread
From: Leon Romanovsky @ 2016-09-15 18:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Gunthorpe, Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 3029 bytes --]

On Wed, Sep 14, 2016 at 12:36:19PM +0530, Parav Pandit wrote:
> Hi Dennis,
>
> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?
> Or Patch is good to go?

I didn't review it yet :(.
Sorry

>
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
>
> Parav
>
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> >> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
> >> > > > > I've posted some initial work toward a) a while ago, and once we
> >> > >
> >> > > Did it get merged? Do you have a pointer?
> >> >
> >> > http://www.spinics.net/lists/linux-rdma/msg31958.html
> >>
> >> Right, I remember that. Certainly the right direction
> >>
> >> > > However, everything under verbs is not straightforward. The files in
> >> > > userspace are not copies...
> >> > >
> >> > > user:
> >> > >
> >> > > struct ibv_query_device {
> >> > >        __u32 command;
> >> > >        __u16 in_words;
> >> > >        __u16 out_words;
> >> > >        __u64 response;
> >> > >        __u64 driver_data[0];
> >> > > };
> >> > >
> >> > > kernel:
> >> > >
> >> > > struct ib_uverbs_query_device {
> >> > >         __u64 response;
> >> > >         __u64 driver_data[0];
> >> > > };
> >> >
> >> > We'll obviously need different strutures for the libibvers API
> >> > and the kernel interface in this case, and we'll need to figure out
> >> > how to properly translate them.  I think a cast, plus compile time
> >> > type checking ala BUILD_BUG_ON is the way to go.
> >>
> >> I'm not sure I follow, which would I cast?
> >>
> >> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
> >>              sizeof(ib_uverbs_query_device))
> >>
> >> ?
> >>
> >> > > I'm thinking the best way forward might be to use a script and
> >> > > transform userspace into:
> >> > >
> >> > > struct ibv_query_device {
> >> > >   struct ib_uverbs_cmd_hdr hdr;
> >> > >   struct ib_uverbs_query_device cmd;
> >> > > };
> >> >
> >> > That would break the users of the interface.
> >>
> >> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
> >> identical the modified libibverbs would still be binary compatible
> >> with all providers but not source compatible. Since all kernel
> >> supported providers are in rdma-plumbing we can add the '.cmd.' at the
> >> same time.
> >>
> >> The kernel uapi header would stay the same.
> >>
> >> > However automatically generating the user ABI from the kernel one
> >> > might still be a good idea in the long run.
> >>
> >> My preference would be to try and use the kernel headers directly.
> >
> > I thought the same, especially after realizing that they are almost
> > copy/paste from the vendor *-abi.h files.
> >
> >>
> >> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-14  7:06                               ` Parav Pandit
  2016-09-14  8:14                                 ` Matan Barak
  2016-09-15 18:56                                 ` Leon Romanovsky
@ 2016-09-19 13:10                                 ` Dalessandro, Dennis
  2016-09-19 17:00                                   ` Parav Pandit
  2 siblings, 1 reply; 44+ messages in thread
From: Dalessandro, Dennis @ 2016-09-19 13:10 UTC (permalink / raw)
  To: pandit.parav, leon
  Cc: lizefan, linux-kernel, corbet, linux-rdma, cgroups, ogerlitz,
	hch, linux-security-module, haggaie, hannes, Hefty, Sean, akpm,
	james.l.morris, tj, liranl, jgunthorpe, linux-doc, dledford,
	matanb, serge

On Wed, 2016-09-14 at 12:36 +0530, Parav Pandit wrote:
> Hi Dennis,
> 
> Do you know how would HFI1 driver would work along with rdma cgroup?

Keep in mind HFI1 driver has two "modes" of operation. We support
verbs, and would surely fall in line with whatever cgroups do for IB
core. For our psm interface, not sure how cgroups would come into play.
Psm is designed to expose the hw to user and avoid the kernel when
possible adding more kernel control is sort of contrary to that.

Now that being said, Christoph recently made mention of maybe having a
drivers/psm [1]. I really haven't had a chance to think about the
implications of that, but maybe it's worth considering, after all we
have two implementations, qib and hfi1. So anyway I'm not sure we need
to be too concerned about cgroups right now as far as psm side of
things goes.

Depending how things shake out for the uAPI rewrite, or verbs 2.0 or
whatever we are calling it today things may change.

[1] http://marc.info/?l=linux-rdma&m=147401714313831&w=2

-Denny

> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?
> Or Patch is good to go?
> 
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
> 
> Parav
> 
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org>
> wrote:
> > On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> > > On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig
> > > wrote:
> > > > > > > I've posted some initial work toward a) a while ago, and
> > > > > > > once we
> > > > > 
> > > > > Did it get merged? Do you have a pointer?
> > > > 
> > > > http://www.spinics.net/lists/linux-rdma/msg31958.html
> > > 
> > > Right, I remember that. Certainly the right direction
> > > 
> > > > > However, everything under verbs is not straightforward. The
> > > > > files in
> > > > > userspace are not copies...
> > > > > 
> > > > > user:
> > > > > 
> > > > > struct ibv_query_device {
> > > > >        __u32 command;
> > > > >        __u16 in_words;
> > > > >        __u16 out_words;
> > > > >        __u64 response;
> > > > >        __u64 driver_data[0];
> > > > > };
> > > > > 
> > > > > kernel:
> > > > > 
> > > > > struct ib_uverbs_query_device {
> > > > >         __u64 response;
> > > > >         __u64 driver_data[0];
> > > > > };
> > > > 
> > > > We'll obviously need different strutures for the libibvers API
> > > > and the kernel interface in this case, and we'll need to figure
> > > > out
> > > > how to properly translate them.  I think a cast, plus compile
> > > > time
> > > > type checking ala BUILD_BUG_ON is the way to go.
> > > 
> > > I'm not sure I follow, which would I cast?
> > > 
> > > BUILD_BUG_ON(sizeof(ibv_query_device) ==
> > > sizeof(ib_uverbs_cmd_hdr) +
> > >              sizeof(ib_uverbs_query_device))
> > > 
> > > ?
> > > 
> > > > > I'm thinking the best way forward might be to use a script
> > > > > and
> > > > > transform userspace into:
> > > > > 
> > > > > struct ibv_query_device {
> > > > >   struct ib_uverbs_cmd_hdr hdr;
> > > > >   struct ib_uverbs_query_device cmd;
> > > > > };
> > > > 
> > > > That would break the users of the interface.
> > > 
> > > Sorry, I mean doing this inside rdma-plumbing. Since the change
> > > is ABI
> > > identical the modified libibverbs would still be binary
> > > compatible
> > > with all providers but not source compatible. Since all kernel
> > > supported providers are in rdma-plumbing we can add the '.cmd.'
> > > at the
> > > same time.
> > > 
> > > The kernel uapi header would stay the same.
> > > 
> > > > However automatically generating the user ABI from the kernel
> > > > one
> > > > might still be a good idea in the long run.
> > > 
> > > My preference would be to try and use the kernel headers
> > > directly.
> > 
> > I thought the same, especially after realizing that they are almost
> > copy/paste from the vendor *-abi.h files.
> > 
> > > 
> > > Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-19 13:10                                 ` Dalessandro, Dennis
@ 2016-09-19 17:00                                   ` Parav Pandit
  0 siblings, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-09-19 17:00 UTC (permalink / raw)
  To: Dalessandro, Dennis
  Cc: leon, lizefan, linux-kernel, corbet, linux-rdma, cgroups,
	ogerlitz, hch, linux-security-module, haggaie, hannes, Hefty,
	Sean, akpm, james.l.morris, tj, liranl, jgunthorpe, linux-doc,
	dledford, matanb, serge

Hi Denny,

On Mon, Sep 19, 2016 at 6:40 PM, Dalessandro, Dennis
<dennis.dalessandro@intel.com> wrote:
> On Wed, 2016-09-14 at 12:36 +0530, Parav Pandit wrote:
>> Hi Dennis,
>>
>> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Keep in mind HFI1 driver has two "modes" of operation. We support
> verbs, and would surely fall in line with whatever cgroups do for IB
> core.
Thanks for the feedback.

> For our psm interface, not sure how cgroups would come into play.
> Psm is designed to expose the hw to user and avoid the kernel when
> possible adding more kernel control is sort of contrary to that.
>
Yes, PSM is currently out of RDMA cgroup and in future we can take a
look on how things shape as subsystem if it does.

> Now that being said, Christoph recently made mention of maybe having a
> drivers/psm [1]. I really haven't had a chance to think about the
> implications of that, but maybe it's worth considering, after all we
> have two implementations, qib and hfi1. So anyway I'm not sure we need
> to be too concerned about cgroups right now as far as psm side of
> things goes.
>
o.k.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-15 18:56                                 ` Leon Romanovsky
@ 2016-09-21  4:43                                   ` Parav Pandit
  2016-09-21 14:26                                     ` Tejun Heo
  0 siblings, 1 reply; 44+ messages in thread
From: Parav Pandit @ 2016-09-21  4:43 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Leon,

On Fri, Sep 16, 2016 at 12:26 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, Sep 14, 2016 at 12:36:19PM +0530, Parav Pandit wrote:
>> Hi Dennis,
>>
>> Do you know how would HFI1 driver would work along with rdma cgroup?
>>
>> Hi Matan, Leon, Jason,
>> Apart from HFI1, is there any other concern?
>> Or Patch is good to go?
>
> I didn't review it yet :(.
> Sorry
>

We have completed review from Tejun, Christoph.
HFI driver folks also provided feedback for Intel drivers.
Matan's also doesn't have any more comments.

If possible, if you can also review, it will be helpful.

I have some more changes unrelated to cgroup in same files in both the git tree.
Pushing them now either results into merge conflict later on for
Doug/Tejun, or requires rebase and resending patch.
If you can review, we can avoid such rework.

>>
>> 4.8 dates are close by (2 weeks) and there are two git trees involved
>> (that might cause merge error to Linus) so if there are no issues, I
>> would like to make request to Doug to consider it for 4.8 early on.
>>
>> Parav
>>
>> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
>> > On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>> >> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>> >> > > > > I've posted some initial work toward a) a while ago, and once we
>> >> > >
>> >> > > Did it get merged? Do you have a pointer?
>> >> >
>> >> > http://www.spinics.net/lists/linux-rdma/msg31958.html
>> >>
>> >> Right, I remember that. Certainly the right direction
>> >>
>> >> > > However, everything under verbs is not straightforward. The files in
>> >> > > userspace are not copies...
>> >> > >
>> >> > > user:
>> >> > >
>> >> > > struct ibv_query_device {
>> >> > >        __u32 command;
>> >> > >        __u16 in_words;
>> >> > >        __u16 out_words;
>> >> > >        __u64 response;
>> >> > >        __u64 driver_data[0];
>> >> > > };
>> >> > >
>> >> > > kernel:
>> >> > >
>> >> > > struct ib_uverbs_query_device {
>> >> > >         __u64 response;
>> >> > >         __u64 driver_data[0];
>> >> > > };
>> >> >
>> >> > We'll obviously need different strutures for the libibvers API
>> >> > and the kernel interface in this case, and we'll need to figure out
>> >> > how to properly translate them.  I think a cast, plus compile time
>> >> > type checking ala BUILD_BUG_ON is the way to go.
>> >>
>> >> I'm not sure I follow, which would I cast?
>> >>
>> >> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>> >>              sizeof(ib_uverbs_query_device))
>> >>
>> >> ?
>> >>
>> >> > > I'm thinking the best way forward might be to use a script and
>> >> > > transform userspace into:
>> >> > >
>> >> > > struct ibv_query_device {
>> >> > >   struct ib_uverbs_cmd_hdr hdr;
>> >> > >   struct ib_uverbs_query_device cmd;
>> >> > > };
>> >> >
>> >> > That would break the users of the interface.
>> >>
>> >> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>> >> identical the modified libibverbs would still be binary compatible
>> >> with all providers but not source compatible. Since all kernel
>> >> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>> >> same time.
>> >>
>> >> The kernel uapi header would stay the same.
>> >>
>> >> > However automatically generating the user ABI from the kernel one
>> >> > might still be a good idea in the long run.
>> >>
>> >> My preference would be to try and use the kernel headers directly.
>> >
>> > I thought the same, especially after realizing that they are almost
>> > copy/paste from the vendor *-abi.h files.
>> >
>> >>
>> >> Jason

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-21  4:43                                   ` Parav Pandit
@ 2016-09-21 14:26                                     ` Tejun Heo
  2016-09-21 16:02                                       ` Parav Pandit
  0 siblings, 1 reply; 44+ messages in thread
From: Tejun Heo @ 2016-09-21 14:26 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hello, Parav.

On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
> We have completed review from Tejun, Christoph.
> HFI driver folks also provided feedback for Intel drivers.
> Matan's also doesn't have any more comments.
> 
> If possible, if you can also review, it will be helpful.
> 
> I have some more changes unrelated to cgroup in same files in both the git tree.
> Pushing them now either results into merge conflict later on for
> Doug/Tejun, or requires rebase and resending patch.
> If you can review, we can avoid such rework.

My impression of the thread was that there doesn't seem to be enough
of consensus around how rdma resources should be defined.  Is that
part agreed upon now?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-21 14:26                                     ` Tejun Heo
@ 2016-09-21 16:02                                       ` Parav Pandit
  2016-10-04 18:19                                         ` Parav Pandit
  0 siblings, 1 reply; 44+ messages in thread
From: Parav Pandit @ 2016-09-21 16:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Tejun,

On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>> We have completed review from Tejun, Christoph.
>> HFI driver folks also provided feedback for Intel drivers.
>> Matan's also doesn't have any more comments.
>>
>> If possible, if you can also review, it will be helpful.
>>
>> I have some more changes unrelated to cgroup in same files in both the git tree.
>> Pushing them now either results into merge conflict later on for
>> Doug/Tejun, or requires rebase and resending patch.
>> If you can review, we can avoid such rework.
>
> My impression of the thread was that there doesn't seem to be enough
> of consensus around how rdma resources should be defined.  Is that
> part agreed upon now?
>

We ended up discussing few points on different thread [1].

There was confusion on how some non-rdma/non-IB drivers would work
with rdma cgroup from Matan.
Christoph explained how they don't fit in the rdma subsystem and
therefore its not prime target to addess.
Intel driver maintainer Denny also acknowledged same on [2].
IB compliant drivers of Intel support rdma cgroup as explained in [2].
With that usnic and Intel psm drivers falls out of rdma cgroup support
as they don't fit very well in the verbs definition.

[1] https://www.spinics.net/lists/linux-rdma/msg40340.html
[2] http://www.spinics.net/lists/linux-rdma/msg40717.html

I will wait for Leon's review comments if he has different view on architecture.
Back in April when I met face-to-face to Leon and Haggai, Leon was in
support to have kernel defined the rdma resources as suggested by
Christoph and Tejun instead of IB/RDMA subsystem.
I will wait for his comments if his views have changed with new uAPI
taking shape.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-21 16:02                                       ` Parav Pandit
@ 2016-10-04 18:19                                         ` Parav Pandit
  2016-10-05  6:37                                           ` Christoph Hellwig
  2016-10-18 20:15                                           ` Parav Pandit
  0 siblings, 2 replies; 44+ messages in thread
From: Parav Pandit @ 2016-10-04 18:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Doug,

I am still waiting for Leon to provide his comments if any on rdma cgroup.
>From other email context, he was on vacation last week.
While we wait for his comments, I wanted to know your view of this
patchset in 4.9 merge window.

To summarize the discussion that happened in two threads.

[1] Ack by Tejun, asking for review from rdma list
[2] quick review by Christoph on patch-v11 (patch 12 has only typo corrections)
[3] Christoph's ack on architecture of rdma cgroup and fitting it with ABI
[4] My response on Matan's query on RSS indirection table
[5] Response from Intel on their driver support for Matan's query
[6] Christoph's point on architecture, which we are following in new
ABI and current ABI

I have reviewed recent patch [7] from Matan where I see IB verbs
objects are still handled through common path as suggested by
Christoph.

I do not see any issues with rdma cgroup patchset other than it requires rebase.
Am I missing something?
Can you please help me - What would be required to merge it to 4.9?

[1] https://lkml.org/lkml/2016/8/31/494
[2] https://lkml.org/lkml/2016/8/25/146
[3] https://lkml.org/lkml/2016/9/10/175
[4] https://lkml.org/lkml/2016/9/14/221
[5] https://lkml.org/lkml/2016/9/19/571
[6] http://www.spinics.net/lists/linux-rdma/msg40337.html
[7] email subject: [RFC ABI V4 0/7] SG-based RDMA ABI Proposal

Regards,
Parav Pandit

On Wed, Sep 21, 2016 at 9:32 PM, Parav Pandit <pandit.parav@gmail.com> wrote:
> Hi Tejun,
>
> On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Parav.
>>
>> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>>> We have completed review from Tejun, Christoph.
>>> HFI driver folks also provided feedback for Intel drivers.
>>> Matan's also doesn't have any more comments.
>>>
>>> If possible, if you can also review, it will be helpful.
>>>
>>> I have some more changes unrelated to cgroup in same files in both the git tree.
>>> Pushing them now either results into merge conflict later on for
>>> Doug/Tejun, or requires rebase and resending patch.
>>> If you can review, we can avoid such rework.
>>
>> My impression of the thread was that there doesn't seem to be enough
>> of consensus around how rdma resources should be defined.  Is that
>> part agreed upon now?
>>
>
> We ended up discussing few points on different thread [1].
>
> There was confusion on how some non-rdma/non-IB drivers would work
> with rdma cgroup from Matan.
> Christoph explained how they don't fit in the rdma subsystem and
> therefore its not prime target to addess.
> Intel driver maintainer Denny also acknowledged same on [2].
> IB compliant drivers of Intel support rdma cgroup as explained in [2].
> With that usnic and Intel psm drivers falls out of rdma cgroup support
> as they don't fit very well in the verbs definition.
>
> [1] https://www.spinics.net/lists/linux-rdma/msg40340.html
> [2] http://www.spinics.net/lists/linux-rdma/msg40717.html
>
> I will wait for Leon's review comments if he has different view on architecture.
> Back in April when I met face-to-face to Leon and Haggai, Leon was in
> support to have kernel defined the rdma resources as suggested by
> Christoph and Tejun instead of IB/RDMA subsystem.
> I will wait for his comments if his views have changed with new uAPI
> taking shape.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-04 18:19                                         ` Parav Pandit
@ 2016-10-05  6:37                                           ` Christoph Hellwig
  2016-10-05 11:22                                             ` Leon Romanovsky
  2016-10-06 12:55                                             ` Parav Pandit
  2016-10-18 20:15                                           ` Parav Pandit
  1 sibling, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2016-10-05  6:37 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig,
	Matan Barak, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Haggai Eran, Jonathan Corbet, james.l.morris, serge,
	Or Gerlitz, Andrew Morton, linux-security-module

FYI, the patches look fine to me:

Acked-by: Christoph Hellwig <hch@lst.de>

but we're past the merge window for 4.9 now unfortunately.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
                   ` (3 preceding siblings ...)
  2016-08-31 13:56 ` [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Tejun Heo
@ 2016-10-05 11:22 ` Leon Romanovsky
  2016-10-06 12:59   ` Parav Pandit
  4 siblings, 1 reply; 44+ messages in thread
From: Leon Romanovsky @ 2016-10-05 11:22 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie,
	corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module

[-- Attachment #1: Type: text/plain, Size: 1721 bytes --]

On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
> rdmacg: IB/core: rdma controller support
>
> Patch is generated and tested against below Doug's linux-rdma
> git tree.
>
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
> Branch: master
>
> Patchset is also compiled and tested against below Tejun's cgroup tree
> using cgroup v2 mode.
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
> Branch: master
>
> Overview:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources. This results into service unavailibility.
>
> RDMA cgroup addresses this issue by allowing resource accounting,
> limit enforcement on per cgroup, per rdma device basis.
>
> RDMA uverbs layer will enforce limits on well defined RDMA verb
> resources without any HCA vendor device driver involvement.
>
> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
> specific resources. Instead rdma cgroup provides set of APIs
> through which vendor specific drivers can do resource accounting
> by making use of rdma cgroup.

Hi Parav,
I want to propose an extension to the RDMA cgroup which can be done as
follow-up patches.

Let's add new global type, which will control whole HCA (for example in percentages). It will
allow natively define new objects without need to introduce them to the user.

This HCA share will be overwritten by specific UVERBS types which you
already defined.

What do you think?

Except this proposal,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

Thanks.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-05  6:37                                           ` Christoph Hellwig
@ 2016-10-05 11:22                                             ` Leon Romanovsky
  2016-10-05 15:36                                               ` Tejun Heo
  2016-10-06 12:55                                             ` Parav Pandit
  1 sibling, 1 reply; 44+ messages in thread
From: Leon Romanovsky @ 2016-10-05 11:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Parav Pandit, Tejun Heo, Jason Gunthorpe, Matan Barak, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 248 bytes --]

On Wed, Oct 05, 2016 at 08:37:35AM +0200, Christoph Hellwig wrote:
> FYI, the patches look fine to me:
>
> Acked-by: Christoph Hellwig <hch@lst.de>
>
> but we're past the merge window for 4.9 now unfortunately.

IMHO, it still can make it.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-05 11:22                                             ` Leon Romanovsky
@ 2016-10-05 15:36                                               ` Tejun Heo
  0 siblings, 0 replies; 44+ messages in thread
From: Tejun Heo @ 2016-10-05 15:36 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Parav Pandit, Jason Gunthorpe, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hello,

On Wed, Oct 05, 2016 at 02:22:57PM +0300, Leon Romanovsky wrote:
> On Wed, Oct 05, 2016 at 08:37:35AM +0200, Christoph Hellwig wrote:
> > FYI, the patches look fine to me:
> >
> > Acked-by: Christoph Hellwig <hch@lst.de>
> >
> > but we're past the merge window for 4.9 now unfortunately.
> 
> IMHO, it still can make it.

Most likely, we only have three / four days till rc1 opens, I think
it's too late.  Let's target the next one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-05  6:37                                           ` Christoph Hellwig
  2016-10-05 11:22                                             ` Leon Romanovsky
@ 2016-10-06 12:55                                             ` Parav Pandit
  1 sibling, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-10-06 12:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Leon Romanovsky, Jason Gunthorpe, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Christoph,

On Wed, Oct 5, 2016 at 12:07 PM, Christoph Hellwig <hch@lst.de> wrote:
> FYI, the patches look fine to me:
>
> Acked-by: Christoph Hellwig <hch@lst.de>
>
Thanks a lot for review.

Parav

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
  2016-10-05 11:22 ` Leon Romanovsky
@ 2016-10-06 12:59   ` Parav Pandit
  0 siblings, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-10-06 12:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

Hi Leon,

On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
>> rdmacg: IB/core: rdma controller support
>>
>> Patch is generated and tested against below Doug's linux-rdma
>> git tree.
>>
>> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
>> Branch: master
>>
>> Patchset is also compiled and tested against below Tejun's cgroup tree
>> using cgroup v2 mode.
>> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
>> Branch: master
>>
>> Overview:
>> Currently user space applications can easily take away all the rdma
>> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> applications in other cgroup or kernel space ULPs may not even get chance
>> to allocate any rdma resources. This results into service unavailibility.
>>
>> RDMA cgroup addresses this issue by allowing resource accounting,
>> limit enforcement on per cgroup, per rdma device basis.
>>
>> RDMA uverbs layer will enforce limits on well defined RDMA verb
>> resources without any HCA vendor device driver involvement.
>>
>> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
>> specific resources. Instead rdma cgroup provides set of APIs
>> through which vendor specific drivers can do resource accounting
>> by making use of rdma cgroup.
>
> Hi Parav,
> I want to propose an extension to the RDMA cgroup which can be done as
> follow-up patches.
>

To bring logical end to this feature/patch discussion and to progress
towards merging it, Lets discuss this new feature in follow-on email
right after this email between these two mailing list and I will drop
linux kernel and docs mailing list.

> Let's add new global type, which will control whole HCA (for example in percentages). It will
> allow natively define new objects without need to introduce them to the user.
>
> This HCA share will be overwritten by specific UVERBS types which you
> already defined.
>
> What do you think?
>
> Except this proposal,
> Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

Thanks a lot for review.

Parav

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-04 18:19                                         ` Parav Pandit
  2016-10-05  6:37                                           ` Christoph Hellwig
@ 2016-10-18 20:15                                           ` Parav Pandit
  1 sibling, 0 replies; 44+ messages in thread
From: Parav Pandit @ 2016-10-18 20:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Doug,

Leon has finished review as well in [7].
Christoph Acked too in [8].

Can you please advise whether
(1) I should rebase and resend PatchV12?
(2) If so for which branch - master/4.9 or?

Tejun and Christoph mentioned that it might be late for 4.9.
Can we atleast merge to linux-rdma tree, so that more features/changes
can be done on top of it?

How can we avoid merge conflict to Linus since this patchset is
applicable to two git trees. (cgroup and linux-rdma).
I was thinking to push through linux-rdma as it is currently going
through more changes, so resolving merge conflict would be simpler if
that happens.

Please provide the direction.

[7] https://lkml.org/lkml/2016/10/5/134
[8] https://lkml.org/lkml/2016/10/5/30

Regards,
Parav Pandit

On Tue, Oct 4, 2016 at 11:49 PM, Parav Pandit <pandit.parav@gmail.com> wrote:
> Hi Doug,
>
> I am still waiting for Leon to provide his comments if any on rdma cgroup.
> From other email context, he was on vacation last week.
> While we wait for his comments, I wanted to know your view of this
> patchset in 4.9 merge window.
>
> To summarize the discussion that happened in two threads.
>
> [1] Ack by Tejun, asking for review from rdma list
> [2] quick review by Christoph on patch-v11 (patch 12 has only typo corrections)
> [3] Christoph's ack on architecture of rdma cgroup and fitting it with ABI
> [4] My response on Matan's query on RSS indirection table
> [5] Response from Intel on their driver support for Matan's query
> [6] Christoph's point on architecture, which we are following in new
> ABI and current ABI
>
> I have reviewed recent patch [7] from Matan where I see IB verbs
> objects are still handled through common path as suggested by
> Christoph.
>
> I do not see any issues with rdma cgroup patchset other than it requires rebase.
> Am I missing something?
> Can you please help me - What would be required to merge it to 4.9?
>
> [1] https://lkml.org/lkml/2016/8/31/494
> [2] https://lkml.org/lkml/2016/8/25/146
> [3] https://lkml.org/lkml/2016/9/10/175
> [4] https://lkml.org/lkml/2016/9/14/221
> [5] https://lkml.org/lkml/2016/9/19/571
> [6] http://www.spinics.net/lists/linux-rdma/msg40337.html
> [7] email subject: [RFC ABI V4 0/7] SG-based RDMA ABI Proposal
>
> Regards,
> Parav Pandit
>
> On Wed, Sep 21, 2016 at 9:32 PM, Parav Pandit <pandit.parav@gmail.com> wrote:
>> Hi Tejun,
>>
>> On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj@kernel.org> wrote:
>>> Hello, Parav.
>>>
>>> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>>>> We have completed review from Tejun, Christoph.
>>>> HFI driver folks also provided feedback for Intel drivers.
>>>> Matan's also doesn't have any more comments.
>>>>
>>>> If possible, if you can also review, it will be helpful.
>>>>
>>>> I have some more changes unrelated to cgroup in same files in both the git tree.
>>>> Pushing them now either results into merge conflict later on for
>>>> Doug/Tejun, or requires rebase and resending patch.
>>>> If you can review, we can avoid such rework.
>>>
>>> My impression of the thread was that there doesn't seem to be enough
>>> of consensus around how rdma resources should be defined.  Is that
>>> part agreed upon now?
>>>
>>
>> We ended up discussing few points on different thread [1].
>>
>> There was confusion on how some non-rdma/non-IB drivers would work
>> with rdma cgroup from Matan.
>> Christoph explained how they don't fit in the rdma subsystem and
>> therefore its not prime target to addess.
>> Intel driver maintainer Denny also acknowledged same on [2].
>> IB compliant drivers of Intel support rdma cgroup as explained in [2].
>> With that usnic and Intel psm drivers falls out of rdma cgroup support
>> as they don't fit very well in the verbs definition.
>>
>> [1] https://www.spinics.net/lists/linux-rdma/msg40340.html
>> [2] http://www.spinics.net/lists/linux-rdma/msg40717.html
>>
>> I will wait for Leon's review comments if he has different view on architecture.
>> Back in April when I met face-to-face to Leon and Haggai, Leon was in
>> support to have kernel defined the rdma resources as suggested by
>> Christoph and Tejun instead of IB/RDMA subsystem.
>> I will wait for his comments if his views have changed with new uAPI
>> taking shape.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2016-10-18 20:15 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
2016-08-31  9:38   ` Leon Romanovsky
2016-09-07 15:07     ` Parav Pandit
2016-09-08  6:12       ` Leon Romanovsky
2016-09-08 10:20         ` Parav Pandit
2016-08-31 15:07   ` Matan Barak
2016-08-31 21:16     ` Tejun Heo
2016-09-01  7:25       ` Matan Barak
2016-09-01  8:44         ` Christoph Hellwig
2016-09-07  7:55           ` Parav Pandit
2016-09-07  8:51             ` Matan Barak
2016-09-07 14:54               ` Parav Pandit
2016-09-10 16:14               ` Christoph Hellwig
2016-09-10 17:01                 ` Jason Gunthorpe
2016-09-11  8:07                   ` Matan Barak
2016-09-11 13:34                   ` Christoph Hellwig
2016-09-11 14:35                     ` Leon Romanovsky
2016-09-11 17:14                       ` Jason Gunthorpe
2016-09-11 17:24                         ` Christoph Hellwig
2016-09-11 17:52                           ` Jason Gunthorpe
2016-09-12  5:07                             ` Leon Romanovsky
2016-09-14  7:06                               ` Parav Pandit
2016-09-14  8:14                                 ` Matan Barak
2016-09-14  9:19                                   ` Parav Pandit
2016-09-15 18:56                                 ` Leon Romanovsky
2016-09-21  4:43                                   ` Parav Pandit
2016-09-21 14:26                                     ` Tejun Heo
2016-09-21 16:02                                       ` Parav Pandit
2016-10-04 18:19                                         ` Parav Pandit
2016-10-05  6:37                                           ` Christoph Hellwig
2016-10-05 11:22                                             ` Leon Romanovsky
2016-10-05 15:36                                               ` Tejun Heo
2016-10-06 12:55                                             ` Parav Pandit
2016-10-18 20:15                                           ` Parav Pandit
2016-09-19 13:10                                 ` Dalessandro, Dennis
2016-09-19 17:00                                   ` Parav Pandit
2016-09-10 16:12             ` Christoph Hellwig
2016-09-11  7:40               ` Matan Barak
2016-08-31  8:37 ` [PATCHv12 2/3] IB/core: added support to use " Parav Pandit
2016-08-31  8:37 ` [PATCHv12 3/3] rdmacg: Added documentation for rdmacg Parav Pandit
2016-08-31 13:56 ` [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Tejun Heo
2016-10-05 11:22 ` Leon Romanovsky
2016-10-06 12:59   ` Parav Pandit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).