All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
@ 2016-08-31  8:37 Parav Pandit
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
                   ` (3 more replies)
  0 siblings, 4 replies; 112+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module, pandit.parav

rdmacg: IB/core: rdma controller support

Patch is generated and tested against below Doug's linux-rdma
git tree.

URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
Branch: master

Patchset is also compiled and tested against below Tejun's cgroup tree
using cgroup v2 mode.
URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
Branch: master

Overview:
Currently user space applications can easily take away all the rdma
device specific resources such as AH, CQ, QP, MR etc. Due to which other
applications in other cgroup or kernel space ULPs may not even get chance
to allocate any rdma resources. This results into service unavailibility.

RDMA cgroup addresses this issue by allowing resource accounting,
limit enforcement on per cgroup, per rdma device basis.

RDMA uverbs layer will enforce limits on well defined RDMA verb
resources without any HCA vendor device driver involvement.

RDMA uverbs layer will not do limit enforcement of HCA hw vendor
specific resources. Instead rdma cgroup provides set of APIs
through which vendor specific drivers can do resource accounting
by making use of rdma cgroup.

Resource limit enforcement is hierarchical.

When process is migrated with active RDMA resources, rdma cgroup
continues to uncharge original cgroup for allocated resource. New resource
is charged to current process's cgroup, which means if the process is
migrated with active resources, for new resources it will be charged to
new cgroup and old resources will be correctly uncharged from old cgroup.

Changes from v11:
  * (To address comments from Tejun)
   1. Added information in Documentation about nested-keyed file
  * (To address comments from Rami Rosen)
   1. Corrected typo errors in Documentation
  * (To address comments from Leon Romanovsky)
   1. Changed cgroup.c copyright to match with other files of the IB stack
      which is dual license GPLv2 + BSD

Changes from v10:
  * (To address comments from Tejun, Christoph)
   1. Removed unused rpool_list_lock from rdma_cgroup structure.
   2. Moved rdma resource definition to rdma cgroup instead of IB stack
   3. Added prefix rdmacg to static instances
   4. Simplified locking with single mutex for all operations
   5. Following approach of atomically allocating object and
      charging resource in hirerchy
   6. Code simplification due to single lock
   7. Using for_each_set_bit API for bit operation
   8. Renamed list heads as Objects instead of _head
   9. Renamed list entries as _node instead of _list.
  10. Made usage_num to 64 bit to avoid overflow and to avoid 
      additional code to track non zero number of usage counts.
  * (To address comments from Doug)
   1. Added copyright and GPLv2 license

Changes from v9:
  * (To address comments from Tejun)
   1. Included clear documentation of resources.
   2. Fixed issue of race condition of process migration during
      charging stage.
   3. Fixed comments and code to adhere to CodingStyle.
   4. Simplified and removed support to charge/uncharge multiple
      resource.
   5. Fixed replaced refcnt with usage_num that tracks how many
      resources are unused to trigger freeing the object.
   6. Simplified locking scheme to use single spin lock for whole
      subsystem.

Changes from v8:
 * Fixed compilation error.
 * Fixed warning reported by checkpatch script.

Changes from v7:
 * (To address comments from Haggai)
   1. Removed max_limit from query_limit function as it is
      unnecessary.
   2. Kept existing printk as it is to instead of replacing all
      with pr_warn except newly added printk.

Changes from v6:
 * (To address comments from Haggai)
   1. Made functions as void wherever necessary.
   2. Code cleanup related to correting few spelling mistakes
      in comments, correcting comments to reflect the code.
   3. Removed max_count parameter from query_limit as its not
      necessary.
   4. Fixed printk to pr_warn.
   5. Removed dependency on pd, instead relying on ib_dev.
   6. Added more documentation to reflect that IB stack honors
      configured limit during query_device operation.
   7. Added pr_warn and avoided system crash in case of
      IB stack or rdma cgroup bug.
 * (To address comments from Leon)
   1. Removed #ifdef CONFIG_CGROUP_RDMA from .c files and added
      necessary dummy functions in header file.
   2. Removed unwanted forward declaration.
 * Fixed uncharing to rdma controller after resource is released
   from verb layer, instead of uncharing first. This ensures that
   uncharging doesn't complete while resource is still allocated.
 
Changes from v5:
 * (To address comments from Tejun)
   1. Removed two type of resource pool, made is single type (as Tejun
      described in past comment)
   2. Removed match tokens and have array definition like "qp", "mr",
      "cq" etc.
   3. Wrote small parser and avoided match_token API as that won't work
      due to different array definitions
   4. Removed one-off remove API to unconfigure cgroup, instead all
      resource should be set to max.
   5. Removed resource pool type (user/default), instead having
      max_num_cnt, when ref_cnt drops to zero and
      max_num_cnt = total_rescource_cnt, pool is freed.
   6. Resource definition ownership is now only with IB stack at single
      header file, no longer in each low level driver.
      This goes through IB maintainer and other reviewers eyes.
      This continue to give flexibility to not force kernel upgrade for
      few enums additions for new resource type.
   7. Wherever possible pool lock is pushed out, except for hierarchical
      charging/unchanging points, as it not possible to do so, due to
      iterative process involves blocking allocations of rpool. Coming up
      more levels up to release locks doesn't make any sense either.
      This is anyway slow path where rpool is not allocated. Except for
      typical first resource allocation, this is less travelled path.
   8. Avoided %d manipulation due to removal of match_token and replaced
      with seq_putc etc friend functions.
 * Other minor cleanups.
 * Fixed rdmacg_register_device to return error in case of IB stack
   tries to register for than 64 resources.
 * Fixed not allowing negative value on resource setting.
 * Fixed cleaning up resource pools during device removal.
 * Simplfied and rename table length field to use ARRAY_SIZE macro.
 * Updated documentation to reflect single resource pool and shorter
   file names.

Changes from v4:
 * Fixed compilation errors for lockdep_assert_held reported by kbuild
   test robot
 * Fixed compilation warning reported by coccinelle for extra
   semicolon.
 * Fixed compilation error for inclusion of linux/parser.h which
   cannot be included in any header file, as that triggers multiple
   inclusion error. parser.h is included in C files which intent to
   use it.
 * Removed unused header file inclusion in cgroup_rdma.c

Changes from v3:
 * (To address comments from Tejun)
   1. Renamed cg_resource to rdmacg_resource
   2. Merged dealloc_cg_rpool and _dealloc_cg_rpool to single function
   3. Renamed _find_cg_rpool to find_cg_rpool_locked()
   5. Removed RDMACG_MAX_RESOURCE_INDEX limitation
   6. Fixed few alignments.
   7. Improved description for RDMA cgroup configuration menu
   8. Renamed cg_list_lock to rpool_list_lock to reflect the lock
      is for rpools.
   9. Renamed _get_cg_rpool to find_cg_rpool.
   10. Made creator as int variable, instead of atomic as its not 
      required to be atomic.
 * Fixed freeing right rpool during query_limit error path
 * Added copywrite for cgroup.c
 * Removed including parser.h from cgroup.c as its included by
   cgroup_rdma.h
 * Reduced try_charge functions to single function and removed duplicate
   comments.

Changes from v2:
 * Fixed compilation error reported by 0-DAY kernel test infrastructure
   for m68k architecture where CONFIG_CGROUP is also not defined.
 * Fixed comment in patch to refer to legacy mode of cgroup, changed to 
   refer to v1 and v2 version.
 * Added more information in commit log for rdma controller patch.

Changes from v1:
 * (To address comments from Tejun)
   a. reduces 3 patches to single patch
   b. removed resource word from the cgroup configuration files
   c. changed cgroup configuration file names to match other cgroups
   d. removed .list file and merged functionality with .max file
 * Based on comment to merge to single patch for rdma controller;
   IB/core patches are reduced to single patch.
 * Removed pid cgroup map and simplified design -
   Charge/Uncharge caller stack keeps track of the rdmacg for
   given resource. This removes the need to maintain and perform
   hash lookup. This also allows little more accurate resource
   charging/uncharging when process moved from one to other cgroup
   with active resources and continue to allocate more.
 * Critical fix: Removed rdma cgroup's dependency on the kernel module
   header files to avoid crashes when modules are upgraded without kernel
   upgrade, which is very common due to high amount of changes in IB stack
   and it is also shipped as individual kernel modules.
 * uboject extended to keep track of the owner rdma cgroup, so that same
   rdmacg can be used while uncharging.
 * Added support functions to hide details of rdmacg device in uverbs
   modules for cases of cgroup enabled/disabled at compile time. This
   avoids multiple ifdefs for every API in uverbs layer.
 * Removed failure counters in first patch, which will be added once
   initial feature is merged.
 * Fixed stale rpool access which is getting freed, while doing
   configuration to rdma.verb.max file.
 * Fixed rpool resource leak while querying max, current values.

Changes from v0:
(To address comments from Haggai, Doug, Liran, Tejun, Sean, Jason)
 * Redesigned to support per device per cgroup limit settings by bringing
   concept of resource pool.
 * Redesigned to let IB stack define the resources instead of rdma
   controller using resource template.
 * Redesigned to support hw vendor specific limits setting
   (optional to drivers).
 * Created new rdma controller instead of piggyback on device cgroup.
 * Fixed race conditions for multiple tasks sharing rdma resources.
 * Removed dependency on the task_struct.


Parav Pandit (3):
  rdmacg: Added rdma cgroup controller
  IB/core: added support to use rdma cgroup controller
  rdmacg: Added documentation for rdmacg

 Documentation/cgroup-v1/rdma.txt    | 117 +++++++
 Documentation/cgroup-v2.txt         |  45 +++
 drivers/infiniband/core/Makefile    |   1 +
 drivers/infiniband/core/cgroup.c    |  93 +++++
 drivers/infiniband/core/core_priv.h |  41 +++
 drivers/infiniband/core/device.c    |  10 +
 include/linux/cgroup_rdma.h         |  66 ++++
 include/linux/cgroup_subsys.h       |   4 +
 init/Kconfig                        |  10 +
 kernel/Makefile                     |   1 +
 kernel/cgroup_rdma.c                | 664 ++++++++++++++++++++++++++++++++++++
 11 files changed, 1052 insertions(+)
 create mode 100644 Documentation/cgroup-v1/rdma.txt
 create mode 100644 drivers/infiniband/core/cgroup.c
 create mode 100644 include/linux/cgroup_rdma.h
 create mode 100644 kernel/cgroup_rdma.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
@ 2016-08-31  8:37 ` Parav Pandit
  2016-08-31  9:38   ` Leon Romanovsky
  2016-08-31 15:07     ` Matan Barak
  2016-08-31  8:37 ` [PATCHv12 2/3] IB/core: added support to use " Parav Pandit
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 112+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module, pandit.parav

Added rdma cgroup controller that does accounting, limit enforcement
on rdma/IB verbs and hw resources.

Added rdma cgroup header file which defines its APIs to perform
charing/uncharing functionality. It also defined APIs for RDMA/IB
stack for device registration. Devices which are registered will
participate in controller functions of accounting and limit
enforcements. It define rdmacg_device structure to bind IB stack
and RDMA cgroup controller.

RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup entity which allows setting up accounting limits
on per device basis.

Currently resources are defined by the RDMA cgroup.

Resource pool is created/destroyed dynamically whenever
charging/uncharging occurs respectively and whenever user
configuration is done. Its a tradeoff of memory vs little more code
space that creates resource pool object whenever necessary, instead of
creating them during cgroup creation and device registration time.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 include/linux/cgroup_rdma.h   |  66 +++++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  10 +
 kernel/Makefile               |   1 +
 kernel/cgroup_rdma.c          | 664 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 745 insertions(+)
 create mode 100644 include/linux/cgroup_rdma.h
 create mode 100644 kernel/cgroup_rdma.c

diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
new file mode 100644
index 0000000..6710e28
--- /dev/null
+++ b/include/linux/cgroup_rdma.h
@@ -0,0 +1,66 @@
+/*
+ * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License. See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#ifndef _CGROUP_RDMA_H
+#define _CGROUP_RDMA_H
+
+#include <linux/cgroup.h>
+
+enum rdmacg_resource_type {
+	RDMACG_VERB_RESOURCE_UCTX,
+	RDMACG_VERB_RESOURCE_AH,
+	RDMACG_VERB_RESOURCE_PD,
+	RDMACG_VERB_RESOURCE_CQ,
+	RDMACG_VERB_RESOURCE_MR,
+	RDMACG_VERB_RESOURCE_MW,
+	RDMACG_VERB_RESOURCE_SRQ,
+	RDMACG_VERB_RESOURCE_QP,
+	RDMACG_VERB_RESOURCE_FLOW,
+	/*
+	 * add any hw specific resource here as RDMA_HW_RESOURCE_NAME
+	 */
+	RDMACG_RESOURCE_MAX,
+};
+
+#ifdef CONFIG_CGROUP_RDMA
+
+struct rdma_cgroup {
+	struct cgroup_subsys_state	css;
+
+	/*
+	 * head to keep track of all resource pools
+	 * that belongs to this cgroup.
+	 */
+	struct list_head		rpools;
+};
+
+struct rdmacg_device {
+	struct list_head	dev_node;
+	struct list_head	rpools;
+	char			*name;
+};
+
+/*
+ * APIs for RDMA/IB stack to publish when a device wants to
+ * participate in resource accounting
+ */
+int rdmacg_register_device(struct rdmacg_device *device);
+void rdmacg_unregister_device(struct rdmacg_device *device);
+
+/* APIs for RDMA/IB stack to charge/uncharge pool specific resources */
+int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
+		      struct rdmacg_device *device,
+		      enum rdmacg_resource_type index);
+void rdmacg_uncharge(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device,
+		     enum rdmacg_resource_type index);
+void rdmacg_query_limit(struct rdmacg_device *device,
+			int *limits);
+
+#endif	/* CONFIG_CGROUP_RDMA */
+#endif	/* _CGROUP_RDMA_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..d0e597c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDMA)
+SUBSYS(rdma)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..c7dc64b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1080,6 +1080,16 @@ config CGROUP_PIDS
 	  since the PIDs limit only affects a process's ability to fork, not to
 	  attach to a cgroup.
 
+config CGROUP_RDMA
+	bool "RDMA controller"
+	help
+	  Provides enforcement of RDMA resources defined by IB stack.
+	  It is fairly easy for consumers to exhaust RDMA resources, which
+	  can result into resource unavailability to other consumers.
+	  RDMA controller is designed to stop this from happening.
+	  Attaching processes with active RDMA resources to the cgroup
+	  hierarchy is allowed even if can cross the hierarchy's limit.
+
 config CGROUP_FREEZER
 	bool "Freezer controller"
 	help
diff --git a/kernel/Makefile b/kernel/Makefile
index e2ec54e..d2b76d0 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -67,6 +67,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
+obj-$(CONFIG_CGROUP_RDMA) += cgroup_rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_rdma.c b/kernel/cgroup_rdma.c
new file mode 100644
index 0000000..6ab9bd9
--- /dev/null
+++ b/kernel/cgroup_rdma.c
@@ -0,0 +1,664 @@
+/*
+ * RDMA resource limiting controller for cgroups.
+ *
+ * Used to allow a cgroup hierarchy to stop processes from consuming
+ * additional RDMA resources after a certain limit is reached.
+ *
+ * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License. See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/cgroup.h>
+#include <linux/parser.h>
+#include <linux/cgroup_rdma.h>
+
+#define RDMACG_MAX_STR "max"
+
+/*
+ * Protects list of resource pools maintained on per cgroup basis
+ * and rdma device list.
+ */
+static DEFINE_MUTEX(rdmacg_mutex);
+static LIST_HEAD(rdmacg_devices);
+
+enum rdmacg_file_type {
+	RDMACG_RESOURCE_TYPE_MAX,
+	RDMACG_RESOURCE_TYPE_STAT,
+};
+
+/*
+ * resource table definition as to be seen by the user.
+ * Need to add entries to it when more resources are
+ * added/defined at IB verb/core layer.
+ */
+static char const *rdmacg_resource_names[] = {
+	[RDMACG_VERB_RESOURCE_UCTX]	= "uctx",
+	[RDMACG_VERB_RESOURCE_AH]	= "ah",
+	[RDMACG_VERB_RESOURCE_PD]	= "pd",
+	[RDMACG_VERB_RESOURCE_CQ]	= "cq",
+	[RDMACG_VERB_RESOURCE_MR]	= "mr",
+	[RDMACG_VERB_RESOURCE_MW]	= "mw",
+	[RDMACG_VERB_RESOURCE_SRQ]	= "srq",
+	[RDMACG_VERB_RESOURCE_QP]	= "qp",
+	[RDMACG_VERB_RESOURCE_FLOW]	= "flow",
+};
+
+/* resource tracker for each resource of rdma cgroup */
+struct rdmacg_resource {
+	int max;
+	int usage;
+};
+
+/*
+ * resource pool object which represents per cgroup, per device
+ * resources. There are multiple instances of this object per cgroup,
+ * therefore it cannot be embedded within rdma_cgroup structure. It
+ * is maintained as list.
+ */
+struct rdmacg_resource_pool {
+	struct rdmacg_device	*device;
+	struct rdmacg_resource	resources[RDMACG_RESOURCE_MAX];
+
+	struct list_head	cg_node;
+	struct list_head	dev_node;
+
+	/* count active user tasks of this pool */
+	u64			usage_sum;
+	/* total number counts which are set to max */
+	int			num_max_cnt;
+};
+
+static struct rdma_cgroup *css_rdmacg(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct rdma_cgroup, css);
+}
+
+static struct rdma_cgroup *parent_rdmacg(struct rdma_cgroup *cg)
+{
+	return css_rdmacg(cg->css.parent);
+}
+
+static inline struct rdma_cgroup *get_current_rdmacg(void)
+{
+	return css_rdmacg(task_get_css(current, rdma_cgrp_id));
+}
+
+static void set_resource_limit(struct rdmacg_resource_pool *rpool,
+			       int index, int new_max)
+{
+	if (new_max == S32_MAX) {
+		if (rpool->resources[index].max != S32_MAX)
+			rpool->num_max_cnt++;
+	} else {
+		if (rpool->resources[index].max == S32_MAX)
+			rpool->num_max_cnt--;
+	}
+	rpool->resources[index].max = new_max;
+}
+
+static void set_all_resource_max_limit(struct rdmacg_resource_pool *rpool)
+{
+	int i;
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
+		set_resource_limit(rpool, i, S32_MAX);
+}
+
+static void free_cg_rpool_locked(struct rdmacg_resource_pool *rpool)
+{
+	lockdep_assert_held(&rdmacg_mutex);
+
+	list_del(&rpool->cg_node);
+	list_del(&rpool->dev_node);
+	kfree(rpool);
+}
+
+static struct rdmacg_resource_pool *
+find_cg_rpool_locked(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device)
+
+{
+	struct rdmacg_resource_pool *pool;
+
+	lockdep_assert_held(&rdmacg_mutex);
+
+	list_for_each_entry(pool, &cg->rpools, cg_node)
+		if (pool->device == device)
+			return pool;
+
+	return NULL;
+}
+
+static struct rdmacg_resource_pool *
+get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
+{
+	struct rdmacg_resource_pool *rpool;
+
+	rpool = find_cg_rpool_locked(cg, device);
+	if (rpool)
+		return rpool;
+
+	rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
+	if (!rpool)
+		return ERR_PTR(-ENOMEM);
+
+	rpool->device = device;
+	set_all_resource_max_limit(rpool);
+
+	INIT_LIST_HEAD(&rpool->cg_node);
+	INIT_LIST_HEAD(&rpool->dev_node);
+	list_add_tail(&rpool->cg_node, &cg->rpools);
+	list_add_tail(&rpool->dev_node, &device->rpools);
+	return rpool;
+}
+
+/**
+ * uncharge_cg_locked - uncharge resource for rdma cgroup
+ * @cg: pointer to cg to uncharge and all parents in hierarchy
+ * @device: pointer to rdmacg device
+ * @index: index of the resource to uncharge in cg (resource pool)
+ *
+ * It also frees the resource pool which was created as part of
+ * charging operation when there are no resources attached to
+ * resource pool.
+ */
+static void
+uncharge_cg_locked(struct rdma_cgroup *cg,
+		   struct rdmacg_device *device,
+		   enum rdmacg_resource_type index)
+{
+	struct rdmacg_resource_pool *rpool;
+
+	rpool = find_cg_rpool_locked(cg, device);
+
+	/*
+	 * rpool cannot be null at this stage. Let kernel operate in case
+	 * if there a bug in IB stack or rdma controller, instead of crashing
+	 * the system.
+	 */
+	if (unlikely(!rpool)) {
+		pr_warn("Invalid device %p or rdma cgroup %p\n", cg, device);
+		return;
+	}
+
+	rpool->resources[index].usage--;
+
+	/*
+	 * A negative count (or overflow) is invalid,
+	 * it indicates a bug in the rdma controller.
+	 */
+	WARN_ON_ONCE(rpool->resources[index].usage < 0);
+	rpool->usage_sum--;
+	if (rpool->usage_sum == 0 &&
+	    rpool->num_max_cnt == RDMACG_RESOURCE_MAX) {
+		/*
+		 * No user of the rpool and all entries are set to max, so
+		 * safe to delete this rpool.
+		 */
+		free_cg_rpool_locked(rpool);
+	}
+}
+
+/**
+ * rdmacg_uncharge_hirerchy - hierarchically uncharge rdma resource count
+ * @device: pointer to rdmacg device
+ * @stop_cg: while traversing hirerchy, when meet with stop_cg cgroup
+ *           stop uncharging
+ * @index: index of the resource to uncharge in cg in given resource pool
+ */
+static void rdmacg_uncharge_hirerchy(struct rdma_cgroup *cg,
+				     struct rdmacg_device *device,
+				     struct rdma_cgroup *stop_cg,
+				     enum rdmacg_resource_type index)
+{
+	struct rdma_cgroup *p;
+
+	mutex_lock(&rdmacg_mutex);
+
+	for (p = cg; p != stop_cg; p = parent_rdmacg(p))
+		uncharge_cg_locked(p, device, index);
+
+	mutex_unlock(&rdmacg_mutex);
+
+	css_put(&cg->css);
+}
+
+/**
+ * rdmacg_uncharge - hierarchically uncharge rdma resource count
+ * @device: pointer to rdmacg device
+ * @index: index of the resource to uncharge in cgroup in given resource pool
+ */
+void rdmacg_uncharge(struct rdma_cgroup *cg,
+		     struct rdmacg_device *device,
+		     enum rdmacg_resource_type index)
+{
+	if (index >= RDMACG_RESOURCE_MAX)
+		return;
+
+	rdmacg_uncharge_hirerchy(cg, device, NULL, index);
+}
+EXPORT_SYMBOL(rdmacg_uncharge);
+
+/**
+ * rdmacg_try_charge - hierarchically try to charge the rdma resource
+ * @rdmacg: pointer to rdma cgroup which will own this resource
+ * @device: pointer to rdmacg device
+ * @index: index of the resource to charge in cgroup (resource pool)
+ *
+ * This function follows charging resource in hierarchical way.
+ * It will fail if the charge would cause the new value to exceed the
+ * hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN, -ENOMEM or -EINVAL.
+ * Returns pointer to rdmacg for this resource when charging is successful.
+ *
+ * Charger needs to account resources on two criteria.
+ * (a) per cgroup & (b) per device resource usage.
+ * Per cgroup resource usage ensures that tasks of cgroup doesn't cross
+ * the configured limits. Per device provides granular configuration
+ * in multi device usage. It allocates resource pool in the hierarchy
+ * for each parent it come across for first resource. Later on resource
+ * pool will be available. Therefore it will be much faster thereon
+ * to charge/uncharge.
+ */
+int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
+		      struct rdmacg_device *device,
+		      enum rdmacg_resource_type index)
+{
+	struct rdma_cgroup *cg, *p;
+	struct rdmacg_resource_pool *rpool;
+	s64 new;
+	int ret = 0;
+
+	if (index >= RDMACG_RESOURCE_MAX)
+		return -EINVAL;
+
+	/*
+	 * hold on to css, as cgroup can be removed but resource
+	 * accounting happens on css.
+	 */
+	cg = get_current_rdmacg();
+
+	mutex_lock(&rdmacg_mutex);
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		rpool = get_cg_rpool_locked(p, device);
+		if (IS_ERR_OR_NULL(rpool)) {
+			ret = PTR_ERR(rpool);
+			goto err;
+		} else {
+			new = rpool->resources[index].usage + 1;
+			if (new > rpool->resources[index].max) {
+				ret = -EAGAIN;
+				goto err;
+			} else {
+				rpool->resources[index].usage = new;
+				rpool->usage_sum++;
+			}
+		}
+	}
+	mutex_unlock(&rdmacg_mutex);
+
+	*rdmacg = cg;
+	return 0;
+
+err:
+	mutex_unlock(&rdmacg_mutex);
+	rdmacg_uncharge_hirerchy(cg, device, p, index);
+	return ret;
+}
+EXPORT_SYMBOL(rdmacg_try_charge);
+
+/**
+ * rdmacg_register_device - register rdmacg device to rdma controller.
+ * @device: pointer to rdmacg device whose resources need to be accounted.
+ *
+ * If IB stack wish a device to participate in rdma cgroup resource
+ * tracking, it must invoke this API to register with rdma cgroup before
+ * any user space application can start using the RDMA resources.
+ * Returns 0 on success or EINVAL when table length given is beyond
+ * supported size.
+ */
+int rdmacg_register_device(struct rdmacg_device *device)
+{
+	INIT_LIST_HEAD(&device->dev_node);
+	INIT_LIST_HEAD(&device->rpools);
+
+	mutex_lock(&rdmacg_mutex);
+	list_add_tail(&device->dev_node, &rdmacg_devices);
+	mutex_unlock(&rdmacg_mutex);
+	return 0;
+}
+EXPORT_SYMBOL(rdmacg_register_device);
+
+/**
+ * rdmacg_unregister_device - unregister rdmacg device from rdma controller.
+ * @device: pointer to rdmacg device which was previously registered with rdma
+ *          controller using rdmacg_register_device().
+ *
+ * IB stack must invoke this after all the resources of the IB device
+ * are destroyed and after ensuring that no more resources will be created
+ * when this API is invoked.
+ */
+void rdmacg_unregister_device(struct rdmacg_device *device)
+{
+	struct rdmacg_resource_pool *rpool, *tmp;
+
+	/*
+	 * Synchronize with any active resource settings,
+	 * usage query happening via configfs.
+	 */
+	mutex_lock(&rdmacg_mutex);
+	list_del_init(&device->dev_node);
+
+	/*
+	 * Now that this device is off the cgroup list, its safe to free
+	 * all the rpool resources.
+	 */
+	list_for_each_entry_safe(rpool, tmp, &device->rpools, dev_node)
+		free_cg_rpool_locked(rpool);
+
+	mutex_unlock(&rdmacg_mutex);
+}
+EXPORT_SYMBOL(rdmacg_unregister_device);
+
+/**
+ * rdmacg_query_limit - query the resource limits that
+ * might have been configured by the user.
+ * @device: pointer to rdmacg device
+ * @limits: pointer to an array of limits where rdma cg will provide
+ *          the configured limits of the cgroup.
+ *
+ * This function honors resource limit in hierarchical way.
+ * In a cgroup hirarchy, it considers the limit of a controller which has
+ * smallest limit configured.
+ */
+void rdmacg_query_limit(struct rdmacg_device *device, int *limits)
+{
+	struct rdma_cgroup *cg, *p;
+	struct rdmacg_resource_pool *rpool;
+	int i;
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
+		limits[i] = S32_MAX;
+
+	cg = get_current_rdmacg();
+	/*
+	 * Check in hirerchy which pool get the least amount of
+	 * resource limits.
+	 */
+	mutex_lock(&rdmacg_mutex);
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		rpool = find_cg_rpool_locked(cg, device);
+		if (!rpool)
+			continue;
+
+		for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
+			limits[i] = min_t(int, limits[i],
+					rpool->resources[i].max);
+	}
+	mutex_unlock(&rdmacg_mutex);
+	css_put(&cg->css);
+}
+EXPORT_SYMBOL(rdmacg_query_limit);
+
+static int parse_resource(char *c, int *intval)
+{
+	substring_t argstr;
+	const char **table = &rdmacg_resource_names[0];
+	char *name, *value = c;
+	size_t len;
+	int ret, i = 0;
+
+	name = strsep(&value, "=");
+	if (!name || !value)
+		return -EINVAL;
+
+	len = strlen(value);
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+		if (strcmp(table[i], name))
+			continue;
+
+		argstr.from = value;
+		argstr.to = value + len;
+
+		ret = match_int(&argstr, intval);
+		if (ret >= 0) {
+			if (*intval < 0)
+				break;
+			return i;
+		}
+		if (strncmp(value, RDMACG_MAX_STR, len) == 0) {
+			*intval = S32_MAX;
+			return i;
+		}
+		break;
+	}
+	return -EINVAL;
+}
+
+static int rdmacg_parse_limits(char *options,
+			       int *new_limits, unsigned long *enables)
+{
+	char *c;
+	int err = -EINVAL;
+
+	/* parse resource options */
+	while ((c = strsep(&options, " ")) != NULL) {
+		int index, intval;
+
+		index = parse_resource(c, &intval);
+		if (index < 0)
+			goto err;
+
+		new_limits[index] = intval;
+		*enables |= BIT(index);
+	}
+	return 0;
+
+err:
+	return err;
+}
+
+static struct rdmacg_device *rdmacg_get_device_locked(const char *name)
+{
+	struct rdmacg_device *device;
+
+	lockdep_assert_held(&rdmacg_mutex);
+
+	list_for_each_entry(device, &rdmacg_devices, dev_node)
+		if (!strcmp(name, device->name))
+			return device;
+
+	return NULL;
+}
+
+static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
+				       char *buf, size_t nbytes, loff_t off)
+{
+	struct rdma_cgroup *cg = css_rdmacg(of_css(of));
+	const char *dev_name;
+	struct rdmacg_resource_pool *rpool;
+	struct rdmacg_device *device;
+	char *options = strstrip(buf);
+	int *new_limits;
+	unsigned long enables = 0;
+	int i = 0, ret = 0;
+
+	/* extract the device name first */
+	dev_name = strsep(&options, " ");
+	if (!dev_name) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	new_limits = kcalloc(RDMACG_RESOURCE_MAX, sizeof(int), GFP_KERNEL);
+	if (!new_limits) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	ret = rdmacg_parse_limits(options, new_limits, &enables);
+	if (ret)
+		goto parse_err;
+
+	/* acquire lock to synchronize with hot plug devices */
+	mutex_lock(&rdmacg_mutex);
+
+	device = rdmacg_get_device_locked(dev_name);
+	if (!device) {
+		ret = -ENODEV;
+		goto dev_err;
+	}
+
+	rpool = get_cg_rpool_locked(cg, device);
+	if (IS_ERR_OR_NULL(rpool)) {
+		ret = PTR_ERR(rpool);
+		goto dev_err;
+	}
+
+	/* now set the new limits of the rpool */
+	for_each_set_bit(i, &enables, RDMACG_RESOURCE_MAX)
+		set_resource_limit(rpool, i, new_limits[i]);
+
+	if (rpool->usage_sum == 0 &&
+	    rpool->num_max_cnt == RDMACG_RESOURCE_MAX) {
+		/*
+		 * No user of the rpool and all entries are set to max, so
+		 * safe to delete this rpool.
+		 */
+		free_cg_rpool_locked(rpool);
+	}
+
+dev_err:
+	mutex_unlock(&rdmacg_mutex);
+
+parse_err:
+	kfree(new_limits);
+
+err:
+	return ret ?: nbytes;
+}
+
+static void print_rpool_values(struct seq_file *sf,
+			       struct rdmacg_resource_pool *rpool)
+{
+	enum rdmacg_file_type sf_type;
+	int i;
+	u32 value;
+
+	sf_type = seq_cft(sf)->private;
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++) {
+		seq_puts(sf, rdmacg_resource_names[i]);
+		seq_putc(sf, '=');
+		if (sf_type == RDMACG_RESOURCE_TYPE_MAX) {
+			if (rpool)
+				value = rpool->resources[i].max;
+			else
+				value = S32_MAX;
+		} else {
+			if (rpool)
+				value = rpool->resources[i].usage;
+		}
+
+		if (value == S32_MAX)
+			seq_puts(sf, RDMACG_MAX_STR);
+		else
+			seq_printf(sf, "%d", value);
+		seq_putc(sf, ' ');
+	}
+}
+
+static int rdmacg_resource_read(struct seq_file *sf, void *v)
+{
+	struct rdmacg_device *device;
+	struct rdmacg_resource_pool *rpool;
+	struct rdma_cgroup *cg = css_rdmacg(seq_css(sf));
+
+	mutex_lock(&rdmacg_mutex);
+
+	list_for_each_entry(device, &rdmacg_devices, dev_node) {
+		seq_printf(sf, "%s ", device->name);
+
+		rpool = find_cg_rpool_locked(cg, device);
+		print_rpool_values(sf, rpool);
+
+		seq_putc(sf, '\n');
+	}
+
+	mutex_unlock(&rdmacg_mutex);
+	return 0;
+}
+
+static struct cftype rdmacg_files[] = {
+	{
+		.name = "max",
+		.write = rdmacg_resource_set_max,
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_RESOURCE_TYPE_MAX,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "current",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_RESOURCE_TYPE_STAT,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{ }	/* terminate */
+};
+
+static struct cgroup_subsys_state *
+rdmacg_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct rdma_cgroup *cg;
+
+	cg = kzalloc(sizeof(*cg), GFP_KERNEL);
+	if (!cg)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&cg->rpools);
+	return &cg->css;
+}
+
+static void rdmacg_css_free(struct cgroup_subsys_state *css)
+{
+	struct rdma_cgroup *cg = css_rdmacg(css);
+
+	kfree(cg);
+}
+
+/**
+ * rdmacg_css_offline - cgroup css_offline callback
+ * @css: css of interest
+ *
+ * This function is called when @css is about to go away and responsible
+ * for shooting down all rdmacg associated with @css. As part of that it
+ * marks all the resource pool entries to max value, so that when resources are
+ * uncharged, associated resource pool can be freed as well.
+ */
+static void rdmacg_css_offline(struct cgroup_subsys_state *css)
+{
+	struct rdma_cgroup *cg = css_rdmacg(css);
+	struct rdmacg_resource_pool *rpool;
+
+	mutex_lock(&rdmacg_mutex);
+
+	list_for_each_entry(rpool, &cg->rpools, cg_node)
+		set_all_resource_max_limit(rpool);
+
+	mutex_unlock(&rdmacg_mutex);
+}
+
+struct cgroup_subsys rdma_cgrp_subsys = {
+	.css_alloc	= rdmacg_css_alloc,
+	.css_free	= rdmacg_css_free,
+	.css_offline	= rdmacg_css_offline,
+	.legacy_cftypes	= rdmacg_files,
+	.dfl_cftypes	= rdmacg_files,
+};
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCHv12 2/3] IB/core: added support to use rdma cgroup controller
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
@ 2016-08-31  8:37 ` Parav Pandit
       [not found] ` <1472632647-1525-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2016-10-05 11:22 ` Leon Romanovsky
  3 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module, pandit.parav

Added support APIs for IB core to register/unregister every IB/RDMA
device with rdma cgroup for tracking verbs and hw resources.
IB core registers with rdma cgroup controller.
Added support APIs for uverbs layer to make use of rdma controller.
Added uverbs layer to perform resource charge/uncharge functionality.
Added support during query_device uverb operation to ensure it
returns resource limits by honoring rdma cgroup configured limits.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 drivers/infiniband/core/Makefile    |  1 +
 drivers/infiniband/core/cgroup.c    | 93 +++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/core_priv.h | 41 ++++++++++++++++
 drivers/infiniband/core/device.c    | 10 ++++
 4 files changed, 145 insertions(+)
 create mode 100644 drivers/infiniband/core/cgroup.c

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index edaae9f..e426ac8 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,6 +13,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
 				multicast.o mad.o smi.o agent.o mad_rmpp.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
 
 ib_cm-y :=			cm.o
 
diff --git a/drivers/infiniband/core/cgroup.c b/drivers/infiniband/core/cgroup.c
new file mode 100644
index 0000000..ffe7234
--- /dev/null
+++ b/drivers/infiniband/core/cgroup.c
@@ -0,0 +1,93 @@
+/*
+ * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "core_priv.h"
+
+/*
+ * resource table definition as to be seen by the user.
+ * Need to add entries to it when more resources are
+ * added/defined at IB verb/core layer.
+ */
+
+/**
+ * ib_device_register_rdmacg - register with rdma cgroup.
+ * @device: device to register to participate in resource
+ *          accounting by rdma cgroup.
+ *
+ * Register with the rdma cgroup. Should be called before
+ * exposing rdma device to user space applications to avoid
+ * resource accounting leak.
+ * Returns 0 on success or otherwise failure code.
+ */
+int ib_device_register_rdmacg(struct ib_device *device)
+{
+	device->cg_device.name = device->name;
+	return rdmacg_register_device(&device->cg_device);
+}
+
+/**
+ * ib_device_unregister_rdmacg - unregister with rdma cgroup.
+ * @device: device to unregister.
+ *
+ * Unregister with the rdma cgroup. Should be called after
+ * all the resources are deallocated, and after a stage when any
+ * other resource allocation by user application cannot be done
+ * for this device to avoid any leak in accounting.
+ */
+void ib_device_unregister_rdmacg(struct ib_device *device)
+{
+	rdmacg_unregister_device(&device->cg_device);
+}
+
+int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+			 struct ib_device *device,
+			 enum rdmacg_resource_type resource_index)
+{
+	return rdmacg_try_charge(&cg_obj->cg, &device->cg_device,
+				 resource_index);
+}
+EXPORT_SYMBOL(ib_rdmacg_try_charge);
+
+void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+			struct ib_device *device,
+			enum rdmacg_resource_type resource_index)
+{
+	rdmacg_uncharge(cg_obj->cg, &device->cg_device,
+			resource_index);
+}
+EXPORT_SYMBOL(ib_rdmacg_uncharge);
+
+void ib_rdmacg_query_limit(struct ib_device *device, int *limits)
+{
+	rdmacg_query_limit(&device->cg_device, limits);
+}
+EXPORT_SYMBOL(ib_rdmacg_query_limit);
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 19d499d..d1e432e 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -35,6 +35,7 @@
 
 #include <linux/list.h>
 #include <linux/spinlock.h>
+#include <linux/cgroup_rdma.h>
 
 #include <rdma/ib_verbs.h>
 
@@ -124,6 +125,46 @@ int ib_cache_setup_one(struct ib_device *device);
 void ib_cache_cleanup_one(struct ib_device *device);
 void ib_cache_release_one(struct ib_device *device);
 
+#ifdef CONFIG_CGROUP_RDMA
+int ib_device_register_rdmacg(struct ib_device *device);
+void ib_device_unregister_rdmacg(struct ib_device *device);
+
+int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+			 struct ib_device *device,
+			 enum rdmacg_resource_type resource_index);
+
+void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+			struct ib_device *device,
+			enum rdmacg_resource_type resource_index);
+
+void ib_rdmacg_query_limit(struct ib_device *device, int *limits);
+#else
+static inline int ib_device_register_rdmacg(struct ib_device *device)
+{ return 0; }
+
+static inline void ib_device_unregister_rdmacg(struct ib_device *device)
+{ }
+
+static inline int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
+				       struct ib_device *device,
+				       enum rdmacg_resource_type resource_index)
+{ return 0; }
+
+static inline void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
+				      struct ib_device *device,
+				      enum rdmacg_resource_type resource_index)
+{ }
+
+static inline void ib_rdmacg_query_limit(struct ib_device *device,
+					 int *limits)
+{
+	int i;
+
+	for (i = 0; i < RDMACG_RESOURCE_MAX; i++)
+		limits[i] = S32_MAX;
+}
+#endif
+
 static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
 					 struct net_device *upper)
 {
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 760ef60..08e3259 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -363,10 +363,18 @@ int ib_register_device(struct ib_device *device,
 		goto out;
 	}
 
+	ret = ib_device_register_rdmacg(device);
+	if (ret) {
+		pr_warn("Couldn't register device with rdma cgroup\n");
+		ib_cache_cleanup_one(device);
+		goto out;
+	}
+
 	memset(&device->attrs, 0, sizeof(device->attrs));
 	ret = device->query_device(device, &device->attrs, &uhw);
 	if (ret) {
 		pr_warn("Couldn't query the device attributes\n");
+		ib_device_unregister_rdmacg(device);
 		ib_cache_cleanup_one(device);
 		goto out;
 	}
@@ -375,6 +383,7 @@ int ib_register_device(struct ib_device *device,
 	if (ret) {
 		pr_warn("Couldn't register device %s with driver model\n",
 			device->name);
+		ib_device_unregister_rdmacg(device);
 		ib_cache_cleanup_one(device);
 		goto out;
 	}
@@ -424,6 +433,7 @@ void ib_unregister_device(struct ib_device *device)
 
 	mutex_unlock(&device_mutex);
 
+	ib_device_unregister_rdmacg(device);
 	ib_device_unregister_sysfs(device);
 	ib_cache_cleanup_one(device);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCHv12 3/3] rdmacg: Added documentation for rdmacg
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
@ 2016-08-31  8:37     ` Parav Pandit
  2016-08-31  8:37 ` [PATCHv12 2/3] IB/core: added support to use " Parav Pandit
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, hch-jcswGhMUV9g,
	liranl-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	haggaie-VPRAkNaXOzVWk0Htik3J/w
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	pandit.parav-Re5JQEeQqe8AvxtiuMwx3w

Added documentation for v1 and v2 version describing high
level design and usage examples on using rdma controller.

Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 Documentation/cgroup-v1/rdma.txt | 117 +++++++++++++++++++++++++++++++++++++++
 Documentation/cgroup-v2.txt      |  45 +++++++++++++++
 2 files changed, 162 insertions(+)
 create mode 100644 Documentation/cgroup-v1/rdma.txt

diff --git a/Documentation/cgroup-v1/rdma.txt b/Documentation/cgroup-v1/rdma.txt
new file mode 100644
index 0000000..28cb59e
--- /dev/null
+++ b/Documentation/cgroup-v1/rdma.txt
@@ -0,0 +1,117 @@
+				RDMA Controller
+				----------------
+
+Contents
+--------
+
+1. Overview
+  1-1. What is RDMA controller?
+  1-2. Why RDMA controller needed?
+  1-3. How is RDMA controller implemented?
+2. Usage Examples
+
+1. Overview
+
+1-1. What is RDMA controller?
+-----------------------------
+
+RDMA controller allows user to limit RDMA/IB specific resources that a given
+set of processes can use. These processes are grouped using RDMA controller.
+
+RDMA controller defines well defined verb resources which can be limited for
+processes of a cgroup.
+
+1-2. Why RDMA controller needed?
+--------------------------------
+
+Currently user space applications can easily take away all the rdma device
+specific resources such as AH, CQ, QP, MR etc. Due to which other applications
+in other cgroup or kernel space ULPs may not even get chance to allocate any
+rdma resources. This can leads to service unavailability.
+
+Therefore RDMA controller is needed through which resource consumption
+of processes can be limited. Through this controller various different rdma
+resources can be accounted.
+
+1-3. How is RDMA controller implemented?
+----------------------------------------
+
+RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains
+resource accounting per cgroup, per device using resource pool structure.
+Each such resource pool is limited up to 64 resources in given resource pool
+by rdma cgroup, which can be extended later if required.
+
+This resource pool object is linked to the cgroup css. Typically there
+are 0 to 4 resource pool instances per cgroup, per device in most use cases.
+But nothing limits to have it more. At present hundreds of RDMA devices per
+single cgroup may not be handled optimally, however there is no
+known use case or requirement for such configuration either.
+
+Since RDMA resources can be allocated from any process and can be freed by any
+of the child processes which shares the address space, rdma resources are
+always owned by the creator cgroup css. This allows process migration from one
+to other cgroup without major complexity of transferring resource ownership;
+because such ownership is not really present due to shared nature of
+rdma resources. Linking resources around css also ensures that cgroups can be
+deleted after processes migrated. This allow progress migration as well with
+active resources, even though that is not a primary use case.
+
+Whenever RDMA resource charging occurs, owner rdma cgroup is returned to
+the caller. Same rdma cgroup should be passed while uncharging the resource.
+This also allows process migrated with active RDMA resource to charge
+to new owner cgroup for new resource. It also allows to uncharge resource of
+a process from previously charged cgroup which is migrated to new cgroup,
+even though that is not a primary use case.
+
+Resource pool object is created in following situations.
+(a) User sets the limit and no previous resource pool exist for the device
+of interest for the cgroup.
+(b) No resource limits were configured, but IB/RDMA stack tries to
+charge the resource. So that it correctly uncharge them when applications are
+running without limits and later on when limits are enforced during uncharging,
+otherwise usage count will drop to negative.
+
+Resource pool is destroyed if all the resource limits are set to max and
+it is the last resource getting deallocated.
+
+User should set all the limit to max value if it intents to remove/unconfigure
+the resource pool for a particular device.
+
+IB stack honors limits enforced by the rdma controller. When application
+query about maximum resource limits of IB device, it returns minimum of
+what is configured by user for a given cgroup and what is supported by
+IB device.
+
+Following resources can be accounted by rdma controller.
+  uctx		Maximum number of User Contexts
+  pd		Maximum number of Protection domains
+  ah		Maximum number of Address handles
+  mr		Maximum number of Memory Regions
+  mw		Maximum number of Memory Windows
+  cq		Maximum number of Completion Queues
+  srq		Maximum number of Shared Receive Queues
+  qp		Maximum number of Queue Pairs
+  flow		Maximum number of Flows
+
+
+2. Usage Examples
+-----------------
+
+(a) Configure resource limit:
+echo mlx4_0 mr=100 qp=10 ah=2 > /sys/fs/cgroup/rdma/1/rdma.max
+echo ocrdma1 mr=120 qp=20 cq=10 > /sys/fs/cgroup/rdma/2/rdma.max
+
+(b) Query resource limit:
+cat /sys/fs/cgroup/rdma/2/rdma.max
+#Output:
+mlx4_0 uctx=max pd=max ah=2 mr=100 mw=max cq=max srq=max qp=10 flow=max
+ocrdma1 uctx=max pd=max ah=max mr=120 mw=max cq=10 srq=max qp=20 flow=max
+
+(c) Query current usage:
+cat /sys/fs/cgroup/rdma/2/rdma.current
+#Output:
+mlx4_0 uctx=1 pd=2 ah=2 mr=95 mw=0 cq=2 srq=0 qp=8 flow=0
+ocrdma1 uctx=1 pd=6 ah=9 mr=20 mw=0 cq=1 srq=0 qp=2 flow=0
+
+(d) Delete resource limit:
+echo mlx4_0 mr=max qp=max ah=max > /sys/fs/cgroup/rdma/1/rdma.max
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 4cc07ce..ca63a31 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -47,6 +47,8 @@ CONTENTS
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
+  5-4. RDMA
+    5-4-1. RDMA Interface Files
 6. Namespace
   6-1. Basics
   6-2. The Root and Views
@@ -1119,6 +1121,49 @@ writeback as follows.
 	vm.dirty[_background]_ratio.
 
 
+5-4. RDMA
+
+The "rdma" controller regulates the distribution and accounting of
+of RDMA resources.
+
+5-4-1. RDMA Interface Files
+
+  rdma.max
+	A readwrite nested-keyed file that exists for all the cgroups
+	except root that describes current configured resource limit
+	for a RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+	The following nested keys are defined.
+
+	  uctx		Maximum number of User Contexts
+	  pd		Maximum number of Protection domains
+	  ah		Maximum number of Address handles
+	  mr		Maximum number of Memory Regions
+	  mw		Maximum number of Memory Windows
+	  cq		Maximum number of Completion Queues
+	  srq		Maximum number of Shared Receive Queues
+	  qp		Maximum number of Queue Pairs
+	  flow		Maximum number of Flows
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_0 uctx=max pd=4 ah=2 mr=10 mw=max cq=1 srq=1 qp=10 flow=10
+	  ocrdma1 uctx=2 pd=2 ah=2 mr=20 mw=max cq=1 srq=1 qp=10 flow=10
+
+  rdma.current
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup except root.
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_1 uctx=1 ah=0 pd=1 cq=4 qp=4 mr=100 srq=0 flow=10
+	  ocrdma1 uctx=2 pd=2 ah=2 mr=20 mw=max cq=1 srq=1 qp=10 flow=10
+
+
 6. Namespace
 
 6-1. Basics
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCHv12 3/3] rdmacg: Added documentation for rdmacg
@ 2016-08-31  8:37     ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-08-31  8:37 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module, pandit.parav

Added documentation for v1 and v2 version describing high
level design and usage examples on using rdma controller.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 Documentation/cgroup-v1/rdma.txt | 117 +++++++++++++++++++++++++++++++++++++++
 Documentation/cgroup-v2.txt      |  45 +++++++++++++++
 2 files changed, 162 insertions(+)
 create mode 100644 Documentation/cgroup-v1/rdma.txt

diff --git a/Documentation/cgroup-v1/rdma.txt b/Documentation/cgroup-v1/rdma.txt
new file mode 100644
index 0000000..28cb59e
--- /dev/null
+++ b/Documentation/cgroup-v1/rdma.txt
@@ -0,0 +1,117 @@
+				RDMA Controller
+				----------------
+
+Contents
+--------
+
+1. Overview
+  1-1. What is RDMA controller?
+  1-2. Why RDMA controller needed?
+  1-3. How is RDMA controller implemented?
+2. Usage Examples
+
+1. Overview
+
+1-1. What is RDMA controller?
+-----------------------------
+
+RDMA controller allows user to limit RDMA/IB specific resources that a given
+set of processes can use. These processes are grouped using RDMA controller.
+
+RDMA controller defines well defined verb resources which can be limited for
+processes of a cgroup.
+
+1-2. Why RDMA controller needed?
+--------------------------------
+
+Currently user space applications can easily take away all the rdma device
+specific resources such as AH, CQ, QP, MR etc. Due to which other applications
+in other cgroup or kernel space ULPs may not even get chance to allocate any
+rdma resources. This can leads to service unavailability.
+
+Therefore RDMA controller is needed through which resource consumption
+of processes can be limited. Through this controller various different rdma
+resources can be accounted.
+
+1-3. How is RDMA controller implemented?
+----------------------------------------
+
+RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains
+resource accounting per cgroup, per device using resource pool structure.
+Each such resource pool is limited up to 64 resources in given resource pool
+by rdma cgroup, which can be extended later if required.
+
+This resource pool object is linked to the cgroup css. Typically there
+are 0 to 4 resource pool instances per cgroup, per device in most use cases.
+But nothing limits to have it more. At present hundreds of RDMA devices per
+single cgroup may not be handled optimally, however there is no
+known use case or requirement for such configuration either.
+
+Since RDMA resources can be allocated from any process and can be freed by any
+of the child processes which shares the address space, rdma resources are
+always owned by the creator cgroup css. This allows process migration from one
+to other cgroup without major complexity of transferring resource ownership;
+because such ownership is not really present due to shared nature of
+rdma resources. Linking resources around css also ensures that cgroups can be
+deleted after processes migrated. This allow progress migration as well with
+active resources, even though that is not a primary use case.
+
+Whenever RDMA resource charging occurs, owner rdma cgroup is returned to
+the caller. Same rdma cgroup should be passed while uncharging the resource.
+This also allows process migrated with active RDMA resource to charge
+to new owner cgroup for new resource. It also allows to uncharge resource of
+a process from previously charged cgroup which is migrated to new cgroup,
+even though that is not a primary use case.
+
+Resource pool object is created in following situations.
+(a) User sets the limit and no previous resource pool exist for the device
+of interest for the cgroup.
+(b) No resource limits were configured, but IB/RDMA stack tries to
+charge the resource. So that it correctly uncharge them when applications are
+running without limits and later on when limits are enforced during uncharging,
+otherwise usage count will drop to negative.
+
+Resource pool is destroyed if all the resource limits are set to max and
+it is the last resource getting deallocated.
+
+User should set all the limit to max value if it intents to remove/unconfigure
+the resource pool for a particular device.
+
+IB stack honors limits enforced by the rdma controller. When application
+query about maximum resource limits of IB device, it returns minimum of
+what is configured by user for a given cgroup and what is supported by
+IB device.
+
+Following resources can be accounted by rdma controller.
+  uctx		Maximum number of User Contexts
+  pd		Maximum number of Protection domains
+  ah		Maximum number of Address handles
+  mr		Maximum number of Memory Regions
+  mw		Maximum number of Memory Windows
+  cq		Maximum number of Completion Queues
+  srq		Maximum number of Shared Receive Queues
+  qp		Maximum number of Queue Pairs
+  flow		Maximum number of Flows
+
+
+2. Usage Examples
+-----------------
+
+(a) Configure resource limit:
+echo mlx4_0 mr=100 qp=10 ah=2 > /sys/fs/cgroup/rdma/1/rdma.max
+echo ocrdma1 mr=120 qp=20 cq=10 > /sys/fs/cgroup/rdma/2/rdma.max
+
+(b) Query resource limit:
+cat /sys/fs/cgroup/rdma/2/rdma.max
+#Output:
+mlx4_0 uctx=max pd=max ah=2 mr=100 mw=max cq=max srq=max qp=10 flow=max
+ocrdma1 uctx=max pd=max ah=max mr=120 mw=max cq=10 srq=max qp=20 flow=max
+
+(c) Query current usage:
+cat /sys/fs/cgroup/rdma/2/rdma.current
+#Output:
+mlx4_0 uctx=1 pd=2 ah=2 mr=95 mw=0 cq=2 srq=0 qp=8 flow=0
+ocrdma1 uctx=1 pd=6 ah=9 mr=20 mw=0 cq=1 srq=0 qp=2 flow=0
+
+(d) Delete resource limit:
+echo mlx4_0 mr=max qp=max ah=max > /sys/fs/cgroup/rdma/1/rdma.max
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 4cc07ce..ca63a31 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -47,6 +47,8 @@ CONTENTS
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
+  5-4. RDMA
+    5-4-1. RDMA Interface Files
 6. Namespace
   6-1. Basics
   6-2. The Root and Views
@@ -1119,6 +1121,49 @@ writeback as follows.
 	vm.dirty[_background]_ratio.
 
 
+5-4. RDMA
+
+The "rdma" controller regulates the distribution and accounting of
+of RDMA resources.
+
+5-4-1. RDMA Interface Files
+
+  rdma.max
+	A readwrite nested-keyed file that exists for all the cgroups
+	except root that describes current configured resource limit
+	for a RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+	The following nested keys are defined.
+
+	  uctx		Maximum number of User Contexts
+	  pd		Maximum number of Protection domains
+	  ah		Maximum number of Address handles
+	  mr		Maximum number of Memory Regions
+	  mw		Maximum number of Memory Windows
+	  cq		Maximum number of Completion Queues
+	  srq		Maximum number of Shared Receive Queues
+	  qp		Maximum number of Queue Pairs
+	  flow		Maximum number of Flows
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_0 uctx=max pd=4 ah=2 mr=10 mw=max cq=1 srq=1 qp=10 flow=10
+	  ocrdma1 uctx=2 pd=2 ah=2 mr=20 mw=max cq=1 srq=1 qp=10 flow=10
+
+  rdma.current
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup except root.
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_1 uctx=1 ah=0 pd=1 cq=4 qp=4 mr=100 srq=0 flow=10
+	  ocrdma1 uctx=2 pd=2 ah=2 mr=20 mw=max cq=1 srq=1 qp=10 flow=10
+
+
 6. Namespace
 
 6-1. Basics
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
@ 2016-08-31  9:38   ` Leon Romanovsky
  2016-09-07 15:07     ` Parav Pandit
  2016-08-31 15:07     ` Matan Barak
  1 sibling, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-08-31  9:38 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie,
	corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Wed, Aug 31, 2016 at 02:07:25PM +0530, Parav Pandit wrote:
> Added rdma cgroup controller that does accounting, limit enforcement
> on rdma/IB verbs and hw resources.
>
> Added rdma cgroup header file which defines its APIs to perform
> charing/uncharing functionality. It also defined APIs for RDMA/IB
> stack for device registration. Devices which are registered will
> participate in controller functions of accounting and limit
> enforcements. It define rdmacg_device structure to bind IB stack
> and RDMA cgroup controller.
>
> RDMA resources are tracked using resource pool. Resource pool is per
> device, per cgroup entity which allows setting up accounting limits
> on per device basis.
>
> Currently resources are defined by the RDMA cgroup.
>
> Resource pool is created/destroyed dynamically whenever
> charging/uncharging occurs respectively and whenever user
> configuration is done. Its a tradeoff of memory vs little more code
> space that creates resource pool object whenever necessary, instead of
> creating them during cgroup creation and device registration time.
>
> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
> ---

<...>

> +
> +static struct rdmacg_resource_pool *
> +get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
> +{
> +	struct rdmacg_resource_pool *rpool;
> +
> +	rpool = find_cg_rpool_locked(cg, device);
> +	if (rpool)
> +		return rpool;
> +
> +	rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
> +	if (!rpool)
> +		return ERR_PTR(-ENOMEM);
> +
> +	rpool->device = device;
> +	set_all_resource_max_limit(rpool);
> +
> +	INIT_LIST_HEAD(&rpool->cg_node);
> +	INIT_LIST_HEAD(&rpool->dev_node);
> +	list_add_tail(&rpool->cg_node, &cg->rpools);
> +	list_add_tail(&rpool->dev_node, &device->rpools);
> +	return rpool;
> +}

<...>

> +	for (p = cg; p; p = parent_rdmacg(p)) {
> +		rpool = get_cg_rpool_locked(p, device);
> +		if (IS_ERR_OR_NULL(rpool)) {

get_cg_rpool_locked always returns !NULL (error, or pointer)

> +			ret = PTR_ERR(rpool);
> +			goto err;

I didn't review the whole series yet.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
@ 2016-08-31 13:56     ` Tejun Heo
  2016-08-31  8:37 ` [PATCHv12 2/3] IB/core: added support to use " Parav Pandit
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 112+ messages in thread
From: Tejun Heo @ 2016-08-31 13:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, hch-jcswGhMUV9g,
	liranl-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	haggaie-VPRAkNaXOzVWk0Htik3J/w, corbet-T1hC0tSOHrs,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hello,

On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
> rdmacg: IB/core: rdma controller support
> 
> Patch is generated and tested against below Doug's linux-rdma
> git tree.
> 
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
> Branch: master
> 
> Patchset is also compiled and tested against below Tejun's cgroup tree
> using cgroup v2 mode.
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
> Branch: master
> 
> Overview:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources. This results into service unavailibility.

Generally looks good to me.  Please feel free to add

 Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

for the whole series.  Also, once reviews from rdma side are done,
please let me know what's the preferable way to route the patchset.  I
can route it through the cgroup tree or it can go through rdma one.

Thanks a lot for the persisence!

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
@ 2016-08-31 13:56     ` Tejun Heo
  0 siblings, 0 replies; 112+ messages in thread
From: Tejun Heo @ 2016-08-31 13:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, hannes,
	dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie, corbet,
	james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module

Hello,

On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
> rdmacg: IB/core: rdma controller support
> 
> Patch is generated and tested against below Doug's linux-rdma
> git tree.
> 
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
> Branch: master
> 
> Patchset is also compiled and tested against below Tejun's cgroup tree
> using cgroup v2 mode.
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
> Branch: master
> 
> Overview:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources. This results into service unavailibility.

Generally looks good to me.  Please feel free to add

 Acked-by: Tejun Heo <tj@kernel.org>

for the whole series.  Also, once reviews from rdma side are done,
please let me know what's the preferable way to route the patchset.  I
can route it through the cgroup tree or it can go through rdma one.

Thanks a lot for the persisence!

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
@ 2016-08-31 15:07     ` Matan Barak
  2016-08-31 15:07     ` Matan Barak
  1 sibling, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-08-31 15:07 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford, hch, liranl, sean.hefty, jgunthorpe,
	haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, akpm, linux-security-module

On 31/08/2016 11:37, Parav Pandit wrote:
> Added rdma cgroup controller that does accounting, limit enforcement
> on rdma/IB verbs and hw resources.
>
> Added rdma cgroup header file which defines its APIs to perform
> charing/uncharing functionality. It also defined APIs for RDMA/IB
> stack for device registration. Devices which are registered will
> participate in controller functions of accounting and limit
> enforcements. It define rdmacg_device structure to bind IB stack
> and RDMA cgroup controller.
>
> RDMA resources are tracked using resource pool. Resource pool is per
> device, per cgroup entity which allows setting up accounting limits
> on per device basis.
>
> Currently resources are defined by the RDMA cgroup.
>
> Resource pool is created/destroyed dynamically whenever
> charging/uncharging occurs respectively and whenever user
> configuration is done. Its a tradeoff of memory vs little more code
> space that creates resource pool object whenever necessary, instead of
> creating them during cgroup creation and device registration time.
>
> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
> ---
>  include/linux/cgroup_rdma.h   |  66 +++++
>  include/linux/cgroup_subsys.h |   4 +
>  init/Kconfig                  |  10 +
>  kernel/Makefile               |   1 +
>  kernel/cgroup_rdma.c          | 664 ++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 745 insertions(+)
>  create mode 100644 include/linux/cgroup_rdma.h
>  create mode 100644 kernel/cgroup_rdma.c
>
> diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
> new file mode 100644
> index 0000000..6710e28
> --- /dev/null
> +++ b/include/linux/cgroup_rdma.h
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
> + *
> + * This file is subject to the terms and conditions of version 2 of the GNU
> + * General Public License. See the file COPYING in the main directory of the
> + * Linux distribution for more details.
> + */
> +
> +#ifndef _CGROUP_RDMA_H
> +#define _CGROUP_RDMA_H
> +
> +#include <linux/cgroup.h>
> +
> +enum rdmacg_resource_type {
> +	RDMACG_VERB_RESOURCE_UCTX,
> +	RDMACG_VERB_RESOURCE_AH,
> +	RDMACG_VERB_RESOURCE_PD,
> +	RDMACG_VERB_RESOURCE_CQ,
> +	RDMACG_VERB_RESOURCE_MR,
> +	RDMACG_VERB_RESOURCE_MW,
> +	RDMACG_VERB_RESOURCE_SRQ,
> +	RDMACG_VERB_RESOURCE_QP,
> +	RDMACG_VERB_RESOURCE_FLOW,
> +	/*
> +	 * add any hw specific resource here as RDMA_HW_RESOURCE_NAME
> +	 */
> +	RDMACG_RESOURCE_MAX,
> +};
> +
> +#ifdef CONFIG_CGROUP_RDMA
> +

Currently, there are some discussions regarding the RDMA ABI. The 
current proposed approach (after a lot of discussions in the OFVWG) is 
to have driver dependent object types rather than the fixed set of IB 
object types we have today.
AFAIK, some vendors might want to use the RDMA subsystem for a different 
fabrics which has a different set of objects.
You could see RFCs for such concepts both from Mellanox and Intel on the 
linux-rdma mailing list.

Saying that, maybe we need to make the resource types a bit more 
flexible and dynamic.

Regards,
Matan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-08-31 15:07     ` Matan Barak
  0 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-08-31 15:07 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford, hch, liranl, sean.hefty, jgunthorpe,
	haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, akpm, linux-security-module

On 31/08/2016 11:37, Parav Pandit wrote:
> Added rdma cgroup controller that does accounting, limit enforcement
> on rdma/IB verbs and hw resources.
>
> Added rdma cgroup header file which defines its APIs to perform
> charing/uncharing functionality. It also defined APIs for RDMA/IB
> stack for device registration. Devices which are registered will
> participate in controller functions of accounting and limit
> enforcements. It define rdmacg_device structure to bind IB stack
> and RDMA cgroup controller.
>
> RDMA resources are tracked using resource pool. Resource pool is per
> device, per cgroup entity which allows setting up accounting limits
> on per device basis.
>
> Currently resources are defined by the RDMA cgroup.
>
> Resource pool is created/destroyed dynamically whenever
> charging/uncharging occurs respectively and whenever user
> configuration is done. Its a tradeoff of memory vs little more code
> space that creates resource pool object whenever necessary, instead of
> creating them during cgroup creation and device registration time.
>
> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
> ---
>  include/linux/cgroup_rdma.h   |  66 +++++
>  include/linux/cgroup_subsys.h |   4 +
>  init/Kconfig                  |  10 +
>  kernel/Makefile               |   1 +
>  kernel/cgroup_rdma.c          | 664 ++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 745 insertions(+)
>  create mode 100644 include/linux/cgroup_rdma.h
>  create mode 100644 kernel/cgroup_rdma.c
>
> diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
> new file mode 100644
> index 0000000..6710e28
> --- /dev/null
> +++ b/include/linux/cgroup_rdma.h
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
> + *
> + * This file is subject to the terms and conditions of version 2 of the GNU
> + * General Public License. See the file COPYING in the main directory of the
> + * Linux distribution for more details.
> + */
> +
> +#ifndef _CGROUP_RDMA_H
> +#define _CGROUP_RDMA_H
> +
> +#include <linux/cgroup.h>
> +
> +enum rdmacg_resource_type {
> +	RDMACG_VERB_RESOURCE_UCTX,
> +	RDMACG_VERB_RESOURCE_AH,
> +	RDMACG_VERB_RESOURCE_PD,
> +	RDMACG_VERB_RESOURCE_CQ,
> +	RDMACG_VERB_RESOURCE_MR,
> +	RDMACG_VERB_RESOURCE_MW,
> +	RDMACG_VERB_RESOURCE_SRQ,
> +	RDMACG_VERB_RESOURCE_QP,
> +	RDMACG_VERB_RESOURCE_FLOW,
> +	/*
> +	 * add any hw specific resource here as RDMA_HW_RESOURCE_NAME
> +	 */
> +	RDMACG_RESOURCE_MAX,
> +};
> +
> +#ifdef CONFIG_CGROUP_RDMA
> +

Currently, there are some discussions regarding the RDMA ABI. The 
current proposed approach (after a lot of discussions in the OFVWG) is 
to have driver dependent object types rather than the fixed set of IB 
object types we have today.
AFAIK, some vendors might want to use the RDMA subsystem for a different 
fabrics which has a different set of objects.
You could see RFCs for such concepts both from Mellanox and Intel on the 
linux-rdma mailing list.

Saying that, maybe we need to make the resource types a bit more 
flexible and dynamic.

Regards,
Matan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31 15:07     ` Matan Barak
  (?)
@ 2016-08-31 21:16     ` Tejun Heo
       [not found]       ` <20160831211618.GA12660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  -1 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-08-31 21:16 UTC (permalink / raw)
  To: Matan Barak
  Cc: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, hannes, dledford, hch, liranl, sean.hefty, jgunthorpe,
	haggaie, corbet, james.l.morris, serge, ogerlitz, akpm,
	linux-security-module

Hello,

On Wed, Aug 31, 2016 at 06:07:30PM +0300, Matan Barak wrote:
> Currently, there are some discussions regarding the RDMA ABI. The current
> proposed approach (after a lot of discussions in the OFVWG) is to have
> driver dependent object types rather than the fixed set of IB object types
> we have today.
> AFAIK, some vendors might want to use the RDMA subsystem for a different
> fabrics which has a different set of objects.
> You could see RFCs for such concepts both from Mellanox and Intel on the
> linux-rdma mailing list.
> 
> Saying that, maybe we need to make the resource types a bit more flexible
> and dynamic.

That'd be back to square one and Christoph was dead against it too,
so...

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31 21:16     ` Tejun Heo
@ 2016-09-01  7:25           ` Matan Barak
  0 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-01  7:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, hch-jcswGhMUV9g,
	liranl-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	haggaie-VPRAkNaXOzVWk0Htik3J/w, corbet-T1hC0tSOHrs,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On 01/09/2016 00:16, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 31, 2016 at 06:07:30PM +0300, Matan Barak wrote:
>> Currently, there are some discussions regarding the RDMA ABI. The current
>> proposed approach (after a lot of discussions in the OFVWG) is to have
>> driver dependent object types rather than the fixed set of IB object types
>> we have today.
>> AFAIK, some vendors might want to use the RDMA subsystem for a different
>> fabrics which has a different set of objects.
>> You could see RFCs for such concepts both from Mellanox and Intel on the
>> linux-rdma mailing list.
>>
>> Saying that, maybe we need to make the resource types a bit more flexible
>> and dynamic.
>
> That'd be back to square one and Christoph was dead against it too,
> so...
>

Well, if I recall, the reason doing so last time was in order to allow 
flexible updating of ib_core independently, which is obviously not a 
good reason (to say the least).

Since the new ABI will probably define new object types (all recent 
proposals go this way), the current approach could lead to either trying 
to map new objects to existing cgroup resource types, which could lead 
to some weird non 1:1 mapping, or having a split definitions - such that 
each driver will declare its objects both in the cgroups mechanism and 
in its driver dispatch table.
Even worse than that, drivers could simply ignore the cgroups support 
while implementing their own resource types and get a very broken 
containers support.


> Thanks.
>

Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-01  7:25           ` Matan Barak
  0 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-01  7:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, hannes, dledford, hch, liranl, sean.hefty, jgunthorpe,
	haggaie, corbet, james.l.morris, serge, ogerlitz, akpm,
	linux-security-module

On 01/09/2016 00:16, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 31, 2016 at 06:07:30PM +0300, Matan Barak wrote:
>> Currently, there are some discussions regarding the RDMA ABI. The current
>> proposed approach (after a lot of discussions in the OFVWG) is to have
>> driver dependent object types rather than the fixed set of IB object types
>> we have today.
>> AFAIK, some vendors might want to use the RDMA subsystem for a different
>> fabrics which has a different set of objects.
>> You could see RFCs for such concepts both from Mellanox and Intel on the
>> linux-rdma mailing list.
>>
>> Saying that, maybe we need to make the resource types a bit more flexible
>> and dynamic.
>
> That'd be back to square one and Christoph was dead against it too,
> so...
>

Well, if I recall, the reason doing so last time was in order to allow 
flexible updating of ib_core independently, which is obviously not a 
good reason (to say the least).

Since the new ABI will probably define new object types (all recent 
proposals go this way), the current approach could lead to either trying 
to map new objects to existing cgroup resource types, which could lead 
to some weird non 1:1 mapping, or having a split definitions - such that 
each driver will declare its objects both in the cgroups mechanism and 
in its driver dispatch table.
Even worse than that, drivers could simply ignore the cgroups support 
while implementing their own resource types and get a very broken 
containers support.


> Thanks.
>

Matan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-01  7:25           ` Matan Barak
  (?)
@ 2016-09-01  8:44           ` Christoph Hellwig
       [not found]             ` <20160901084406.GA4115-jcswGhMUV9g@public.gmane.org>
  -1 siblings, 1 reply; 112+ messages in thread
From: Christoph Hellwig @ 2016-09-01  8:44 UTC (permalink / raw)
  To: Matan Barak
  Cc: Tejun Heo, Parav Pandit, cgroups, linux-doc, linux-kernel,
	linux-rdma, lizefan, hannes, dledford, hch, liranl, sean.hefty,
	jgunthorpe, haggaie, corbet, james.l.morris, serge, ogerlitz,
	akpm, linux-security-module

On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
> Well, if I recall, the reason doing so last time was in order to allow 
> flexible updating of ib_core independently, which is obviously not a good 
> reason (to say the least).
>
> Since the new ABI will probably define new object types (all recent 
> proposals go this way), the current approach could lead to either trying to 
> map new objects to existing cgroup resource types, which could lead to some 
> weird non 1:1 mapping, or having a split definitions - such that each 
> driver will declare its objects both in the cgroups mechanism and in its 
> driver dispatch table.
> Even worse than that, drivers could simply ignore the cgroups support while 
> implementing their own resource types and get a very broken containers 
> support.

Sorry guys, but arbitrary extensibility for something not finished is the
worst idea ever.  We have two options here:

 a) delay cgroups support until the grand rewrite is done
 b) add it now and deal with the consequences later

That being said, adding random non-Verbs hardwasre to the RDMA core is
the second worst idea ever.  Guess I need to catch up with the
discussion and start using the flame thrower.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-01  8:44           ` Christoph Hellwig
@ 2016-09-07  7:55                 ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-07  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matan Barak, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hi Matan,

On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
>> Well, if I recall, the reason doing so last time was in order to allow
>> flexible updating of ib_core independently, which is obviously not a good
>> reason (to say the least).
>>
>> Since the new ABI will probably define new object types (all recent
>> proposals go this way), the current approach could lead to either trying to
>> map new objects to existing cgroup resource types, which could lead to some
>> weird non 1:1 mapping, or having a split definitions - such that each
>> driver will declare its objects both in the cgroups mechanism and in its
>> driver dispatch table.

>> Even worse than that, drivers could simply ignore the cgroups support while
>> implementing their own resource types and get a very broken containers
>> support.
If drivers are broken due to ignorance of not-calling cgroup APIs,
that should be ok.
That particular driver should fix it.
If the resource creation using uverbs is using well defined rdma level
resource, uverbs level will make sure to honor cgroup limits without
involving hw drivers anyway.

RDMA Verb level resource is charged/uncharged by RDMA core.
RDMA HW level resource is charged/uncharged by RDMA HW driver using
well defined API and resource by cgroup core.
This scheme ensures that there is 1:1 mapping.

I don't think current definition of resource type is carved out on stone.
They can be extended as we forward.
As many of us agree that, they should be well defined and it should be
agreed by cgroup and rdma community.

For example, today we have RDMA_VERB_xxx resources.
New well defined RDMA HW resources can be defined in rdma_cgroup.h
file as RDMA_HW_xx in same enum table.

>
> Sorry guys, but arbitrary extensibility for something not finished is the
> worst idea ever.  We have two options here:
>
>  a) delay cgroups support until the grand rewrite is done
>  b) add it now and deal with the consequences later
>
Can we do (b) now and differ adding any HW resources to cgroup until
they are clearly called out.
Architecture and APIs are already in place to support this.

> That being said, adding random non-Verbs hardwasre to the RDMA core is
> the second worst idea ever.

We can differ adding HW resource to core and cgroup until they are
clearly defined.
In that case current architecture still holds good.

> Guess I need to catch up with the
> discussion and start using the flame thrower.

Matan,
Can you please point us to the new RFC/ABI email thread which
describes the design, partitioning of code between core vs hw drivers
etc.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-07  7:55                 ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-07  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matan Barak, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Matan,

On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
>> Well, if I recall, the reason doing so last time was in order to allow
>> flexible updating of ib_core independently, which is obviously not a good
>> reason (to say the least).
>>
>> Since the new ABI will probably define new object types (all recent
>> proposals go this way), the current approach could lead to either trying to
>> map new objects to existing cgroup resource types, which could lead to some
>> weird non 1:1 mapping, or having a split definitions - such that each
>> driver will declare its objects both in the cgroups mechanism and in its
>> driver dispatch table.

>> Even worse than that, drivers could simply ignore the cgroups support while
>> implementing their own resource types and get a very broken containers
>> support.
If drivers are broken due to ignorance of not-calling cgroup APIs,
that should be ok.
That particular driver should fix it.
If the resource creation using uverbs is using well defined rdma level
resource, uverbs level will make sure to honor cgroup limits without
involving hw drivers anyway.

RDMA Verb level resource is charged/uncharged by RDMA core.
RDMA HW level resource is charged/uncharged by RDMA HW driver using
well defined API and resource by cgroup core.
This scheme ensures that there is 1:1 mapping.

I don't think current definition of resource type is carved out on stone.
They can be extended as we forward.
As many of us agree that, they should be well defined and it should be
agreed by cgroup and rdma community.

For example, today we have RDMA_VERB_xxx resources.
New well defined RDMA HW resources can be defined in rdma_cgroup.h
file as RDMA_HW_xx in same enum table.

>
> Sorry guys, but arbitrary extensibility for something not finished is the
> worst idea ever.  We have two options here:
>
>  a) delay cgroups support until the grand rewrite is done
>  b) add it now and deal with the consequences later
>
Can we do (b) now and differ adding any HW resources to cgroup until
they are clearly called out.
Architecture and APIs are already in place to support this.

> That being said, adding random non-Verbs hardwasre to the RDMA core is
> the second worst idea ever.

We can differ adding HW resource to core and cgroup until they are
clearly defined.
In that case current architecture still holds good.

> Guess I need to catch up with the
> discussion and start using the flame thrower.

Matan,
Can you please point us to the new RFC/ABI email thread which
describes the design, partitioning of code between core vs hw drivers
etc.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07  7:55                 ` Parav Pandit
@ 2016-09-07  8:51                   ` Matan Barak
  -1 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-07  8:51 UTC (permalink / raw)
  To: Parav Pandit, Christoph Hellwig
  Cc: Tejun Heo, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris, serge, Or Gerlitz, Andrew Morton,
	linux-security-module

On 07/09/2016 10:55, Parav Pandit wrote:
> Hi Matan,
>
> On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <hch@lst.de> wrote:
>> On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
>>> Well, if I recall, the reason doing so last time was in order to allow
>>> flexible updating of ib_core independently, which is obviously not a good
>>> reason (to say the least).
>>>
>>> Since the new ABI will probably define new object types (all recent
>>> proposals go this way), the current approach could lead to either trying to
>>> map new objects to existing cgroup resource types, which could lead to some
>>> weird non 1:1 mapping, or having a split definitions - such that each
>>> driver will declare its objects both in the cgroups mechanism and in its
>>> driver dispatch table.
>
>>> Even worse than that, drivers could simply ignore the cgroups support while
>>> implementing their own resource types and get a very broken containers
>>> support.
> If drivers are broken due to ignorance of not-calling cgroup APIs,
> that should be ok.
> That particular driver should fix it.
> If the resource creation using uverbs is using well defined rdma level
> resource, uverbs level will make sure to honor cgroup limits without
> involving hw drivers anyway.
>

All recent proposals of the new ABI schema deals with extending the 
flexibility of the current schema by letting drivers define their 
specific types, actions, attributes, etc. Even more than that, the 
dispatching starts from the driver and it chooses if it wants to use the 
common RDMA core layer or have it's own wise implementation instead.
Some drivers might even prefer not to implement the current verbs types.
These decisions were made in the OFVWG meetings.

Anyway, maybe we should consider using a more higher-level logic objects 
that could fit multiple drivers requirements.

> RDMA Verb level resource is charged/uncharged by RDMA core.
> RDMA HW level resource is charged/uncharged by RDMA HW driver using
> well defined API and resource by cgroup core.
> This scheme ensures that there is 1:1 mapping.
>

Sounds reasonable, but what about drivers which ignore the common code 
and implement it in their own way? What about drivers which don't 
support the standard RDMA types at all?
I guess we should find a balance between something abstract and common 
enough that will ease administrator configuration but be specific enough 
for the various models we have (or will have) in the RDMA stack.

> I don't think current definition of resource type is carved out on stone.
> They can be extended as we forward.
> As many of us agree that, they should be well defined and it should be
> agreed by cgroup and rdma community.
>

Of course, but since the ABI and cgroups model are somehow related, they 
should be dealt with together and approved by Doug who participated in 
some of the OFVWG meetings.

> For example, today we have RDMA_VERB_xxx resources.
> New well defined RDMA HW resources can be defined in rdma_cgroup.h
> file as RDMA_HW_xx in same enum table.
>

So a driver will change the cgroups file for every new type it adds?
Will we just have a super set enum of all crazy types vendors added?

>>
>> Sorry guys, but arbitrary extensibility for something not finished is the
>> worst idea ever.  We have two options here:
>>
>>  a) delay cgroups support until the grand rewrite is done
>>  b) add it now and deal with the consequences later
>>
> Can we do (b) now and differ adding any HW resources to cgroup until
> they are clearly called out.
> Architecture and APIs are already in place to support this.
>

Since this affect the user, it's better to think how it fits our 
"optional standard"/"vendor types" model first. Maybe we could force all 
standard types even if the driver we use doesn't support any of them.

>> That being said, adding random non-Verbs hardwasre to the RDMA core is
>> the second worst idea ever.
>
> We can differ adding HW resource to core and cgroup until they are
> clearly defined.
> In that case current architecture still holds good.
>

Clearly we should differ adding the actual code until a driver could 
declare such objects, but we need to decide how to expose the standard 
optional RDMA types (basically, the types you've added) and how future 
driver specific types fit in.

>> Guess I need to catch up with the
>> discussion and start using the flame thrower.
>
> Matan,
> Can you please point us to the new RFC/ABI email thread which
> describes the design, partitioning of code between core vs hw drivers
> etc.
>

One proposal is [1]. There's another one from Sean which aims for 
similar targets regards the driver specific types.

[1] https://www.spinics.net/lists/linux-rdma/msg38997.html

Matan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-07  8:51                   ` Matan Barak
  0 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-07  8:51 UTC (permalink / raw)
  To: Parav Pandit, Christoph Hellwig
  Cc: Tejun Heo, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris, serge, Or Gerlitz, Andrew Morton,
	linux-security-module

On 07/09/2016 10:55, Parav Pandit wrote:
> Hi Matan,
>
> On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <hch@lst.de> wrote:
>> On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
>>> Well, if I recall, the reason doing so last time was in order to allow
>>> flexible updating of ib_core independently, which is obviously not a good
>>> reason (to say the least).
>>>
>>> Since the new ABI will probably define new object types (all recent
>>> proposals go this way), the current approach could lead to either trying to
>>> map new objects to existing cgroup resource types, which could lead to some
>>> weird non 1:1 mapping, or having a split definitions - such that each
>>> driver will declare its objects both in the cgroups mechanism and in its
>>> driver dispatch table.
>
>>> Even worse than that, drivers could simply ignore the cgroups support while
>>> implementing their own resource types and get a very broken containers
>>> support.
> If drivers are broken due to ignorance of not-calling cgroup APIs,
> that should be ok.
> That particular driver should fix it.
> If the resource creation using uverbs is using well defined rdma level
> resource, uverbs level will make sure to honor cgroup limits without
> involving hw drivers anyway.
>

All recent proposals of the new ABI schema deals with extending the 
flexibility of the current schema by letting drivers define their 
specific types, actions, attributes, etc. Even more than that, the 
dispatching starts from the driver and it chooses if it wants to use the 
common RDMA core layer or have it's own wise implementation instead.
Some drivers might even prefer not to implement the current verbs types.
These decisions were made in the OFVWG meetings.

Anyway, maybe we should consider using a more higher-level logic objects 
that could fit multiple drivers requirements.

> RDMA Verb level resource is charged/uncharged by RDMA core.
> RDMA HW level resource is charged/uncharged by RDMA HW driver using
> well defined API and resource by cgroup core.
> This scheme ensures that there is 1:1 mapping.
>

Sounds reasonable, but what about drivers which ignore the common code 
and implement it in their own way? What about drivers which don't 
support the standard RDMA types at all?
I guess we should find a balance between something abstract and common 
enough that will ease administrator configuration but be specific enough 
for the various models we have (or will have) in the RDMA stack.

> I don't think current definition of resource type is carved out on stone.
> They can be extended as we forward.
> As many of us agree that, they should be well defined and it should be
> agreed by cgroup and rdma community.
>

Of course, but since the ABI and cgroups model are somehow related, they 
should be dealt with together and approved by Doug who participated in 
some of the OFVWG meetings.

> For example, today we have RDMA_VERB_xxx resources.
> New well defined RDMA HW resources can be defined in rdma_cgroup.h
> file as RDMA_HW_xx in same enum table.
>

So a driver will change the cgroups file for every new type it adds?
Will we just have a super set enum of all crazy types vendors added?

>>
>> Sorry guys, but arbitrary extensibility for something not finished is the
>> worst idea ever.  We have two options here:
>>
>>  a) delay cgroups support until the grand rewrite is done
>>  b) add it now and deal with the consequences later
>>
> Can we do (b) now and differ adding any HW resources to cgroup until
> they are clearly called out.
> Architecture and APIs are already in place to support this.
>

Since this affect the user, it's better to think how it fits our 
"optional standard"/"vendor types" model first. Maybe we could force all 
standard types even if the driver we use doesn't support any of them.

>> That being said, adding random non-Verbs hardwasre to the RDMA core is
>> the second worst idea ever.
>
> We can differ adding HW resource to core and cgroup until they are
> clearly defined.
> In that case current architecture still holds good.
>

Clearly we should differ adding the actual code until a driver could 
declare such objects, but we need to decide how to expose the standard 
optional RDMA types (basically, the types you've added) and how future 
driver specific types fit in.

>> Guess I need to catch up with the
>> discussion and start using the flame thrower.
>
> Matan,
> Can you please point us to the new RFC/ABI email thread which
> describes the design, partitioning of code between core vs hw drivers
> etc.
>

One proposal is [1]. There's another one from Sean which aims for 
similar targets regards the driver specific types.

[1] https://www.spinics.net/lists/linux-rdma/msg38997.html

Matan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07  8:51                   ` Matan Barak
  (?)
@ 2016-09-07 14:54                   ` Parav Pandit
  -1 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-07 14:54 UTC (permalink / raw)
  To: Matan Barak
  Cc: Christoph Hellwig, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Wed, Sep 7, 2016 at 2:21 PM, Matan Barak <matanb@mellanox.com> wrote:
> On 07/09/2016 10:55, Parav Pandit wrote:
>>
>> Hi Matan,
>>
>> On Thu, Sep 1, 2016 at 2:14 PM, Christoph Hellwig <hch@lst.de> wrote:
>>>
>>> On Thu, Sep 01, 2016 at 10:25:40AM +0300, Matan Barak wrote:
>>>>
>>>> Well, if I recall, the reason doing so last time was in order to allow
>>>> flexible updating of ib_core independently, which is obviously not a
>>>> good
>>>> reason (to say the least).
>>>>
>>>> Since the new ABI will probably define new object types (all recent
>>>> proposals go this way), the current approach could lead to either trying
>>>> to
>>>> map new objects to existing cgroup resource types, which could lead to
>>>> some
>>>> weird non 1:1 mapping, or having a split definitions - such that each
>>>> driver will declare its objects both in the cgroups mechanism and in its
>>>> driver dispatch table.
>>
>>
>>>> Even worse than that, drivers could simply ignore the cgroups support
>>>> while
>>>> implementing their own resource types and get a very broken containers
>>>> support.
>>
>> If drivers are broken due to ignorance of not-calling cgroup APIs,
>> that should be ok.
>> That particular driver should fix it.
>> If the resource creation using uverbs is using well defined rdma level
>> resource, uverbs level will make sure to honor cgroup limits without
>> involving hw drivers anyway.
>>
>
> All recent proposals of the new ABI schema deals with extending the
> flexibility of the current schema by letting drivers define their specific
> types, actions, attributes, etc. Even more than that, the dispatching starts
> from the driver and it chooses if it wants to use the common RDMA core layer
> or have it's own wise implementation instead.
> Some drivers might even prefer not to implement the current verbs types.
> These decisions were made in the OFVWG meetings.

o.k. If some drivers doesn't implement current verbs type, there is no
point in controlling it either.
In such case application space library won't even invoke resource
allocation/free for unsupported resources.
For resources (type in your word) which are not defined in cgroup, but
exist in hw driver, cannot be controlled by cgroup.
As you highlighted in your [1], "driver's specific attributes could
someday become core's standard attributes", we should be able to add
new resource type to existing rdma cgroup.

>
> Anyway, maybe we should consider using a more higher-level logic objects
> that could fit multiple drivers requirements.
I do not have any other objects other than QP, MR etc in mind currently.
I can think of a RDMA specific socket that encompass one PD, AH,
couple of MRs, and QP.
But we don't have such resource abstraction and its data transport
APIs in place.
There is rsocket, various versions of MPI, libfabric etc in place.
So I am not sure which high level objects to defined at this point.
If you have such objects nailed down, we should be able to add them in
incremental development.

>
>> RDMA Verb level resource is charged/uncharged by RDMA core.
>> RDMA HW level resource is charged/uncharged by RDMA HW driver using
>> well defined API and resource by cgroup core.
>> This scheme ensures that there is 1:1 mapping.
>>
>
> Sounds reasonable, but what about drivers which ignore the common code and
> implement it in their own way?
Shouldn't Doug ask them to use cgroup infrastructure instead of
implementing resource charging/uncharing in their own way.
It still unlikely or difficult for drivers to group processes them
selves like cgroup to implement things in their own way.
I agree, they can completely ignore RDMA verbs resources and implement
their own HW resources.

As you pointed below, we need to find balance between which hw
resource to be defined and which not.
For example, newly added XRQ object, which could be a pure buffer_tag
with receive queue for other vendor. I am not sure how to abstract
them into single object.



> What about drivers which don't support the
> standard RDMA types at all?
> I guess we should find a balance between something abstract and common
> enough that will ease administrator configuration but be specific enough for
> the various models we have (or will have) in the RDMA stack.
>
>> I don't think current definition of resource type is carved out on stone.
>> They can be extended as we forward.
>> As many of us agree that, they should be well defined and it should be
>> agreed by cgroup and rdma community.
>>
>
> Of course, but since the ABI and cgroups model are somehow related, they
> should be dealt with together and approved by Doug who participated in some
> of the OFVWG meetings.
Sure.
>
>> For example, today we have RDMA_VERB_xxx resources.
>> New well defined RDMA HW resources can be defined in rdma_cgroup.h
>> file as RDMA_HW_xx in same enum table.
>>
>
> So a driver will change the cgroups file for every new type it adds?
Well, we wanted to avoid that such churn in cgroup file, thats why v11
was defining resources in IB core. But I guess that was not
acceptable. I had NAK from Christoph and Tejun on that idea.

> Will we just have a super set enum of all crazy types vendors added?
As you said, we need to find balance. I frankly don't know how to do
so. There has to be some reasonable number of types. As we go along
Doug, Tejun and others should approve adding such.
If I am not wrong in past one year, may be two more resource types got
added? XRQ, state-less Queues?

>
>>>
>>> Sorry guys, but arbitrary extensibility for something not finished is the
>>> worst idea ever.  We have two options here:
>>>
>>>  a) delay cgroups support until the grand rewrite is done
>>>  b) add it now and deal with the consequences later
>>>
>> Can we do (b) now and differ adding any HW resources to cgroup until
>> they are clearly called out.
>> Architecture and APIs are already in place to support this.
>>
>
> Since this affect the user, it's better to think how it fits our "optional
> standard"/"vendor types" model first. Maybe we could force all standard
> types even if the driver we use doesn't support any of them.

If vendor doesn't support a given type, user won't allocate it. So its
just don't care condition.
I dont see a need to force standard types either.

>
>>> That being said, adding random non-Verbs hardwasre to the RDMA core is
>>> the second worst idea ever.
>>
>>
>> We can differ adding HW resource to core and cgroup until they are
>> clearly defined.
>> In that case current architecture still holds good.
>>
>
> Clearly we should differ adding the actual code until a driver could declare
> such objects, but we need to decide how to expose the standard optional RDMA
> types (basically, the types you've added) and how future driver specific
> types fit in.
>

o.k. Few more handful of driver specific types should be ok to add in cgroup.
I will let others speak up if thats not acceptable. Current code
already documents and provide infrastructure for that.

>>
>>
>> Matan,
>> Can you please point us to the new RFC/ABI email thread which
>> describes the design, partitioning of code between core vs hw drivers
>> etc.
>>
>
> One proposal is [1]. There's another one from Sean which aims for similar
> targets regards the driver specific types.
>
> [1] https://www.spinics.net/lists/linux-rdma/msg38997.html
>
I looked at the RFC briefly. I can see that old infrastructure (a) and
(b) is not going away.
So It should be ok. to charge/uncharge those standard resources from
those hooks.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-08-31  9:38   ` Leon Romanovsky
@ 2016-09-07 15:07     ` Parav Pandit
  2016-09-08  6:12       ` Leon Romanovsky
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-09-07 15:07 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

Hi Leon,

>> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
>> +static struct rdmacg_resource_pool *
>> +get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
>> +{
>> +     struct rdmacg_resource_pool *rpool;
>> +
>> +     rpool = find_cg_rpool_locked(cg, device);
>> +     if (rpool)
>> +             return rpool;
>> +
>> +     rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
>> +     if (!rpool)
>> +             return ERR_PTR(-ENOMEM);
>> +
>> +     rpool->device = device;
>> +     set_all_resource_max_limit(rpool);
>> +
>> +     INIT_LIST_HEAD(&rpool->cg_node);
>> +     INIT_LIST_HEAD(&rpool->dev_node);
>> +     list_add_tail(&rpool->cg_node, &cg->rpools);
>> +     list_add_tail(&rpool->dev_node, &device->rpools);
>> +     return rpool;
>> +}
>
> <...>
>
>> +     for (p = cg; p; p = parent_rdmacg(p)) {
>> +             rpool = get_cg_rpool_locked(p, device);
>> +             if (IS_ERR_OR_NULL(rpool)) {
>
> get_cg_rpool_locked always returns !NULL (error, or pointer)

Can this change go as incremental change after this patch, since this
patch is close to completion?
Or I need to revise v13?

>
>> +                     ret = PTR_ERR(rpool);
>> +                     goto err;
>
> I didn't review the whole series yet.

Did you get a chance to review the series?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07 15:07     ` Parav Pandit
@ 2016-09-08  6:12       ` Leon Romanovsky
  2016-09-08 10:20           ` Parav Pandit
  0 siblings, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-09-08  6:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 1626 bytes --]

On Wed, Sep 07, 2016 at 08:37:23PM +0530, Parav Pandit wrote:
> Hi Leon,
>
> >> Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
> >> +static struct rdmacg_resource_pool *
> >> +get_cg_rpool_locked(struct rdma_cgroup *cg, struct rdmacg_device *device)
> >> +{
> >> +     struct rdmacg_resource_pool *rpool;
> >> +
> >> +     rpool = find_cg_rpool_locked(cg, device);
> >> +     if (rpool)
> >> +             return rpool;
> >> +
> >> +     rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
> >> +     if (!rpool)
> >> +             return ERR_PTR(-ENOMEM);
> >> +
> >> +     rpool->device = device;
> >> +     set_all_resource_max_limit(rpool);
> >> +
> >> +     INIT_LIST_HEAD(&rpool->cg_node);
> >> +     INIT_LIST_HEAD(&rpool->dev_node);
> >> +     list_add_tail(&rpool->cg_node, &cg->rpools);
> >> +     list_add_tail(&rpool->dev_node, &device->rpools);
> >> +     return rpool;
> >> +}
> >
> > <...>
> >
> >> +     for (p = cg; p; p = parent_rdmacg(p)) {
> >> +             rpool = get_cg_rpool_locked(p, device);
> >> +             if (IS_ERR_OR_NULL(rpool)) {
> >
> > get_cg_rpool_locked always returns !NULL (error, or pointer)
>
> Can this change go as incremental change after this patch, since this
> patch is close to completion?
> Or I need to revise v13?

Sure, it is cleanup. It is not worth of respinning.

>
> >
> >> +                     ret = PTR_ERR(rpool);
> >> +                     goto err;
> >
> > I didn't review the whole series yet.
>
> Did you get a chance to review the series?

We need to decide on fundamental question before reviewing it, which is
"how to fit rdmacg to new ABI model".

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-08  6:12       ` Leon Romanovsky
@ 2016-09-08 10:20           ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-08 10:20 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

On Thu, Sep 8, 2016 at 11:42 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, Sep 07, 2016 at 08:37:23PM +0530, Parav Pandit wrote:
>> Did you get a chance to review the series?
>
> We need to decide on fundamental question before reviewing it, which is
> "how to fit rdmacg to new ABI model".

>From last discussion with Matan in this email thread, it appears that -
only broken case are:
(a) HW vendor driver specific resources (if they have crazy big list),
which cannot be abstracted out well enough, won't be controlled by
rdma cgroup.
(b) Such resource objects are not well defined today with new ABI model.
If such objects are well defined today, lets call them out and discuss
with Doug, Tejun, Christoph and larger group, whether they qualify for
inclusion or not.

rdma cgroup currently supports including handful of HW resource that
can be abstracted (at least at functionality level).

Please include any other option issue, if any.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-08 10:20           ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-08 10:20 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

On Thu, Sep 8, 2016 at 11:42 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, Sep 07, 2016 at 08:37:23PM +0530, Parav Pandit wrote:
>> Did you get a chance to review the series?
>
> We need to decide on fundamental question before reviewing it, which is
> "how to fit rdmacg to new ABI model".

From last discussion with Matan in this email thread, it appears that -
only broken case are:
(a) HW vendor driver specific resources (if they have crazy big list),
which cannot be abstracted out well enough, won't be controlled by
rdma cgroup.
(b) Such resource objects are not well defined today with new ABI model.
If such objects are well defined today, lets call them out and discuss
with Doug, Tejun, Christoph and larger group, whether they qualify for
inclusion or not.

rdma cgroup currently supports including handful of HW resource that
can be abstracted (at least at functionality level).

Please include any other option issue, if any.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07  7:55                 ` Parav Pandit
@ 2016-09-10 16:12                     ` Christoph Hellwig
  -1 siblings, 0 replies; 112+ messages in thread
From: Christoph Hellwig @ 2016-09-10 16:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On Wed, Sep 07, 2016 at 01:25:13PM +0530, Parav Pandit wrote:
> >  a) delay cgroups support until the grand rewrite is done
> >  b) add it now and deal with the consequences later
> >
> Can we do (b) now and differ adding any HW resources to cgroup until
> they are clearly called out.
> Architecture and APIs are already in place to support this.

Sounds fine to me.  The only thing I want to avoid is pie in the
sky "future proofing" that leads to horrible architectures.  And I assume
that's what Matan proposed.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-10 16:12                     ` Christoph Hellwig
  0 siblings, 0 replies; 112+ messages in thread
From: Christoph Hellwig @ 2016-09-10 16:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Christoph Hellwig, Matan Barak, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Wed, Sep 07, 2016 at 01:25:13PM +0530, Parav Pandit wrote:
> >  a) delay cgroups support until the grand rewrite is done
> >  b) add it now and deal with the consequences later
> >
> Can we do (b) now and differ adding any HW resources to cgroup until
> they are clearly called out.
> Architecture and APIs are already in place to support this.

Sounds fine to me.  The only thing I want to avoid is pie in the
sky "future proofing" that leads to horrible architectures.  And I assume
that's what Matan proposed.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-07  8:51                   ` Matan Barak
@ 2016-09-10 16:14                       ` Christoph Hellwig
  -1 siblings, 0 replies; 112+ messages in thread
From: Christoph Hellwig @ 2016-09-10 16:14 UTC (permalink / raw)
  To: Matan Barak
  Cc: Parav Pandit, Christoph Hellwig, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On Wed, Sep 07, 2016 at 11:51:42AM +0300, Matan Barak wrote:
> All recent proposals of the new ABI schema deals with extending the 
> flexibility of the current schema by letting drivers define their specific 
> types, actions, attributes, etc. Even more than that, the dispatching 
> starts from the driver and it chooses if it wants to use the common RDMA 
> core layer or have it's own wise implementation instead.
> Some drivers might even prefer not to implement the current verbs types.
> These decisions were made in the OFVWG meetings.

OFVWG meetings have absolutely zero relevance for Linux development.
More "flexibility" for drivers just means giving up on designing a
coherent API and leaving it to drivers authors to add crap to their
own drivers.  That's a major step backwards.

> Sounds reasonable, but what about drivers which ignore the common code and 
> implement it in their own way? What about drivers which don't support the 
> standard RDMA types at all?

They should not be using the code in drivers/infiniband.  usnic is such
an example of a driver that should never have been added in it's current
form.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-10 16:14                       ` Christoph Hellwig
  0 siblings, 0 replies; 112+ messages in thread
From: Christoph Hellwig @ 2016-09-10 16:14 UTC (permalink / raw)
  To: Matan Barak
  Cc: Parav Pandit, Christoph Hellwig, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Wed, Sep 07, 2016 at 11:51:42AM +0300, Matan Barak wrote:
> All recent proposals of the new ABI schema deals with extending the 
> flexibility of the current schema by letting drivers define their specific 
> types, actions, attributes, etc. Even more than that, the dispatching 
> starts from the driver and it chooses if it wants to use the common RDMA 
> core layer or have it's own wise implementation instead.
> Some drivers might even prefer not to implement the current verbs types.
> These decisions were made in the OFVWG meetings.

OFVWG meetings have absolutely zero relevance for Linux development.
More "flexibility" for drivers just means giving up on designing a
coherent API and leaving it to drivers authors to add crap to their
own drivers.  That's a major step backwards.

> Sounds reasonable, but what about drivers which ignore the common code and 
> implement it in their own way? What about drivers which don't support the 
> standard RDMA types at all?

They should not be using the code in drivers/infiniband.  usnic is such
an example of a driver that should never have been added in it's current
form.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-10 16:14                       ` Christoph Hellwig
  (?)
@ 2016-09-10 17:01                       ` Jason Gunthorpe
       [not found]                         ` <20160910170151.GA5230-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2016-09-11 13:34                         ` Christoph Hellwig
  -1 siblings, 2 replies; 112+ messages in thread
From: Jason Gunthorpe @ 2016-09-10 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matan Barak, Parav Pandit, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Haggai Eran,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Sat, Sep 10, 2016 at 06:14:42PM +0200, Christoph Hellwig wrote:
> OFVWG meetings have absolutely zero relevance for Linux development.

Well, to be fair there are a fair number of kernel developers on that
particular call..

> More "flexibility" for drivers just means giving up on designing a
> coherent API and leaving it to drivers authors to add crap to their
> own drivers.  That's a major step backwards.

Sadly, it isn't a step backwards, it is status quo - at least as far
as the uapi is concerned.

Every single user space driver has its own private abi file, carefully
hidden in their driver, and dutifully copied over to user space:

providers/cxgb3/iwch-abi.h
providers/cxgb4/cxgb4-abi.h
providers/hfi1verbs/hfi-abi.h
providers/i40iw/i40iw-abi.h
providers/ipathverbs/ipath-abi.h
providers/mlx4/mlx4-abi.h
providers/mlx5/mlx5-abi.h
providers/mthca/mthca-abi.h
providers/nes/nes-abi.h
providers/ocrdma/ocrdma_abi.h
providers/rxe/rxe-abi.h

Just to pick two random examples:

struct mlx5_create_cq {
        struct ibv_create_cq            ibv_cmd;
        __u64                           buf_addr;
        __u64                           db_addr;
        __u32				cqe_size;
};

struct iwch_create_cq {
        struct ibv_create_cq ibv_cmd;
        uint64_t user_rptr_addr;
};

Love to hear ideas on a way forward that doesn't involve rewriting
everything :(

> They should not be using the code in drivers/infiniband.  usnic is such
> an example of a driver that should never have been added in it's current
> form.

+1

Jason

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-10 16:12                     ` Christoph Hellwig
@ 2016-09-11  7:40                         ` Matan Barak
  -1 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-11  7:40 UTC (permalink / raw)
  To: Christoph Hellwig, Parav Pandit
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On 10/09/2016 19:12, Christoph Hellwig wrote:
> On Wed, Sep 07, 2016 at 01:25:13PM +0530, Parav Pandit wrote:
>>>  a) delay cgroups support until the grand rewrite is done
>>>  b) add it now and deal with the consequences later
>>>
>> Can we do (b) now and differ adding any HW resources to cgroup until
>> they are clearly called out.
>> Architecture and APIs are already in place to support this.
>
> Sounds fine to me.  The only thing I want to avoid is pie in the
> sky "future proofing" that leads to horrible architectures.  And I assume
> that's what Matan proposed.
>

NO, that not what I proposed. User-kernel API/ABI should be designed 
with drivers differences in mind. The internal design or internals APIs 
could ignore such things as they can be changed later.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-11  7:40                         ` Matan Barak
  0 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-11  7:40 UTC (permalink / raw)
  To: Christoph Hellwig, Parav Pandit
  Cc: Tejun Heo, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris, serge, Or Gerlitz, Andrew Morton,
	linux-security-module

On 10/09/2016 19:12, Christoph Hellwig wrote:
> On Wed, Sep 07, 2016 at 01:25:13PM +0530, Parav Pandit wrote:
>>>  a) delay cgroups support until the grand rewrite is done
>>>  b) add it now and deal with the consequences later
>>>
>> Can we do (b) now and differ adding any HW resources to cgroup until
>> they are clearly called out.
>> Architecture and APIs are already in place to support this.
>
> Sounds fine to me.  The only thing I want to avoid is pie in the
> sky "future proofing" that leads to horrible architectures.  And I assume
> that's what Matan proposed.
>

NO, that not what I proposed. User-kernel API/ABI should be designed 
with drivers differences in mind. The internal design or internals APIs 
could ignore such things as they can be changed later.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-10 17:01                       ` Jason Gunthorpe
@ 2016-09-11  8:07                             ` Matan Barak
  2016-09-11 13:34                         ` Christoph Hellwig
  1 sibling, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-11  8:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Parav Pandit, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Haggai Eran,
	Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On 10/09/2016 20:01, Jason Gunthorpe wrote:
> On Sat, Sep 10, 2016 at 06:14:42PM +0200, Christoph Hellwig wrote:
>> OFVWG meetings have absolutely zero relevance for Linux development.
>
> Well, to be fair there are a fair number of kernel developers on that
> particular call..
>
>> More "flexibility" for drivers just means giving up on designing a
>> coherent API and leaving it to drivers authors to add crap to their
>> own drivers.  That's a major step backwards.
>
> Sadly, it isn't a step backwards, it is status quo - at least as far
> as the uapi is concerned.
>
> Every single user space driver has its own private abi file, carefully
> hidden in their driver, and dutifully copied over to user space:
>
> providers/cxgb3/iwch-abi.h
> providers/cxgb4/cxgb4-abi.h
> providers/hfi1verbs/hfi-abi.h
> providers/i40iw/i40iw-abi.h
> providers/ipathverbs/ipath-abi.h
> providers/mlx4/mlx4-abi.h
> providers/mlx5/mlx5-abi.h
> providers/mthca/mthca-abi.h
> providers/nes/nes-abi.h
> providers/ocrdma/ocrdma_abi.h
> providers/rxe/rxe-abi.h
>
> Just to pick two random examples:
>
> struct mlx5_create_cq {
>         struct ibv_create_cq            ibv_cmd;
>         __u64                           buf_addr;
>         __u64                           db_addr;
>         __u32				cqe_size;
> };
>
> struct iwch_create_cq {
>         struct ibv_create_cq ibv_cmd;
>         uint64_t user_rptr_addr;
> };
>
> Love to hear ideas on a way forward that doesn't involve rewriting
> everything :(
>

Yeah, unfortunately, the RDMA ABI is more driver specific ABI than a 
common user-kernel ABI. I guess this will become even worse, as the RDMA 
subsystem is evolving to serve more drivers with different object types. 
For example, I would like to hear how hfi1 are going to define their 
user-kernel ABI (once they leave the custom ioctls).

>> They should not be using the code in drivers/infiniband.  usnic is such
>> an example of a driver that should never have been added in it's current
>> form.
>
> +1
>
> Jason
>

Matan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-11  8:07                             ` Matan Barak
  0 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-11  8:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Parav Pandit, Tejun Heo, cgroups, linux-doc,
	Linux Kernel Mailing List, linux-rdma, Li Zefan, Johannes Weiner,
	Doug Ledford, Liran Liss, Hefty, Sean, Haggai Eran,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On 10/09/2016 20:01, Jason Gunthorpe wrote:
> On Sat, Sep 10, 2016 at 06:14:42PM +0200, Christoph Hellwig wrote:
>> OFVWG meetings have absolutely zero relevance for Linux development.
>
> Well, to be fair there are a fair number of kernel developers on that
> particular call..
>
>> More "flexibility" for drivers just means giving up on designing a
>> coherent API and leaving it to drivers authors to add crap to their
>> own drivers.  That's a major step backwards.
>
> Sadly, it isn't a step backwards, it is status quo - at least as far
> as the uapi is concerned.
>
> Every single user space driver has its own private abi file, carefully
> hidden in their driver, and dutifully copied over to user space:
>
> providers/cxgb3/iwch-abi.h
> providers/cxgb4/cxgb4-abi.h
> providers/hfi1verbs/hfi-abi.h
> providers/i40iw/i40iw-abi.h
> providers/ipathverbs/ipath-abi.h
> providers/mlx4/mlx4-abi.h
> providers/mlx5/mlx5-abi.h
> providers/mthca/mthca-abi.h
> providers/nes/nes-abi.h
> providers/ocrdma/ocrdma_abi.h
> providers/rxe/rxe-abi.h
>
> Just to pick two random examples:
>
> struct mlx5_create_cq {
>         struct ibv_create_cq            ibv_cmd;
>         __u64                           buf_addr;
>         __u64                           db_addr;
>         __u32				cqe_size;
> };
>
> struct iwch_create_cq {
>         struct ibv_create_cq ibv_cmd;
>         uint64_t user_rptr_addr;
> };
>
> Love to hear ideas on a way forward that doesn't involve rewriting
> everything :(
>

Yeah, unfortunately, the RDMA ABI is more driver specific ABI than a 
common user-kernel ABI. I guess this will become even worse, as the RDMA 
subsystem is evolving to serve more drivers with different object types. 
For example, I would like to hear how hfi1 are going to define their 
user-kernel ABI (once they leave the custom ioctls).

>> They should not be using the code in drivers/infiniband.  usnic is such
>> an example of a driver that should never have been added in it's current
>> form.
>
> +1
>
> Jason
>

Matan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-10 17:01                       ` Jason Gunthorpe
       [not found]                         ` <20160910170151.GA5230-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2016-09-11 13:34                         ` Christoph Hellwig
  2016-09-11 14:35                           ` Leon Romanovsky
  1 sibling, 1 reply; 112+ messages in thread
From: Christoph Hellwig @ 2016-09-11 13:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Sat, Sep 10, 2016 at 11:01:51AM -0600, Jason Gunthorpe wrote:
> Sadly, it isn't a step backwards, it is status quo - at least as far
> as the uapi is concerned.

Sort of, see below:

> struct mlx5_create_cq {
>         struct ibv_create_cq            ibv_cmd;
>         __u64                           buf_addr;
>         __u64                           db_addr;
>         __u32				cqe_size;
> };
> 
> struct iwch_create_cq {
>         struct ibv_create_cq ibv_cmd;
>         uint64_t user_rptr_addr;
> };
> 
> Love to hear ideas on a way forward that doesn't involve rewriting
> everything :(

We stil always have the common structure first.  And at least for
cgroups supports that's what matters.

Re the actual structures - we'll really need to make sure we

 a) expose proper userspace abi headers in the kernel for all code
    in the RDMA subsystem
 b) actually use that in the userspace components

I've posted some initial work toward a) a while ago, and once we
agree on adopting your common repo I'd really like to start through
with that work.  I think it's a pre-requisite for any major new
userspace ABI work.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 13:34                         ` Christoph Hellwig
@ 2016-09-11 14:35                           ` Leon Romanovsky
  2016-09-11 17:14                             ` Jason Gunthorpe
  0 siblings, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-09-11 14:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On Sun, Sep 11, 2016 at 03:34:21PM +0200, Christoph Hellwig wrote:
> On Sat, Sep 10, 2016 at 11:01:51AM -0600, Jason Gunthorpe wrote:
> > Sadly, it isn't a step backwards, it is status quo - at least as far
> > as the uapi is concerned.
>
> Sort of, see below:
>
> > struct mlx5_create_cq {
> >         struct ibv_create_cq            ibv_cmd;
> >         __u64                           buf_addr;
> >         __u64                           db_addr;
> >         __u32				cqe_size;
> > };
> >
> > struct iwch_create_cq {
> >         struct ibv_create_cq ibv_cmd;
> >         uint64_t user_rptr_addr;
> > };
> >
> > Love to hear ideas on a way forward that doesn't involve rewriting
> > everything :(
>
> We stil always have the common structure first.  And at least for
> cgroups supports that's what matters.
>
> Re the actual structures - we'll really need to make sure we
>
>  a) expose proper userspace abi headers in the kernel for all code
>     in the RDMA subsystem
>  b) actually use that in the userspace components
>
> I've posted some initial work toward a) a while ago, and once we
> agree on adopting your common repo I'd really like to start through
> with that work.  I think it's a pre-requisite for any major new
> userspace ABI work.

I started to work on it over weekend and it is worth do not do same work twice.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 14:35                           ` Leon Romanovsky
@ 2016-09-11 17:14                             ` Jason Gunthorpe
  2016-09-11 17:24                               ` Christoph Hellwig
  0 siblings, 1 reply; 112+ messages in thread
From: Jason Gunthorpe @ 2016-09-11 17:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Sun, Sep 11, 2016 at 05:35:22PM +0300, Leon Romanovsky wrote:

> > We stil always have the common structure first.  And at least for
> > cgroups supports that's what matters.
> >
> > Re the actual structures - we'll really need to make sure we
> >
> >  a) expose proper userspace abi headers in the kernel for all code
> >     in the RDMA subsystem
> >  b) actually use that in the userspace components
> >
> > I've posted some initial work toward a) a while ago, and once we

Did it get merged? Do you have a pointer?

> > agree on adopting your common repo I'd really like to start through
> > with that work.  I think it's a pre-requisite for any major new
> > userspace ABI work.
> 
> I started to work on it over weekend and it is worth do not do same work twice.

Yes, I also agree that it is important before we tackle the uapi
conversion to get this fully sorted.

I've already done several cases working with the existing uapi headers:

https://github.com/jgunthorpe/rdma-plumbing/commit/f4f40689440dbc9c57b55548b04b15fe808a1767
https://github.com/jgunthorpe/rdma-plumbing/commit/0cf1893dce4791dafa035bcb6ee045a6ce0ff3c3
https://github.com/jgunthorpe/rdma-plumbing/commit/0522fc42aac4a5e8fc888dcca4341c9bc1dc58ca

[.. and this is a strong argument why we need the common repo, doing
this without it would be very hard, as everything is cross-linked, I
couldnn't unwind libibcm until I fixed a bit of verbs, and rdmacm can't
even include its uapi header until the duplicate definitions in the
verbs copy are delt with .. and I've also learned we are making
changing to the kernel uapi header and since nothing uses them we never even
compile test :( :( eg
https://github.com/torvalds/linux/commit/b493d91d333e867a043f7ff1397bcba6e2d0dda2]

However, everything under verbs is not straightforward. The files in
userspace are not copies...

user:

struct ibv_query_device {
       __u32 command;
       __u16 in_words;
       __u16 out_words;
       __u64 response;
       __u64 driver_data[0];
};

kernel:

struct ib_uverbs_query_device {
        __u64 response;
        __u64 driver_data[0];
};

eg the userspace version stuffs the header into the struct and the
kernel version does not. Presumably this is for efficiency so that no
copies are required when marshaling. This impacts everything :(

I'm thinking the best way forward might be to use a script and
transform userspace into:

struct ibv_query_device {
	struct ib_uverbs_cmd_hdr hdr;
	struct ib_uverbs_query_device cmd;
};

Jason

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 17:14                             ` Jason Gunthorpe
@ 2016-09-11 17:24                               ` Christoph Hellwig
  2016-09-11 17:52                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 112+ messages in thread
From: Christoph Hellwig @ 2016-09-11 17:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Christoph Hellwig, Matan Barak, Parav Pandit,
	Tejun Heo, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Haggai Eran, Jonathan Corbet, james.l.morris, serge,
	Or Gerlitz, Andrew Morton, linux-security-module

On Sun, Sep 11, 2016 at 11:14:09AM -0600, Jason Gunthorpe wrote:
> > > We stil always have the common structure first.  And at least for
> > > cgroups supports that's what matters.
> > >
> > > Re the actual structures - we'll really need to make sure we
> > >
> > >  a) expose proper userspace abi headers in the kernel for all code
> > >     in the RDMA subsystem
> > >  b) actually use that in the userspace components
> > >
> > > I've posted some initial work toward a) a while ago, and once we
> 
> Did it get merged? Do you have a pointer?

http://www.spinics.net/lists/linux-rdma/msg31958.html

> this without it would be very hard, as everything is cross-linked, I
> couldnn't unwind libibcm until I fixed a bit of verbs, and rdmacm can't
> even include its uapi header until the duplicate definitions in the
> verbs copy are delt with .. and I've also learned we are making
> changing to the kernel uapi header and since nothing uses them we never even
> compile test :( :( eg
> https://github.com/torvalds/linux/commit/b493d91d333e867a043f7ff1397bcba6e2d0dda2]

> However, everything under verbs is not straightforward. The files in
> userspace are not copies...
> 
> user:
> 
> struct ibv_query_device {
>        __u32 command;
>        __u16 in_words;
>        __u16 out_words;
>        __u64 response;
>        __u64 driver_data[0];
> };
> 
> kernel:
> 
> struct ib_uverbs_query_device {
>         __u64 response;
>         __u64 driver_data[0];
> };

We'll obviously need different strutures for the libibvers API
and the kernel interface in this case, and we'll need to figure out
how to properly translate them.  I think a cast, plus compile time
type checking ala BUILD_BUG_ON is the way to go.

> eg the userspace version stuffs the header into the struct and the
> kernel version does not. Presumably this is for efficiency so that no
> copies are required when marshaling. This impacts everything :(
> 
> I'm thinking the best way forward might be to use a script and
> transform userspace into:
> 
> struct ibv_query_device {
> 	struct ib_uverbs_cmd_hdr hdr;
> 	struct ib_uverbs_query_device cmd;
> };

That would break the users of the interface.  However automatically
generating the user ABI from the kernel one might still be a good idea
in the long run.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 17:24                               ` Christoph Hellwig
@ 2016-09-11 17:52                                 ` Jason Gunthorpe
       [not found]                                   ` <20160911175235.GB13442-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Jason Gunthorpe @ 2016-09-11 17:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
> > > > I've posted some initial work toward a) a while ago, and once we
> > 
> > Did it get merged? Do you have a pointer?
> 
> http://www.spinics.net/lists/linux-rdma/msg31958.html

Right, I remember that. Certainly the right direction

> > However, everything under verbs is not straightforward. The files in
> > userspace are not copies...
> > 
> > user:
> > 
> > struct ibv_query_device {
> >        __u32 command;
> >        __u16 in_words;
> >        __u16 out_words;
> >        __u64 response;
> >        __u64 driver_data[0];
> > };
> > 
> > kernel:
> > 
> > struct ib_uverbs_query_device {
> >         __u64 response;
> >         __u64 driver_data[0];
> > };
> 
> We'll obviously need different strutures for the libibvers API
> and the kernel interface in this case, and we'll need to figure out
> how to properly translate them.  I think a cast, plus compile time
> type checking ala BUILD_BUG_ON is the way to go.

I'm not sure I follow, which would I cast?

BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
             sizeof(ib_uverbs_query_device))

?

> > I'm thinking the best way forward might be to use a script and
> > transform userspace into:
> > 
> > struct ibv_query_device {
> > 	struct ib_uverbs_cmd_hdr hdr;
> > 	struct ib_uverbs_query_device cmd;
> > };
> 
> That would break the users of the interface.

Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
identical the modified libibverbs would still be binary compatible
with all providers but not source compatible. Since all kernel
supported providers are in rdma-plumbing we can add the '.cmd.' at the
same time.

The kernel uapi header would stay the same.

> However automatically generating the user ABI from the kernel one
> might still be a good idea in the long run.

My preference would be to try and use the kernel headers directly.

Jason

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-11 17:52                                 ` Jason Gunthorpe
@ 2016-09-12  5:07                                       ` Leon Romanovsky
  0 siblings, 0 replies; 112+ messages in thread
From: Leon Romanovsky @ 2016-09-12  5:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Matan Barak, Parav Pandit, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2227 bytes --]

On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
> > > > > I've posted some initial work toward a) a while ago, and once we
> > >
> > > Did it get merged? Do you have a pointer?
> >
> > http://www.spinics.net/lists/linux-rdma/msg31958.html
>
> Right, I remember that. Certainly the right direction
>
> > > However, everything under verbs is not straightforward. The files in
> > > userspace are not copies...
> > >
> > > user:
> > >
> > > struct ibv_query_device {
> > >        __u32 command;
> > >        __u16 in_words;
> > >        __u16 out_words;
> > >        __u64 response;
> > >        __u64 driver_data[0];
> > > };
> > >
> > > kernel:
> > >
> > > struct ib_uverbs_query_device {
> > >         __u64 response;
> > >         __u64 driver_data[0];
> > > };
> >
> > We'll obviously need different strutures for the libibvers API
> > and the kernel interface in this case, and we'll need to figure out
> > how to properly translate them.  I think a cast, plus compile time
> > type checking ala BUILD_BUG_ON is the way to go.
>
> I'm not sure I follow, which would I cast?
>
> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>              sizeof(ib_uverbs_query_device))
>
> ?
>
> > > I'm thinking the best way forward might be to use a script and
> > > transform userspace into:
> > >
> > > struct ibv_query_device {
> > > 	struct ib_uverbs_cmd_hdr hdr;
> > > 	struct ib_uverbs_query_device cmd;
> > > };
> >
> > That would break the users of the interface.
>
> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
> identical the modified libibverbs would still be binary compatible
> with all providers but not source compatible. Since all kernel
> supported providers are in rdma-plumbing we can add the '.cmd.' at the
> same time.
>
> The kernel uapi header would stay the same.
>
> > However automatically generating the user ABI from the kernel one
> > might still be a good idea in the long run.
>
> My preference would be to try and use the kernel headers directly.

I thought the same, especially after realizing that they are almost
copy/paste from the vendor *-abi.h files.

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-12  5:07                                       ` Leon Romanovsky
  0 siblings, 0 replies; 112+ messages in thread
From: Leon Romanovsky @ 2016-09-12  5:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Matan Barak, Parav Pandit, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 2227 bytes --]

On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
> > > > > I've posted some initial work toward a) a while ago, and once we
> > >
> > > Did it get merged? Do you have a pointer?
> >
> > http://www.spinics.net/lists/linux-rdma/msg31958.html
>
> Right, I remember that. Certainly the right direction
>
> > > However, everything under verbs is not straightforward. The files in
> > > userspace are not copies...
> > >
> > > user:
> > >
> > > struct ibv_query_device {
> > >        __u32 command;
> > >        __u16 in_words;
> > >        __u16 out_words;
> > >        __u64 response;
> > >        __u64 driver_data[0];
> > > };
> > >
> > > kernel:
> > >
> > > struct ib_uverbs_query_device {
> > >         __u64 response;
> > >         __u64 driver_data[0];
> > > };
> >
> > We'll obviously need different strutures for the libibvers API
> > and the kernel interface in this case, and we'll need to figure out
> > how to properly translate them.  I think a cast, plus compile time
> > type checking ala BUILD_BUG_ON is the way to go.
>
> I'm not sure I follow, which would I cast?
>
> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>              sizeof(ib_uverbs_query_device))
>
> ?
>
> > > I'm thinking the best way forward might be to use a script and
> > > transform userspace into:
> > >
> > > struct ibv_query_device {
> > > 	struct ib_uverbs_cmd_hdr hdr;
> > > 	struct ib_uverbs_query_device cmd;
> > > };
> >
> > That would break the users of the interface.
>
> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
> identical the modified libibverbs would still be binary compatible
> with all providers but not source compatible. Since all kernel
> supported providers are in rdma-plumbing we can add the '.cmd.' at the
> same time.
>
> The kernel uapi header would stay the same.
>
> > However automatically generating the user ABI from the kernel one
> > might still be a good idea in the long run.
>
> My preference would be to try and use the kernel headers directly.

I thought the same, especially after realizing that they are almost
copy/paste from the vendor *-abi.h files.

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-12  5:07                                       ` Leon Romanovsky
@ 2016-09-14  7:06                                           ` Parav Pandit
  -1 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-14  7:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hi Dennis,

Do you know how would HFI1 driver would work along with rdma cgroup?

Hi Matan, Leon, Jason,
Apart from HFI1, is there any other concern?
Or Patch is good to go?

4.8 dates are close by (2 weeks) and there are two git trees involved
(that might cause merge error to Linus) so if there are no issues, I
would like to make request to Doug to consider it for 4.8 early on.

Parav

On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>> > > > > I've posted some initial work toward a) a while ago, and once we
>> > >
>> > > Did it get merged? Do you have a pointer?
>> >
>> > http://www.spinics.net/lists/linux-rdma/msg31958.html
>>
>> Right, I remember that. Certainly the right direction
>>
>> > > However, everything under verbs is not straightforward. The files in
>> > > userspace are not copies...
>> > >
>> > > user:
>> > >
>> > > struct ibv_query_device {
>> > >        __u32 command;
>> > >        __u16 in_words;
>> > >        __u16 out_words;
>> > >        __u64 response;
>> > >        __u64 driver_data[0];
>> > > };
>> > >
>> > > kernel:
>> > >
>> > > struct ib_uverbs_query_device {
>> > >         __u64 response;
>> > >         __u64 driver_data[0];
>> > > };
>> >
>> > We'll obviously need different strutures for the libibvers API
>> > and the kernel interface in this case, and we'll need to figure out
>> > how to properly translate them.  I think a cast, plus compile time
>> > type checking ala BUILD_BUG_ON is the way to go.
>>
>> I'm not sure I follow, which would I cast?
>>
>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>              sizeof(ib_uverbs_query_device))
>>
>> ?
>>
>> > > I'm thinking the best way forward might be to use a script and
>> > > transform userspace into:
>> > >
>> > > struct ibv_query_device {
>> > >   struct ib_uverbs_cmd_hdr hdr;
>> > >   struct ib_uverbs_query_device cmd;
>> > > };
>> >
>> > That would break the users of the interface.
>>
>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>> identical the modified libibverbs would still be binary compatible
>> with all providers but not source compatible. Since all kernel
>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>> same time.
>>
>> The kernel uapi header would stay the same.
>>
>> > However automatically generating the user ABI from the kernel one
>> > might still be a good idea in the long run.
>>
>> My preference would be to try and use the kernel headers directly.
>
> I thought the same, especially after realizing that they are almost
> copy/paste from the vendor *-abi.h files.
>
>>
>> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-14  7:06                                           ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-14  7:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Dennis,

Do you know how would HFI1 driver would work along with rdma cgroup?

Hi Matan, Leon, Jason,
Apart from HFI1, is there any other concern?
Or Patch is good to go?

4.8 dates are close by (2 weeks) and there are two git trees involved
(that might cause merge error to Linus) so if there are no issues, I
would like to make request to Doug to consider it for 4.8 early on.

Parav

On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>> > > > > I've posted some initial work toward a) a while ago, and once we
>> > >
>> > > Did it get merged? Do you have a pointer?
>> >
>> > http://www.spinics.net/lists/linux-rdma/msg31958.html
>>
>> Right, I remember that. Certainly the right direction
>>
>> > > However, everything under verbs is not straightforward. The files in
>> > > userspace are not copies...
>> > >
>> > > user:
>> > >
>> > > struct ibv_query_device {
>> > >        __u32 command;
>> > >        __u16 in_words;
>> > >        __u16 out_words;
>> > >        __u64 response;
>> > >        __u64 driver_data[0];
>> > > };
>> > >
>> > > kernel:
>> > >
>> > > struct ib_uverbs_query_device {
>> > >         __u64 response;
>> > >         __u64 driver_data[0];
>> > > };
>> >
>> > We'll obviously need different strutures for the libibvers API
>> > and the kernel interface in this case, and we'll need to figure out
>> > how to properly translate them.  I think a cast, plus compile time
>> > type checking ala BUILD_BUG_ON is the way to go.
>>
>> I'm not sure I follow, which would I cast?
>>
>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>              sizeof(ib_uverbs_query_device))
>>
>> ?
>>
>> > > I'm thinking the best way forward might be to use a script and
>> > > transform userspace into:
>> > >
>> > > struct ibv_query_device {
>> > >   struct ib_uverbs_cmd_hdr hdr;
>> > >   struct ib_uverbs_query_device cmd;
>> > > };
>> >
>> > That would break the users of the interface.
>>
>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>> identical the modified libibverbs would still be binary compatible
>> with all providers but not source compatible. Since all kernel
>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>> same time.
>>
>> The kernel uapi header would stay the same.
>>
>> > However automatically generating the user ABI from the kernel one
>> > might still be a good idea in the long run.
>>
>> My preference would be to try and use the kernel headers directly.
>
> I thought the same, especially after realizing that they are almost
> copy/paste from the vendor *-abi.h files.
>
>>
>> Jason

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-14  7:06                                           ` Parav Pandit
@ 2016-09-14  8:14                                             ` Matan Barak
  -1 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-14  8:14 UTC (permalink / raw)
  To: Parav Pandit, Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On 14/09/2016 10:06, Parav Pandit wrote:
> Hi Dennis,
>
> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?

I just wonder how things like RSS will work. For example, a RSS QP 
doesn't really have a queue (if I recall, it's connected to work queues 
via an indirection table). So, when a user creates such a QP, do you 
want to account it as a regular QP?
How are work queues accounted?


> Or Patch is good to go?
>
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
>
> Parav
>
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
>> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>>>>>>> I've posted some initial work toward a) a while ago, and once we
>>>>>
>>>>> Did it get merged? Do you have a pointer?
>>>>
>>>> http://www.spinics.net/lists/linux-rdma/msg31958.html
>>>
>>> Right, I remember that. Certainly the right direction
>>>
>>>>> However, everything under verbs is not straightforward. The files in
>>>>> userspace are not copies...
>>>>>
>>>>> user:
>>>>>
>>>>> struct ibv_query_device {
>>>>>        __u32 command;
>>>>>        __u16 in_words;
>>>>>        __u16 out_words;
>>>>>        __u64 response;
>>>>>        __u64 driver_data[0];
>>>>> };
>>>>>
>>>>> kernel:
>>>>>
>>>>> struct ib_uverbs_query_device {
>>>>>         __u64 response;
>>>>>         __u64 driver_data[0];
>>>>> };
>>>>
>>>> We'll obviously need different strutures for the libibvers API
>>>> and the kernel interface in this case, and we'll need to figure out
>>>> how to properly translate them.  I think a cast, plus compile time
>>>> type checking ala BUILD_BUG_ON is the way to go.
>>>
>>> I'm not sure I follow, which would I cast?
>>>
>>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>>              sizeof(ib_uverbs_query_device))
>>>
>>> ?
>>>
>>>>> I'm thinking the best way forward might be to use a script and
>>>>> transform userspace into:
>>>>>
>>>>> struct ibv_query_device {
>>>>>   struct ib_uverbs_cmd_hdr hdr;
>>>>>   struct ib_uverbs_query_device cmd;
>>>>> };
>>>>
>>>> That would break the users of the interface.
>>>
>>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>>> identical the modified libibverbs would still be binary compatible
>>> with all providers but not source compatible. Since all kernel
>>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>>> same time.
>>>
>>> The kernel uapi header would stay the same.
>>>
>>>> However automatically generating the user ABI from the kernel one
>>>> might still be a good idea in the long run.
>>>
>>> My preference would be to try and use the kernel headers directly.
>>
>> I thought the same, especially after realizing that they are almost
>> copy/paste from the vendor *-abi.h files.
>>
>>>
>>> Jason


^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-14  8:14                                             ` Matan Barak
  0 siblings, 0 replies; 112+ messages in thread
From: Matan Barak @ 2016-09-14  8:14 UTC (permalink / raw)
  To: Parav Pandit, Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Tejun Heo, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

On 14/09/2016 10:06, Parav Pandit wrote:
> Hi Dennis,
>
> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?

I just wonder how things like RSS will work. For example, a RSS QP 
doesn't really have a queue (if I recall, it's connected to work queues 
via an indirection table). So, when a user creates such a QP, do you 
want to account it as a regular QP?
How are work queues accounted?


> Or Patch is good to go?
>
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
>
> Parav
>
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
>> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>>>>>>> I've posted some initial work toward a) a while ago, and once we
>>>>>
>>>>> Did it get merged? Do you have a pointer?
>>>>
>>>> http://www.spinics.net/lists/linux-rdma/msg31958.html
>>>
>>> Right, I remember that. Certainly the right direction
>>>
>>>>> However, everything under verbs is not straightforward. The files in
>>>>> userspace are not copies...
>>>>>
>>>>> user:
>>>>>
>>>>> struct ibv_query_device {
>>>>>        __u32 command;
>>>>>        __u16 in_words;
>>>>>        __u16 out_words;
>>>>>        __u64 response;
>>>>>        __u64 driver_data[0];
>>>>> };
>>>>>
>>>>> kernel:
>>>>>
>>>>> struct ib_uverbs_query_device {
>>>>>         __u64 response;
>>>>>         __u64 driver_data[0];
>>>>> };
>>>>
>>>> We'll obviously need different strutures for the libibvers API
>>>> and the kernel interface in this case, and we'll need to figure out
>>>> how to properly translate them.  I think a cast, plus compile time
>>>> type checking ala BUILD_BUG_ON is the way to go.
>>>
>>> I'm not sure I follow, which would I cast?
>>>
>>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>>              sizeof(ib_uverbs_query_device))
>>>
>>> ?
>>>
>>>>> I'm thinking the best way forward might be to use a script and
>>>>> transform userspace into:
>>>>>
>>>>> struct ibv_query_device {
>>>>>   struct ib_uverbs_cmd_hdr hdr;
>>>>>   struct ib_uverbs_query_device cmd;
>>>>> };
>>>>
>>>> That would break the users of the interface.
>>>
>>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>>> identical the modified libibverbs would still be binary compatible
>>> with all providers but not source compatible. Since all kernel
>>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>>> same time.
>>>
>>> The kernel uapi header would stay the same.
>>>
>>>> However automatically generating the user ABI from the kernel one
>>>> might still be a good idea in the long run.
>>>
>>> My preference would be to try and use the kernel headers directly.
>>
>> I thought the same, especially after realizing that they are almost
>> copy/paste from the vendor *-abi.h files.
>>
>>>
>>> Jason

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-14  8:14                                             ` Matan Barak
@ 2016-09-14  9:19                                                 ` Parav Pandit
  -1 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-14  9:19 UTC (permalink / raw)
  To: Matan Barak
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hi Matan,

On Wed, Sep 14, 2016 at 1:44 PM, Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 14/09/2016 10:06, Parav Pandit wrote:
>>
>> Hi Dennis,
>>
>> Do you know how would HFI1 driver would work along with rdma cgroup?
>>
>> Hi Matan, Leon, Jason,
>> Apart from HFI1, is there any other concern?
>
>
> I just wonder how things like RSS will work. For example, a RSS QP doesn't
> really have a queue (if I recall, it's connected to work queues via an
> indirection table). So, when a user creates such a QP, do you want to
> account it as a regular QP?
> How are work queues accounted?

ib_create_rwq_ind_table verb allows creating indirection table.
I assume it allows creating multiple such tables.
If it is so, than number of tables should be a cgroup resource that we
can add in follow on patch.
By doing so, one container doesn't takeaway all the tables.

>
>
>> Or Patch is good to go?
>>
>> 4.8 dates are close by (2 weeks) and there are two git trees involved
>> (that might cause merge error to Linus) so if there are no issues, I
>> would like to make request to Doug to consider it for 4.8 early on.
>>
>> Parav
>>
>> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>>
>>> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>>>>
>>>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>>>>>>>>
>>>>>>>> I've posted some initial work toward a) a while ago, and once we
>>>>>>
>>>>>>
>>>>>> Did it get merged? Do you have a pointer?
>>>>>
>>>>>
>>>>> http://www.spinics.net/lists/linux-rdma/msg31958.html
>>>>
>>>>
>>>> Right, I remember that. Certainly the right direction
>>>>
>>>>>> However, everything under verbs is not straightforward. The files in
>>>>>> userspace are not copies...
>>>>>>
>>>>>> user:
>>>>>>
>>>>>> struct ibv_query_device {
>>>>>>        __u32 command;
>>>>>>        __u16 in_words;
>>>>>>        __u16 out_words;
>>>>>>        __u64 response;
>>>>>>        __u64 driver_data[0];
>>>>>> };
>>>>>>
>>>>>> kernel:
>>>>>>
>>>>>> struct ib_uverbs_query_device {
>>>>>>         __u64 response;
>>>>>>         __u64 driver_data[0];
>>>>>> };
>>>>>
>>>>>
>>>>> We'll obviously need different strutures for the libibvers API
>>>>> and the kernel interface in this case, and we'll need to figure out
>>>>> how to properly translate them.  I think a cast, plus compile time
>>>>> type checking ala BUILD_BUG_ON is the way to go.
>>>>
>>>>
>>>> I'm not sure I follow, which would I cast?
>>>>
>>>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>>>              sizeof(ib_uverbs_query_device))
>>>>
>>>> ?
>>>>
>>>>>> I'm thinking the best way forward might be to use a script and
>>>>>> transform userspace into:
>>>>>>
>>>>>> struct ibv_query_device {
>>>>>>   struct ib_uverbs_cmd_hdr hdr;
>>>>>>   struct ib_uverbs_query_device cmd;
>>>>>> };
>>>>>
>>>>>
>>>>> That would break the users of the interface.
>>>>
>>>>
>>>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>>>> identical the modified libibverbs would still be binary compatible
>>>> with all providers but not source compatible. Since all kernel
>>>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>>>> same time.
>>>>
>>>> The kernel uapi header would stay the same.
>>>>
>>>>> However automatically generating the user ABI from the kernel one
>>>>> might still be a good idea in the long run.
>>>>
>>>>
>>>> My preference would be to try and use the kernel headers directly.
>>>
>>>
>>> I thought the same, especially after realizing that they are almost
>>> copy/paste from the vendor *-abi.h files.
>>>
>>>>
>>>> Jason
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-14  9:19                                                 ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-14  9:19 UTC (permalink / raw)
  To: Matan Barak
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Tejun Heo,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Matan,

On Wed, Sep 14, 2016 at 1:44 PM, Matan Barak <matanb@mellanox.com> wrote:
> On 14/09/2016 10:06, Parav Pandit wrote:
>>
>> Hi Dennis,
>>
>> Do you know how would HFI1 driver would work along with rdma cgroup?
>>
>> Hi Matan, Leon, Jason,
>> Apart from HFI1, is there any other concern?
>
>
> I just wonder how things like RSS will work. For example, a RSS QP doesn't
> really have a queue (if I recall, it's connected to work queues via an
> indirection table). So, when a user creates such a QP, do you want to
> account it as a regular QP?
> How are work queues accounted?

ib_create_rwq_ind_table verb allows creating indirection table.
I assume it allows creating multiple such tables.
If it is so, than number of tables should be a cgroup resource that we
can add in follow on patch.
By doing so, one container doesn't takeaway all the tables.

>
>
>> Or Patch is good to go?
>>
>> 4.8 dates are close by (2 weeks) and there are two git trees involved
>> (that might cause merge error to Linus) so if there are no issues, I
>> would like to make request to Doug to consider it for 4.8 early on.
>>
>> Parav
>>
>> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
>>>
>>> On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>>>>
>>>> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>>>>>>>>
>>>>>>>> I've posted some initial work toward a) a while ago, and once we
>>>>>>
>>>>>>
>>>>>> Did it get merged? Do you have a pointer?
>>>>>
>>>>>
>>>>> http://www.spinics.net/lists/linux-rdma/msg31958.html
>>>>
>>>>
>>>> Right, I remember that. Certainly the right direction
>>>>
>>>>>> However, everything under verbs is not straightforward. The files in
>>>>>> userspace are not copies...
>>>>>>
>>>>>> user:
>>>>>>
>>>>>> struct ibv_query_device {
>>>>>>        __u32 command;
>>>>>>        __u16 in_words;
>>>>>>        __u16 out_words;
>>>>>>        __u64 response;
>>>>>>        __u64 driver_data[0];
>>>>>> };
>>>>>>
>>>>>> kernel:
>>>>>>
>>>>>> struct ib_uverbs_query_device {
>>>>>>         __u64 response;
>>>>>>         __u64 driver_data[0];
>>>>>> };
>>>>>
>>>>>
>>>>> We'll obviously need different strutures for the libibvers API
>>>>> and the kernel interface in this case, and we'll need to figure out
>>>>> how to properly translate them.  I think a cast, plus compile time
>>>>> type checking ala BUILD_BUG_ON is the way to go.
>>>>
>>>>
>>>> I'm not sure I follow, which would I cast?
>>>>
>>>> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>>>>              sizeof(ib_uverbs_query_device))
>>>>
>>>> ?
>>>>
>>>>>> I'm thinking the best way forward might be to use a script and
>>>>>> transform userspace into:
>>>>>>
>>>>>> struct ibv_query_device {
>>>>>>   struct ib_uverbs_cmd_hdr hdr;
>>>>>>   struct ib_uverbs_query_device cmd;
>>>>>> };
>>>>>
>>>>>
>>>>> That would break the users of the interface.
>>>>
>>>>
>>>> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>>>> identical the modified libibverbs would still be binary compatible
>>>> with all providers but not source compatible. Since all kernel
>>>> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>>>> same time.
>>>>
>>>> The kernel uapi header would stay the same.
>>>>
>>>>> However automatically generating the user ABI from the kernel one
>>>>> might still be a good idea in the long run.
>>>>
>>>>
>>>> My preference would be to try and use the kernel headers directly.
>>>
>>>
>>> I thought the same, especially after realizing that they are almost
>>> copy/paste from the vendor *-abi.h files.
>>>
>>>>
>>>> Jason
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-14  7:06                                           ` Parav Pandit
@ 2016-09-15 18:56                                               ` Leon Romanovsky
  -1 siblings, 0 replies; 112+ messages in thread
From: Leon Romanovsky @ 2016-09-15 18:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Gunthorpe, Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3058 bytes --]

On Wed, Sep 14, 2016 at 12:36:19PM +0530, Parav Pandit wrote:
> Hi Dennis,
>
> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?
> Or Patch is good to go?

I didn't review it yet :(.
Sorry

>
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
>
> Parav
>
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> >> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
> >> > > > > I've posted some initial work toward a) a while ago, and once we
> >> > >
> >> > > Did it get merged? Do you have a pointer?
> >> >
> >> > http://www.spinics.net/lists/linux-rdma/msg31958.html
> >>
> >> Right, I remember that. Certainly the right direction
> >>
> >> > > However, everything under verbs is not straightforward. The files in
> >> > > userspace are not copies...
> >> > >
> >> > > user:
> >> > >
> >> > > struct ibv_query_device {
> >> > >        __u32 command;
> >> > >        __u16 in_words;
> >> > >        __u16 out_words;
> >> > >        __u64 response;
> >> > >        __u64 driver_data[0];
> >> > > };
> >> > >
> >> > > kernel:
> >> > >
> >> > > struct ib_uverbs_query_device {
> >> > >         __u64 response;
> >> > >         __u64 driver_data[0];
> >> > > };
> >> >
> >> > We'll obviously need different strutures for the libibvers API
> >> > and the kernel interface in this case, and we'll need to figure out
> >> > how to properly translate them.  I think a cast, plus compile time
> >> > type checking ala BUILD_BUG_ON is the way to go.
> >>
> >> I'm not sure I follow, which would I cast?
> >>
> >> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
> >>              sizeof(ib_uverbs_query_device))
> >>
> >> ?
> >>
> >> > > I'm thinking the best way forward might be to use a script and
> >> > > transform userspace into:
> >> > >
> >> > > struct ibv_query_device {
> >> > >   struct ib_uverbs_cmd_hdr hdr;
> >> > >   struct ib_uverbs_query_device cmd;
> >> > > };
> >> >
> >> > That would break the users of the interface.
> >>
> >> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
> >> identical the modified libibverbs would still be binary compatible
> >> with all providers but not source compatible. Since all kernel
> >> supported providers are in rdma-plumbing we can add the '.cmd.' at the
> >> same time.
> >>
> >> The kernel uapi header would stay the same.
> >>
> >> > However automatically generating the user ABI from the kernel one
> >> > might still be a good idea in the long run.
> >>
> >> My preference would be to try and use the kernel headers directly.
> >
> > I thought the same, especially after realizing that they are almost
> > copy/paste from the vendor *-abi.h files.
> >
> >>
> >> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-15 18:56                                               ` Leon Romanovsky
  0 siblings, 0 replies; 112+ messages in thread
From: Leon Romanovsky @ 2016-09-15 18:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Gunthorpe, Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 3029 bytes --]

On Wed, Sep 14, 2016 at 12:36:19PM +0530, Parav Pandit wrote:
> Hi Dennis,
>
> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?
> Or Patch is good to go?

I didn't review it yet :(.
Sorry

>
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
>
> Parav
>
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
> > On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> >> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
> >> > > > > I've posted some initial work toward a) a while ago, and once we
> >> > >
> >> > > Did it get merged? Do you have a pointer?
> >> >
> >> > http://www.spinics.net/lists/linux-rdma/msg31958.html
> >>
> >> Right, I remember that. Certainly the right direction
> >>
> >> > > However, everything under verbs is not straightforward. The files in
> >> > > userspace are not copies...
> >> > >
> >> > > user:
> >> > >
> >> > > struct ibv_query_device {
> >> > >        __u32 command;
> >> > >        __u16 in_words;
> >> > >        __u16 out_words;
> >> > >        __u64 response;
> >> > >        __u64 driver_data[0];
> >> > > };
> >> > >
> >> > > kernel:
> >> > >
> >> > > struct ib_uverbs_query_device {
> >> > >         __u64 response;
> >> > >         __u64 driver_data[0];
> >> > > };
> >> >
> >> > We'll obviously need different strutures for the libibvers API
> >> > and the kernel interface in this case, and we'll need to figure out
> >> > how to properly translate them.  I think a cast, plus compile time
> >> > type checking ala BUILD_BUG_ON is the way to go.
> >>
> >> I'm not sure I follow, which would I cast?
> >>
> >> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
> >>              sizeof(ib_uverbs_query_device))
> >>
> >> ?
> >>
> >> > > I'm thinking the best way forward might be to use a script and
> >> > > transform userspace into:
> >> > >
> >> > > struct ibv_query_device {
> >> > >   struct ib_uverbs_cmd_hdr hdr;
> >> > >   struct ib_uverbs_query_device cmd;
> >> > > };
> >> >
> >> > That would break the users of the interface.
> >>
> >> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
> >> identical the modified libibverbs would still be binary compatible
> >> with all providers but not source compatible. Since all kernel
> >> supported providers are in rdma-plumbing we can add the '.cmd.' at the
> >> same time.
> >>
> >> The kernel uapi header would stay the same.
> >>
> >> > However automatically generating the user ABI from the kernel one
> >> > might still be a good idea in the long run.
> >>
> >> My preference would be to try and use the kernel headers directly.
> >
> > I thought the same, especially after realizing that they are almost
> > copy/paste from the vendor *-abi.h files.
> >
> >>
> >> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-14  7:06                                           ` Parav Pandit
@ 2016-09-19 13:10                                               ` Dalessandro, Dennis
  -1 siblings, 0 replies; 112+ messages in thread
From: Dalessandro, Dennis @ 2016-09-19 13:10 UTC (permalink / raw)
  To: pandit.parav-Re5JQEeQqe8AvxtiuMwx3w, leon-DgEjT+Ai2ygdnm+yROfE0A
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, corbet-T1hC0tSOHrs,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	hch-jcswGhMUV9g, linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	haggaie-VPRAkNaXOzVWk0Htik3J/w, hannes-druUgvl0LCNAfugRpC6u6w,
	Hefty, Sean, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	liranl-VPRAkNaXOzVWk0Htik3J/w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org

On Wed, 2016-09-14 at 12:36 +0530, Parav Pandit wrote:
> Hi Dennis,
> 
> Do you know how would HFI1 driver would work along with rdma cgroup?

Keep in mind HFI1 driver has two "modes" of operation. We support
verbs, and would surely fall in line with whatever cgroups do for IB
core. For our psm interface, not sure how cgroups would come into play.
Psm is designed to expose the hw to user and avoid the kernel when
possible adding more kernel control is sort of contrary to that.

Now that being said, Christoph recently made mention of maybe having a
drivers/psm [1]. I really haven't had a chance to think about the
implications of that, but maybe it's worth considering, after all we
have two implementations, qib and hfi1. So anyway I'm not sure we need
to be too concerned about cgroups right now as far as psm side of
things goes.

Depending how things shake out for the uAPI rewrite, or verbs 2.0 or
whatever we are calling it today things may change.

[1] http://marc.info/?l=linux-rdma&m=147401714313831&w=2

-Denny

> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?
> Or Patch is good to go?
> 
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
> 
> Parav
> 
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org>
> wrote:
> > On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> > > On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig
> > > wrote:
> > > > > > > I've posted some initial work toward a) a while ago, and
> > > > > > > once we
> > > > > 
> > > > > Did it get merged? Do you have a pointer?
> > > > 
> > > > http://www.spinics.net/lists/linux-rdma/msg31958.html
> > > 
> > > Right, I remember that. Certainly the right direction
> > > 
> > > > > However, everything under verbs is not straightforward. The
> > > > > files in
> > > > > userspace are not copies...
> > > > > 
> > > > > user:
> > > > > 
> > > > > struct ibv_query_device {
> > > > >        __u32 command;
> > > > >        __u16 in_words;
> > > > >        __u16 out_words;
> > > > >        __u64 response;
> > > > >        __u64 driver_data[0];
> > > > > };
> > > > > 
> > > > > kernel:
> > > > > 
> > > > > struct ib_uverbs_query_device {
> > > > >         __u64 response;
> > > > >         __u64 driver_data[0];
> > > > > };
> > > > 
> > > > We'll obviously need different strutures for the libibvers API
> > > > and the kernel interface in this case, and we'll need to figure
> > > > out
> > > > how to properly translate them.  I think a cast, plus compile
> > > > time
> > > > type checking ala BUILD_BUG_ON is the way to go.
> > > 
> > > I'm not sure I follow, which would I cast?
> > > 
> > > BUILD_BUG_ON(sizeof(ibv_query_device) ==
> > > sizeof(ib_uverbs_cmd_hdr) +
> > >              sizeof(ib_uverbs_query_device))
> > > 
> > > ?
> > > 
> > > > > I'm thinking the best way forward might be to use a script
> > > > > and
> > > > > transform userspace into:
> > > > > 
> > > > > struct ibv_query_device {
> > > > >   struct ib_uverbs_cmd_hdr hdr;
> > > > >   struct ib_uverbs_query_device cmd;
> > > > > };
> > > > 
> > > > That would break the users of the interface.
> > > 
> > > Sorry, I mean doing this inside rdma-plumbing. Since the change
> > > is ABI
> > > identical the modified libibverbs would still be binary
> > > compatible
> > > with all providers but not source compatible. Since all kernel
> > > supported providers are in rdma-plumbing we can add the '.cmd.'
> > > at the
> > > same time.
> > > 
> > > The kernel uapi header would stay the same.
> > > 
> > > > However automatically generating the user ABI from the kernel
> > > > one
> > > > might still be a good idea in the long run.
> > > 
> > > My preference would be to try and use the kernel headers
> > > directly.
> > 
> > I thought the same, especially after realizing that they are almost
> > copy/paste from the vendor *-abi.h files.
> > 
> > > 
> > > Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-19 13:10                                               ` Dalessandro, Dennis
  0 siblings, 0 replies; 112+ messages in thread
From: Dalessandro, Dennis @ 2016-09-19 13:10 UTC (permalink / raw)
  To: pandit.parav, leon
  Cc: lizefan, linux-kernel, corbet, linux-rdma, cgroups, ogerlitz,
	hch, linux-security-module, haggaie, hannes, Hefty, Sean, akpm,
	james.l.morris, tj, liranl, jgunthorpe, linux-doc, dledford,
	matanb, serge

On Wed, 2016-09-14 at 12:36 +0530, Parav Pandit wrote:
> Hi Dennis,
> 
> Do you know how would HFI1 driver would work along with rdma cgroup?

Keep in mind HFI1 driver has two "modes" of operation. We support
verbs, and would surely fall in line with whatever cgroups do for IB
core. For our psm interface, not sure how cgroups would come into play.
Psm is designed to expose the hw to user and avoid the kernel when
possible adding more kernel control is sort of contrary to that.

Now that being said, Christoph recently made mention of maybe having a
drivers/psm [1]. I really haven't had a chance to think about the
implications of that, but maybe it's worth considering, after all we
have two implementations, qib and hfi1. So anyway I'm not sure we need
to be too concerned about cgroups right now as far as psm side of
things goes.

Depending how things shake out for the uAPI rewrite, or verbs 2.0 or
whatever we are calling it today things may change.

[1] http://marc.info/?l=linux-rdma&m=147401714313831&w=2

-Denny

> Hi Matan, Leon, Jason,
> Apart from HFI1, is there any other concern?
> Or Patch is good to go?
> 
> 4.8 dates are close by (2 weeks) and there are two git trees involved
> (that might cause merge error to Linus) so if there are no issues, I
> would like to make request to Doug to consider it for 4.8 early on.
> 
> Parav
> 
> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org>
> wrote:
> > On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
> > > On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig
> > > wrote:
> > > > > > > I've posted some initial work toward a) a while ago, and
> > > > > > > once we
> > > > > 
> > > > > Did it get merged? Do you have a pointer?
> > > > 
> > > > http://www.spinics.net/lists/linux-rdma/msg31958.html
> > > 
> > > Right, I remember that. Certainly the right direction
> > > 
> > > > > However, everything under verbs is not straightforward. The
> > > > > files in
> > > > > userspace are not copies...
> > > > > 
> > > > > user:
> > > > > 
> > > > > struct ibv_query_device {
> > > > >        __u32 command;
> > > > >        __u16 in_words;
> > > > >        __u16 out_words;
> > > > >        __u64 response;
> > > > >        __u64 driver_data[0];
> > > > > };
> > > > > 
> > > > > kernel:
> > > > > 
> > > > > struct ib_uverbs_query_device {
> > > > >         __u64 response;
> > > > >         __u64 driver_data[0];
> > > > > };
> > > > 
> > > > We'll obviously need different strutures for the libibvers API
> > > > and the kernel interface in this case, and we'll need to figure
> > > > out
> > > > how to properly translate them.  I think a cast, plus compile
> > > > time
> > > > type checking ala BUILD_BUG_ON is the way to go.
> > > 
> > > I'm not sure I follow, which would I cast?
> > > 
> > > BUILD_BUG_ON(sizeof(ibv_query_device) ==
> > > sizeof(ib_uverbs_cmd_hdr) +
> > >              sizeof(ib_uverbs_query_device))
> > > 
> > > ?
> > > 
> > > > > I'm thinking the best way forward might be to use a script
> > > > > and
> > > > > transform userspace into:
> > > > > 
> > > > > struct ibv_query_device {
> > > > >   struct ib_uverbs_cmd_hdr hdr;
> > > > >   struct ib_uverbs_query_device cmd;
> > > > > };
> > > > 
> > > > That would break the users of the interface.
> > > 
> > > Sorry, I mean doing this inside rdma-plumbing. Since the change
> > > is ABI
> > > identical the modified libibverbs would still be binary
> > > compatible
> > > with all providers but not source compatible. Since all kernel
> > > supported providers are in rdma-plumbing we can add the '.cmd.'
> > > at the
> > > same time.
> > > 
> > > The kernel uapi header would stay the same.
> > > 
> > > > However automatically generating the user ABI from the kernel
> > > > one
> > > > might still be a good idea in the long run.
> > > 
> > > My preference would be to try and use the kernel headers
> > > directly.
> > 
> > I thought the same, especially after realizing that they are almost
> > copy/paste from the vendor *-abi.h files.
> > 
> > > 
> > > Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-19 13:10                                               ` Dalessandro, Dennis
@ 2016-09-19 17:00                                                 ` Parav Pandit
  -1 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-19 17:00 UTC (permalink / raw)
  To: Dalessandro, Dennis
  Cc: leon, lizefan, linux-kernel, corbet, linux-rdma, cgroups,
	ogerlitz, hch, linux-security-module, haggaie, hannes, Hefty,
	Sean, akpm, james.l.morris, tj, liranl

Hi Denny,

On Mon, Sep 19, 2016 at 6:40 PM, Dalessandro, Dennis
<dennis.dalessandro@intel.com> wrote:
> On Wed, 2016-09-14 at 12:36 +0530, Parav Pandit wrote:
>> Hi Dennis,
>>
>> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Keep in mind HFI1 driver has two "modes" of operation. We support
> verbs, and would surely fall in line with whatever cgroups do for IB
> core.
Thanks for the feedback.

> For our psm interface, not sure how cgroups would come into play.
> Psm is designed to expose the hw to user and avoid the kernel when
> possible adding more kernel control is sort of contrary to that.
>
Yes, PSM is currently out of RDMA cgroup and in future we can take a
look on how things shape as subsystem if it does.

> Now that being said, Christoph recently made mention of maybe having a
> drivers/psm [1]. I really haven't had a chance to think about the
> implications of that, but maybe it's worth considering, after all we
> have two implementations, qib and hfi1. So anyway I'm not sure we need
> to be too concerned about cgroups right now as far as psm side of
> things goes.
>
o.k.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-19 17:00                                                 ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-19 17:00 UTC (permalink / raw)
  To: Dalessandro, Dennis
  Cc: leon, lizefan, linux-kernel, corbet, linux-rdma, cgroups,
	ogerlitz, hch, linux-security-module, haggaie, hannes, Hefty,
	Sean, akpm, james.l.morris, tj, liranl, jgunthorpe, linux-doc,
	dledford, matanb, serge

Hi Denny,

On Mon, Sep 19, 2016 at 6:40 PM, Dalessandro, Dennis
<dennis.dalessandro@intel.com> wrote:
> On Wed, 2016-09-14 at 12:36 +0530, Parav Pandit wrote:
>> Hi Dennis,
>>
>> Do you know how would HFI1 driver would work along with rdma cgroup?
>
> Keep in mind HFI1 driver has two "modes" of operation. We support
> verbs, and would surely fall in line with whatever cgroups do for IB
> core.
Thanks for the feedback.

> For our psm interface, not sure how cgroups would come into play.
> Psm is designed to expose the hw to user and avoid the kernel when
> possible adding more kernel control is sort of contrary to that.
>
Yes, PSM is currently out of RDMA cgroup and in future we can take a
look on how things shape as subsystem if it does.

> Now that being said, Christoph recently made mention of maybe having a
> drivers/psm [1]. I really haven't had a chance to think about the
> implications of that, but maybe it's worth considering, after all we
> have two implementations, qib and hfi1. So anyway I'm not sure we need
> to be too concerned about cgroups right now as far as psm side of
> things goes.
>
o.k.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-15 18:56                                               ` Leon Romanovsky
  (?)
@ 2016-09-21  4:43                                               ` Parav Pandit
  2016-09-21 14:26                                                 ` Tejun Heo
  -1 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-09-21  4:43 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Matan Barak, Tejun Heo,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Leon,

On Fri, Sep 16, 2016 at 12:26 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, Sep 14, 2016 at 12:36:19PM +0530, Parav Pandit wrote:
>> Hi Dennis,
>>
>> Do you know how would HFI1 driver would work along with rdma cgroup?
>>
>> Hi Matan, Leon, Jason,
>> Apart from HFI1, is there any other concern?
>> Or Patch is good to go?
>
> I didn't review it yet :(.
> Sorry
>

We have completed review from Tejun, Christoph.
HFI driver folks also provided feedback for Intel drivers.
Matan's also doesn't have any more comments.

If possible, if you can also review, it will be helpful.

I have some more changes unrelated to cgroup in same files in both the git tree.
Pushing them now either results into merge conflict later on for
Doug/Tejun, or requires rebase and resending patch.
If you can review, we can avoid such rework.

>>
>> 4.8 dates are close by (2 weeks) and there are two git trees involved
>> (that might cause merge error to Linus) so if there are no issues, I
>> would like to make request to Doug to consider it for 4.8 early on.
>>
>> Parav
>>
>> On Mon, Sep 12, 2016 at 10:37 AM, Leon Romanovsky <leon@kernel.org> wrote:
>> > On Sun, Sep 11, 2016 at 11:52:35AM -0600, Jason Gunthorpe wrote:
>> >> On Sun, Sep 11, 2016 at 07:24:45PM +0200, Christoph Hellwig wrote:
>> >> > > > > I've posted some initial work toward a) a while ago, and once we
>> >> > >
>> >> > > Did it get merged? Do you have a pointer?
>> >> >
>> >> > http://www.spinics.net/lists/linux-rdma/msg31958.html
>> >>
>> >> Right, I remember that. Certainly the right direction
>> >>
>> >> > > However, everything under verbs is not straightforward. The files in
>> >> > > userspace are not copies...
>> >> > >
>> >> > > user:
>> >> > >
>> >> > > struct ibv_query_device {
>> >> > >        __u32 command;
>> >> > >        __u16 in_words;
>> >> > >        __u16 out_words;
>> >> > >        __u64 response;
>> >> > >        __u64 driver_data[0];
>> >> > > };
>> >> > >
>> >> > > kernel:
>> >> > >
>> >> > > struct ib_uverbs_query_device {
>> >> > >         __u64 response;
>> >> > >         __u64 driver_data[0];
>> >> > > };
>> >> >
>> >> > We'll obviously need different strutures for the libibvers API
>> >> > and the kernel interface in this case, and we'll need to figure out
>> >> > how to properly translate them.  I think a cast, plus compile time
>> >> > type checking ala BUILD_BUG_ON is the way to go.
>> >>
>> >> I'm not sure I follow, which would I cast?
>> >>
>> >> BUILD_BUG_ON(sizeof(ibv_query_device) == sizeof(ib_uverbs_cmd_hdr) +
>> >>              sizeof(ib_uverbs_query_device))
>> >>
>> >> ?
>> >>
>> >> > > I'm thinking the best way forward might be to use a script and
>> >> > > transform userspace into:
>> >> > >
>> >> > > struct ibv_query_device {
>> >> > >   struct ib_uverbs_cmd_hdr hdr;
>> >> > >   struct ib_uverbs_query_device cmd;
>> >> > > };
>> >> >
>> >> > That would break the users of the interface.
>> >>
>> >> Sorry, I mean doing this inside rdma-plumbing. Since the change is ABI
>> >> identical the modified libibverbs would still be binary compatible
>> >> with all providers but not source compatible. Since all kernel
>> >> supported providers are in rdma-plumbing we can add the '.cmd.' at the
>> >> same time.
>> >>
>> >> The kernel uapi header would stay the same.
>> >>
>> >> > However automatically generating the user ABI from the kernel one
>> >> > might still be a good idea in the long run.
>> >>
>> >> My preference would be to try and use the kernel headers directly.
>> >
>> > I thought the same, especially after realizing that they are almost
>> > copy/paste from the vendor *-abi.h files.
>> >
>> >>
>> >> Jason

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-21  4:43                                               ` Parav Pandit
@ 2016-09-21 14:26                                                 ` Tejun Heo
       [not found]                                                   ` <20160921142645.GB10734-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-09-21 14:26 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hello, Parav.

On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
> We have completed review from Tejun, Christoph.
> HFI driver folks also provided feedback for Intel drivers.
> Matan's also doesn't have any more comments.
> 
> If possible, if you can also review, it will be helpful.
> 
> I have some more changes unrelated to cgroup in same files in both the git tree.
> Pushing them now either results into merge conflict later on for
> Doug/Tejun, or requires rebase and resending patch.
> If you can review, we can avoid such rework.

My impression of the thread was that there doesn't seem to be enough
of consensus around how rdma resources should be defined.  Is that
part agreed upon now?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-21 14:26                                                 ` Tejun Heo
@ 2016-09-21 16:02                                                       ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-21 16:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hi Tejun,

On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>> We have completed review from Tejun, Christoph.
>> HFI driver folks also provided feedback for Intel drivers.
>> Matan's also doesn't have any more comments.
>>
>> If possible, if you can also review, it will be helpful.
>>
>> I have some more changes unrelated to cgroup in same files in both the git tree.
>> Pushing them now either results into merge conflict later on for
>> Doug/Tejun, or requires rebase and resending patch.
>> If you can review, we can avoid such rework.
>
> My impression of the thread was that there doesn't seem to be enough
> of consensus around how rdma resources should be defined.  Is that
> part agreed upon now?
>

We ended up discussing few points on different thread [1].

There was confusion on how some non-rdma/non-IB drivers would work
with rdma cgroup from Matan.
Christoph explained how they don't fit in the rdma subsystem and
therefore its not prime target to addess.
Intel driver maintainer Denny also acknowledged same on [2].
IB compliant drivers of Intel support rdma cgroup as explained in [2].
With that usnic and Intel psm drivers falls out of rdma cgroup support
as they don't fit very well in the verbs definition.

[1] https://www.spinics.net/lists/linux-rdma/msg40340.html
[2] http://www.spinics.net/lists/linux-rdma/msg40717.html

I will wait for Leon's review comments if he has different view on architecture.
Back in April when I met face-to-face to Leon and Haggai, Leon was in
support to have kernel defined the rdma resources as suggested by
Christoph and Tejun instead of IB/RDMA subsystem.
I will wait for his comments if his views have changed with new uAPI
taking shape.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-09-21 16:02                                                       ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-09-21 16:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Tejun,

On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>> We have completed review from Tejun, Christoph.
>> HFI driver folks also provided feedback for Intel drivers.
>> Matan's also doesn't have any more comments.
>>
>> If possible, if you can also review, it will be helpful.
>>
>> I have some more changes unrelated to cgroup in same files in both the git tree.
>> Pushing them now either results into merge conflict later on for
>> Doug/Tejun, or requires rebase and resending patch.
>> If you can review, we can avoid such rework.
>
> My impression of the thread was that there doesn't seem to be enough
> of consensus around how rdma resources should be defined.  Is that
> part agreed upon now?
>

We ended up discussing few points on different thread [1].

There was confusion on how some non-rdma/non-IB drivers would work
with rdma cgroup from Matan.
Christoph explained how they don't fit in the rdma subsystem and
therefore its not prime target to addess.
Intel driver maintainer Denny also acknowledged same on [2].
IB compliant drivers of Intel support rdma cgroup as explained in [2].
With that usnic and Intel psm drivers falls out of rdma cgroup support
as they don't fit very well in the verbs definition.

[1] https://www.spinics.net/lists/linux-rdma/msg40340.html
[2] http://www.spinics.net/lists/linux-rdma/msg40717.html

I will wait for Leon's review comments if he has different view on architecture.
Back in April when I met face-to-face to Leon and Haggai, Leon was in
support to have kernel defined the rdma resources as suggested by
Christoph and Tejun instead of IB/RDMA subsystem.
I will wait for his comments if his views have changed with new uAPI
taking shape.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-09-21 16:02                                                       ` Parav Pandit
  (?)
@ 2016-10-04 18:19                                                           ` Parav Pandit
  -1 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-04 18:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hi Doug,

I am still waiting for Leon to provide his comments if any on rdma cgroup.
>From other email context, he was on vacation last week.
While we wait for his comments, I wanted to know your view of this
patchset in 4.9 merge window.

To summarize the discussion that happened in two threads.

[1] Ack by Tejun, asking for review from rdma list
[2] quick review by Christoph on patch-v11 (patch 12 has only typo corrections)
[3] Christoph's ack on architecture of rdma cgroup and fitting it with ABI
[4] My response on Matan's query on RSS indirection table
[5] Response from Intel on their driver support for Matan's query
[6] Christoph's point on architecture, which we are following in new
ABI and current ABI

I have reviewed recent patch [7] from Matan where I see IB verbs
objects are still handled through common path as suggested by
Christoph.

I do not see any issues with rdma cgroup patchset other than it requires rebase.
Am I missing something?
Can you please help me - What would be required to merge it to 4.9?

[1] https://lkml.org/lkml/2016/8/31/494
[2] https://lkml.org/lkml/2016/8/25/146
[3] https://lkml.org/lkml/2016/9/10/175
[4] https://lkml.org/lkml/2016/9/14/221
[5] https://lkml.org/lkml/2016/9/19/571
[6] http://www.spinics.net/lists/linux-rdma/msg40337.html
[7] email subject: [RFC ABI V4 0/7] SG-based RDMA ABI Proposal

Regards,
Parav Pandit

On Wed, Sep 21, 2016 at 9:32 PM, Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Tejun,
>
> On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> Hello, Parav.
>>
>> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>>> We have completed review from Tejun, Christoph.
>>> HFI driver folks also provided feedback for Intel drivers.
>>> Matan's also doesn't have any more comments.
>>>
>>> If possible, if you can also review, it will be helpful.
>>>
>>> I have some more changes unrelated to cgroup in same files in both the git tree.
>>> Pushing them now either results into merge conflict later on for
>>> Doug/Tejun, or requires rebase and resending patch.
>>> If you can review, we can avoid such rework.
>>
>> My impression of the thread was that there doesn't seem to be enough
>> of consensus around how rdma resources should be defined.  Is that
>> part agreed upon now?
>>
>
> We ended up discussing few points on different thread [1].
>
> There was confusion on how some non-rdma/non-IB drivers would work
> with rdma cgroup from Matan.
> Christoph explained how they don't fit in the rdma subsystem and
> therefore its not prime target to addess.
> Intel driver maintainer Denny also acknowledged same on [2].
> IB compliant drivers of Intel support rdma cgroup as explained in [2].
> With that usnic and Intel psm drivers falls out of rdma cgroup support
> as they don't fit very well in the verbs definition.
>
> [1] https://www.spinics.net/lists/linux-rdma/msg40340.html
> [2] http://www.spinics.net/lists/linux-rdma/msg40717.html
>
> I will wait for Leon's review comments if he has different view on architecture.
> Back in April when I met face-to-face to Leon and Haggai, Leon was in
> support to have kernel defined the rdma resources as suggested by
> Christoph and Tejun instead of IB/RDMA subsystem.
> I will wait for his comments if his views have changed with new uAPI
> taking shape.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-10-04 18:19                                                           ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-04 18:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Doug,

I am still waiting for Leon to provide his comments if any on rdma cgroup.
>From other email context, he was on vacation last week.
While we wait for his comments, I wanted to know your view of this
patchset in 4.9 merge window.

To summarize the discussion that happened in two threads.

[1] Ack by Tejun, asking for review from rdma list
[2] quick review by Christoph on patch-v11 (patch 12 has only typo corrections)
[3] Christoph's ack on architecture of rdma cgroup and fitting it with ABI
[4] My response on Matan's query on RSS indirection table
[5] Response from Intel on their driver support for Matan's query
[6] Christoph's point on architecture, which we are following in new
ABI and current ABI

I have reviewed recent patch [7] from Matan where I see IB verbs
objects are still handled through common path as suggested by
Christoph.

I do not see any issues with rdma cgroup patchset other than it requires rebase.
Am I missing something?
Can you please help me - What would be required to merge it to 4.9?

[1] https://lkml.org/lkml/2016/8/31/494
[2] https://lkml.org/lkml/2016/8/25/146
[3] https://lkml.org/lkml/2016/9/10/175
[4] https://lkml.org/lkml/2016/9/14/221
[5] https://lkml.org/lkml/2016/9/19/571
[6] http://www.spinics.net/lists/linux-rdma/msg40337.html
[7] email subject: [RFC ABI V4 0/7] SG-based RDMA ABI Proposal

Regards,
Parav Pandit

On Wed, Sep 21, 2016 at 9:32 PM, Parav Pandit <pandit.parav@gmail.com> wrote:
> Hi Tejun,
>
> On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Parav.
>>
>> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>>> We have completed review from Tejun, Christoph.
>>> HFI driver folks also provided feedback for Intel drivers.
>>> Matan's also doesn't have any more comments.
>>>
>>> If possible, if you can also review, it will be helpful.
>>>
>>> I have some more changes unrelated to cgroup in same files in both the git tree.
>>> Pushing them now either results into merge conflict later on for
>>> Doug/Tejun, or requires rebase and resending patch.
>>> If you can review, we can avoid such rework.
>>
>> My impression of the thread was that there doesn't seem to be enough
>> of consensus around how rdma resources should be defined.  Is that
>> part agreed upon now?
>>
>
> We ended up discussing few points on different thread [1].
>
> There was confusion on how some non-rdma/non-IB drivers would work
> with rdma cgroup from Matan.
> Christoph explained how they don't fit in the rdma subsystem and
> therefore its not prime target to addess.
> Intel driver maintainer Denny also acknowledged same on [2].
> IB compliant drivers of Intel support rdma cgroup as explained in [2].
> With that usnic and Intel psm drivers falls out of rdma cgroup support
> as they don't fit very well in the verbs definition.
>
> [1] https://www.spinics.net/lists/linux-rdma/msg40340.html
> [2] http://www.spinics.net/lists/linux-rdma/msg40717.html
>
> I will wait for Leon's review comments if he has different view on architecture.
> Back in April when I met face-to-face to Leon and Haggai, Leon was in
> support to have kernel defined the rdma resources as suggested by
> Christoph and Tejun instead of IB/RDMA subsystem.
> I will wait for his comments if his views have changed with new uAPI
> taking shape.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-10-04 18:19                                                           ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-04 18:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hi Doug,

I am still waiting for Leon to provide his comments if any on rdma cgroup.
From other email context, he was on vacation last week.
While we wait for his comments, I wanted to know your view of this
patchset in 4.9 merge window.

To summarize the discussion that happened in two threads.

[1] Ack by Tejun, asking for review from rdma list
[2] quick review by Christoph on patch-v11 (patch 12 has only typo corrections)
[3] Christoph's ack on architecture of rdma cgroup and fitting it with ABI
[4] My response on Matan's query on RSS indirection table
[5] Response from Intel on their driver support for Matan's query
[6] Christoph's point on architecture, which we are following in new
ABI and current ABI

I have reviewed recent patch [7] from Matan where I see IB verbs
objects are still handled through common path as suggested by
Christoph.

I do not see any issues with rdma cgroup patchset other than it requires rebase.
Am I missing something?
Can you please help me - What would be required to merge it to 4.9?

[1] https://lkml.org/lkml/2016/8/31/494
[2] https://lkml.org/lkml/2016/8/25/146
[3] https://lkml.org/lkml/2016/9/10/175
[4] https://lkml.org/lkml/2016/9/14/221
[5] https://lkml.org/lkml/2016/9/19/571
[6] http://www.spinics.net/lists/linux-rdma/msg40337.html
[7] email subject: [RFC ABI V4 0/7] SG-based RDMA ABI Proposal

Regards,
Parav Pandit

On Wed, Sep 21, 2016 at 9:32 PM, Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Tejun,
>
> On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> Hello, Parav.
>>
>> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>>> We have completed review from Tejun, Christoph.
>>> HFI driver folks also provided feedback for Intel drivers.
>>> Matan's also doesn't have any more comments.
>>>
>>> If possible, if you can also review, it will be helpful.
>>>
>>> I have some more changes unrelated to cgroup in same files in both the git tree.
>>> Pushing them now either results into merge conflict later on for
>>> Doug/Tejun, or requires rebase and resending patch.
>>> If you can review, we can avoid such rework.
>>
>> My impression of the thread was that there doesn't seem to be enough
>> of consensus around how rdma resources should be defined.  Is that
>> part agreed upon now?
>>
>
> We ended up discussing few points on different thread [1].
>
> There was confusion on how some non-rdma/non-IB drivers would work
> with rdma cgroup from Matan.
> Christoph explained how they don't fit in the rdma subsystem and
> therefore its not prime target to addess.
> Intel driver maintainer Denny also acknowledged same on [2].
> IB compliant drivers of Intel support rdma cgroup as explained in [2].
> With that usnic and Intel psm drivers falls out of rdma cgroup support
> as they don't fit very well in the verbs definition.
>
> [1] https://www.spinics.net/lists/linux-rdma/msg40340.html
> [2] http://www.spinics.net/lists/linux-rdma/msg40717.html
>
> I will wait for Leon's review comments if he has different view on architecture.
> Back in April when I met face-to-face to Leon and Haggai, Leon was in
> support to have kernel defined the rdma resources as suggested by
> Christoph and Tejun instead of IB/RDMA subsystem.
> I will wait for his comments if his views have changed with new uAPI
> taking shape.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-04 18:19                                                           ` Parav Pandit
  (?)
  (?)
@ 2016-10-05  6:37                                                           ` Christoph Hellwig
  2016-10-05 11:22                                                             ` Leon Romanovsky
       [not found]                                                             ` <20161005063735.GC3086-jcswGhMUV9g@public.gmane.org>
  -1 siblings, 2 replies; 112+ messages in thread
From: Christoph Hellwig @ 2016-10-05  6:37 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig,
	Matan Barak, cgroups, linux-doc, Linux Kernel Mailing List,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss,
	Hefty, Sean, Haggai Eran, Jonathan Corbet, james.l.morris, serge,
	Or Gerlitz, Andrew Morton, linux-security-module

FYI, the patches look fine to me:

Acked-by: Christoph Hellwig <hch@lst.de>

but we're past the merge window for 4.9 now unfortunately.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
  2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
                   ` (2 preceding siblings ...)
       [not found] ` <1472632647-1525-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-10-05 11:22 ` Leon Romanovsky
       [not found]   ` <20161005112206.GC9282-2ukJVAZIZ/Y@public.gmane.org>
  3 siblings, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-10-05 11:22 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	hannes, dledford, hch, liranl, sean.hefty, jgunthorpe, haggaie,
	corbet, james.l.morris, serge, ogerlitz, matanb, akpm,
	linux-security-module

[-- Attachment #1: Type: text/plain, Size: 1721 bytes --]

On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
> rdmacg: IB/core: rdma controller support
>
> Patch is generated and tested against below Doug's linux-rdma
> git tree.
>
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
> Branch: master
>
> Patchset is also compiled and tested against below Tejun's cgroup tree
> using cgroup v2 mode.
> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
> Branch: master
>
> Overview:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources. This results into service unavailibility.
>
> RDMA cgroup addresses this issue by allowing resource accounting,
> limit enforcement on per cgroup, per rdma device basis.
>
> RDMA uverbs layer will enforce limits on well defined RDMA verb
> resources without any HCA vendor device driver involvement.
>
> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
> specific resources. Instead rdma cgroup provides set of APIs
> through which vendor specific drivers can do resource accounting
> by making use of rdma cgroup.

Hi Parav,
I want to propose an extension to the RDMA cgroup which can be done as
follow-up patches.

Let's add new global type, which will control whole HCA (for example in percentages). It will
allow natively define new objects without need to introduce them to the user.

This HCA share will be overwritten by specific UVERBS types which you
already defined.

What do you think?

Except this proposal,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

Thanks.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-05  6:37                                                           ` Christoph Hellwig
@ 2016-10-05 11:22                                                             ` Leon Romanovsky
  2016-10-05 15:36                                                               ` Tejun Heo
       [not found]                                                             ` <20161005063735.GC3086-jcswGhMUV9g@public.gmane.org>
  1 sibling, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-10-05 11:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Parav Pandit, Tejun Heo, Jason Gunthorpe, Matan Barak, cgroups,
	linux-doc, Linux Kernel Mailing List, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

[-- Attachment #1: Type: text/plain, Size: 248 bytes --]

On Wed, Oct 05, 2016 at 08:37:35AM +0200, Christoph Hellwig wrote:
> FYI, the patches look fine to me:
>
> Acked-by: Christoph Hellwig <hch@lst.de>
>
> but we're past the merge window for 4.9 now unfortunately.

IMHO, it still can make it.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-05 11:22                                                             ` Leon Romanovsky
@ 2016-10-05 15:36                                                               ` Tejun Heo
  0 siblings, 0 replies; 112+ messages in thread
From: Tejun Heo @ 2016-10-05 15:36 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Parav Pandit, Jason Gunthorpe, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hello,

On Wed, Oct 05, 2016 at 02:22:57PM +0300, Leon Romanovsky wrote:
> On Wed, Oct 05, 2016 at 08:37:35AM +0200, Christoph Hellwig wrote:
> > FYI, the patches look fine to me:
> >
> > Acked-by: Christoph Hellwig <hch@lst.de>
> >
> > but we're past the merge window for 4.9 now unfortunately.
> 
> IMHO, it still can make it.

Most likely, we only have three / four days till rc1 opens, I think
it's too late.  Let's target the next one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-05  6:37                                                           ` Christoph Hellwig
@ 2016-10-06 12:55                                                                 ` Parav Pandit
       [not found]                                                             ` <20161005063735.GC3086-jcswGhMUV9g@public.gmane.org>
  1 sibling, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-06 12:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Leon Romanovsky, Jason Gunthorpe, Matan Barak,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Andrew Morton,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hi Christoph,

On Wed, Oct 5, 2016 at 12:07 PM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> FYI, the patches look fine to me:
>
> Acked-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
>
Thanks a lot for review.

Parav

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
@ 2016-10-06 12:55                                                                 ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-06 12:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, Leon Romanovsky, Jason Gunthorpe, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Christoph,

On Wed, Oct 5, 2016 at 12:07 PM, Christoph Hellwig <hch@lst.de> wrote:
> FYI, the patches look fine to me:
>
> Acked-by: Christoph Hellwig <hch@lst.de>
>
Thanks a lot for review.

Parav

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
  2016-10-05 11:22 ` Leon Romanovsky
@ 2016-10-06 12:59       ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-06 12:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	Andrew Morton, linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hi Leon,

On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
>> rdmacg: IB/core: rdma controller support
>>
>> Patch is generated and tested against below Doug's linux-rdma
>> git tree.
>>
>> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
>> Branch: master
>>
>> Patchset is also compiled and tested against below Tejun's cgroup tree
>> using cgroup v2 mode.
>> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
>> Branch: master
>>
>> Overview:
>> Currently user space applications can easily take away all the rdma
>> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> applications in other cgroup or kernel space ULPs may not even get chance
>> to allocate any rdma resources. This results into service unavailibility.
>>
>> RDMA cgroup addresses this issue by allowing resource accounting,
>> limit enforcement on per cgroup, per rdma device basis.
>>
>> RDMA uverbs layer will enforce limits on well defined RDMA verb
>> resources without any HCA vendor device driver involvement.
>>
>> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
>> specific resources. Instead rdma cgroup provides set of APIs
>> through which vendor specific drivers can do resource accounting
>> by making use of rdma cgroup.
>
> Hi Parav,
> I want to propose an extension to the RDMA cgroup which can be done as
> follow-up patches.
>

To bring logical end to this feature/patch discussion and to progress
towards merging it, Lets discuss this new feature in follow-on email
right after this email between these two mailing list and I will drop
linux kernel and docs mailing list.

> Let's add new global type, which will control whole HCA (for example in percentages). It will
> allow natively define new objects without need to introduce them to the user.
>
> This HCA share will be overwritten by specific UVERBS types which you
> already defined.
>
> What do you think?
>
> Except this proposal,
> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Thanks a lot for review.

Parav
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
@ 2016-10-06 12:59       ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-06 12:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Tejun Heo, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Liran Liss, Hefty, Sean, Jason Gunthorpe,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Matan Barak, Andrew Morton, linux-security-module

Hi Leon,

On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
>> rdmacg: IB/core: rdma controller support
>>
>> Patch is generated and tested against below Doug's linux-rdma
>> git tree.
>>
>> URL: git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
>> Branch: master
>>
>> Patchset is also compiled and tested against below Tejun's cgroup tree
>> using cgroup v2 mode.
>> URL: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
>> Branch: master
>>
>> Overview:
>> Currently user space applications can easily take away all the rdma
>> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> applications in other cgroup or kernel space ULPs may not even get chance
>> to allocate any rdma resources. This results into service unavailibility.
>>
>> RDMA cgroup addresses this issue by allowing resource accounting,
>> limit enforcement on per cgroup, per rdma device basis.
>>
>> RDMA uverbs layer will enforce limits on well defined RDMA verb
>> resources without any HCA vendor device driver involvement.
>>
>> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
>> specific resources. Instead rdma cgroup provides set of APIs
>> through which vendor specific drivers can do resource accounting
>> by making use of rdma cgroup.
>
> Hi Parav,
> I want to propose an extension to the RDMA cgroup which can be done as
> follow-up patches.
>

To bring logical end to this feature/patch discussion and to progress
towards merging it, Lets discuss this new feature in follow-on email
right after this email between these two mailing list and I will drop
linux kernel and docs mailing list.

> Let's add new global type, which will control whole HCA (for example in percentages). It will
> allow natively define new objects without need to introduce them to the user.
>
> This HCA share will be overwritten by specific UVERBS types which you
> already defined.
>
> What do you think?
>
> Except this proposal,
> Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

Thanks a lot for review.

Parav

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]   ` <20161005112206.GC9282-2ukJVAZIZ/Y@public.gmane.org>
  2016-10-06 12:59       ` Parav Pandit
@ 2016-10-06 13:49     ` Parav Pandit
       [not found]       ` <CAG53R5VNVb=8-LJbDRqjtOZG347ucPuc420bcfnDgBKMoKqU-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-06 13:49 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Leon,

On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
>> rdmacg: IB/core: rdma controller support
>>
>> Overview:
>> Currently user space applications can easily take away all the rdma
>> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> applications in other cgroup or kernel space ULPs may not even get chance
>> to allocate any rdma resources. This results into service unavailibility.
>>
>> RDMA cgroup addresses this issue by allowing resource accounting,
>> limit enforcement on per cgroup, per rdma device basis.
>>
>> RDMA uverbs layer will enforce limits on well defined RDMA verb
>> resources without any HCA vendor device driver involvement.
>>
>> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
>> specific resources. Instead rdma cgroup provides set of APIs
>> through which vendor specific drivers can do resource accounting
>> by making use of rdma cgroup.
>
> Hi Parav,
> I want to propose an extension to the RDMA cgroup which can be done as
> follow-up patches.
>
> Let's add new global type, which will control whole HCA (for example in percentages). It will
> allow natively define new objects without need to introduce them to the user.
>
In other cgroup such as CPU, this is done using cpu.weight API. Where
percentage or weight is configured by the user.
In this mode, resources taken away from other cgroup proportionately.
It works for cpu because its mainly stateless resource unlike rdma
resources.
So if we want to simplify user configuration similarly,
percentage/weight configuration can be extended.
This way they need not be introduced to users.
I hope your definition of "user" is actual end-user and not rdma cgroup. Right?
In other words, new object should be still added as new enum value in
rdma_cgroup.h?
Only than it can be overwritten by specific UVERBs type as you
described below. I think thats what you meant as you described below.

Otherwise charging/uncharging this new percentage resource can get messy.

> This HCA share will be overwritten by specific UVERBS types which you
> already defined.
>
> What do you think?

So to refine your proposal from cgroup perspective, instead of adding
new resource type in rdma_cgroup.h for percentage, I prefer to have

Existing
1. rdma.max
2. rdma.current
New,
3. rdma.weight
This ABI will have similar API to say
echo "mlx4_0 50" > rdma.weight.
Where 50 is weight of the resources.
For example,
for one cgroup instance weight=sum=100% resource for a given cgroup.
for three cgroup instances percentage=(weight/sum)% = 50/(50+50+50) = 33%.
One cgroup gets 33% resource.

Weight can be in range of 1 to 10,000 similar to cpu cgroup.

This might work if applications running in all cgroups are similar.
But weight doesn't do justice, when there are different type of
applications running in each cgroup. Such as few running libfabric
based apps, few running MPI, others directly using ibverbs.
So as you said rdma.max configuration would be required for management
plane to override weight (percentage) for certain resources.


>
> Except this proposal,
> Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> Thanks.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]       ` <CAG53R5VNVb=8-LJbDRqjtOZG347ucPuc420bcfnDgBKMoKqU-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-10  4:46         ` Leon Romanovsky
       [not found]           ` <20161010044623.GI9282-2ukJVAZIZ/Y@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-10-10  4:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 4430 bytes --]

On Thu, Oct 06, 2016 at 07:19:24PM +0530, Parav Pandit wrote:
> Hi Leon,
>
> On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
> >> rdmacg: IB/core: rdma controller support
> >>
> >> Overview:
> >> Currently user space applications can easily take away all the rdma
> >> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> >> applications in other cgroup or kernel space ULPs may not even get chance
> >> to allocate any rdma resources. This results into service unavailibility.
> >>
> >> RDMA cgroup addresses this issue by allowing resource accounting,
> >> limit enforcement on per cgroup, per rdma device basis.
> >>
> >> RDMA uverbs layer will enforce limits on well defined RDMA verb
> >> resources without any HCA vendor device driver involvement.
> >>
> >> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
> >> specific resources. Instead rdma cgroup provides set of APIs
> >> through which vendor specific drivers can do resource accounting
> >> by making use of rdma cgroup.
> >
> > Hi Parav,
> > I want to propose an extension to the RDMA cgroup which can be done as
> > follow-up patches.
> >
> > Let's add new global type, which will control whole HCA (for example in percentages). It will
> > allow natively define new objects without need to introduce them to the user.
> >
> In other cgroup such as CPU, this is done using cpu.weight API. Where
> percentage or weight is configured by the user.
> In this mode, resources taken away from other cgroup proportionately.
> It works for cpu because its mainly stateless resource unlike rdma
> resources.
> So if we want to simplify user configuration similarly,
> percentage/weight configuration can be extended.
> This way they need not be introduced to users.
> I hope your definition of "user" is actual end-user and not rdma cgroup. Right?

Yes, "user" -> "admin".
I think that percentage is more intuitive to them and will be much easier to
explain how to use it. I always have in mind "swappiness" field and the
numerous questions on how to configure it.

> In other words, new object should be still added as new enum value in
> rdma_cgroup.h?

Yes, I had in mind something like IB_CGROUP_HCA, this is why it can be
done as a future work after accepting current patches.

> Only than it can be overwritten by specific UVERBs type as you
> described below. I think thats what you meant as you described below.

Exactly.

>
> Otherwise charging/uncharging this new percentage resource can get messy.

Agree

>
> > This HCA share will be overwritten by specific UVERBS types which you
> > already defined.
> >
> > What do you think?
>
> So to refine your proposal from cgroup perspective, instead of adding
> new resource type in rdma_cgroup.h for percentage, I prefer to have
>
> Existing
> 1. rdma.max
> 2. rdma.current
> New,
> 3. rdma.weight
> This ABI will have similar API to say
> echo "mlx4_0 50" > rdma.weight.
> Where 50 is weight of the resources.
> For example,
> for one cgroup instance weight=sum=100% resource for a given cgroup.
> for three cgroup instances percentage=(weight/sum)% = 50/(50+50+50) = 33%.
> One cgroup gets 33% resource.
>
> Weight can be in range of 1 to 10,000 similar to cpu cgroup.

This is exactly what I don't like, the percentage will remove from the
user the translation needs between weight and actual limitation.

IMHO CPU used weights because everything there is in weights :).

>
> This might work if applications running in all cgroups are similar.
> But weight doesn't do justice, when there are different type of
> applications running in each cgroup. Such as few running libfabric
> based apps, few running MPI, others directly using ibverbs.
> So as you said rdma.max configuration would be required for management
> plane to override weight (percentage) for certain resources.

Why?
The device exposes max values during initialization and if user asked
for 20% percent of HCA, he will get max*0.2.

>
>
> >
> > Except this proposal,
> > Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >
> > Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]           ` <20161010044623.GI9282-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-10-10  6:29             ` Parav Pandit
       [not found]               ` <CAG53R5UM6nSTZ7=0S9reKGX45CpNBi8soSDVZyXkN-z0_XXWWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-10  6:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Tejun Heo, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Leon,

On Mon, Oct 10, 2016 at 10:16 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Thu, Oct 06, 2016 at 07:19:24PM +0530, Parav Pandit wrote:
>> Hi Leon,
>>
>> On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
>> >> rdmacg: IB/core: rdma controller support
>> >>
>> >> Overview:
>> >> Currently user space applications can easily take away all the rdma
>> >> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> >> applications in other cgroup or kernel space ULPs may not even get chance
>> >> to allocate any rdma resources. This results into service unavailibility.
>> >>
>> >> RDMA cgroup addresses this issue by allowing resource accounting,
>> >> limit enforcement on per cgroup, per rdma device basis.
>> >>
>> >> RDMA uverbs layer will enforce limits on well defined RDMA verb
>> >> resources without any HCA vendor device driver involvement.
>> >>
>> >> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
>> >> specific resources. Instead rdma cgroup provides set of APIs
>> >> through which vendor specific drivers can do resource accounting
>> >> by making use of rdma cgroup.
>> >
>> > Hi Parav,
>> > I want to propose an extension to the RDMA cgroup which can be done as
>> > follow-up patches.
>> >
>> > Let's add new global type, which will control whole HCA (for example in percentages). It will
>> > allow natively define new objects without need to introduce them to the user.
>> >
>> In other cgroup such as CPU, this is done using cpu.weight API. Where
>> percentage or weight is configured by the user.
>> In this mode, resources taken away from other cgroup proportionately.
>> It works for cpu because its mainly stateless resource unlike rdma
>> resources.
>> So if we want to simplify user configuration similarly,
>> percentage/weight configuration can be extended.
>> This way they need not be introduced to users.
>> I hope your definition of "user" is actual end-user and not rdma cgroup. Right?
>
> Yes, "user" -> "admin".
> I think that percentage is more intuitive to them and will be much easier to
> explain how to use it. I always have in mind "swappiness" field and the
> numerous questions on how to configure it.
>
>> In other words, new object should be still added as new enum value in
>> rdma_cgroup.h?
>
> Yes, I had in mind something like IB_CGROUP_HCA, this is why it can be
> done as a future work after accepting current patches.
>
What I meant is,
today we have RDMACG_VERB_RESOURCE_QP etc,
We will additionally have RDMACG_VERB_RESOURCE_INDIRECT_TBL etc in
cgroup_rdma.h.
So that its available for admin to override it.

>> Only than it can be overwritten by specific UVERBs type as you
>> described below. I think thats what you meant as you described below.
>
> Exactly.
>
>>
>> Otherwise charging/uncharging this new percentage resource can get messy.
>
> Agree
>
>>
>> > This HCA share will be overwritten by specific UVERBS types which you
>> > already defined.
>> >
>> > What do you think?
>>
>> So to refine your proposal from cgroup perspective, instead of adding
>> new resource type in rdma_cgroup.h for percentage, I prefer to have
>>
>> Existing
>> 1. rdma.max
>> 2. rdma.current
>> New,
>> 3. rdma.weight
>> This ABI will have similar API to say
>> echo "mlx4_0 50" > rdma.weight.
>> Where 50 is weight of the resources.
>> For example,
>> for one cgroup instance weight=sum=100% resource for a given cgroup.
>> for three cgroup instances percentage=(weight/sum)% = 50/(50+50+50) = 33%.
>> One cgroup gets 33% resource.
>>
>> Weight can be in range of 1 to 10,000 similar to cpu cgroup.
>
> This is exactly what I don't like, the percentage will remove from the
> user the translation needs between weight and actual limitation.
>
> IMHO CPU used weights because everything there is in weights :).
>
I admit weight are not very intuitive, I was aligning to the existing
other cgroup interfaces which achieves similar functionality.
I will let Tejun approve the "percentage" or "ratio" new file
interface as its little different than weight.

>>
>> This might work if applications running in all cgroups are similar.
>> But weight doesn't do justice, when there are different type of
>> applications running in each cgroup. Such as few running libfabric
>> based apps, few running MPI, others directly using ibverbs.
>> So as you said rdma.max configuration would be required for management
>> plane to override weight (percentage) for certain resources.
>
> Why?
> The device exposes max values during initialization and if user asked
> for 20% percent of HCA, he will get max*0.2.

Because every application may not be equivalent of other application.
For example, some require one to one QP and PD mapping.
Some share single PD across multiple QPs.
Some have ratio of 100 MRs per QP, as factor of memory size and operations.
some servers like to have 1K MRs per QP.
So if we have just weight, it will equally distributes MRs per QP in
all cgroup and that either leads to unused resource per cgroup or,
lesser number of cg instances.
So fine tuning required for individual one, which we already have.

weight or percentage helps in abstracting as starting point. So I like
to add it too.



>
>>
>>
>> >
>> > Except this proposal,
>> > Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> >
>> > Thanks.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]               ` <CAG53R5UM6nSTZ7=0S9reKGX45CpNBi8soSDVZyXkN-z0_XXWWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-10  7:33                 ` Leon Romanovsky
       [not found]                   ` <20161010073343.GK9282-2ukJVAZIZ/Y@public.gmane.org>
  2016-10-10 12:25                 ` Tejun Heo
  1 sibling, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-10-10  7:33 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Tejun Heo, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 6486 bytes --]

On Mon, Oct 10, 2016 at 11:59:45AM +0530, Parav Pandit wrote:
> Hi Leon,
>
> On Mon, Oct 10, 2016 at 10:16 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On Thu, Oct 06, 2016 at 07:19:24PM +0530, Parav Pandit wrote:
> >> Hi Leon,
> >>
> >> On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >> > On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
> >> >> rdmacg: IB/core: rdma controller support
> >> >>
> >> >> Overview:
> >> >> Currently user space applications can easily take away all the rdma
> >> >> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> >> >> applications in other cgroup or kernel space ULPs may not even get chance
> >> >> to allocate any rdma resources. This results into service unavailibility.
> >> >>
> >> >> RDMA cgroup addresses this issue by allowing resource accounting,
> >> >> limit enforcement on per cgroup, per rdma device basis.
> >> >>
> >> >> RDMA uverbs layer will enforce limits on well defined RDMA verb
> >> >> resources without any HCA vendor device driver involvement.
> >> >>
> >> >> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
> >> >> specific resources. Instead rdma cgroup provides set of APIs
> >> >> through which vendor specific drivers can do resource accounting
> >> >> by making use of rdma cgroup.
> >> >
> >> > Hi Parav,
> >> > I want to propose an extension to the RDMA cgroup which can be done as
> >> > follow-up patches.
> >> >
> >> > Let's add new global type, which will control whole HCA (for example in percentages). It will
> >> > allow natively define new objects without need to introduce them to the user.
> >> >
> >> In other cgroup such as CPU, this is done using cpu.weight API. Where
> >> percentage or weight is configured by the user.
> >> In this mode, resources taken away from other cgroup proportionately.
> >> It works for cpu because its mainly stateless resource unlike rdma
> >> resources.
> >> So if we want to simplify user configuration similarly,
> >> percentage/weight configuration can be extended.
> >> This way they need not be introduced to users.
> >> I hope your definition of "user" is actual end-user and not rdma cgroup. Right?
> >
> > Yes, "user" -> "admin".
> > I think that percentage is more intuitive to them and will be much easier to
> > explain how to use it. I always have in mind "swappiness" field and the
> > numerous questions on how to configure it.
> >
> >> In other words, new object should be still added as new enum value in
> >> rdma_cgroup.h?
> >
> > Yes, I had in mind something like IB_CGROUP_HCA, this is why it can be
> > done as a future work after accepting current patches.
> >
> What I meant is,
> today we have RDMACG_VERB_RESOURCE_QP etc,
> We will additionally have RDMACG_VERB_RESOURCE_INDIRECT_TBL etc in
> cgroup_rdma.h.
> So that its available for admin to override it.

IMHO, we are talking about the same. My global HCA object will be
overwritten by more granular VERBS objects in case they exists.

>
> >> Only than it can be overwritten by specific UVERBs type as you
> >> described below. I think thats what you meant as you described below.
> >
> > Exactly.
> >
> >>
> >> Otherwise charging/uncharging this new percentage resource can get messy.
> >
> > Agree
> >
> >>
> >> > This HCA share will be overwritten by specific UVERBS types which you
> >> > already defined.
> >> >
> >> > What do you think?
> >>
> >> So to refine your proposal from cgroup perspective, instead of adding
> >> new resource type in rdma_cgroup.h for percentage, I prefer to have
> >>
> >> Existing
> >> 1. rdma.max
> >> 2. rdma.current
> >> New,
> >> 3. rdma.weight
> >> This ABI will have similar API to say
> >> echo "mlx4_0 50" > rdma.weight.
> >> Where 50 is weight of the resources.
> >> For example,
> >> for one cgroup instance weight=sum=100% resource for a given cgroup.
> >> for three cgroup instances percentage=(weight/sum)% = 50/(50+50+50) = 33%.
> >> One cgroup gets 33% resource.
> >>
> >> Weight can be in range of 1 to 10,000 similar to cpu cgroup.
> >
> > This is exactly what I don't like, the percentage will remove from the
> > user the translation needs between weight and actual limitation.
> >
> > IMHO CPU used weights because everything there is in weights :).
> >
> I admit weight are not very intuitive, I was aligning to the existing
> other cgroup interfaces which achieves similar functionality.
> I will let Tejun approve the "percentage" or "ratio" new file
> interface as its little different than weight.

Sure, let's close the main idea first and see if it makes sense for
other participants.

>
> >>
> >> This might work if applications running in all cgroups are similar.
> >> But weight doesn't do justice, when there are different type of
> >> applications running in each cgroup. Such as few running libfabric
> >> based apps, few running MPI, others directly using ibverbs.
> >> So as you said rdma.max configuration would be required for management
> >> plane to override weight (percentage) for certain resources.
> >
> > Why?
> > The device exposes max values during initialization and if user asked
> > for 20% percent of HCA, he will get max*0.2.
>
> Because every application may not be equivalent of other application.
> For example, some require one to one QP and PD mapping.
> Some share single PD across multiple QPs.
> Some have ratio of 100 MRs per QP, as factor of memory size and operations.
> some servers like to have 1K MRs per QP.
> So if we have just weight, it will equally distributes MRs per QP in
> all cgroup and that either leads to unused resource per cgroup or,
> lesser number of cg instances.
> So fine tuning required for individual one, which we already have.

I afraid that it is over complicating which can be done by curious user
in his user-space scripts: limit the global HCA -> read max values ->
overwrite with specific mapping.

>
> weight or percentage helps in abstracting as starting point. So I like
> to add it too.

Let's start simple
Thanks.

>
>
>
> >
> >>
> >>
> >> >
> >> > Except this proposal,
> >> > Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> >
> >> > Thanks.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                   ` <20161010073343.GK9282-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-10-10  8:35                     ` Parav Pandit
       [not found]                       ` <CAG53R5WeWSrJ5-Gtt-cXpUr0r73zh3bqQM_G5zTue27tPtVEXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-10  8:35 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Tejun Heo, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

On Mon, Oct 10, 2016 at 1:03 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Mon, Oct 10, 2016 at 11:59:45AM +0530, Parav Pandit wrote:
>> Hi Leon,
>>
>> On Mon, Oct 10, 2016 at 10:16 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > On Thu, Oct 06, 2016 at 07:19:24PM +0530, Parav Pandit wrote:
>> >> Hi Leon,
>> >>
>> >> On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> >> > On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
>> >> >> rdmacg: IB/core: rdma controller support
>> >> >>
>> >> >> Overview:
>> >> >> Currently user space applications can easily take away all the rdma
>> >> >> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> >> >> applications in other cgroup or kernel space ULPs may not even get chance
>> >> >> to allocate any rdma resources. This results into service unavailibility.
>> >> >>
>> >> >> RDMA cgroup addresses this issue by allowing resource accounting,
>> >> >> limit enforcement on per cgroup, per rdma device basis.
>> >> >>
>> >> >> RDMA uverbs layer will enforce limits on well defined RDMA verb
>> >> >> resources without any HCA vendor device driver involvement.
>> >> >>
>> >> >> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
>> >> >> specific resources. Instead rdma cgroup provides set of APIs
>> >> >> through which vendor specific drivers can do resource accounting
>> >> >> by making use of rdma cgroup.
>> >> >
>> >> > Hi Parav,
>> >> > I want to propose an extension to the RDMA cgroup which can be done as
>> >> > follow-up patches.
>> >> >
>> >> > Let's add new global type, which will control whole HCA (for example in percentages). It will
>> >> > allow natively define new objects without need to introduce them to the user.
>> >> >
>> >> In other cgroup such as CPU, this is done using cpu.weight API. Where
>> >> percentage or weight is configured by the user.
>> >> In this mode, resources taken away from other cgroup proportionately.
>> >> It works for cpu because its mainly stateless resource unlike rdma
>> >> resources.
>> >> So if we want to simplify user configuration similarly,
>> >> percentage/weight configuration can be extended.
>> >> This way they need not be introduced to users.
>> >> I hope your definition of "user" is actual end-user and not rdma cgroup. Right?
>> >
>> > Yes, "user" -> "admin".
>> > I think that percentage is more intuitive to them and will be much easier to
>> > explain how to use it. I always have in mind "swappiness" field and the
>> > numerous questions on how to configure it.
>> >
>> >> In other words, new object should be still added as new enum value in
>> >> rdma_cgroup.h?
>> >
>> > Yes, I had in mind something like IB_CGROUP_HCA, this is why it can be
>> > done as a future work after accepting current patches.
>> >
>> What I meant is,
>> today we have RDMACG_VERB_RESOURCE_QP etc,
>> We will additionally have RDMACG_VERB_RESOURCE_INDIRECT_TBL etc in
>> cgroup_rdma.h.
>> So that its available for admin to override it.
>
> IMHO, we are talking about the same. My global HCA object will be
> overwritten by more granular VERBS objects in case they exists.
>
>>
>> >> Only than it can be overwritten by specific UVERBs type as you
>> >> described below. I think thats what you meant as you described below.
>> >
>> > Exactly.
>> >
>> >>
>> >> Otherwise charging/uncharging this new percentage resource can get messy.
>> >
>> > Agree
>> >
>> >>
>> >> > This HCA share will be overwritten by specific UVERBS types which you
>> >> > already defined.
>> >> >
>> >> > What do you think?
>> >>
>> >> So to refine your proposal from cgroup perspective, instead of adding
>> >> new resource type in rdma_cgroup.h for percentage, I prefer to have
>> >>
>> >> Existing
>> >> 1. rdma.max
>> >> 2. rdma.current
>> >> New,
>> >> 3. rdma.weight
>> >> This ABI will have similar API to say
>> >> echo "mlx4_0 50" > rdma.weight.
>> >> Where 50 is weight of the resources.
>> >> For example,
>> >> for one cgroup instance weight=sum=100% resource for a given cgroup.
>> >> for three cgroup instances percentage=(weight/sum)% = 50/(50+50+50) = 33%.
>> >> One cgroup gets 33% resource.
>> >>
>> >> Weight can be in range of 1 to 10,000 similar to cpu cgroup.
>> >
>> > This is exactly what I don't like, the percentage will remove from the
>> > user the translation needs between weight and actual limitation.
>> >
>> > IMHO CPU used weights because everything there is in weights :).
>> >
>> I admit weight are not very intuitive, I was aligning to the existing
>> other cgroup interfaces which achieves similar functionality.
>> I will let Tejun approve the "percentage" or "ratio" new file
>> interface as its little different than weight.
>
> Sure, let's close the main idea first and see if it makes sense for
> other participants.
>
>>
>> >>
>> >> This might work if applications running in all cgroups are similar.
>> >> But weight doesn't do justice, when there are different type of
>> >> applications running in each cgroup. Such as few running libfabric
>> >> based apps, few running MPI, others directly using ibverbs.
>> >> So as you said rdma.max configuration would be required for management
>> >> plane to override weight (percentage) for certain resources.
>> >
>> > Why?
>> > The device exposes max values during initialization and if user asked
>> > for 20% percent of HCA, he will get max*0.2.
>>
>> Because every application may not be equivalent of other application.
>> For example, some require one to one QP and PD mapping.
>> Some share single PD across multiple QPs.
>> Some have ratio of 100 MRs per QP, as factor of memory size and operations.
>> some servers like to have 1K MRs per QP.
>> So if we have just weight, it will equally distributes MRs per QP in
>> all cgroup and that either leads to unused resource per cgroup or,
>> lesser number of cg instances.
>> So fine tuning required for individual one, which we already have.
>
> I afraid that it is over complicating which can be done by curious user
> in his user-space scripts: limit the global HCA -> read max values ->
> overwrite with specific mapping.
>
>>
>> weight or percentage helps in abstracting as starting point. So I like
>> to add it too.
>
> Let's start simple

Yes. I will rebase and test my patch today and see if requires resending.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                       ` <CAG53R5WeWSrJ5-Gtt-cXpUr0r73zh3bqQM_G5zTue27tPtVEXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-10  8:52                         ` Leon Romanovsky
       [not found]                           ` <20161010085241.GL9282-2ukJVAZIZ/Y@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-10-10  8:52 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Tejun Heo, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 6847 bytes --]

On Mon, Oct 10, 2016 at 02:05:27PM +0530, Parav Pandit wrote:
> On Mon, Oct 10, 2016 at 1:03 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On Mon, Oct 10, 2016 at 11:59:45AM +0530, Parav Pandit wrote:
> >> Hi Leon,
> >>
> >> On Mon, Oct 10, 2016 at 10:16 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >> > On Thu, Oct 06, 2016 at 07:19:24PM +0530, Parav Pandit wrote:
> >> >> Hi Leon,
> >> >>
> >> >> On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >> >> > On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
> >> >> >> rdmacg: IB/core: rdma controller support
> >> >> >>
> >> >> >> Overview:
> >> >> >> Currently user space applications can easily take away all the rdma
> >> >> >> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> >> >> >> applications in other cgroup or kernel space ULPs may not even get chance
> >> >> >> to allocate any rdma resources. This results into service unavailibility.
> >> >> >>
> >> >> >> RDMA cgroup addresses this issue by allowing resource accounting,
> >> >> >> limit enforcement on per cgroup, per rdma device basis.
> >> >> >>
> >> >> >> RDMA uverbs layer will enforce limits on well defined RDMA verb
> >> >> >> resources without any HCA vendor device driver involvement.
> >> >> >>
> >> >> >> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
> >> >> >> specific resources. Instead rdma cgroup provides set of APIs
> >> >> >> through which vendor specific drivers can do resource accounting
> >> >> >> by making use of rdma cgroup.
> >> >> >
> >> >> > Hi Parav,
> >> >> > I want to propose an extension to the RDMA cgroup which can be done as
> >> >> > follow-up patches.
> >> >> >
> >> >> > Let's add new global type, which will control whole HCA (for example in percentages). It will
> >> >> > allow natively define new objects without need to introduce them to the user.
> >> >> >
> >> >> In other cgroup such as CPU, this is done using cpu.weight API. Where
> >> >> percentage or weight is configured by the user.
> >> >> In this mode, resources taken away from other cgroup proportionately.
> >> >> It works for cpu because its mainly stateless resource unlike rdma
> >> >> resources.
> >> >> So if we want to simplify user configuration similarly,
> >> >> percentage/weight configuration can be extended.
> >> >> This way they need not be introduced to users.
> >> >> I hope your definition of "user" is actual end-user and not rdma cgroup. Right?
> >> >
> >> > Yes, "user" -> "admin".
> >> > I think that percentage is more intuitive to them and will be much easier to
> >> > explain how to use it. I always have in mind "swappiness" field and the
> >> > numerous questions on how to configure it.
> >> >
> >> >> In other words, new object should be still added as new enum value in
> >> >> rdma_cgroup.h?
> >> >
> >> > Yes, I had in mind something like IB_CGROUP_HCA, this is why it can be
> >> > done as a future work after accepting current patches.
> >> >
> >> What I meant is,
> >> today we have RDMACG_VERB_RESOURCE_QP etc,
> >> We will additionally have RDMACG_VERB_RESOURCE_INDIRECT_TBL etc in
> >> cgroup_rdma.h.
> >> So that its available for admin to override it.
> >
> > IMHO, we are talking about the same. My global HCA object will be
> > overwritten by more granular VERBS objects in case they exists.
> >
> >>
> >> >> Only than it can be overwritten by specific UVERBs type as you
> >> >> described below. I think thats what you meant as you described below.
> >> >
> >> > Exactly.
> >> >
> >> >>
> >> >> Otherwise charging/uncharging this new percentage resource can get messy.
> >> >
> >> > Agree
> >> >
> >> >>
> >> >> > This HCA share will be overwritten by specific UVERBS types which you
> >> >> > already defined.
> >> >> >
> >> >> > What do you think?
> >> >>
> >> >> So to refine your proposal from cgroup perspective, instead of adding
> >> >> new resource type in rdma_cgroup.h for percentage, I prefer to have
> >> >>
> >> >> Existing
> >> >> 1. rdma.max
> >> >> 2. rdma.current
> >> >> New,
> >> >> 3. rdma.weight
> >> >> This ABI will have similar API to say
> >> >> echo "mlx4_0 50" > rdma.weight.
> >> >> Where 50 is weight of the resources.
> >> >> For example,
> >> >> for one cgroup instance weight=sum=100% resource for a given cgroup.
> >> >> for three cgroup instances percentage=(weight/sum)% = 50/(50+50+50) = 33%.
> >> >> One cgroup gets 33% resource.
> >> >>
> >> >> Weight can be in range of 1 to 10,000 similar to cpu cgroup.
> >> >
> >> > This is exactly what I don't like, the percentage will remove from the
> >> > user the translation needs between weight and actual limitation.
> >> >
> >> > IMHO CPU used weights because everything there is in weights :).
> >> >
> >> I admit weight are not very intuitive, I was aligning to the existing
> >> other cgroup interfaces which achieves similar functionality.
> >> I will let Tejun approve the "percentage" or "ratio" new file
> >> interface as its little different than weight.
> >
> > Sure, let's close the main idea first and see if it makes sense for
> > other participants.
> >
> >>
> >> >>
> >> >> This might work if applications running in all cgroups are similar.
> >> >> But weight doesn't do justice, when there are different type of
> >> >> applications running in each cgroup. Such as few running libfabric
> >> >> based apps, few running MPI, others directly using ibverbs.
> >> >> So as you said rdma.max configuration would be required for management
> >> >> plane to override weight (percentage) for certain resources.
> >> >
> >> > Why?
> >> > The device exposes max values during initialization and if user asked
> >> > for 20% percent of HCA, he will get max*0.2.
> >>
> >> Because every application may not be equivalent of other application.
> >> For example, some require one to one QP and PD mapping.
> >> Some share single PD across multiple QPs.
> >> Some have ratio of 100 MRs per QP, as factor of memory size and operations.
> >> some servers like to have 1K MRs per QP.
> >> So if we have just weight, it will equally distributes MRs per QP in
> >> all cgroup and that either leads to unused resource per cgroup or,
> >> lesser number of cg instances.
> >> So fine tuning required for individual one, which we already have.
> >
> > I afraid that it is over complicating which can be done by curious user
> > in his user-space scripts: limit the global HCA -> read max values ->
> > overwrite with specific mapping.
> >
> >>
> >> weight or percentage helps in abstracting as starting point. So I like
> >> to add it too.
> >
> > Let's start simple
>
> Yes. I will rebase and test my patch today and see if requires resending.

It is worth to wait till -rc1. Doug didn't finish his pull requests yet.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                           ` <20161010085241.GL9282-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-10-10  9:22                             ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-10  9:22 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Tejun Heo, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

On Mon, Oct 10, 2016 at 2:22 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Mon, Oct 10, 2016 at 02:05:27PM +0530, Parav Pandit wrote:
>> On Mon, Oct 10, 2016 at 1:03 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > On Mon, Oct 10, 2016 at 11:59:45AM +0530, Parav Pandit wrote:
>> >> Hi Leon,
>> >>
>> >> On Mon, Oct 10, 2016 at 10:16 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> >> > On Thu, Oct 06, 2016 at 07:19:24PM +0530, Parav Pandit wrote:
>> >> >> Hi Leon,
>> >> >>
>> >> >> On Wed, Oct 5, 2016 at 4:52 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> >> >> > On Wed, Aug 31, 2016 at 02:07:24PM +0530, Parav Pandit wrote:
>> >> >> >> rdmacg: IB/core: rdma controller support
>> >> >> >>
>> >> >> >> Overview:
>> >> >> >> Currently user space applications can easily take away all the rdma
>> >> >> >> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> >> >> >> applications in other cgroup or kernel space ULPs may not even get chance
>> >> >> >> to allocate any rdma resources. This results into service unavailibility.
>> >> >> >>
>> >> >> >> RDMA cgroup addresses this issue by allowing resource accounting,
>> >> >> >> limit enforcement on per cgroup, per rdma device basis.
>> >> >> >>
>> >> >> >> RDMA uverbs layer will enforce limits on well defined RDMA verb
>> >> >> >> resources without any HCA vendor device driver involvement.
>> >> >> >>
>> >> >> >> RDMA uverbs layer will not do limit enforcement of HCA hw vendor
>> >> >> >> specific resources. Instead rdma cgroup provides set of APIs
>> >> >> >> through which vendor specific drivers can do resource accounting
>> >> >> >> by making use of rdma cgroup.
>> >> >> >
>> >> >> > Hi Parav,
>> >> >> > I want to propose an extension to the RDMA cgroup which can be done as
>> >> >> > follow-up patches.
>> >> >> >
>> >> >> > Let's add new global type, which will control whole HCA (for example in percentages). It will
>> >> >> > allow natively define new objects without need to introduce them to the user.
>> >> >> >
>> >> >> In other cgroup such as CPU, this is done using cpu.weight API. Where
>> >> >> percentage or weight is configured by the user.
>> >> >> In this mode, resources taken away from other cgroup proportionately.
>> >> >> It works for cpu because its mainly stateless resource unlike rdma
>> >> >> resources.
>> >> >> So if we want to simplify user configuration similarly,
>> >> >> percentage/weight configuration can be extended.
>> >> >> This way they need not be introduced to users.
>> >> >> I hope your definition of "user" is actual end-user and not rdma cgroup. Right?
>> >> >
>> >> > Yes, "user" -> "admin".
>> >> > I think that percentage is more intuitive to them and will be much easier to
>> >> > explain how to use it. I always have in mind "swappiness" field and the
>> >> > numerous questions on how to configure it.
>> >> >
>> >> >> In other words, new object should be still added as new enum value in
>> >> >> rdma_cgroup.h?
>> >> >
>> >> > Yes, I had in mind something like IB_CGROUP_HCA, this is why it can be
>> >> > done as a future work after accepting current patches.
>> >> >
>> >> What I meant is,
>> >> today we have RDMACG_VERB_RESOURCE_QP etc,
>> >> We will additionally have RDMACG_VERB_RESOURCE_INDIRECT_TBL etc in
>> >> cgroup_rdma.h.
>> >> So that its available for admin to override it.
>> >
>> > IMHO, we are talking about the same. My global HCA object will be
>> > overwritten by more granular VERBS objects in case they exists.
>> >
>> >>
>> >> >> Only than it can be overwritten by specific UVERBs type as you
>> >> >> described below. I think thats what you meant as you described below.
>> >> >
>> >> > Exactly.
>> >> >
>> >> >>
>> >> >> Otherwise charging/uncharging this new percentage resource can get messy.
>> >> >
>> >> > Agree
>> >> >
>> >> >>
>> >> >> > This HCA share will be overwritten by specific UVERBS types which you
>> >> >> > already defined.
>> >> >> >
>> >> >> > What do you think?
>> >> >>
>> >> >> So to refine your proposal from cgroup perspective, instead of adding
>> >> >> new resource type in rdma_cgroup.h for percentage, I prefer to have
>> >> >>
>> >> >> Existing
>> >> >> 1. rdma.max
>> >> >> 2. rdma.current
>> >> >> New,
>> >> >> 3. rdma.weight
>> >> >> This ABI will have similar API to say
>> >> >> echo "mlx4_0 50" > rdma.weight.
>> >> >> Where 50 is weight of the resources.
>> >> >> For example,
>> >> >> for one cgroup instance weight=sum=100% resource for a given cgroup.
>> >> >> for three cgroup instances percentage=(weight/sum)% = 50/(50+50+50) = 33%.
>> >> >> One cgroup gets 33% resource.
>> >> >>
>> >> >> Weight can be in range of 1 to 10,000 similar to cpu cgroup.
>> >> >
>> >> > This is exactly what I don't like, the percentage will remove from the
>> >> > user the translation needs between weight and actual limitation.
>> >> >
>> >> > IMHO CPU used weights because everything there is in weights :).
>> >> >
>> >> I admit weight are not very intuitive, I was aligning to the existing
>> >> other cgroup interfaces which achieves similar functionality.
>> >> I will let Tejun approve the "percentage" or "ratio" new file
>> >> interface as its little different than weight.
>> >
>> > Sure, let's close the main idea first and see if it makes sense for
>> > other participants.
>> >
>> >>
>> >> >>
>> >> >> This might work if applications running in all cgroups are similar.
>> >> >> But weight doesn't do justice, when there are different type of
>> >> >> applications running in each cgroup. Such as few running libfabric
>> >> >> based apps, few running MPI, others directly using ibverbs.
>> >> >> So as you said rdma.max configuration would be required for management
>> >> >> plane to override weight (percentage) for certain resources.
>> >> >
>> >> > Why?
>> >> > The device exposes max values during initialization and if user asked
>> >> > for 20% percent of HCA, he will get max*0.2.
>> >>
>> >> Because every application may not be equivalent of other application.
>> >> For example, some require one to one QP and PD mapping.
>> >> Some share single PD across multiple QPs.
>> >> Some have ratio of 100 MRs per QP, as factor of memory size and operations.
>> >> some servers like to have 1K MRs per QP.
>> >> So if we have just weight, it will equally distributes MRs per QP in
>> >> all cgroup and that either leads to unused resource per cgroup or,
>> >> lesser number of cg instances.
>> >> So fine tuning required for individual one, which we already have.
>> >
>> > I afraid that it is over complicating which can be done by curious user
>> > in his user-space scripts: limit the global HCA -> read max values ->
>> > overwrite with specific mapping.
>> >
>> >>
>> >> weight or percentage helps in abstracting as starting point. So I like
>> >> to add it too.
>> >
>> > Let's start simple
>>
>> Yes. I will rebase and test my patch today and see if requires resending.
>
> It is worth to wait till -rc1. Doug didn't finish his pull requests yet.
>
Alright. I will wait.

> Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]               ` <CAG53R5UM6nSTZ7=0S9reKGX45CpNBi8soSDVZyXkN-z0_XXWWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-10-10  7:33                 ` Leon Romanovsky
@ 2016-10-10 12:25                 ` Tejun Heo
       [not found]                   ` <20161010122545.GA27360-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  1 sibling, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-10-10 12:25 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello, Parav.

On Mon, Oct 10, 2016 at 11:59:45AM +0530, Parav Pandit wrote:
> >> Weight can be in range of 1 to 10,000 similar to cpu cgroup.
> >
> > This is exactly what I don't like, the percentage will remove from the
> > user the translation needs between weight and actual limitation.
> >
> > IMHO CPU used weights because everything there is in weights :).
>
> I admit weight are not very intuitive, I was aligning to the existing
> other cgroup interfaces which achieves similar functionality.
> I will let Tejun approve the "percentage" or "ratio" new file
> interface as its little different than weight.

So, if there is gonna be a proportional control mechanism, it should
use the same interface convention as other proportional controls.
Also, I don't get what you mean by using percentage and when people
brought up this idea, it always has been stemming from
misunderstanding.  Can you please elaborate how percentage based
proportional control would work?  What would 100% mean when cgroups
can come and go?

If you're suggesting expressing absolute limits in terms of
percentage, that is not proportional control.  That's just using a
different unit for absolute resource limits and it must not be called
weight.

> weight or percentage helps in abstracting as starting point. So I like
> to add it too.

Way back, when rdmacg was proposed first, I asked the same question -
whether there can be a higher level abstraction for rdma resources,
and, IIRC, the collective answer was that the there can be no
universal measure of resources in the kernel because a large part of
resource management actually takes place in userspace.

If I misunderstood, please correct me but what's being suggested here
seems to be implementing the knob in rdamcg and letting the specific
drivers worry about the actual resource provisioning and even then
there doesn't seem to be a clear way of semantically defining what
ratios would mean.

Let's please first establish what the resource control would exactly
mean.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                   ` <20161010122545.GA27360-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-10-10 13:13                     ` Parav Pandit
       [not found]                       ` <CAG53R5V5yE4PsDBjP9BieG_=39M0G1kx-AfBEzWK4LUCxNnYBA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-10 13:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,

On Mon, Oct 10, 2016 at 5:55 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Mon, Oct 10, 2016 at 11:59:45AM +0530, Parav Pandit wrote:
>> >> Weight can be in range of 1 to 10,000 similar to cpu cgroup.
>> >
>> > This is exactly what I don't like, the percentage will remove from the
>> > user the translation needs between weight and actual limitation.
>> >
>> > IMHO CPU used weights because everything there is in weights :).
>>
>> I admit weight are not very intuitive, I was aligning to the existing
>> other cgroup interfaces which achieves similar functionality.
>> I will let Tejun approve the "percentage" or "ratio" new file
>> interface as its little different than weight.
>
> So, if there is gonna be a proportional control mechanism, it should
> use the same interface convention as other proportional controls.
Thats what I was suggesting to Leon.

> Also, I don't get what you mean by using percentage and when people
> brought up this idea, it always has been stemming from
> misunderstanding.  Can you please elaborate how percentage based
> proportional control would work?  What would 100% mean when cgroups
> can come and go?
When 100% is given to one cgroup, all resources of all type can be
charged by processes of that cgroup.
Resources are stateful resource. So when cgroup goes away, they go
back to global pool (or hw).
Giving 100% to two cgroups is configuration error anyway (or without config).

>
> If you're suggesting expressing absolute limits in terms of
> percentage, that is not proportional control.  That's just using a
> different unit for absolute resource limits and it must not be called
> weight.
>
>> weight or percentage helps in abstracting as starting point. So I like
>> to add it too.
>
> Way back, when rdmacg was proposed first, I asked the same question -
> whether there can be a higher level abstraction for rdma resources,
> and, IIRC, the collective answer was that the there can be no
> universal measure of resources in the kernel because a large part of
> resource management actually takes place in userspace.
>
Yes. This still holds true. I don't think anything changed.
proportional knob by kernel provides constrained know of equally
distribute all type of resources.
This works only on one class of application. Thats the primary reason
we had knob for verb level well defined resource.

(Which is subset of what patchv12 had to offer, where such
configuration can be done by the user space).

What is not part of patchv12 is - currently max limit is configured as
"max", instead of real max value.
Due to which user space tool is unable to configure exact value to
program to achieve functionality of ratio.
This new functionality can be done post this patch as this is
incremental functionality for user space tools.

As you know weight configuration allows automatic increase/decrease of
resource to other cgroups when one of them go away, as opposed to
absolute value. How this is going to work in exact terms for stateful
resource, we don't know yet. I haven't though through the design
either from kernel side. So just started exploring if weight interface
can be leveraged.

> If I misunderstood, please correct me but what's being suggested here
> seems to be implementing the knob in rdamcg and letting the specific
> drivers worry about the actual resource provisioning and even then
> there doesn't seem to be a clear way of semantically defining what
> ratios would mean.

Nop. Thats not true.
(a) Every new resource has to be defined in cgroup_rdma.h
(b) charge()/uncharge() has to happen by the cgroup for each.
(c) Letting drivers do will make things fall apart. There are no APIs
exposed either to let drivers know process cgroup either. There is no
intention either.

(d) ratio means -if adapter has
100 resources of type - A,
80 resource of type - B,

10% for cgroup-1 means,
10 resource of type - A
8 resource of type - B

>
> Let's please first establish what the resource control would exactly
> mean.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                       ` <CAG53R5V5yE4PsDBjP9BieG_=39M0G1kx-AfBEzWK4LUCxNnYBA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-10 13:20                         ` Tejun Heo
       [not found]                           ` <20161010132014.GD29742-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-10-10 13:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello, Parav.

On Mon, Oct 10, 2016 at 06:43:59PM +0530, Parav Pandit wrote:
> > Also, I don't get what you mean by using percentage and when people
> > brought up this idea, it always has been stemming from
> > misunderstanding.  Can you please elaborate how percentage based
> > proportional control would work?  What would 100% mean when cgroups
> > can come and go?
>
> When 100% is given to one cgroup, all resources of all type can be
> charged by processes of that cgroup.
> Resources are stateful resource. So when cgroup goes away, they go
> back to global pool (or hw).
> Giving 100% to two cgroups is configuration error anyway (or without config).

That isn't proportional control.  That's using percentage as the unit
to implement absolute limits.  Proportional control implies work
conservation.

> As you know weight configuration allows automatic increase/decrease of
> resource to other cgroups when one of them go away, as opposed to
> absolute value. How this is going to work in exact terms for stateful

Hmm.... so are you saying that ti'd be work-conserving?  But what does
it mean to say "30%" and then have it all resources when there are no
other users.  Also, is it even possible to take back what have already
been allocated and are in use?

> Nop. Thats not true.
> (a) Every new resource has to be defined in cgroup_rdma.h
> (b) charge()/uncharge() has to happen by the cgroup for each.
> (c) Letting drivers do will make things fall apart. There are no APIs
> exposed either to let drivers know process cgroup either. There is no
> intention either.
> 
> (d) ratio means -if adapter has
> 100 resources of type - A,
> 80 resource of type - B,
> 
> 10% for cgroup-1 means,
> 10 resource of type - A
> 8 resource of type - B

So, this is not work-conserving.  There's too much confusion here.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                           ` <20161010132014.GD29742-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-10-10 13:32                             ` Parav Pandit
       [not found]                               ` <CAG53R5ULKCqtw45E6t4hYdRV+y_OQqVazf=7A7Ax_XAJ2K0_dw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-10 13:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,

On Mon, Oct 10, 2016 at 6:50 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Mon, Oct 10, 2016 at 06:43:59PM +0530, Parav Pandit wrote:
>> > Also, I don't get what you mean by using percentage and when people
>> > brought up this idea, it always has been stemming from
>> > misunderstanding.  Can you please elaborate how percentage based
>> > proportional control would work?  What would 100% mean when cgroups
>> > can come and go?
>>
>> When 100% is given to one cgroup, all resources of all type can be
>> charged by processes of that cgroup.
>> Resources are stateful resource. So when cgroup goes away, they go
>> back to global pool (or hw).
>> Giving 100% to two cgroups is configuration error anyway (or without config).
>
> That isn't proportional control.  That's using percentage as the unit
> to implement absolute limits.  Proportional control implies work
> conservation.
>
>> As you know weight configuration allows automatic increase/decrease of
>> resource to other cgroups when one of them go away, as opposed to
>> absolute value. How this is going to work in exact terms for stateful
>
> Hmm.... so are you saying that ti'd be work-conserving?
They cannot be work conversing.

> But what does
> it mean to say "30%" and then have it all resources when there are no
> other users.  Also, is it even possible to take back what have already
> been allocated and are in use?
>
Most resources that I know of, and whats described in current
cgroup_rdma.h are not work conversing, therefore it cannot be taken
back.

>> Nop. Thats not true.
>> (a) Every new resource has to be defined in cgroup_rdma.h
>> (b) charge()/uncharge() has to happen by the cgroup for each.
>> (c) Letting drivers do will make things fall apart. There are no APIs
>> exposed either to let drivers know process cgroup either. There is no
>> intention either.
>>
>> (d) ratio means -if adapter has
>> 100 resources of type - A,
>> 80 resource of type - B,
>>
>> 10% for cgroup-1 means,
>> 10 resource of type - A
>> 8 resource of type - B
>
> So, this is not work-conserving.  There's too much confusion here.

Give me some more time, I will think more and take feeback from Leon
and others on
(a) how can we implement or want to implement weight like
functionality for non-work-conversing resource
(b) what could be its acceptable limitations of that interface would be
before we propose you.

At minimum we would need to expose actual value in rdma.max in
subsequent patch, instead of exposing just "max" string. I don't want
to complicate this discussion but similar functionality is needed for
pid controller as well to expose actual value.

>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                               ` <CAG53R5ULKCqtw45E6t4hYdRV+y_OQqVazf=7A7Ax_XAJ2K0_dw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-13 10:34                                 ` Leon Romanovsky
       [not found]                                   ` <20161013103430.GB9282-2ukJVAZIZ/Y@public.gmane.org>
  2016-10-13 23:14                                 ` Tejun Heo
  1 sibling, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-10-13 10:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 3281 bytes --]

On Mon, Oct 10, 2016 at 07:02:11PM +0530, Parav Pandit wrote:
> Hi Tejun,
>
> On Mon, Oct 10, 2016 at 6:50 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > Hello, Parav.
> >
> > On Mon, Oct 10, 2016 at 06:43:59PM +0530, Parav Pandit wrote:
> >> > Also, I don't get what you mean by using percentage and when people
> >> > brought up this idea, it always has been stemming from
> >> > misunderstanding.  Can you please elaborate how percentage based
> >> > proportional control would work?  What would 100% mean when cgroups
> >> > can come and go?
> >>
> >> When 100% is given to one cgroup, all resources of all type can be
> >> charged by processes of that cgroup.
> >> Resources are stateful resource. So when cgroup goes away, they go
> >> back to global pool (or hw).
> >> Giving 100% to two cgroups is configuration error anyway (or without config).
> >
> > That isn't proportional control.  That's using percentage as the unit
> > to implement absolute limits.  Proportional control implies work
> > conservation.
> >
> >> As you know weight configuration allows automatic increase/decrease of
> >> resource to other cgroups when one of them go away, as opposed to
> >> absolute value. How this is going to work in exact terms for stateful
> >
> > Hmm.... so are you saying that ti'd be work-conserving?
> They cannot be work conversing.
>
> > But what does
> > it mean to say "30%" and then have it all resources when there are no
> > other users.  Also, is it even possible to take back what have already
> > been allocated and are in use?
> >
> Most resources that I know of, and whats described in current
> cgroup_rdma.h are not work conversing, therefore it cannot be taken
> back.
>
> >> Nop. Thats not true.
> >> (a) Every new resource has to be defined in cgroup_rdma.h
> >> (b) charge()/uncharge() has to happen by the cgroup for each.
> >> (c) Letting drivers do will make things fall apart. There are no APIs
> >> exposed either to let drivers know process cgroup either. There is no
> >> intention either.
> >>
> >> (d) ratio means -if adapter has
> >> 100 resources of type - A,
> >> 80 resource of type - B,
> >>
> >> 10% for cgroup-1 means,
> >> 10 resource of type - A
> >> 8 resource of type - B
> >
> > So, this is not work-conserving.  There's too much confusion here.
>
> Give me some more time, I will think more and take feeback from Leon
> and others on
> (a) how can we implement or want to implement weight like
> functionality for non-work-conversing resource

I'm not fully understand the question. RDMA resources are static and
won't be recalculated dynamically for running cgroups consumers while
new cgroup is started. In this situation, weights and percentages are
the same.

> (b) what could be its acceptable limitations of that interface would be
> before we propose you.

More easy is to summarize requirements:
1. It needs to be convenient for users.
2. It can limit any future objects without change in user tools.

>
> At minimum we would need to expose actual value in rdma.max in
> subsequent patch, instead of exposing just "max" string. I don't want
> to complicate this discussion but similar functionality is needed for
> pid controller as well to expose actual value.
>
> >
> > Thanks.
> >
> > --
> > tejun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                   ` <20161013103430.GB9282-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-10-13 11:04                                     ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-13 11:04 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Leon,


On Thu, Oct 13, 2016 at 4:04 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Mon, Oct 10, 2016 at 07:02:11PM +0530, Parav Pandit wrote:
>> Hi Tejun,
>>
>> On Mon, Oct 10, 2016 at 6:50 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > Hello, Parav.
>> >
>> > On Mon, Oct 10, 2016 at 06:43:59PM +0530, Parav Pandit wrote:
>> >> > Also, I don't get what you mean by using percentage and when people
>> >> > brought up this idea, it always has been stemming from
>> >> > misunderstanding.  Can you please elaborate how percentage based
>> >> > proportional control would work?  What would 100% mean when cgroups
>> >> > can come and go?
>> >>
>> >> When 100% is given to one cgroup, all resources of all type can be
>> >> charged by processes of that cgroup.
>> >> Resources are stateful resource. So when cgroup goes away, they go
>> >> back to global pool (or hw).
>> >> Giving 100% to two cgroups is configuration error anyway (or without config).
>> >
>> > That isn't proportional control.  That's using percentage as the unit
>> > to implement absolute limits.  Proportional control implies work
>> > conservation.
>> >
>> >> As you know weight configuration allows automatic increase/decrease of
>> >> resource to other cgroups when one of them go away, as opposed to
>> >> absolute value. How this is going to work in exact terms for stateful
>> >
>> > Hmm.... so are you saying that ti'd be work-conserving?
>> They cannot be work conversing.
>>
>> > But what does
>> > it mean to say "30%" and then have it all resources when there are no
>> > other users.  Also, is it even possible to take back what have already
>> > been allocated and are in use?
>> >
>> Most resources that I know of, and whats described in current
>> cgroup_rdma.h are not work conversing, therefore it cannot be taken
>> back.
>>
>> >> Nop. Thats not true.
>> >> (a) Every new resource has to be defined in cgroup_rdma.h
>> >> (b) charge()/uncharge() has to happen by the cgroup for each.
>> >> (c) Letting drivers do will make things fall apart. There are no APIs
>> >> exposed either to let drivers know process cgroup either. There is no
>> >> intention either.
>> >>
>> >> (d) ratio means -if adapter has
>> >> 100 resources of type - A,
>> >> 80 resource of type - B,
>> >>
>> >> 10% for cgroup-1 means,
>> >> 10 resource of type - A
>> >> 8 resource of type - B
>> >
>> > So, this is not work-conserving.  There's too much confusion here.
>>
>> Give me some more time, I will think more and take feeback from Leon
>> and others on
>> (a) how can we implement or want to implement weight like
>> functionality for non-work-conversing resource
>
> I'm not fully understand the question. RDMA resources are static and
> won't be recalculated dynamically for running cgroups consumers while
> new cgroup is started. In this situation, weights and percentages are
> the same.
>
Let me try again to take weights example.

total resources distributed is based on equation:

resource_of_cg = weight_of_cg * max_resource_of_hca / (sum of all weights)

say there is only one cg-A. Lets say it has weight of 20.
max_resource_of_hca = 100.

So total resource_for_cg_A = (20 * 100) / 20 = 100 (all resources).

Now new cg-B is created with weight of 20.
So with new cg-B, cg-A and cg-B will get 50 resources.

With cg-C creation with weight of 20, each cg gets 33 resources.

Which means that resources are recalculated dynamically on
running/creating new cgroups.

>> (b) what could be its acceptable limitations of that interface would be
>> before we propose you.
>
> More easy is to summarize requirements:
> 1. It needs to be convenient for users.
> 2. It can limit any future objects without change in user tools.

Why don't we have such requirements on the actual dataplane and
control plane APIs side (similar to having abstract socket APIs).
Instead we expect applications to change to make use of new verbs
objects for performance, functionality etc.

New future objects can be limited if we introduce weights/percentage
knob but at the cost of not able to tune for performance.
Usually end-users will use application templates when they deploy for
specific applications, such as mongodb, MPI cluster, glusterFS cluster
etc.
So those application specific plug-in would program exact ratio of MR
to QP or PD to QP etc by writing values to rdma.max depending on MPI
rank, cluster size etc.

weights API allow to auto adjust the value for existing cgroups when
new cgroups are added/removed when deployed application is not well
defined.

>
>>
>> At minimum we would need to expose actual value in rdma.max in
>> subsequent patch, instead of exposing just "max" string. I don't want
>> to complicate this discussion but similar functionality is needed for
>> pid controller as well to expose actual value.
>>
>> >
>> > Thanks.
>> >
>> > --
>> > tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                               ` <CAG53R5ULKCqtw45E6t4hYdRV+y_OQqVazf=7A7Ax_XAJ2K0_dw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-10-13 10:34                                 ` Leon Romanovsky
@ 2016-10-13 23:14                                 ` Tejun Heo
       [not found]                                   ` <20161013231413.GA32534-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  1 sibling, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-10-13 23:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello, Parav.

On Mon, Oct 10, 2016 at 07:02:11PM +0530, Parav Pandit wrote:
> Give me some more time, I will think more and take feeback from Leon
> and others on
> (a) how can we implement or want to implement weight like
> functionality for non-work-conversing resource

I think what you want is just a way to specify absolute limits using
percentages of what's available at the time of configuration -
e.g. being able to say "allow upto 30% of what's available in the
parent".  If so, the simplest way would be simply updating the
existing knobs to accept % inputs in addition to absolute values on
writes.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                   ` <20161013231413.GA32534-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-10-18 20:02                                     ` Parav Pandit
       [not found]                                       ` <CAG53R5UciPpa5d8BWyR-tks3LBrBwRCN2NyBbbm1e3EE-OWSYQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-18 20:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,

Sorry for the late response. I was traveling during weekend.

On Fri, Oct 14, 2016 at 4:44 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Mon, Oct 10, 2016 at 07:02:11PM +0530, Parav Pandit wrote:
>> Give me some more time, I will think more and take feeback from Leon
>> and others on
>> (a) how can we implement or want to implement weight like
>> functionality for non-work-conversing resource
>
> I think what you want is just a way to specify absolute limits using
> percentages of what's available at the time of configuration -
> e.g. being able to say "allow upto 30% of what's available in the
> parent".

Yes. I am concerned about how to configure < 1% value to avoid
floating point math in kernel as thats discouraged.
Configuring in range of 1 to 100% for a given resource limits to only
100 or less cgroup instances which I think is not desired.

> If so, the simplest way would be simply updating the
> existing knobs to accept % inputs in addition to absolute values on
> writes.

I was not sure to overload rdma.max file for accepting % inputs as
thats not done in other cgroups. So I was thinking more of weights
interface which avoids floating point problem and also allows much
wider configuration range.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 1/3] rdmacg: Added rdma cgroup controller
  2016-10-04 18:19                                                           ` Parav Pandit
                                                                             ` (2 preceding siblings ...)
  (?)
@ 2016-10-18 20:15                                                           ` Parav Pandit
  -1 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-10-18 20:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Jason Gunthorpe, Christoph Hellwig, Matan Barak,
	cgroups, linux-doc, Linux Kernel Mailing List, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Haggai Eran, Jonathan Corbet, james.l.morris, serge, Or Gerlitz,
	Andrew Morton, linux-security-module

Hi Doug,

Leon has finished review as well in [7].
Christoph Acked too in [8].

Can you please advise whether
(1) I should rebase and resend PatchV12?
(2) If so for which branch - master/4.9 or?

Tejun and Christoph mentioned that it might be late for 4.9.
Can we atleast merge to linux-rdma tree, so that more features/changes
can be done on top of it?

How can we avoid merge conflict to Linus since this patchset is
applicable to two git trees. (cgroup and linux-rdma).
I was thinking to push through linux-rdma as it is currently going
through more changes, so resolving merge conflict would be simpler if
that happens.

Please provide the direction.

[7] https://lkml.org/lkml/2016/10/5/134
[8] https://lkml.org/lkml/2016/10/5/30

Regards,
Parav Pandit

On Tue, Oct 4, 2016 at 11:49 PM, Parav Pandit <pandit.parav@gmail.com> wrote:
> Hi Doug,
>
> I am still waiting for Leon to provide his comments if any on rdma cgroup.
> From other email context, he was on vacation last week.
> While we wait for his comments, I wanted to know your view of this
> patchset in 4.9 merge window.
>
> To summarize the discussion that happened in two threads.
>
> [1] Ack by Tejun, asking for review from rdma list
> [2] quick review by Christoph on patch-v11 (patch 12 has only typo corrections)
> [3] Christoph's ack on architecture of rdma cgroup and fitting it with ABI
> [4] My response on Matan's query on RSS indirection table
> [5] Response from Intel on their driver support for Matan's query
> [6] Christoph's point on architecture, which we are following in new
> ABI and current ABI
>
> I have reviewed recent patch [7] from Matan where I see IB verbs
> objects are still handled through common path as suggested by
> Christoph.
>
> I do not see any issues with rdma cgroup patchset other than it requires rebase.
> Am I missing something?
> Can you please help me - What would be required to merge it to 4.9?
>
> [1] https://lkml.org/lkml/2016/8/31/494
> [2] https://lkml.org/lkml/2016/8/25/146
> [3] https://lkml.org/lkml/2016/9/10/175
> [4] https://lkml.org/lkml/2016/9/14/221
> [5] https://lkml.org/lkml/2016/9/19/571
> [6] http://www.spinics.net/lists/linux-rdma/msg40337.html
> [7] email subject: [RFC ABI V4 0/7] SG-based RDMA ABI Proposal
>
> Regards,
> Parav Pandit
>
> On Wed, Sep 21, 2016 at 9:32 PM, Parav Pandit <pandit.parav@gmail.com> wrote:
>> Hi Tejun,
>>
>> On Wed, Sep 21, 2016 at 7:56 PM, Tejun Heo <tj@kernel.org> wrote:
>>> Hello, Parav.
>>>
>>> On Wed, Sep 21, 2016 at 10:13:38AM +0530, Parav Pandit wrote:
>>>> We have completed review from Tejun, Christoph.
>>>> HFI driver folks also provided feedback for Intel drivers.
>>>> Matan's also doesn't have any more comments.
>>>>
>>>> If possible, if you can also review, it will be helpful.
>>>>
>>>> I have some more changes unrelated to cgroup in same files in both the git tree.
>>>> Pushing them now either results into merge conflict later on for
>>>> Doug/Tejun, or requires rebase and resending patch.
>>>> If you can review, we can avoid such rework.
>>>
>>> My impression of the thread was that there doesn't seem to be enough
>>> of consensus around how rdma resources should be defined.  Is that
>>> part agreed upon now?
>>>
>>
>> We ended up discussing few points on different thread [1].
>>
>> There was confusion on how some non-rdma/non-IB drivers would work
>> with rdma cgroup from Matan.
>> Christoph explained how they don't fit in the rdma subsystem and
>> therefore its not prime target to addess.
>> Intel driver maintainer Denny also acknowledged same on [2].
>> IB compliant drivers of Intel support rdma cgroup as explained in [2].
>> With that usnic and Intel psm drivers falls out of rdma cgroup support
>> as they don't fit very well in the verbs definition.
>>
>> [1] https://www.spinics.net/lists/linux-rdma/msg40340.html
>> [2] http://www.spinics.net/lists/linux-rdma/msg40717.html
>>
>> I will wait for Leon's review comments if he has different view on architecture.
>> Back in April when I met face-to-face to Leon and Haggai, Leon was in
>> support to have kernel defined the rdma resources as suggested by
>> Christoph and Tejun instead of IB/RDMA subsystem.
>> I will wait for his comments if his views have changed with new uAPI
>> taking shape.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                       ` <CAG53R5UciPpa5d8BWyR-tks3LBrBwRCN2NyBbbm1e3EE-OWSYQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-18 21:51                                         ` Tejun Heo
       [not found]                                           ` <20161018215134.GB2761-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-10-18 21:51 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello,

On Wed, Oct 19, 2016 at 01:32:01AM +0530, Parav Pandit wrote:
> On Fri, Oct 14, 2016 at 4:44 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > I think what you want is just a way to specify absolute limits using
> > percentages of what's available at the time of configuration -
> > e.g. being able to say "allow upto 30% of what's available in the
> > parent".
> 
> Yes. I am concerned about how to configure < 1% value to avoid
> floating point math in kernel as thats discouraged.
> Configuring in range of 1 to 100% for a given resource limits to only
> 100 or less cgroup instances which I think is not desired.

Heh, we can go for per-mil and use %0 as the suffix if absolutely
necessary but is this a real issue?

> > If so, the simplest way would be simply updating the
> > existing knobs to accept % inputs in addition to absolute values on
> > writes.
> 
> I was not sure to overload rdma.max file for accepting % inputs as
> thats not done in other cgroups. So I was thinking more of weights
> interface which avoids floating point problem and also allows much
> wider configuration range.

I think it's a lot more consistent to implement all absoulte limits
through max.  weight is for actual proportional control which this
isn't.  It's just a fancy way of specifying absolute limits.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                           ` <20161018215134.GB2761-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-10-19  9:34                                             ` Parav Pandit
       [not found]                                               ` <CAG53R5UEvkPBM0yFrR=fvEzyCrku2q=rLZyDVrSs9q+3hgbSmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-19  9:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,

On Wed, Oct 19, 2016 at 3:21 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Wed, Oct 19, 2016 at 01:32:01AM +0530, Parav Pandit wrote:
>> On Fri, Oct 14, 2016 at 4:44 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > I think what you want is just a way to specify absolute limits using
>> > percentages of what's available at the time of configuration -
>> > e.g. being able to say "allow upto 30% of what's available in the
>> > parent".
>>
>> Yes. I am concerned about how to configure < 1% value to avoid
>> floating point math in kernel as thats discouraged.
>> Configuring in range of 1 to 100% for a given resource limits to only
>> 100 or less cgroup instances which I think is not desired.
>
> Heh, we can go for per-mil and use %0 as the suffix if absolutely
> necessary but is this a real issue?
>
We need to provide current version to exceed the 100 containers limit first. :-)
Please let me know if such user interface already exist as reference point.
I prefer to avoid introducing such configuration.

>> > If so, the simplest way would be simply updating the
>> > existing knobs to accept % inputs in addition to absolute values on
>> > writes.
>>
>> I was not sure to overload rdma.max file for accepting % inputs as
>> thats not done in other cgroups. So I was thinking more of weights
>> interface which avoids floating point problem and also allows much
>> wider configuration range.
>
> I think it's a lot more consistent to implement all absoulte limits
> through max.
I agree. I will implement reporting actual max values instead of
reporting "max" string so that percentage configuration can be done by
the user space tools.
(post merging of v12).
Waiting for direction from Doug.

> weight is for actual proportional control which this
> isn't.  It's just a fancy way of specifying absolute limits.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                               ` <CAG53R5UEvkPBM0yFrR=fvEzyCrku2q=rLZyDVrSs9q+3hgbSmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-19 14:33                                                 ` Tejun Heo
       [not found]                                                   ` <20161019143345.GA18532-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-10-19 14:33 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello,

On Wed, Oct 19, 2016 at 03:04:52PM +0530, Parav Pandit wrote:
> > Heh, we can go for per-mil and use %0 as the suffix if absolutely
> > necessary but is this a real issue?
>
> We need to provide current version to exceed the 100 containers limit first. :-)
> Please let me know if such user interface already exist as reference point.
> I prefer to avoid introducing such configuration.

We don't have such usage yet.  rdmacg would be the first one doing
something like this.

> > I think it's a lot more consistent to implement all absoulte limits
> > through max.
>
> I agree. I will implement reporting actual max values instead of
> reporting "max" string so that percentage configuration can be done by
> the user space tools.
> (post merging of v12).
> Waiting for direction from Doug.

That doesn't sound like a good idea.  What happens when the amount of
available resources changes either through underlying resource changes
or config changes in one of the ancestors?  What "max" means is "no
limit is imposed" which is different from "limit it to 100% of what's
currently available".

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                   ` <20161019143345.GA18532-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-10-19 19:03                                                     ` Parav Pandit
       [not found]                                                       ` <CAG53R5WUyA7JBn=PeivUc5F5k210xf_HccPXFt3r7ZGYHOPaGA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-19 19:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,

On Wed, Oct 19, 2016 at 8:03 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Wed, Oct 19, 2016 at 03:04:52PM +0530, Parav Pandit wrote:
>> > Heh, we can go for per-mil and use %0 as the suffix if absolutely
>> > necessary but is this a real issue?
>>
>> We need to provide current version to exceed the 100 containers limit first. :-)
>> Please let me know if such user interface already exist as reference point.
>> I prefer to avoid introducing such configuration.
>
> We don't have such usage yet.  rdmacg would be the first one doing
> something like this.
>
>> > I think it's a lot more consistent to implement all absoulte limits
>> > through max.
>>
>> I agree. I will implement reporting actual max values instead of
>> reporting "max" string so that percentage configuration can be done by
>> the user space tools.
>> (post merging of v12).
>> Waiting for direction from Doug.
>
> That doesn't sound like a good idea.  What happens when the amount of
> available resources changes either through underlying resource changes

To my knowledge device configuration cannot change on the fly while
its still active.
PCIe SR-IOV VF might be able to do so. In that case we might need to
introduce event notifier framework.
This can be possibility regardless of percentage configuration if such
device comes up.
However this appears far fetched.

> or config changes in one of the ancestors?  What "max" means is "no
> limit is imposed" which is different from "limit it to 100% of what's
> currently available".
Charging is hierarchical for rdmacg too.
rdma.max configuration exist at all the levels so ancestors change
won't affect its children.
rdma.max absolute (or future percentage) value is with reference to
the total device resources.


> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                       ` <CAG53R5WUyA7JBn=PeivUc5F5k210xf_HccPXFt3r7ZGYHOPaGA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-19 19:20                                                         ` Tejun Heo
       [not found]                                                           ` <20161019192006.GB3044-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-10-19 19:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello,

On Thu, Oct 20, 2016 at 12:33:53AM +0530, Parav Pandit wrote:
> > or config changes in one of the ancestors?  What "max" means is "no
> > limit is imposed" which is different from "limit it to 100% of what's
> > currently available".
>
> Charging is hierarchical for rdmacg too.
> rdma.max configuration exist at all the levels so ancestors change
> won't affect its children.
> rdma.max absolute (or future percentage) value is with reference to
> the total device resources.

Ah, right, the percentage is out of the total device resources
regardless of the hierarchical restrictions.  I still don't think it's
a good idea for rdmacg to deviate from the common interface
conventions.  If you want to do the percentage calculation in the
userland and the base numbers are system-wide numbers which are
hardware dependent, it'd be best if there's an existing place where
the numbers can be exposed naturally.  That'd be more in line with
others too.  The amount of total resources available for the device
isn't tied to cgroup after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                           ` <20161019192006.GB3044-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-10-19 19:54                                                             ` Parav Pandit
       [not found]                                                               ` <CAG53R5X5dyo7J-UkeMxi_mSxgv=c54fV=anuCZtmf9kaYwDbPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-19 19:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,

On Thu, Oct 20, 2016 at 12:50 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Thu, Oct 20, 2016 at 12:33:53AM +0530, Parav Pandit wrote:
>> > or config changes in one of the ancestors?  What "max" means is "no
>> > limit is imposed" which is different from "limit it to 100% of what's
>> > currently available".
>>
>> Charging is hierarchical for rdmacg too.
>> rdma.max configuration exist at all the levels so ancestors change
>> won't affect its children.
>> rdma.max absolute (or future percentage) value is with reference to
>> the total device resources.
>
> Ah, right, the percentage is out of the total device resources
> regardless of the hierarchical restrictions.  I still don't think it's
> a good idea for rdmacg to deviate from the common interface
> conventions.  If you want to do the percentage calculation in the
> userland and the base numbers are system-wide numbers which are
> hardware dependent, it'd be best if there's an existing place where
> the numbers can be exposed naturally.  That'd be more in line with
> others too.  The amount of total resources available for the device
> isn't tied to cgroup after all.
>
userland can get the max numbers using other framework which is used
by control & data plane available in C library form or in form of
system tools.
I was preferring to get and set through same interface because,
It simplifies user land software which is often not written in C so
its likely that it needs to rely on system tools and parse the
content, iterate through devices etc.
Getting these info through rdma.max just makes it simple. There will
be logic built to read/write rdma.max in userland anyway, which can be
leveraged for percentage calculation instead of doing it from two
places.


> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                               ` <CAG53R5X5dyo7J-UkeMxi_mSxgv=c54fV=anuCZtmf9kaYwDbPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-19 20:05                                                                 ` Tejun Heo
       [not found]                                                                   ` <20161019200536.GC3044-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-10-19 20:05 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello, Parav.

On Thu, Oct 20, 2016 at 01:24:42AM +0530, Parav Pandit wrote:
> userland can get the max numbers using other framework which is used
> by control & data plane available in C library form or in form of
> system tools.
> I was preferring to get and set through same interface because,
> It simplifies user land software which is often not written in C so
> its likely that it needs to rely on system tools and parse the
> content, iterate through devices etc.
> Getting these info through rdma.max just makes it simple. There will
> be logic built to read/write rdma.max in userland anyway, which can be
> leveraged for percentage calculation instead of doing it from two
> places.

Yeah, I get that this can be convenient in this case but it isn't a
generic approach.  I'd much prefer keeping it in line with other
resources.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                   ` <20161019200536.GC3044-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-10-19 20:18                                                                     ` Parav Pandit
       [not found]                                                                       ` <CAG53R5XkRKdo-SCaREZvov3AGp5MSd18RpQ+0HEu-htUzqwOOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-10-19 20:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma,
	Li Zefan, Johannes Weiner, Doug Ledford, Christoph Hellwig,
	Liran Liss, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

On Thu, Oct 20, 2016 at 1:35 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Thu, Oct 20, 2016 at 01:24:42AM +0530, Parav Pandit wrote:
>> userland can get the max numbers using other framework which is used
>> by control & data plane available in C library form or in form of
>> system tools.
>> I was preferring to get and set through same interface because,
>> It simplifies user land software which is often not written in C so
>> its likely that it needs to rely on system tools and parse the
>> content, iterate through devices etc.
>> Getting these info through rdma.max just makes it simple. There will
>> be logic built to read/write rdma.max in userland anyway, which can be
>> leveraged for percentage calculation instead of doing it from two
>> places.
>
> Yeah, I get that this can be convenient in this case but it isn't a
> generic approach.  I'd much prefer keeping it in line with other
> resources.
>
Hmm. we don't have /proc/sys/kernel/pid_max type of simple interface
to get the max values for rdma resources.
rdma.max is close to that simplicity.




> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                       ` <CAG53R5XkRKdo-SCaREZvov3AGp5MSd18RpQ+0HEu-htUzqwOOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-10-31  6:54                                                                         ` Leon Romanovsky
       [not found]                                                                           ` <20161031065441.GY3617-2ukJVAZIZ/Y@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-10-31  6:54 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 2061 bytes --]

On Thu, Oct 20, 2016 at 01:48:27AM +0530, Parav Pandit wrote:
> On Thu, Oct 20, 2016 at 1:35 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > Hello, Parav.
> >
> > On Thu, Oct 20, 2016 at 01:24:42AM +0530, Parav Pandit wrote:
> >> userland can get the max numbers using other framework which is used
> >> by control & data plane available in C library form or in form of
> >> system tools.
> >> I was preferring to get and set through same interface because,
> >> It simplifies user land software which is often not written in C so
> >> its likely that it needs to rely on system tools and parse the
> >> content, iterate through devices etc.
> >> Getting these info through rdma.max just makes it simple. There will
> >> be logic built to read/write rdma.max in userland anyway, which can be
> >> leveraged for percentage calculation instead of doing it from two
> >> places.
> >
> > Yeah, I get that this can be convenient in this case but it isn't a
> > generic approach.  I'd much prefer keeping it in line with other
> > resources.
> >
> Hmm. we don't have /proc/sys/kernel/pid_max type of simple interface
> to get the max values for rdma resources.
> rdma.max is close to that simplicity.

Sorry for my late response (very long weekends and piles of mails after it) and
for not clarifying our requirements better, which are very simple.

1. We will have vendor specific vendors objects in the future (new ABI
support it and designed for that).
2. We don't want to fight for every addition of such objects to cgroup list.
3. We don't want to teach and/or rewrite scripts for "average" user after
addition of new objects.
4. Cgroup configuration should be as close as possible to "standard" if
such exists, so all infinite internet guides will work for RDMA too.

From my understanding of current status.
My naive approach of introducing GLOBAL_HCA object is the way to go and the real question
is to understand how to configure it, am I right?

>
>
>
>
> > Thanks.
> >
> > --
> > tejun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                           ` <20161031065441.GY3617-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-11-01 11:03                                                                             ` Parav Pandit
       [not found]                                                                               ` <CAG53R5VKwntDHX101+5aaGoyKMKQuiKQWam575iFAxhmKxhE1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-11-01 11:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

On Mon, Oct 31, 2016 at 12:24 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Thu, Oct 20, 2016 at 01:48:27AM +0530, Parav Pandit wrote:
>> On Thu, Oct 20, 2016 at 1:35 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > Hello, Parav.
>> >
>> > On Thu, Oct 20, 2016 at 01:24:42AM +0530, Parav Pandit wrote:
>> >> userland can get the max numbers using other framework which is used
>> >> by control & data plane available in C library form or in form of
>> >> system tools.
>> >> I was preferring to get and set through same interface because,
>> >> It simplifies user land software which is often not written in C so
>> >> its likely that it needs to rely on system tools and parse the
>> >> content, iterate through devices etc.
>> >> Getting these info through rdma.max just makes it simple. There will
>> >> be logic built to read/write rdma.max in userland anyway, which can be
>> >> leveraged for percentage calculation instead of doing it from two
>> >> places.
>> >
>> > Yeah, I get that this can be convenient in this case but it isn't a
>> > generic approach.  I'd much prefer keeping it in line with other
>> > resources.
>> >
>> Hmm. we don't have /proc/sys/kernel/pid_max type of simple interface
>> to get the max values for rdma resources.
>> rdma.max is close to that simplicity.
>
> Sorry for my late response (very long weekends and piles of mails after it) and
> for not clarifying our requirements better, which are very simple.
>
> 1. We will have vendor specific vendors objects in the future (new ABI
> support it and designed for that).
I will let others comments on it. The patch_v11 design was allowing
vendor specific objects and standard objects to be defined in IB core
and rdma cgroup was facilitator to do so. We didn't reach consensus on
that approach.

> 2. We don't want to fight for every addition of such objects to cgroup list.
Ditto comment as above.

> 3. We don't want to teach and/or rewrite scripts for "average" user after
> addition of new objects.
This we can possibly do by having new rdma.percentage knob, which gets
configured by default for every new object in rdma cgroup.
This way average user/administrator doesn't have to know about it.

> 4. Cgroup configuration should be as close as possible to "standard" if
> such exists, so all infinite internet guides will work for RDMA too.
I didnt follow this comment. Can you please explain? Are you saying
rdma cgroup should have define all the objects of IB spec?
>
> From my understanding of current status.
> My naive approach of introducing GLOBAL_HCA object is the way to go and the real question
> is to understand how to configure it, am I right?
>
Global object won't work for below reason.
Lets take example that makes life easier.
Lets say two new RDMA objects exist which are not part of rdma cgroup
standard resource definition.
say, indirection table and PSM tags.
Both are abstracted using one global_hca resource object.
Say its given 10%.
Now IB core performs charging of each such object using GLOBAL_HCA.
(Because cgroup level there is only one object GLOBAL_HCA).
So two or more resources are mapped to single object.
Which means, one object can be charged more with total limit still
under 10%, thats leads to same problem as not having cgroup at all.

So my opinion is:
(a) Let cgroup define the current standard objects and new reasonable
set of vendor specific objects in future.
(b) Add new rdma.percentage parameter so that any new standard object
or vendor specific object can be abstracted from average end user and
applications which are yet to catch up.
I believe this takes care of your point (1), (3), (4)?

In other hypothetical design,
we can have rdma group as just pid to cgroup mapping facilitator.
All the charging/uncharging logic moves to IB core in form of library,
that standard ABI uverbs and vendor specific layer invokes. In this
approach there will be code duplicated in every such vendor driver.
By doing so, more callbacks will also have to be moved down till IB
core and vendor drivers for cgroup creation/deletion/offline etc.

This also means that lack of standard object definitions, may creates
more confusion to end user and orchestration applications. I prefer to
avoid such design.

Parav

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                               ` <CAG53R5VKwntDHX101+5aaGoyKMKQuiKQWam575iFAxhmKxhE1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-11-01 14:07                                                                                 ` Leon Romanovsky
       [not found]                                                                                   ` <20161101140732.GC3617-2ukJVAZIZ/Y@public.gmane.org>
  2016-11-03 18:00                                                                                 ` Leon Romanovsky
  1 sibling, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-11-01 14:07 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

On Tue, Nov 01, 2016 at 04:33:23PM +0530, Parav Pandit wrote:
>
> > 4. Cgroup configuration should be as close as possible to "standard" if
> > such exists, so all infinite internet guides will work for RDMA too.
> I didnt follow this comment. Can you please explain? Are you saying
> rdma cgroup should have define all the objects of IB spec?

It is not related to spec at all. There were comments from Tejun and you that
other cgroups (CPU, ...) have different semantics and RDMA has something unique
(I don't remember what was it). I want to see minimal uniqueness RDMA cgroups.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                   ` <20161101140732.GC3617-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-11-02  4:34                                                                                     ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-11-02  4:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

On Tue, Nov 1, 2016 at 7:37 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Tue, Nov 01, 2016 at 04:33:23PM +0530, Parav Pandit wrote:
>>
>> > 4. Cgroup configuration should be as close as possible to "standard" if
>> > such exists, so all infinite internet guides will work for RDMA too.
>> I didnt follow this comment. Can you please explain? Are you saying
>> rdma cgroup should have define all the objects of IB spec?
>
> It is not related to spec at all. There were comments from Tejun and you that
> other cgroups (CPU, ...) have different semantics and RDMA has something unique
> (I don't remember what was it). I want to see minimal uniqueness RDMA cgroups.
ok. Got it.
Its the weights interface which is is not suitable for stateful rdma
resources which cannot be reclaimed by other cgroup once allocated.

So proposing your idea in different way to have rdma.percentage
interface as described in previous email.
This is applicable for all the resources and allows generic
configuration for average user.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                               ` <CAG53R5VKwntDHX101+5aaGoyKMKQuiKQWam575iFAxhmKxhE1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-11-01 14:07                                                                                 ` Leon Romanovsky
@ 2016-11-03 18:00                                                                                 ` Leon Romanovsky
       [not found]                                                                                   ` <20161103180006.GL3617-2ukJVAZIZ/Y@public.gmane.org>
  1 sibling, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-11-03 18:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

On Tue, Nov 01, 2016 at 04:33:23PM +0530, Parav Pandit wrote:
> So my opinion is:
> (a) Let cgroup define the current standard objects and new reasonable
> set of vendor specific objects in future.
> (b) Add new rdma.percentage parameter so that any new standard object
> or vendor specific object can be abstracted from average end user and
> applications which are yet to catch up.
> I believe this takes care of your point (1), (3), (4)?

We (Tejun, Christoph, Matan and me) had a face-to-face talk during
KS/LPC and decided that the best way to move forward is to export to
user one object (global HCA like) only and don't export anything else.

All internal calculations will be based on this percentage.

Once the cgroups users will come with reasonable justification why they
need to configure different unexposed objects, we will expose them.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                   ` <20161103180006.GL3617-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-11-04  4:20                                                                                     ` Leon Romanovsky
  2016-11-04  4:20                                                                                     ` Liran Liss
  2016-11-04  4:28                                                                                     ` Parav Pandit
  2 siblings, 0 replies; 112+ messages in thread
From: Leon Romanovsky @ 2016-11-04  4:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 1308 bytes --]

On Thu, Nov 03, 2016 at 08:00:06PM +0200, Leon Romanovsky wrote:
> On Tue, Nov 01, 2016 at 04:33:23PM +0530, Parav Pandit wrote:
> > So my opinion is:
> > (a) Let cgroup define the current standard objects and new reasonable
> > set of vendor specific objects in future.
> > (b) Add new rdma.percentage parameter so that any new standard object
> > or vendor specific object can be abstracted from average end user and
> > applications which are yet to catch up.
> > I believe this takes care of your point (1), (3), (4)?
>
> We (Tejun, Christoph, Matan and me) had a face-to-face talk during
> KS/LPC and decided that the best way to move forward is to export to
> user one object (global HCA like) only and don't export anything else.
>
> All internal calculations will be based on this percentage.

In order to simplify for users and developers more, this global cgroup
object should be not based on percentage, but on actual number of objects
units. While declaration of object unit is object which consumes IDR.

The IDR consumers can be of any type. Such simplification will give
excellent scalability to the cgroup without sacrificing user experience.

>
> Once the cgroups users will come with reasonable justification why they
> need to configure different unexposed objects, we will expose them.



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                   ` <20161103180006.GL3617-2ukJVAZIZ/Y@public.gmane.org>
  2016-11-04  4:20                                                                                     ` Leon Romanovsky
@ 2016-11-04  4:20                                                                                     ` Liran Liss
       [not found]                                                                                       ` <AM4PR0501MB2802030EE9E359133E04439CB1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  2016-11-04  4:28                                                                                     ` Parav Pandit
  2 siblings, 1 reply; 112+ messages in thread
From: Liran Liss @ 2016-11-04  4:20 UTC (permalink / raw)
  To: Leon Romanovsky, Parav Pandit
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

> From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]

> We (Tejun, Christoph, Matan and me) had a face-to-face talk during KS/LPC and
> decided that the best way to move forward is to export to user one object
> (global HCA like) only and don't export anything else.
> 
> All internal calculations will be based on this percentage.
> 
> Once the cgroups users will come with reasonable justification why they need to
> configure different unexposed objects, we will expose them.

A global HCA metric is indeed in the right direction.
However, rethinking this, I think that we should specify the metric in terms of RDMA objects rather than percentage.
Basically, any resource that consumes an IDR is charged.

The reasons are:
- Some HCAs can have a huge amount of resources (millions of objects), of which even a small percentage may consume a considerable amount of kernel memory.
- We follow the same notion as FD limits, which accounts for numerous resource types that consume file objects in the kernel (files, pipes, sockets)
- The namespaces for RDMA resources are large (usually 24 bits). So even large resource counts won't come nowhere close in depleting the namespace. (Compare that to the mere 64K socket port space...)
- The metric measures the actual application usage of resources, rather than proportional to the resources of a given HCA adapter.
- We can continue to use the cgroup mechanism for charging (just as in the original proposal)

I have discussed this matter with Doug and Matan, and it seems like this is the right direction.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                   ` <20161103180006.GL3617-2ukJVAZIZ/Y@public.gmane.org>
  2016-11-04  4:20                                                                                     ` Leon Romanovsky
  2016-11-04  4:20                                                                                     ` Liran Liss
@ 2016-11-04  4:28                                                                                     ` Parav Pandit
  2 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-11-04  4:28 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-rdma, Li Zefan,
	Johannes Weiner, Doug Ledford, Christoph Hellwig, Liran Liss,
	Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Leon, Christoph, Matan, Tejun,

Thanks for the update.
I need some more information in order to roll out new patch.
Inline clarification below.


On Thu, Nov 3, 2016 at 11:30 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Tue, Nov 01, 2016 at 04:33:23PM +0530, Parav Pandit wrote:
>> So my opinion is:
>> (a) Let cgroup define the current standard objects and new reasonable
>> set of vendor specific objects in future.
>> (b) Add new rdma.percentage parameter so that any new standard object
>> or vendor specific object can be abstracted from average end user and
>> applications which are yet to catch up.
>> I believe this takes care of your point (1), (3), (4)?
>
> We (Tejun, Christoph, Matan and me) had a face-to-face talk during
> KS/LPC and decided that the best way to move forward is to export to
> user one object (global HCA like) only and don't export anything else.
>

Can you please confirm the below points to make sure design fits-in.

1. so rdma.current and rdma.max, will show one overall current
percentage used and configured?
(Instead of per object absolute value)
2. As a starting point minimum percentage will be 1%. Default will be 100%.
3. So for example if user has configured 2% of resource, this 2% will
be applicable as 2% of MR, 2% of QP and so on.
4. rdma cgroup continues to do accounting, resource definition as done
in patch_v12.
Though there is provision for defining handful of vendor specific
objects in rdma cgroup, we don't define is currently and therefore
they won't be accounted.
5. In future when such need arise to account vendor specific objects,
they will be added to rdma cgroup.

> All internal calculations will be based on this percentage.
>
> Once the cgroups users will come with reasonable justification why they
> need to configure different unexposed objects, we will expose them.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                       ` <AM4PR0501MB2802030EE9E359133E04439CB1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2016-11-04  4:47                                                                                         ` Parav Pandit
       [not found]                                                                                           ` <CAG53R5Vd58wEBKgAajp9VvJmB5sO2Umii0JE4XaLYKbfrJrxyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-11-04  4:47 UTC (permalink / raw)
  To: Liran Liss
  Cc: Leon Romanovsky, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Liran,

On Fri, Nov 4, 2016 at 9:50 AM, Liran Liss <liranl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org]
>
>> We (Tejun, Christoph, Matan and me) had a face-to-face talk during KS/LPC and
>> decided that the best way to move forward is to export to user one object
>> (global HCA like) only and don't export anything else.
>>
>> All internal calculations will be based on this percentage.
>>
>> Once the cgroups users will come with reasonable justification why they need to
>> configure different unexposed objects, we will expose them.
>
> A global HCA metric is indeed in the right direction.
> However, rethinking this, I think that we should specify the metric in terms of RDMA objects rather than percentage.
> Basically, any resource that consumes an IDR is charged.
>
If metric definition is based on RDMA objects (count) and not based on
percentage, how would user specify the metric without really
specifying object type.
Current patch defines the metric as absolute numbers and objects as well.

Comment from Leon about his discussion with Matan, Tejun, Christoph
says opposite of this for user level configuration.
May be I am missing something.

> The reasons are:
> - Some HCAs can have a huge amount of resources (millions of objects), of which even a small percentage may consume a considerable amount of kernel memory.
> - We follow the same notion as FD limits, which accounts for numerous resource types that consume file objects in the kernel (files, pipes, sockets)
> - The namespaces for RDMA resources are large (usually 24 bits). So even large resource counts won't come nowhere close in depleting the namespace. (Compare that to the mere 64K socket port space...)
> - The metric measures the actual application usage of resources, rather than proportional to the resources of a given HCA adapter.
> - We can continue to use the cgroup mechanism for charging (just as in the original proposal)
>
> I have discussed this matter with Doug and Matan, and it seems like this is the right direction.
> --Liran
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                           ` <CAG53R5Vd58wEBKgAajp9VvJmB5sO2Umii0JE4XaLYKbfrJrxyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-11-04  4:52                                                                                             ` Liran Liss
       [not found]                                                                                               ` <AM4PR0501MB2802E87F709F41DDEC20B7C9B1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Liran Liss @ 2016-11-04  4:52 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1963 bytes --]

> From: Parav Pandit [mailto:pandit.parav@gmail.com]

> >
> > A global HCA metric is indeed in the right direction.
> > However, rethinking this, I think that we should specify the metric in terms of
> RDMA objects rather than percentage.
> > Basically, any resource that consumes an IDR is charged.
> >
> If metric definition is based on RDMA objects (count) and not based on
> percentage, how would user specify the metric without really specifying object
> type.
> Current patch defines the metric as absolute numbers and objects as well.
>

That is the requested change. The absolute number would account for any object allocation. We won't distinguish between types.
Only a single counter (per device).
 
> Comment from Leon about his discussion with Matan, Tejun, Christoph says
> opposite of this for user level configuration.
> May be I am missing something.
> 
> > The reasons are:
> > - Some HCAs can have a huge amount of resources (millions of objects), of
> which even a small percentage may consume a considerable amount of kernel
> memory.
> > - We follow the same notion as FD limits, which accounts for numerous
> > resource types that consume file objects in the kernel (files, pipes,
> > sockets)
> > - The namespaces for RDMA resources are large (usually 24 bits). So
> > even large resource counts won't come nowhere close in depleting the
> > namespace. (Compare that to the mere 64K socket port space...)
> > - The metric measures the actual application usage of resources, rather than
> proportional to the resources of a given HCA adapter.
> > - We can continue to use the cgroup mechanism for charging (just as in
> > the original proposal)
> >
> > I have discussed this matter with Doug and Matan, and it seems like this is the
> right direction.
> > --Liran
> >
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±­ÙšŠ{ayº\x1dʇڙë,j\a­¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                               ` <AM4PR0501MB2802E87F709F41DDEC20B7C9B1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2016-11-04  4:57                                                                                                 ` Parav Pandit
       [not found]                                                                                                   ` <CAG53R5UyZPh9wduPZGRg2P09n2Og8oODqb+QW=7ryAPqJDa6Vw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-11-04  4:57 UTC (permalink / raw)
  To: Liran Liss
  Cc: Leon Romanovsky, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

On Fri, Nov 4, 2016 at 10:22 AM, Liran Liss <liranl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> From: Parav Pandit [mailto:pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
>
>> >
>> > A global HCA metric is indeed in the right direction.
>> > However, rethinking this, I think that we should specify the metric in terms of
>> RDMA objects rather than percentage.
>> > Basically, any resource that consumes an IDR is charged.
>> >
>> If metric definition is based on RDMA objects (count) and not based on
>> percentage, how would user specify the metric without really specifying object
>> type.
>> Current patch defines the metric as absolute numbers and objects as well.
>>
>
> That is the requested change. The absolute number would account for any object allocation. We won't distinguish between types.
> Only a single counter (per device).
>

In that case ucontext deserve a additional count. Because that is
handful in range of 256 to 1K.
If we give absolute consolidated number as 2000, one container will
allocate all the doorbell uctx and no other container can run.
Percentage works for this particular case.


>> Comment from Leon about his discussion with Matan, Tejun, Christoph says
>> opposite of this for user level configuration.
>> May be I am missing something.
>>
>> > The reasons are:
>> > - Some HCAs can have a huge amount of resources (millions of objects), of
>> which even a small percentage may consume a considerable amount of kernel
>> memory.
>> > - We follow the same notion as FD limits, which accounts for numerous
>> > resource types that consume file objects in the kernel (files, pipes,
>> > sockets)
>> > - The namespaces for RDMA resources are large (usually 24 bits). So
>> > even large resource counts won't come nowhere close in depleting the
>> > namespace. (Compare that to the mere 64K socket port space...)
>> > - The metric measures the actual application usage of resources, rather than
>> proportional to the resources of a given HCA adapter.
>> > - We can continue to use the cgroup mechanism for charging (just as in
>> > the original proposal)
>> >
>> > I have discussed this matter with Doug and Matan, and it seems like this is the
>> right direction.
>> > --Liran
>> >
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                   ` <CAG53R5UyZPh9wduPZGRg2P09n2Og8oODqb+QW=7ryAPqJDa6Vw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-11-04  5:06                                                                                                     ` Liran Liss
       [not found]                                                                                                       ` <AM4PR0501MB28025BE002CBA9D04675A5A5B1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Liran Liss @ 2016-11-04  5:06 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

> From: Parav Pandit [mailto:pandit.parav@gmail.com]

> 
> On Fri, Nov 4, 2016 at 10:22 AM, Liran Liss <liranl@mellanox.com> wrote:
> >> From: Parav Pandit [mailto:pandit.parav@gmail.com]
> >
> >> >
> >> > A global HCA metric is indeed in the right direction.
> >> > However, rethinking this, I think that we should specify the metric
> >> > in terms of
> >> RDMA objects rather than percentage.
> >> > Basically, any resource that consumes an IDR is charged.
> >> >
> >> If metric definition is based on RDMA objects (count) and not based
> >> on percentage, how would user specify the metric without really
> >> specifying object type.
> >> Current patch defines the metric as absolute numbers and objects as well.
> >>
> >
> > That is the requested change. The absolute number would account for any
> object allocation. We won't distinguish between types.
> > Only a single counter (per device).
> >
> 
> In that case ucontext deserve a additional count. Because that is handful in
> range of 256 to 1K.
> If we give absolute consolidated number as 2000, one container will allocate all
> the doorbell uctx and no other container can run.
> Percentage works for this particular case.
> 

Hmm..
I guess that you are right.

So we can add another count for "HCA handles", or alternatively, each provider will restrict the number of handles per device to a reasonable small number (which won't be treated as one of the "HCA resources").
Typically, a process shouldn't need to open more than a single handle...

> 
> >> Comment from Leon about his discussion with Matan, Tejun, Christoph
> >> says opposite of this for user level configuration.
> >> May be I am missing something.
> >>
> >> > The reasons are:
> >> > - Some HCAs can have a huge amount of resources (millions of
> >> > objects), of
> >> which even a small percentage may consume a considerable amount of
> >> kernel memory.
> >> > - We follow the same notion as FD limits, which accounts for
> >> > numerous resource types that consume file objects in the kernel
> >> > (files, pipes,
> >> > sockets)
> >> > - The namespaces for RDMA resources are large (usually 24 bits). So
> >> > even large resource counts won't come nowhere close in depleting
> >> > the namespace. (Compare that to the mere 64K socket port space...)
> >> > - The metric measures the actual application usage of resources,
> >> > rather than
> >> proportional to the resources of a given HCA adapter.
> >> > - We can continue to use the cgroup mechanism for charging (just as
> >> > in the original proposal)
> >> >
> >> > I have discussed this matter with Doug and Matan, and it seems like
> >> > this is the
> >> right direction.
> >> > --Liran
> >> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                       ` <AM4PR0501MB28025BE002CBA9D04675A5A5B1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2016-11-04  5:44                                                                                                         ` Parav Pandit
       [not found]                                                                                                           ` <CAG53R5WdauHpML66g-O6zj+j_DUYWJMPjmL1xDaSxwDmPPYm2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-11-04  5:44 UTC (permalink / raw)
  To: Liran Liss
  Cc: Leon Romanovsky, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Liran,

On Fri, Nov 4, 2016 at 10:36 AM, Liran Liss <liranl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> From: Parav Pandit [mailto:pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
>
>>
>> On Fri, Nov 4, 2016 at 10:22 AM, Liran Liss <liranl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> >> From: Parav Pandit [mailto:pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
>> >
>> >> >
>> >> > A global HCA metric is indeed in the right direction.
>> >> > However, rethinking this, I think that we should specify the metric
>> >> > in terms of
>> >> RDMA objects rather than percentage.
>> >> > Basically, any resource that consumes an IDR is charged.
>> >> >
>> >> If metric definition is based on RDMA objects (count) and not based
>> >> on percentage, how would user specify the metric without really
>> >> specifying object type.
>> >> Current patch defines the metric as absolute numbers and objects as well.
>> >>
>> >
>> > That is the requested change. The absolute number would account for any
>> object allocation. We won't distinguish between types.
>> > Only a single counter (per device).
>> >
>>
>> In that case ucontext deserve a additional count. Because that is handful in
>> range of 256 to 1K.
>> If we give absolute consolidated number as 2000, one container will allocate all
>> the doorbell uctx and no other container can run.
>> Percentage works for this particular case.
>>
>
> Hmm..
> I guess that you are right.
>
> So we can add another count for "HCA handles",
I prefer this. This keeps it vendor agnostic and clean if we don't go
percentage route.
Would indirection table also fall in this category?

> or alternatively, each provider will restrict the number of handles per device to a reasonable small number (which
> won't be treated as one of the "HCA resources").
This would require vendor drivers to get the understanding of cgroup
object and pid and that breaks the modular approach. I like to avoid
this.

> Typically, a process shouldn't need to open more than a single handle...
Right. well behaved application won't do multiple handles.


>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* RE: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                           ` <CAG53R5WdauHpML66g-O6zj+j_DUYWJMPjmL1xDaSxwDmPPYm2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-11-08  8:12                                                                                                             ` Liran Liss
       [not found]                                                                                                               ` <HE1PR0501MB2812298C05431B08B0F408EEB1A60-692Kmc8YnlIVrnpjwTCbp8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Liran Liss @ 2016-11-08  8:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

> From: Parav Pandit [mailto:pandit.parav@gmail.com]

> >
> > Hmm..
> > I guess that you are right.
> >
> > So we can add another count for "HCA handles",
> I prefer this. This keeps it vendor agnostic and clean if we don't go percentage
> route.

OK; let's do it.

> Would indirection table also fall in this category?
> 

No. It's just another HCA resource...

> > or alternatively, each provider will restrict the number of handles
> > per device to a reasonable small number (which won't be treated as one of the
> "HCA resources").
> This would require vendor drivers to get the understanding of cgroup object
> and pid and that breaks the modular approach. I like to avoid this.
> 
> > Typically, a process shouldn't need to open more than a single handle...
> Right. well behaved application won't do multiple handles.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                               ` <HE1PR0501MB2812298C05431B08B0F408EEB1A60-692Kmc8YnlIVrnpjwTCbp8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2016-11-10  7:41                                                                                                                 ` Parav Pandit
       [not found]                                                                                                                   ` <CAG53R5XqZwrYsdX=JQ1D4cDB0h65RDQVb=VCiaR5TXuf_uoO0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-11-10  7:41 UTC (permalink / raw)
  To: Liran Liss
  Cc: Leon Romanovsky, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Leon, Tejun, Christoph, Liran,  Doug, Matan,

So are you ok with below proposal?

1. Define two resources by rdmacg.
(a) hca_handles (covers doorbell pages)
(b) hca_resources (mr, pd, qp, srq, vendor defined, all consolidated count)
Both cannot be combined as explained in [1].

2. User configures absolute count for above two resources (similar to
today's file descriptors, pid cgroup controller max limit)

Leon,
Let us know if you have any further discussions during LPC on
questions of [2] in using percentage based scheme or otherwise.

Parav

[1] https://www.spinics.net/lists/linux-rdma/msg42771.html
[2] https://www.spinics.net/lists/linux-rdma/msg42768.html



On Tue, Nov 8, 2016 at 1:42 PM, Liran Liss <liranl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> From: Parav Pandit [mailto:pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
>
>> >
>> > Hmm..
>> > I guess that you are right.
>> >
>> > So we can add another count for "HCA handles",
>> I prefer this. This keeps it vendor agnostic and clean if we don't go percentage
>> route.
>
> OK; let's do it.
>
>> Would indirection table also fall in this category?
>>
>
> No. It's just another HCA resource...
>
>> > or alternatively, each provider will restrict the number of handles
>> > per device to a reasonable small number (which won't be treated as one of the
>> "HCA resources").
>> This would require vendor drivers to get the understanding of cgroup object
>> and pid and that breaks the modular approach. I like to avoid this.
>>
>> > Typically, a process shouldn't need to open more than a single handle...
>> Right. well behaved application won't do multiple handles.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                                   ` <CAG53R5XqZwrYsdX=JQ1D4cDB0h65RDQVb=VCiaR5TXuf_uoO0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-11-10 16:38                                                                                                                     ` Leon Romanovsky
       [not found]                                                                                                                       ` <20161110163837.GE28957-2ukJVAZIZ/Y@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Leon Romanovsky @ 2016-11-10 16:38 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Liran Liss, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

[-- Attachment #1: Type: text/plain, Size: 1917 bytes --]

On Thu, Nov 10, 2016 at 01:11:18PM +0530, Parav Pandit wrote:
> Hi Leon, Tejun, Christoph, Liran,  Doug, Matan,
>
> So are you ok with below proposal?

I'm fine with it and it looks like very clean approach
to solve our multi-object future.

>
> 1. Define two resources by rdmacg.
> (a) hca_handles (covers doorbell pages)
> (b) hca_resources (mr, pd, qp, srq, vendor defined, all consolidated count)
> Both cannot be combined as explained in [1].
>
> 2. User configures absolute count for above two resources (similar to
> today's file descriptors, pid cgroup controller max limit)
>
> Leon,
> Let us know if you have any further discussions during LPC on
> questions of [2] in using percentage based scheme or otherwise.

No, we didn't have.

>
> Parav
>
> [1] https://www.spinics.net/lists/linux-rdma/msg42771.html
> [2] https://www.spinics.net/lists/linux-rdma/msg42768.html
>
>
>
> On Tue, Nov 8, 2016 at 1:42 PM, Liran Liss <liranl-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> >> From: Parav Pandit [mailto:pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> >
> >> >
> >> > Hmm..
> >> > I guess that you are right.
> >> >
> >> > So we can add another count for "HCA handles",
> >> I prefer this. This keeps it vendor agnostic and clean if we don't go percentage
> >> route.
> >
> > OK; let's do it.
> >
> >> Would indirection table also fall in this category?
> >>
> >
> > No. It's just another HCA resource...
> >
> >> > or alternatively, each provider will restrict the number of handles
> >> > per device to a reasonable small number (which won't be treated as one of the
> >> "HCA resources").
> >> This would require vendor drivers to get the understanding of cgroup object
> >> and pid and that breaks the modular approach. I like to avoid this.
> >>
> >> > Typically, a process shouldn't need to open more than a single handle...
> >> Right. well behaved application won't do multiple handles.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                                       ` <20161110163837.GE28957-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-11-10 16:46                                                                                                                         ` Tejun Heo
       [not found]                                                                                                                           ` <20161110164638.GC26105-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-11-10 16:46 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Parav Pandit, Liran Liss, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello,

On Thu, Nov 10, 2016 at 06:38:37PM +0200, Leon Romanovsky wrote:
> On Thu, Nov 10, 2016 at 01:11:18PM +0530, Parav Pandit wrote:
> > Hi Leon, Tejun, Christoph, Liran,  Doug, Matan,
> >
> > So are you ok with below proposal?
> 
> I'm fine with it and it looks like very clean approach
> to solve our multi-object future.

No ojbection.  Parav, can you write up how the interface would look
like with examples?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                                           ` <20161110164638.GC26105-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-11-10 17:04                                                                                                                             ` Parav Pandit
       [not found]                                                                                                                               ` <CAG53R5UGfhGHc3-jgUjH5taFzTHg3BOgXi25QjuQfUFc0U7tgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-11-10 17:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Liran Liss, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,

On Thu, Nov 10, 2016 at 10:16 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Thu, Nov 10, 2016 at 06:38:37PM +0200, Leon Romanovsky wrote:
>> On Thu, Nov 10, 2016 at 01:11:18PM +0530, Parav Pandit wrote:
>> > Hi Leon, Tejun, Christoph, Liran,  Doug, Matan,
>> >
>> > So are you ok with below proposal?
>>
>> I'm fine with it and it looks like very clean approach
>> to solve our multi-object future.
>
> No ojbection.  Parav, can you write up how the interface would look
> like with examples?

Simplified version of v12 with no architecture change.
I will describe below.

user-kernel interface:
---------------------------
(a) rdma.current - Will continue to report resource count.
(b) rdma.max - will continue to accept hca_handles, and hca_objects as
absolute number.

Instead of mr, pd, qp, ah, srq entries of patch_v12, it will have just
two entries for each device.
(1) hca_handles, (2) hca_objects.

rdmacg - IB stack interface:
--------------------------------
cgroup_rdma.h will have two enum entries.

RDMACG_RESOURCE_HCA_HANDLE
RDMACG_RESOURCE_OBJECT

IB stack will charge either of the object.
When HCA handles are allocated/freed IB core will request charge/uncharge.
When standard verb resources such as PD, MR, CQ, QP, SRQ are
allocated/freed IB core will request for XX_OBJECT charge/uncharge.

Currently defined APIs and interfaces just remains same.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                                               ` <CAG53R5UGfhGHc3-jgUjH5taFzTHg3BOgXi25QjuQfUFc0U7tgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-11-10 17:32                                                                                                                                 ` Tejun Heo
       [not found]                                                                                                                                   ` <20161110173217.GD26105-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-11-10 17:32 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Liran Liss, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello, Parav.

On Thu, Nov 10, 2016 at 10:34:44PM +0530, Parav Pandit wrote:
> user-kernel interface:
> ---------------------------
> (a) rdma.current - Will continue to report resource count.
> (b) rdma.max - will continue to accept hca_handles, and hca_objects as
> absolute number.
> 
> Instead of mr, pd, qp, ah, srq entries of patch_v12, it will have just
> two entries for each device.
> (1) hca_handles, (2) hca_objects.
> 
> rdmacg - IB stack interface:
> --------------------------------
> cgroup_rdma.h will have two enum entries.
> 
> RDMACG_RESOURCE_HCA_HANDLE
> RDMACG_RESOURCE_OBJECT
> 
> IB stack will charge either of the object.
> When HCA handles are allocated/freed IB core will request charge/uncharge.
> When standard verb resources such as PD, MR, CQ, QP, SRQ are
> allocated/freed IB core will request for XX_OBJECT charge/uncharge.
> 
> Currently defined APIs and interfaces just remains same.

That looks great to me from cgroup side.  Do you have plans for
exposing the maximum numbers available?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                                                   ` <20161110173217.GD26105-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-11-10 17:56                                                                                                                                     ` Parav Pandit
  2016-11-10 19:23                                                                                                                                       ` Tejun Heo
  0 siblings, 1 reply; 112+ messages in thread
From: Parav Pandit @ 2016-11-10 17:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Liran Liss, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,


On Thu, Nov 10, 2016 at 11:02 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Thu, Nov 10, 2016 at 10:34:44PM +0530, Parav Pandit wrote:
>> user-kernel interface:
>> ---------------------------
>> (a) rdma.current - Will continue to report resource count.
>> (b) rdma.max - will continue to accept hca_handles, and hca_objects as
>> absolute number.
>>
>> Instead of mr, pd, qp, ah, srq entries of patch_v12, it will have just
>> two entries for each device.
>> (1) hca_handles, (2) hca_objects.
>>
>> rdmacg - IB stack interface:
>> --------------------------------
>> cgroup_rdma.h will have two enum entries.
>>
>> RDMACG_RESOURCE_HCA_HANDLE
>> RDMACG_RESOURCE_OBJECT
>>
>> IB stack will charge either of the object.
>> When HCA handles are allocated/freed IB core will request charge/uncharge.
>> When standard verb resources such as PD, MR, CQ, QP, SRQ are
>> allocated/freed IB core will request for XX_OBJECT charge/uncharge.
>>
>> Currently defined APIs and interfaces just remains same.
>
> That looks great to me from cgroup side.  Do you have plans for
> exposing the maximum numbers available?
>
I thought more on this.
If I have to expose max limits, I need new file interface as rdma.limit.
Because once rdma.max is set, user space cannot get back the old value.
It needs to cache it. user space tools might have been restarted and
so on, so store in other file etc.
So such user space solutions are just ugly.

Getting and setting values in device agnostic way, through cgroup
files is desirable, however its not must. It can fallback on using
verb based API.

So if there is no objection, I prefer to have rdma.limit file as
incremental patch once base version is merged.


> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
  2016-11-10 17:56                                                                                                                                     ` Parav Pandit
@ 2016-11-10 19:23                                                                                                                                       ` Tejun Heo
       [not found]                                                                                                                                         ` <20161110192344.GA4805-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 112+ messages in thread
From: Tejun Heo @ 2016-11-10 19:23 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Liran Liss, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hello, Parav.

On Thu, Nov 10, 2016 at 11:26:28PM +0530, Parav Pandit wrote:
> > That looks great to me from cgroup side.  Do you have plans for
> > exposing the maximum numbers available?
>
> If I have to expose max limits, I need new file interface as rdma.limit.
> Because once rdma.max is set, user space cannot get back the old value.
> It needs to cache it. user space tools might have been restarted and
> so on, so store in other file etc.
> So such user space solutions are just ugly.
> 
> Getting and setting values in device agnostic way, through cgroup
> files is desirable, however its not must. It can fallback on using
> verb based API.
> 
> So if there is no objection, I prefer to have rdma.limit file as
> incremental patch once base version is merged.

How about something like RESOURCE.available field in rdma.stat file?
Its value can be what's maximally available at that level when max is
unlimited and there is no competition.  At top level cgroups, it'd be
the total resources available in the system.  At sub levels, it'd be
min of what's available to the grandparent and parent's limit on the
resource.

This would be in line with cgroup conventions and would behave the
same way nested making things easier for containers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCHv12 0/3] rdmacg: IB/core: rdma controller support
       [not found]                                                                                                                                         ` <20161110192344.GA4805-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-11-11 13:00                                                                                                                                           ` Parav Pandit
  0 siblings, 0 replies; 112+ messages in thread
From: Parav Pandit @ 2016-11-11 13:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Leon Romanovsky, Liran Liss, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma, Li Zefan, Johannes Weiner, Doug Ledford,
	Christoph Hellwig, Hefty, Sean, Jason Gunthorpe, Haggai Eran,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, Or Gerlitz, Matan Barak

Hi Tejun,

On Fri, Nov 11, 2016 at 12:53 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Thu, Nov 10, 2016 at 11:26:28PM +0530, Parav Pandit wrote:
>> > That looks great to me from cgroup side.  Do you have plans for
>> > exposing the maximum numbers available?
>>
>> If I have to expose max limits, I need new file interface as rdma.limit.
>> Because once rdma.max is set, user space cannot get back the old value.
>> It needs to cache it. user space tools might have been restarted and
>> so on, so store in other file etc.
>> So such user space solutions are just ugly.
>>
>> Getting and setting values in device agnostic way, through cgroup
>> files is desirable, however its not must. It can fallback on using
>> verb based API.
>>
>> So if there is no objection, I prefer to have rdma.limit file as
>> incremental patch once base version is merged.
>
> How about something like RESOURCE.available field in rdma.stat file?
> Its value can be what's maximally available at that level when max is
> unlimited and there is no competition.  At top level cgroups, it'd be
> the total resources available in the system.  At sub levels, it'd be
> min of what's available to the grandparent and parent's limit on the
> resource.
>
> This would be in line with cgroup conventions and would behave the
> same way nested making things easier for containers.
>
Yes. This is what is needed. Awesome.
I will include in follow on patch after first round of functionality.
Patch_v13 will have base functionality similar to v12. This new file
will be once code is merged.


> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

end of thread, other threads:[~2016-11-11 13:00 UTC | newest]

Thread overview: 112+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-31  8:37 [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Parav Pandit
2016-08-31  8:37 ` [PATCHv12 1/3] rdmacg: Added rdma cgroup controller Parav Pandit
2016-08-31  9:38   ` Leon Romanovsky
2016-09-07 15:07     ` Parav Pandit
2016-09-08  6:12       ` Leon Romanovsky
2016-09-08 10:20         ` Parav Pandit
2016-09-08 10:20           ` Parav Pandit
2016-08-31 15:07   ` Matan Barak
2016-08-31 15:07     ` Matan Barak
2016-08-31 21:16     ` Tejun Heo
     [not found]       ` <20160831211618.GA12660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-09-01  7:25         ` Matan Barak
2016-09-01  7:25           ` Matan Barak
2016-09-01  8:44           ` Christoph Hellwig
     [not found]             ` <20160901084406.GA4115-jcswGhMUV9g@public.gmane.org>
2016-09-07  7:55               ` Parav Pandit
2016-09-07  7:55                 ` Parav Pandit
2016-09-07  8:51                 ` Matan Barak
2016-09-07  8:51                   ` Matan Barak
2016-09-07 14:54                   ` Parav Pandit
     [not found]                   ` <ae3adcc4-253e-f87c-6ff6-202c91599f48-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-10 16:14                     ` Christoph Hellwig
2016-09-10 16:14                       ` Christoph Hellwig
2016-09-10 17:01                       ` Jason Gunthorpe
     [not found]                         ` <20160910170151.GA5230-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-09-11  8:07                           ` Matan Barak
2016-09-11  8:07                             ` Matan Barak
2016-09-11 13:34                         ` Christoph Hellwig
2016-09-11 14:35                           ` Leon Romanovsky
2016-09-11 17:14                             ` Jason Gunthorpe
2016-09-11 17:24                               ` Christoph Hellwig
2016-09-11 17:52                                 ` Jason Gunthorpe
     [not found]                                   ` <20160911175235.GB13442-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-09-12  5:07                                     ` Leon Romanovsky
2016-09-12  5:07                                       ` Leon Romanovsky
     [not found]                                       ` <20160912050717.GE8812-2ukJVAZIZ/Y@public.gmane.org>
2016-09-14  7:06                                         ` Parav Pandit
2016-09-14  7:06                                           ` Parav Pandit
2016-09-14  8:14                                           ` Matan Barak
2016-09-14  8:14                                             ` Matan Barak
     [not found]                                             ` <13a00119-e629-2d34-d08b-c02bb6beceea-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-14  9:19                                               ` Parav Pandit
2016-09-14  9:19                                                 ` Parav Pandit
     [not found]                                           ` <CAG53R5X4stfy5+Jmg+XReUJqt56Z-zABK+UEswHW1dXhH-9cNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-15 18:56                                             ` Leon Romanovsky
2016-09-15 18:56                                               ` Leon Romanovsky
2016-09-21  4:43                                               ` Parav Pandit
2016-09-21 14:26                                                 ` Tejun Heo
     [not found]                                                   ` <20160921142645.GB10734-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-09-21 16:02                                                     ` Parav Pandit
2016-09-21 16:02                                                       ` Parav Pandit
     [not found]                                                       ` <CAG53R5WMuojhzFGmqk6nHfypd9Hq4dGsWRKjtUyMZ=RezU-LhQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-04 18:19                                                         ` Parav Pandit
2016-10-04 18:19                                                           ` Parav Pandit
2016-10-04 18:19                                                           ` Parav Pandit
2016-10-05  6:37                                                           ` Christoph Hellwig
2016-10-05 11:22                                                             ` Leon Romanovsky
2016-10-05 15:36                                                               ` Tejun Heo
     [not found]                                                             ` <20161005063735.GC3086-jcswGhMUV9g@public.gmane.org>
2016-10-06 12:55                                                               ` Parav Pandit
2016-10-06 12:55                                                                 ` Parav Pandit
2016-10-18 20:15                                                           ` Parav Pandit
2016-09-19 13:10                                             ` Dalessandro, Dennis
2016-09-19 13:10                                               ` Dalessandro, Dennis
2016-09-19 17:00                                               ` Parav Pandit
2016-09-19 17:00                                                 ` Parav Pandit
     [not found]                 ` <CAG53R5Ws4BJKqeEYfEoEx5kuaXUmhDKcXfH4Vx=LTMK6tKMG0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-10 16:12                   ` Christoph Hellwig
2016-09-10 16:12                     ` Christoph Hellwig
     [not found]                     ` <20160910161228.GB29259-jcswGhMUV9g@public.gmane.org>
2016-09-11  7:40                       ` Matan Barak
2016-09-11  7:40                         ` Matan Barak
2016-08-31  8:37 ` [PATCHv12 2/3] IB/core: added support to use " Parav Pandit
     [not found] ` <1472632647-1525-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-08-31  8:37   ` [PATCHv12 3/3] rdmacg: Added documentation for rdmacg Parav Pandit
2016-08-31  8:37     ` Parav Pandit
2016-08-31 13:56   ` [PATCHv12 0/3] rdmacg: IB/core: rdma controller support Tejun Heo
2016-08-31 13:56     ` Tejun Heo
2016-10-05 11:22 ` Leon Romanovsky
     [not found]   ` <20161005112206.GC9282-2ukJVAZIZ/Y@public.gmane.org>
2016-10-06 12:59     ` Parav Pandit
2016-10-06 12:59       ` Parav Pandit
2016-10-06 13:49     ` Parav Pandit
     [not found]       ` <CAG53R5VNVb=8-LJbDRqjtOZG347ucPuc420bcfnDgBKMoKqU-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-10  4:46         ` Leon Romanovsky
     [not found]           ` <20161010044623.GI9282-2ukJVAZIZ/Y@public.gmane.org>
2016-10-10  6:29             ` Parav Pandit
     [not found]               ` <CAG53R5UM6nSTZ7=0S9reKGX45CpNBi8soSDVZyXkN-z0_XXWWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-10  7:33                 ` Leon Romanovsky
     [not found]                   ` <20161010073343.GK9282-2ukJVAZIZ/Y@public.gmane.org>
2016-10-10  8:35                     ` Parav Pandit
     [not found]                       ` <CAG53R5WeWSrJ5-Gtt-cXpUr0r73zh3bqQM_G5zTue27tPtVEXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-10  8:52                         ` Leon Romanovsky
     [not found]                           ` <20161010085241.GL9282-2ukJVAZIZ/Y@public.gmane.org>
2016-10-10  9:22                             ` Parav Pandit
2016-10-10 12:25                 ` Tejun Heo
     [not found]                   ` <20161010122545.GA27360-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-10-10 13:13                     ` Parav Pandit
     [not found]                       ` <CAG53R5V5yE4PsDBjP9BieG_=39M0G1kx-AfBEzWK4LUCxNnYBA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-10 13:20                         ` Tejun Heo
     [not found]                           ` <20161010132014.GD29742-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-10-10 13:32                             ` Parav Pandit
     [not found]                               ` <CAG53R5ULKCqtw45E6t4hYdRV+y_OQqVazf=7A7Ax_XAJ2K0_dw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-13 10:34                                 ` Leon Romanovsky
     [not found]                                   ` <20161013103430.GB9282-2ukJVAZIZ/Y@public.gmane.org>
2016-10-13 11:04                                     ` Parav Pandit
2016-10-13 23:14                                 ` Tejun Heo
     [not found]                                   ` <20161013231413.GA32534-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-10-18 20:02                                     ` Parav Pandit
     [not found]                                       ` <CAG53R5UciPpa5d8BWyR-tks3LBrBwRCN2NyBbbm1e3EE-OWSYQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-18 21:51                                         ` Tejun Heo
     [not found]                                           ` <20161018215134.GB2761-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-10-19  9:34                                             ` Parav Pandit
     [not found]                                               ` <CAG53R5UEvkPBM0yFrR=fvEzyCrku2q=rLZyDVrSs9q+3hgbSmQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-19 14:33                                                 ` Tejun Heo
     [not found]                                                   ` <20161019143345.GA18532-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-10-19 19:03                                                     ` Parav Pandit
     [not found]                                                       ` <CAG53R5WUyA7JBn=PeivUc5F5k210xf_HccPXFt3r7ZGYHOPaGA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-19 19:20                                                         ` Tejun Heo
     [not found]                                                           ` <20161019192006.GB3044-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-10-19 19:54                                                             ` Parav Pandit
     [not found]                                                               ` <CAG53R5X5dyo7J-UkeMxi_mSxgv=c54fV=anuCZtmf9kaYwDbPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-19 20:05                                                                 ` Tejun Heo
     [not found]                                                                   ` <20161019200536.GC3044-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-10-19 20:18                                                                     ` Parav Pandit
     [not found]                                                                       ` <CAG53R5XkRKdo-SCaREZvov3AGp5MSd18RpQ+0HEu-htUzqwOOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-10-31  6:54                                                                         ` Leon Romanovsky
     [not found]                                                                           ` <20161031065441.GY3617-2ukJVAZIZ/Y@public.gmane.org>
2016-11-01 11:03                                                                             ` Parav Pandit
     [not found]                                                                               ` <CAG53R5VKwntDHX101+5aaGoyKMKQuiKQWam575iFAxhmKxhE1g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-01 14:07                                                                                 ` Leon Romanovsky
     [not found]                                                                                   ` <20161101140732.GC3617-2ukJVAZIZ/Y@public.gmane.org>
2016-11-02  4:34                                                                                     ` Parav Pandit
2016-11-03 18:00                                                                                 ` Leon Romanovsky
     [not found]                                                                                   ` <20161103180006.GL3617-2ukJVAZIZ/Y@public.gmane.org>
2016-11-04  4:20                                                                                     ` Leon Romanovsky
2016-11-04  4:20                                                                                     ` Liran Liss
     [not found]                                                                                       ` <AM4PR0501MB2802030EE9E359133E04439CB1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2016-11-04  4:47                                                                                         ` Parav Pandit
     [not found]                                                                                           ` <CAG53R5Vd58wEBKgAajp9VvJmB5sO2Umii0JE4XaLYKbfrJrxyg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-04  4:52                                                                                             ` Liran Liss
     [not found]                                                                                               ` <AM4PR0501MB2802E87F709F41DDEC20B7C9B1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2016-11-04  4:57                                                                                                 ` Parav Pandit
     [not found]                                                                                                   ` <CAG53R5UyZPh9wduPZGRg2P09n2Og8oODqb+QW=7ryAPqJDa6Vw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-04  5:06                                                                                                     ` Liran Liss
     [not found]                                                                                                       ` <AM4PR0501MB28025BE002CBA9D04675A5A5B1A20-dp/nxUn679jTOi/YP668sMDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2016-11-04  5:44                                                                                                         ` Parav Pandit
     [not found]                                                                                                           ` <CAG53R5WdauHpML66g-O6zj+j_DUYWJMPjmL1xDaSxwDmPPYm2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-08  8:12                                                                                                             ` Liran Liss
     [not found]                                                                                                               ` <HE1PR0501MB2812298C05431B08B0F408EEB1A60-692Kmc8YnlIVrnpjwTCbp8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2016-11-10  7:41                                                                                                                 ` Parav Pandit
     [not found]                                                                                                                   ` <CAG53R5XqZwrYsdX=JQ1D4cDB0h65RDQVb=VCiaR5TXuf_uoO0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-10 16:38                                                                                                                     ` Leon Romanovsky
     [not found]                                                                                                                       ` <20161110163837.GE28957-2ukJVAZIZ/Y@public.gmane.org>
2016-11-10 16:46                                                                                                                         ` Tejun Heo
     [not found]                                                                                                                           ` <20161110164638.GC26105-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-11-10 17:04                                                                                                                             ` Parav Pandit
     [not found]                                                                                                                               ` <CAG53R5UGfhGHc3-jgUjH5taFzTHg3BOgXi25QjuQfUFc0U7tgw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-11-10 17:32                                                                                                                                 ` Tejun Heo
     [not found]                                                                                                                                   ` <20161110173217.GD26105-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-11-10 17:56                                                                                                                                     ` Parav Pandit
2016-11-10 19:23                                                                                                                                       ` Tejun Heo
     [not found]                                                                                                                                         ` <20161110192344.GA4805-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-11-11 13:00                                                                                                                                           ` Parav Pandit
2016-11-04  4:28                                                                                     ` Parav Pandit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.