From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756316AbcDDTgr (ORCPT <rfc822;w@1wt.eu>);
	Mon, 4 Apr 2016 15:36:47 -0400
Received: from mail-yw0-f194.google.com ([209.85.161.194]:34324 "EHLO
	mail-yw0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755116AbcDDTgo (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 4 Apr 2016 15:36:44 -0400
Date: Mon, 4 Apr 2016 15:36:40 -0400
From: Tejun Heo <tj@kernel.org>
To: Parav Pandit <pandit.parav@gmail.com>
Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
        linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
        lizefan@huawei.com, hannes@cmpxchg.org, dledford@redhat.com,
        liranl@mellanox.com, sean.hefty@intel.com,
        jgunthorpe@obsidianresearch.com, haggaie@mellanox.com, corbet@lwn.net,
        james.l.morris@oracle.com, serge@hallyn.com, ogerlitz@mellanox.com,
        matanb@mellanox.com, akpm@linux-foundation.org,
        linux-security-module@vger.kernel.org
Subject: Re: [PATCHv10 1/3] rdmacg: Added rdma cgroup controller
Message-ID: <20160404193640.GA7822@mtj.duckdns.org>
References: <1458850962-16057-1-git-send-email-pandit.parav@gmail.com>
 <1458850962-16057-2-git-send-email-pandit.parav@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1458850962-16057-2-git-send-email-pandit.parav@gmail.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hello, Parav.

Sorry about the delay.

> +struct rdma_cgroup {
> +	struct cgroup_subsys_state	css;
> +
> +	/* protects resource pool list */
> +	spinlock_t			rpool_list_lock;

Is this lock still being used?

> +	/*
> +	 * head to keep track of all resource pools
> +	 * that belongs to this cgroup.
> +	 */
> +	struct list_head		rpool_head;
> +};
> +
> +struct rdmacg_pool_info {
> +	const char	**resource_name_table;
> +	int		table_len;

It's a bit bothering that resource type defs are outside the
controller in a way the controller can't know and assumed to have the
same type and range.  Architecting kernel code so that allow building
modules separately can modify behaviors usually isn't a priority and
styling things like this makes implementing specific one-off behaviors
cumbersome.

It now looks okay but let's say in the future someone wants to use a
different type and/or ranges for one of the resources, given the above
framework, that person is most likely to blow up pool_info - first
with types and min, max values and maybe later with callback handlers
and so on, when the only thing which would have been necessary is an
one-off handling of that specific setting in the interface functions.

Making things modular comes at the cost of complexity and ad-hoc
flexiblity.  There are cases where such overhead is warranted but as I
mentioned before, I don't think this is one of them.  Please look back
on how this patchset developed.  It has been a continuous process of
cutting down complexities which didn't need to be there which is the
opposite of what we usually want to do - buliding up sophiscation and
complexity as necessary from something simple.  Please cut down this
part too.

> +};
> +
> +struct rdmacg_device {
> +	struct rdmacg_pool_info pool_info;
> +	struct list_head	rdmacg_list;
> +	struct list_head	rpool_head;
> +	/* protects resource pool list */
> +	spinlock_t		rpool_lock;

Does this still need to be a separate lock?

> +#define RDMACG_MAX_STR "max"
> +
> +static DEFINE_MUTEX(dev_mutex);

Can you please prefix the above with rdmacg?

> +static LIST_HEAD(dev_list_head);

Ditto, how about rdmacg_dev_list?

> +/*
> + * resource pool object which represents per cgroup, per device
> + * resources. There are multiple instance of this object per cgroup,
                                            ^s
> + * therefore it cannot be embedded within rdma_cgroup structure. It
> + * is maintained as list.
> + */
> +struct rdmacg_resource_pool {
> +	struct list_head	cg_list;
> +	struct list_head	dev_list;

Isn't it more conventional to use OBJ_list or OBJs for the head and
_node or _link for the members?

> +static int
> +alloc_cg_rpool(struct rdma_cgroup *cg, struct rdmacg_device *device)
> +{
> +	struct rdmacg_resource_pool *rpool, *other_rpool;
> +	struct rdmacg_pool_info *pool_info = &device->pool_info;
> +	int ret;
> +
> +	rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
> +	if (!rpool) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +	rpool->resources = kcalloc(pool_info->table_len,
> +				   sizeof(*rpool->resources),
> +				   GFP_KERNEL);
> +	if (!rpool->resources) {
> +		ret = -ENOMEM;
> +		goto alloc_err;
> +	}
> +
> +	rpool->device = device;
> +	INIT_LIST_HEAD(&rpool->cg_list);
> +	INIT_LIST_HEAD(&rpool->dev_list);
> +	spin_lock_init(&device->rpool_lock);
> +	set_all_resource_max_limit(rpool);
> +
> +	spin_lock(&rpool_list_lock);
> +
> +	other_rpool = find_cg_rpool_locked(cg, device);
> +
> +	/*
> +	 * if other task added resource pool for this device for this cgroup
> +	 * than free up which was recently created and use the one we found.
> +	 */

* It's usually a good idea to have hierarchical objects to be created
  all the way to the root when a leaf node is requested and link
  parent at each level.  That way, when a descendant object is looked
  up, the caller can always deref its ->parent pointer till it reaches
  the top.

* Unless it's a very hot path, using a mutex which protects creation
  of new nodes is usually simpler than doing alloc -> lock ->
  try-insert.  The concurrency in the creation path doesn't matter and
  even if it mattered it isn't a good idea to bang on allocation path
  concurrently anyway.

> +static void
> +uncharge_cg_resource_locked(struct rdma_cgroup *cg,
> +			    struct rdmacg_device *device,
> +			    int index)
> +{
> +	struct rdmacg_resource_pool *rpool;
> +	struct rdmacg_pool_info *pool_info = &device->pool_info;
> +
> +	rpool = find_cg_rpool_locked(cg, device);

So, with each pool linking to its parent, instead of looking up at
each level, it can just follow ->parent recursively.

> +	/*
> +	 * A negative count (or overflow) is invalid,
> +	 * it indicates a bug in the rdma controller.
> +	 */
> +	WARN_ON_ONCE(rpool->resources[index].usage < 0);
> +	rpool->usage_sum--;

What guarantees that this count wouldn't overflow?  More on this
later.

> +/**
> + * rdmacg_uncharge_resource - hierarchically uncharge rdma resource count
> + * @device: pointer to rdmacg device
> + * @index: index of the resource to uncharge in cg in given resource pool
> + * @num: the number of rdma resource to uncharge
> + *
> + */
> +void rdmacg_uncharge(struct rdma_cgroup *cg,
> +		     struct rdmacg_device *device,
> +		     int index)
> +{
> +	struct rdma_cgroup *p;
> +
> +	spin_lock(&rpool_list_lock);
> +	for (p = cg; p; p = parent_rdmacg(p))
> +		uncharge_cg_resource_locked(p, device, index);
> +	spin_unlock(&rpool_list_lock);
> +
> +	css_put(&cg->css);

Hmm... what css ref is the above putting?

> +int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
> +		      struct rdmacg_device *device,
> +		      int index)
> +{
...
> +	spin_lock(&rpool_list_lock);
> +	for (p = cg; p; p = parent_rdmacg(p)) {
> +retry:
> +		rpool = find_cg_rpool_locked(p, device);

Here, too, you can look up the leaf node once and walk up the
hierarhcy.  Maybe create a helper, say rdmacg_get_pool_and_lock(),
which looks up an existing node or creates all pools from leaf to root
and returns with the lock held?

...
> +err:
> +	spin_lock(&rpool_list_lock);
> +	for (q = cg; q != p; q = parent_rdmacg(q))
> +		uncharge_cg_resource_locked(q, device, index);

If some resources are not very plentiful and contended, dropping lock
before error handling like above could lead to spurious failures.  It
looks like the above is from trying to create pool at each level while
charging up.  Creating all necessary pools at the outset should make
the code simpler and more robust.

> +	spin_unlock(&rpool_list_lock);
> +	css_put(&cg->css);

Ditto, where is this from?

> +void rdmacg_query_limit(struct rdmacg_device *device, int *limits)
> +{
> +	struct rdma_cgroup *cg, *p;
> +	struct rdmacg_resource_pool *rpool;
> +	struct rdmacg_pool_info *pool_info;
> +	int i;
> +
> +	pool_info = &device->pool_info;
> +
> +	for (i = 0; i < pool_info->table_len; i++)
> +		limits[i] = S32_MAX;
> +
> +	cg = get_current_rdmacg();
> +	/*
> +	 * Check in hirerchy which pool get the least amount of
> +	 * resource limits.
> +	 */
> +	spin_lock(&rpool_list_lock);
> +	for (p = cg; p; p = parent_rdmacg(p)) {
> +		rpool = find_cg_rpool_locked(cg, device);

Ditto with lookup here.

> +static struct rdmacg_device *rdmacg_get_device_locked(const char *name)
> +{
> +	struct rdmacg_device *device;

lockdep_assert_held() would be nice.

> +	list_for_each_entry(device, &dev_list_head, rdmacg_list)
> +		if (!strcmp(name, device->name))
> +			return device;
> +
> +	return NULL;
> +}
> +
> +static ssize_t rdmacg_resource_set_max(struct kernfs_open_file *of,
> +				       char *buf, size_t nbytes, loff_t off)
> +{
> +	struct rdma_cgroup *cg = css_rdmacg(of_css(of));
> +	const char *dev_name;
> +	struct rdmacg_resource_pool *rpool;
> +	struct rdmacg_device *device;
> +	char *options = strstrip(buf);
> +	struct rdmacg_pool_info *pool_info;
> +	u64 enables = 0;
> +	int *new_limits;
> +	int i = 0, ret = 0;
> +
> +	/* extract the device name first */
> +	dev_name = strsep(&options, " ");
> +	if (!dev_name) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	/* acquire lock to synchronize with hot plug devices */
> +	mutex_lock(&dev_mutex);
> +
> +	device = rdmacg_get_device_locked(dev_name);
> +	if (!device) {
> +		ret = -ENODEV;
> +		goto parse_err;
> +	}
> +
> +	pool_info = &device->pool_info;
> +
> +	new_limits = kcalloc(pool_info->table_len, sizeof(int), GFP_KERNEL);
> +	if (!new_limits) {
> +		ret = -ENOMEM;
> +		goto parse_err;
> +	}

Hmm... except for new_limits allocation and alloc_cg_rpool()
invocation, both of which can be done at the head of the function,
nothing seems to require a mutex here.  If we move them to the head of
the function, can we get rid of dev_mutex entirely?

> +	ret = rdmacg_parse_limits(options, pool_info, new_limits, &enables);
> +	if (ret)
> +		goto opt_err;
> +
> +retry:
> +	spin_lock(&rpool_list_lock);
> +	rpool = find_cg_rpool_locked(cg, device);
> +	if (!rpool) {
> +		spin_unlock(&rpool_list_lock);
> +		ret = alloc_cg_rpool(cg, device);
> +		if (ret)
> +			goto opt_err;
> +		else
> +			goto retry;
> +	}
> +
> +	/* now set the new limits of the rpool */
> +	while (enables) {
> +		/* if user set the limit, enables bit is set */
> +		if (enables & BIT(i)) {
> +			enables &= ~BIT(i);
> +			set_resource_limit(rpool, i, new_limits[i]);
> +		}
> +		i++;
> +	}

Maybe we can use for_each_set_bit()?

> +	if (rpool->usage_sum == 0 &&
> +	    rpool->num_max_cnt == pool_info->table_len) {
> +		/*
> +		 * No user of the rpool and all entries are set to max, so
> +		 * safe to delete this rpool.
> +		 */
> +		list_del(&rpool->cg_list);
> +		spin_unlock(&rpool_list_lock);
> +
> +		free_cg_rpool(rpool);

Can't we just count the number of resources which either have !max
setting or !0 usage?

> +static u32 *get_cg_rpool_values(struct rdma_cgroup *cg,
> +				struct rdmacg_device *device,
> +				enum rdmacg_file_type sf_type,
> +				int count)
> +{
> +	struct rdmacg_resource_pool *rpool;
> +	u32 *value_tbl;
> +	int i, ret;
> +
> +	value_tbl = kcalloc(count, sizeof(u32), GFP_KERNEL);
> +	if (!value_tbl) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}

Do we need this type of interface?  Why not let the caller look up the
target pool and call this function with pool and resource index?

Thanks.

-- 
tejun