linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC rdma cgroup
@ 2015-10-28  8:29 Parav Pandit
  2015-10-29 14:57 ` Haggai Eran
  2015-11-24 15:47 ` Tejun Heo
  0 siblings, 2 replies; 8+ messages in thread
From: Parav Pandit @ 2015-10-28  8:29 UTC (permalink / raw)
  To: Tejun Heo, Doug Ledford, Hefty, Sean, linux-rdma, cgroups,
	Liran Liss, linux-kernel, lizefan, Johannes Weiner,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak,
	raindel, akpm, linux-security-module, Jason Gunthorpe,
	Parav Pandit, Haggai Eran

Hi All,

Based on the review comments, feedback, discussion from/with Tejun,
Haggai, Doug, Jason, Liran, Sean, ORNL team, I have updated the design
as below.

This is fairly strong and simple design, addresses most of the points
raised to cover current RDMA use cases.
Feel free to skip design guidelines section and jump to design section
below if you find it too verbose. I had to describe it to set the
context and address comments from our past discussion.

Design guidelines:
-----------------------
1. There will be new rdma cgroup for accounting rdma resources
(instead of extending device cgroup).
Rationale: RDMA tracks different type of resources and it functions
differently than device cgroup. Though device cgroup could have been
extended for more generic nature, community feels that its better to
create RDMA cgroup, which might have more features than just resource
limit enforcement in future.

2. RDMA cgroup will allow resource accounting, limit enforcement on
per cgroup, per rdma device basis (instead of resource limiting across
all devices).
Rationale: this give granular control when multiple devices exist in the system.

3. Resources are not defined by the RDMA cgroup. Resources are defined
by RDMA/IB subsystem and optionally by HCA vendor device drivers.
Rationale: This allows rdma cgroup to remain constant while RDMA/IB
subsystem can evolve without the need of rdma cgroup update. A new
resource can be easily added by the RDMA/IB subsystem without touching
rdma cgroup.

4. RDMA uverbs layer will enforce limits on well defined RDMA verb
resources without any HCA vendor device driver involvement.
Rationale:
(a) RDMA verbs are very well defined set of resource abstraction in
Linux kernel stack for many years now and in use by many applications
directly working with RDMA resources in varied manner. Instead of
replicating code in every vendor driver, RDMA uverbs layer will
enforce such resource limits (with help of rdma cgroup).
(b) IB verbs resource is also a vendor agnostic representation of RDMA
resource; therefore its done at RDMA uverbs level.

6. RDMA uverbs layer will not do accounting of hw vendor specific resources.
Rationale: RDMA uverbs layer is not aware of which hw resource maps to
which verb resource and by how much amount. Therefore hw resource
accounting, charging, uncharging has to happen by the vendor driver.
This is optional and left to the HCA vendor device driver to
implement. HCA driver best knows on how to keep the mapping, therefore
its left to HCA vendor driver to do the accounting.

7. RDMA cgroup will provide unified APIs through which both RDMA
subsystem and vendor defined RDMA resource can be charged, uncharged
by verb layer and HCA driver respectively.

8. RDMA cgroup initial version will support only hard limits without
any kind of reservation of resources or ranges. In future it might be
extended for more dynamic nature.
Rationale: Typically RDMA resources are stateful resource unlike cpu
and RDMA resources don't follow work conserving nature.

9. Resource limit enforcement is hierarchical.

10. Process migration from one to other cgroup with active RDMA
resource is highly discouraged.

11. When process is migrated with active RDMA resources, rdma cgroup
continues to charge original cgroup.
Rationale:
Unlike other POSIX calls, RDMA resources are not defined as POSIX
level. These resources sit behind a file descriptor.
Multiple processes forked, belonging to different thread group, can
possibly placed in different cgroup sharing same rdma resources.
It could be well done where its allocated by one thread group and
release by other thread group from different cgroup.
Resource usage hierarchy can bet easily get complex even though that
is not primary use case.
Typically all processes which want to use RDMA resources will be part
of one leaf cgroup throughout their life cycle.
Therefore its not worth to complicate design around process migration.

Design:
---------
1. New RDMA cgroup defines resource pool object, that connects cgroup
subsystem to RDMA subsystem.
2. Resource pool object is per cgroup, per device entity that is
managed, controlled, configured by the administrator via cgroup
interface.
3. There can be at maximum of 64 resources per resource pool (such as
MR, QP, AH, PD etc and other hardware resources). To manage resources
beyond 64, it will require RDMA cgroup subsystem update. This will be
done in future if at all its needed.

4. RDMA cgroup defines two class of resources.
(a) verb resources - tracks RDMA verb layer resources
(b) hw resources - tracks HCA HW specific resources
5. verbs resource template is defined by RDMA uverbs layer.
6. hw resource template is defined by HCA vendor driver. This is
optional and should be done by those driver which doesn't have one to
one mapping with verb resource and hw resource.

7. Processes in a cgroup without any configured limit (or in other
words without resource pools) has max limits of the resources. If one
of the resource limit is configured, that particular resource will be
enforced, rest will enjoy upto their maximum limit.

8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
hw resource pool per such device.
(Nothing stops to have more devices and pools, but design is around
this use case).

9. Resource pool object is created in following situations.
(a) administrative operation is done to set the limit and no previous
resource pool exist for the device of interest for the cgroup.
(b) no resource limits were configured, but IB/RDMA subsystem tries to
charge the resource. so that when applications are running without
limits and later on when limits are enforced, during uncharging, it
correctly uncharges them, otherwise usage count will drop to negative.
This is done using default resource pool.
Instead of implementing any sort of time markers, default pool
simplifies the design.
(c) When process migrate from one to other cgroup, resource is
continue to be owned by the creator cgroup (rather css).
After process migration, whenever new resource is created in new
cgroup, it will be owned by new cgroup.

10. Resource pool is destroyed if it was of default type (not created
by administrative operation) and its the last resource getting
deallocated. Resource pool created as administrative operation is not
deleted, as its expected to be used in near future.

13. if administrative command tries to delete all the resource limit
with active resources per device, RDMA cgroup just marks the pool as
default pool with maximum limits.
----------------------------------------------------------------

Examples:
#configure resource limit:
echo mlx4_0 mr=100 qp=10 ah=2 cq=10 >
/sys/fs/cgroup/rdma/1/rdma.resource.verb.limit
echo ocrdma1 mr=120 qp=20 ah=2 cq=10 >
/sys/fs/cgroup/rdma/2/rdma.resource.verb.limit

#query resource limit:
cat /sys/fs/cgroup/rdma/2/rdma.resource.verb.limit
#output:
mlx4_0 mr=100 qp=10 ah=2 cq=10
ocrdma1 mr=120 qp=20 cq=10

#delete resource limit:
echo mlx4_0 del > /sys/fs/cgroup/rdma/1/rdma.resource.verb.limit

#query resource list:
cat /sys/fs/cgroup/rdma/1/rdma.resource.verb.list
mlx4_0 mr qp ah pd cq

cat /sys/fs/cgroup/rdma/1/rdma.hw.verb.list
vendor1 hw_qp hw_cq hw_timer

#configure hw specific resource limit
echo vendor1 hw_qp=56 > /sys/fs/cgroup/rdma/2/rdma.resource.hw.limit

-------------------------------------------------------------------------

I have completed initial development of above design. I am currently
testing this design.
I will post the patch soon once I am done validating it.

Let me know if there are any design comments.

Regards,
Parav Pandit

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC rdma cgroup
  2015-10-28  8:29 RFC rdma cgroup Parav Pandit
@ 2015-10-29 14:57 ` Haggai Eran
  2015-10-29 18:46   ` Parav Pandit
  2015-11-24 15:47 ` Tejun Heo
  1 sibling, 1 reply; 8+ messages in thread
From: Haggai Eran @ 2015-10-29 14:57 UTC (permalink / raw)
  To: Parav Pandit, Tejun Heo, Doug Ledford, Hefty, Sean, linux-rdma,
	cgroups, Liran Liss, linux-kernel, lizefan, Johannes Weiner,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak,
	raindel, akpm, linux-security-module, Jason Gunthorpe

On 28/10/2015 10:29, Parav Pandit wrote:
> 3. Resources are not defined by the RDMA cgroup. Resources are defined
> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
> subsystem can evolve without the need of rdma cgroup update. A new
> resource can be easily added by the RDMA/IB subsystem without touching
> rdma cgroup.
Resources exposed by the cgroup are basically a UAPI, so we have to be
careful to make it stable when it evolves. I understand the need for
vendor specific resources, following the discussion on the previous
proposal, but could you write on how you plan to allow these set of
resources to evolve?

> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
> hw resource pool per such device.
> (Nothing stops to have more devices and pools, but design is around
> this use case).
In what way does the design depend on this assumption?

> 9. Resource pool object is created in following situations.
> (a) administrative operation is done to set the limit and no previous
> resource pool exist for the device of interest for the cgroup.
> (b) no resource limits were configured, but IB/RDMA subsystem tries to
> charge the resource. so that when applications are running without
> limits and later on when limits are enforced, during uncharging, it
> correctly uncharges them, otherwise usage count will drop to negative.
> This is done using default resource pool.
> Instead of implementing any sort of time markers, default pool
> simplifies the design.
Having a default resource pool kind of implies there is a non-default
one. Is the only difference between the default and non-default the fact
that the second was created with an administrative operation and has
specified limits or is there some other difference?

> (c) When process migrate from one to other cgroup, resource is
> continue to be owned by the creator cgroup (rather css).
> After process migration, whenever new resource is created in new
> cgroup, it will be owned by new cgroup.
It sounds a little different from how other cgroups behave. I agree that
mostly processes will create the resources in their cgroup and won't
migrate, but why not move the charge during migration?

I finally wanted to ask about other limitations an RDMA cgroup could
handle. It would be great to be able to limit a container to be allowed
to use only a subset of the MAC/VLAN pairs programmed to a device, or
only a subset of P_Keys and GIDs it has. Do you see such limitations
also as part of this cgroup?

Thanks,
Haggai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC rdma cgroup
  2015-10-29 14:57 ` Haggai Eran
@ 2015-10-29 18:46   ` Parav Pandit
  2015-11-02 13:43     ` Haggai Eran
  0 siblings, 1 reply; 8+ messages in thread
From: Parav Pandit @ 2015-10-29 18:46 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Tejun Heo, Doug Ledford, Hefty, Sean, linux-rdma, cgroups,
	Liran Liss, linux-kernel, lizefan, Johannes Weiner,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak,
	raindel, akpm, linux-security-module, Jason Gunthorpe

Hi Haggai,

On Thu, Oct 29, 2015 at 8:27 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 28/10/2015 10:29, Parav Pandit wrote:
>> 3. Resources are not defined by the RDMA cgroup. Resources are defined
>> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
>> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
>> subsystem can evolve without the need of rdma cgroup update. A new
>> resource can be easily added by the RDMA/IB subsystem without touching
>> rdma cgroup.
> Resources exposed by the cgroup are basically a UAPI, so we have to be
> careful to make it stable when it evolves. I understand the need for
> vendor specific resources, following the discussion on the previous
> proposal, but could you write on how you plan to allow these set of
> resources to evolve?

Its fairly simple.
Here is the code snippet on how resources are defined in my tree.
It doesn't have the RSS work queues yet, but can be added right after
this patch.

Resource are defined as index and as match_table_t.

enum rdma_resource_type {
        RDMA_VERB_RESOURCE_UCTX,
        RDMA_VERB_RESOURCE_AH,
        RDMA_VERB_RESOURCE_PD,
        RDMA_VERB_RESOURCE_CQ,
        RDMA_VERB_RESOURCE_MR,
        RDMA_VERB_RESOURCE_MW,
        RDMA_VERB_RESOURCE_SRQ,
        RDMA_VERB_RESOURCE_QP,
        RDMA_VERB_RESOURCE_FLOW,
        RDMA_VERB_RESOURCE_MAX,
};
So UAPI RDMA resources can evolve by just adding more entries here.

>
>> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
>> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
>> hw resource pool per such device.
>> (Nothing stops to have more devices and pools, but design is around
>> this use case).
> In what way does the design depend on this assumption?

Current code when performs resource charging/uncharging, it needs to
identify the resource pool which one to charge to.
This resource pool is maintained as list_head and so its linear search
per device.
If we are thinking of 100 of RDMA devices per container, than liner
search will not be good way and different data structure needs to be
deployed.


>
>> 9. Resource pool object is created in following situations.
>> (a) administrative operation is done to set the limit and no previous
>> resource pool exist for the device of interest for the cgroup.
>> (b) no resource limits were configured, but IB/RDMA subsystem tries to
>> charge the resource. so that when applications are running without
>> limits and later on when limits are enforced, during uncharging, it
>> correctly uncharges them, otherwise usage count will drop to negative.
>> This is done using default resource pool.
>> Instead of implementing any sort of time markers, default pool
>> simplifies the design.
> Having a default resource pool kind of implies there is a non-default
> one. Is the only difference between the default and non-default the fact
> that the second was created with an administrative operation and has
> specified limits or is there some other difference?
>
You described it correctly.

>> (c) When process migrate from one to other cgroup, resource is
>> continue to be owned by the creator cgroup (rather css).
>> After process migration, whenever new resource is created in new
>> cgroup, it will be owned by new cgroup.
> It sounds a little different from how other cgroups behave. I agree that
> mostly processes will create the resources in their cgroup and won't
> migrate, but why not move the charge during migration?
>
With fork() process doesn't really own the resource (unlike other file
and socket descriptors).
Parent process might have died also.
There is possibly no clear way to transfer resource to right child.
Child that cgroup picks might not even want to own RDMA resources.
RDMA resources might be allocated by one process and freed by other
process (though this might not be the way they use it).
Its pretty similar to other cgroups with exception in migration area,
such exception comes from different behavior of how RDMA resources are
owned, created and used.
Recent unified hierarchy patch from Tejun equally highlights to not
frequently migrate processes among cgroups.

So in current implementation, (like other),
if process created a RDMA resource, forked a child.
child and parent both can allocate and free more resources.
child moved to different cgroup. But resource is shared among them.
child can free also the resource. All crazy combinations are possible
in theory (without much use cases).
So at best they are charged to the first cgroup css in which
parent/child are created and reference is hold to CSS.
cgroup, process can die, cut css remains until RDMA resources are freed.
This is similar to process behavior where task struct is release but
id is hold up for a while.


> I finally wanted to ask about other limitations an RDMA cgroup could
> handle. It would be great to be able to limit a container to be allowed
> to use only a subset of the MAC/VLAN pairs programmed to a device,

Truly. I agree. That was one of the prime reason I originally has it
as part of the device cgroup.
Where RDMA was just one category.
But Tejun's opinion was to have rdma's own cgroup.
Current internal data structure and interface between rdma cgroup and
uverbs are tied to ib_device structure.
which I think easy to overcome by abstracting out as new
resource_device which can be used beyond RDMA as well.

However my bigger concern is interface to user land.
We already have two use cases and I am inclined to make it as as
"device resource cgroup" instead of "rdma cgroup".
I seek Tejun's input here.
Initial implementation can expose rdma resources under device resource
cgroup, as it evolves we can add other net resources such as mac, vlan
as you described.

 or
> only a subset of P_Keys and GIDs it has. Do you see such limitations
> also as part of this cgroup?
>
At present no. Because GID, P_key resources are created from the
bottom up, either by stack or by network. They are kind of not tied to
the user processes, unlike mac, vlan, qp which are more application
driven or administrative driven.

For applications that doesn't use RDMA-CM, query_device and query_port
will filter out the GID entries based on the network namespace in
which caller process is running.
It was in my TODO list while we were working on RoCEv2 and GID
movement changes but I never got chance to chase that fix.

One of the idea I was considering is: to create virtual RDMA device
mapped to physical device.
And configure GID count limit via configfs for each such device.

> Thanks,
> Haggai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC rdma cgroup
  2015-10-29 18:46   ` Parav Pandit
@ 2015-11-02 13:43     ` Haggai Eran
  2015-11-03 19:11       ` Parav Pandit
  0 siblings, 1 reply; 8+ messages in thread
From: Haggai Eran @ 2015-11-02 13:43 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, Doug Ledford, Hefty, Sean, linux-rdma, cgroups,
	Liran Liss, linux-kernel, lizefan, Johannes Weiner,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak,
	raindel, akpm, linux-security-module, Jason Gunthorpe

On 29/10/2015 20:46, Parav Pandit wrote:
> On Thu, Oct 29, 2015 at 8:27 PM, Haggai Eran <haggaie@mellanox.com> wrote:
>> On 28/10/2015 10:29, Parav Pandit wrote:
>>> 3. Resources are not defined by the RDMA cgroup. Resources are defined
>>> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
>>> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
>>> subsystem can evolve without the need of rdma cgroup update. A new
>>> resource can be easily added by the RDMA/IB subsystem without touching
>>> rdma cgroup.
>> Resources exposed by the cgroup are basically a UAPI, so we have to be
>> careful to make it stable when it evolves. I understand the need for
>> vendor specific resources, following the discussion on the previous
>> proposal, but could you write on how you plan to allow these set of
>> resources to evolve?
> 
> Its fairly simple.
> Here is the code snippet on how resources are defined in my tree.
> It doesn't have the RSS work queues yet, but can be added right after
> this patch.
> 
> Resource are defined as index and as match_table_t.
> 
> enum rdma_resource_type {
>         RDMA_VERB_RESOURCE_UCTX,
>         RDMA_VERB_RESOURCE_AH,
>         RDMA_VERB_RESOURCE_PD,
>         RDMA_VERB_RESOURCE_CQ,
>         RDMA_VERB_RESOURCE_MR,
>         RDMA_VERB_RESOURCE_MW,
>         RDMA_VERB_RESOURCE_SRQ,
>         RDMA_VERB_RESOURCE_QP,
>         RDMA_VERB_RESOURCE_FLOW,
>         RDMA_VERB_RESOURCE_MAX,
> };
> So UAPI RDMA resources can evolve by just adding more entries here.
Are the names that appear in userspace also controlled by uverbs? What
about the vendor specific resources?

>>> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
>>> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
>>> hw resource pool per such device.
>>> (Nothing stops to have more devices and pools, but design is around
>>> this use case).
>> In what way does the design depend on this assumption?
> 
> Current code when performs resource charging/uncharging, it needs to
> identify the resource pool which one to charge to.
> This resource pool is maintained as list_head and so its linear search
> per device.
> If we are thinking of 100 of RDMA devices per container, than liner
> search will not be good way and different data structure needs to be
> deployed.
Okay, sounds fine to me.

>>> (c) When process migrate from one to other cgroup, resource is
>>> continue to be owned by the creator cgroup (rather css).
>>> After process migration, whenever new resource is created in new
>>> cgroup, it will be owned by new cgroup.
>> It sounds a little different from how other cgroups behave. I agree that
>> mostly processes will create the resources in their cgroup and won't
>> migrate, but why not move the charge during migration?
>>
> With fork() process doesn't really own the resource (unlike other file
> and socket descriptors).
> Parent process might have died also.
> There is possibly no clear way to transfer resource to right child.
> Child that cgroup picks might not even want to own RDMA resources.
> RDMA resources might be allocated by one process and freed by other
> process (though this might not be the way they use it).
> Its pretty similar to other cgroups with exception in migration area,
> such exception comes from different behavior of how RDMA resources are
> owned, created and used.
> Recent unified hierarchy patch from Tejun equally highlights to not
> frequently migrate processes among cgroups.
> 
> So in current implementation, (like other),
> if process created a RDMA resource, forked a child.
> child and parent both can allocate and free more resources.
> child moved to different cgroup. But resource is shared among them.
> child can free also the resource. All crazy combinations are possible
> in theory (without much use cases).
> So at best they are charged to the first cgroup css in which
> parent/child are created and reference is hold to CSS.
> cgroup, process can die, cut css remains until RDMA resources are freed.
> This is similar to process behavior where task struct is release but
> id is hold up for a while.

I guess there aren't a lot of options when the resources can belong to
multiple cgroups. So after migrating, new resources will belong to the
new cgroup or the old one?

>> I finally wanted to ask about other limitations an RDMA cgroup could
>> handle. It would be great to be able to limit a container to be allowed
>> to use only a subset of the MAC/VLAN pairs programmed to a device,
> 
> Truly. I agree. That was one of the prime reason I originally has it
> as part of the device cgroup.
> Where RDMA was just one category.
> But Tejun's opinion was to have rdma's own cgroup.
> Current internal data structure and interface between rdma cgroup and
> uverbs are tied to ib_device structure.
> which I think easy to overcome by abstracting out as new
> resource_device which can be used beyond RDMA as well.
> 
> However my bigger concern is interface to user land.
> We already have two use cases and I am inclined to make it as as
> "device resource cgroup" instead of "rdma cgroup".
> I seek Tejun's input here.
> Initial implementation can expose rdma resources under device resource
> cgroup, as it evolves we can add other net resources such as mac, vlan
> as you described.

When I was talking about limiting to MAC/VLAN pairs I only meant
limiting an RDMA device's ability to use that pair (e.g. use a GID that
uses the specific MAC VLAN pair). I don't understand how that makes the
RDMA cgroup any more generic than it is.

>  or
>> only a subset of P_Keys and GIDs it has. Do you see such limitations
>> also as part of this cgroup?
>>
> At present no. Because GID, P_key resources are created from the
> bottom up, either by stack or by network. They are kind of not tied to
> the user processes, unlike mac, vlan, qp which are more application
> driven or administrative driven.
They are created from the network, after the network administrator
configured them this way.

> For applications that doesn't use RDMA-CM, query_device and query_port
> will filter out the GID entries based on the network namespace in
> which caller process is running.
This could work well for RoCE, as each entry in the GID table is
associated with a net device and a network namespace. However, in
InfiniBand, the GID table isn't directly related to the network
namespace. As for the P_Keys, you could deduce the set of P_Keys of a
namespace by the set of IPoIB netdevs in the network namespace, but
InfiniBand is designed to also work without IPoIB, so I don't think it's
a good idea.

I think it would be better to allow each cgroup to limit the pkeys and
gids its processes can use.

> It was in my TODO list while we were working on RoCEv2 and GID
> movement changes but I never got chance to chase that fix.
> 
> One of the idea I was considering is: to create virtual RDMA device
> mapped to physical device.
> And configure GID count limit via configfs for each such device.
You could probably achieve what you want by creating a virtual RDMA
device and use the device cgroup to limit access to it, but it sounds to
me like an overkill.

Regards,
Haggai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC rdma cgroup
  2015-11-02 13:43     ` Haggai Eran
@ 2015-11-03 19:11       ` Parav Pandit
  2015-11-04 11:58         ` Haggai Eran
  0 siblings, 1 reply; 8+ messages in thread
From: Parav Pandit @ 2015-11-03 19:11 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Tejun Heo, Doug Ledford, Hefty, Sean, linux-rdma, cgroups,
	Liran Liss, linux-kernel, lizefan, Johannes Weiner,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak,
	raindel, akpm, linux-security-module, Jason Gunthorpe

>> Resource are defined as index and as match_table_t.
>>
>> enum rdma_resource_type {
>>         RDMA_VERB_RESOURCE_UCTX,
>>         RDMA_VERB_RESOURCE_AH,
>>         RDMA_VERB_RESOURCE_PD,
>>         RDMA_VERB_RESOURCE_CQ,
>>         RDMA_VERB_RESOURCE_MR,
>>         RDMA_VERB_RESOURCE_MW,
>>         RDMA_VERB_RESOURCE_SRQ,
>>         RDMA_VERB_RESOURCE_QP,
>>         RDMA_VERB_RESOURCE_FLOW,
>>         RDMA_VERB_RESOURCE_MAX,
>> };
>> So UAPI RDMA resources can evolve by just adding more entries here.
> Are the names that appear in userspace also controlled by uverbs? What
> about the vendor specific resources?

I am not sure I followed your question.
Basically any RDMA resource that is allocated through uverb API can be tracked.
uverb makes the call to charge/uncharge.
There is list rdma.resources.verbs.list. This file lists all the verbs
resource names of all the devices which have registered themselves to
rdma cgroup.
Similarly there is rdma.resource.hw.list. This file lists all hw
specific resource names, which means they are defined run time and
potentially different for each vendor.

So it looks like below,
#cat rdma.resources.verbs.list
Output:
mlx4_0 uctx ah pd cq mr mw srq qp flow
mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq

#cat rdma.resources.hw.list
hfi1 hw_qp hw_mr sw_pd
(This particular one is hypothical example, I haven't actually coded
this, unlike uverbs which is real).

>>>> (c) When process migrate from one to other cgroup, resource is
>>>> continue to be owned by the creator cgroup (rather css).
>>>> After process migration, whenever new resource is created in new
>>>> cgroup, it will be owned by new cgroup.
>>> It sounds a little different from how other cgroups behave. I agree that
>>> mostly processes will create the resources in their cgroup and won't
>>> migrate, but why not move the charge during migration?
>>>
>> With fork() process doesn't really own the resource (unlike other file
>> and socket descriptors).
>> Parent process might have died also.
>> There is possibly no clear way to transfer resource to right child.
>> Child that cgroup picks might not even want to own RDMA resources.
>> RDMA resources might be allocated by one process and freed by other
>> process (though this might not be the way they use it).
>> Its pretty similar to other cgroups with exception in migration area,
>> such exception comes from different behavior of how RDMA resources are
>> owned, created and used.
>> Recent unified hierarchy patch from Tejun equally highlights to not
>> frequently migrate processes among cgroups.
>>
>> So in current implementation, (like other),
>> if process created a RDMA resource, forked a child.
>> child and parent both can allocate and free more resources.
>> child moved to different cgroup. But resource is shared among them.
>> child can free also the resource. All crazy combinations are possible
>> in theory (without much use cases).
>> So at best they are charged to the first cgroup css in which
>> parent/child are created and reference is hold to CSS.
>> cgroup, process can die, cut css remains until RDMA resources are freed.
>> This is similar to process behavior where task struct is release but
>> id is hold up for a while.
>
> I guess there aren't a lot of options when the resources can belong to
> multiple cgroups. So after migrating, new resources will belong to the
> new cgroup or the old one?
Resource always belongs to the cgroup in which its created, regardless
of process migration.
Again, its owned at the css level instead of cgroup. Therefore
original cgroup can also be deleted but internal reference to data
structure and that is freed and last rdma resource is freed.

>
> When I was talking about limiting to MAC/VLAN pairs I only meant
> limiting an RDMA device's ability to use that pair (e.g. use a GID that
> uses the specific MAC VLAN pair). I don't understand how that makes the
> RDMA cgroup any more generic than it is.
>
Oh ok. That doesn't. I meant that I wanted to limit how many vlans a
given container can create.
We have just high level capabilities (7) to enable such creation, but
not the count.

>>  or
>>> only a subset of P_Keys and GIDs it has. Do you see such limitations
>>> also as part of this cgroup?
>>>
>> At present no. Because GID, P_key resources are created from the
>> bottom up, either by stack or by network. They are kind of not tied to
>> the user processes, unlike mac, vlan, qp which are more application
>> driven or administrative driven.
> They are created from the network, after the network administrator
> configured them this way.
>
>> For applications that doesn't use RDMA-CM, query_device and query_port
>> will filter out the GID entries based on the network namespace in
>> which caller process is running.
> This could work well for RoCE, as each entry in the GID table is
> associated with a net device and a network namespace. However, in
> InfiniBand, the GID table isn't directly related to the network
> namespace. As for the P_Keys, you could deduce the set of P_Keys of a
> namespace by the set of IPoIB netdevs in the network namespace, but
> InfiniBand is designed to also work without IPoIB, so I don't think it's
> a good idea.
Got it. Yeah, this code can be under if(device_type RoCE).

>
> I think it would be better to allow each cgroup to limit the pkeys and
> gids its processes can use.

o.k. So the use case is P_Key? So I believe requirement would similar
to device cgroup.
Where set of GID table entries are configured as white list entries.
and when they are queried or used during create_ah or modify_qp, its
compared against the white list (or in other words as ACL).
If they are found in ACL, they are reported in query_device or in
create_ah, modify_qp. If not they those calls are failed with
appropriate status?
Does this look ok? Can we address requirement as additional feature
just after first path?
Tejun had some other idea on this kind of requirement, and I need to
discuss with him.

>
>> It was in my TODO list while we were working on RoCEv2 and GID
>> movement changes but I never got chance to chase that fix.
>>
>> One of the idea I was considering is: to create virtual RDMA device
>> mapped to physical device.
>> And configure GID count limit via configfs for each such device.
> You could probably achieve what you want by creating a virtual RDMA
> device and use the device cgroup to limit access to it, but it sounds to
> me like an overkill.

Actually not much. Basically this virtual RDMA device points to the
struct device of the physical device itself.
So only overhead is linking this structure to native device structure
and  passing most of the calls to native ib_device with thin filter
layer in control path.
post_send/recv/poll_cq will directly go native device and same performance.


>
> Regards,
> Haggai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC rdma cgroup
  2015-11-03 19:11       ` Parav Pandit
@ 2015-11-04 11:58         ` Haggai Eran
  2015-11-04 17:23           ` Parav Pandit
  0 siblings, 1 reply; 8+ messages in thread
From: Haggai Eran @ 2015-11-04 11:58 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Tejun Heo, Doug Ledford, Hefty, Sean, linux-rdma, cgroups,
	Liran Liss, linux-kernel, lizefan, Johannes Weiner,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak,
	raindel, akpm, linux-security-module, Jason Gunthorpe

On 03/11/2015 21:11, Parav Pandit wrote:
> So it looks like below,
> #cat rdma.resources.verbs.list
> Output:
> mlx4_0 uctx ah pd cq mr mw srq qp flow
> mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq
What happens if you set a limit of rss_wq to mlx4_0 in this example?
Would it fail? I think it would be simpler for administrators if they
can configure every resource supported by uverbs. If a resource is not
supported by a specific device, you can never go over the limit anyway.

> #cat rdma.resources.hw.list
> hfi1 hw_qp hw_mr sw_pd
> (This particular one is hypothical example, I haven't actually coded
> this, unlike uverbs which is real).
Sounds fine to me. We will need to be careful to make sure that driver
maintainers don't break backward compatibility with this interface.

>> I guess there aren't a lot of options when the resources can belong to
>> multiple cgroups. So after migrating, new resources will belong to the
>> new cgroup or the old one?
> Resource always belongs to the cgroup in which its created, regardless
> of process migration.
> Again, its owned at the css level instead of cgroup. Therefore
> original cgroup can also be deleted but internal reference to data
> structure and that is freed and last rdma resource is freed.
Okay.

>>> For applications that doesn't use RDMA-CM, query_device and query_port
>>> will filter out the GID entries based on the network namespace in
>>> which caller process is running.
>> This could work well for RoCE, as each entry in the GID table is
>> associated with a net device and a network namespace. However, in
>> InfiniBand, the GID table isn't directly related to the network
>> namespace. As for the P_Keys, you could deduce the set of P_Keys of a
>> namespace by the set of IPoIB netdevs in the network namespace, but
>> InfiniBand is designed to also work without IPoIB, so I don't think it's
>> a good idea.
> Got it. Yeah, this code can be under if(device_type RoCE).
IIRC there's a core capability for the new GID table code that contains
namespace, so you can use that.

>> I think it would be better to allow each cgroup to limit the pkeys and
>> gids its processes can use.
> 
> o.k. So the use case is P_Key? So I believe requirement would similar
> to device cgroup.
> Where set of GID table entries are configured as white list entries.
> and when they are queried or used during create_ah or modify_qp, its
> compared against the white list (or in other words as ACL).
> If they are found in ACL, they are reported in query_device or in
> create_ah, modify_qp. If not they those calls are failed with
> appropriate status?
> Does this look ok? 
Yes, that sounds good to me.

> Can we address requirement as additional feature just after first path?
> Tejun had some other idea on this kind of requirement, and I need to
> discuss with him.
Of course. I think there's use for the RDMA cgroup even without a pkey
or GID ACL, just to make sure one application doesn't hog hardware
resources.

>>> One of the idea I was considering is: to create virtual RDMA device
>>> mapped to physical device.
>>> And configure GID count limit via configfs for each such device.
>> You could probably achieve what you want by creating a virtual RDMA
>> device and use the device cgroup to limit access to it, but it sounds to
>> me like an overkill.
> 
> Actually not much. Basically this virtual RDMA device points to the
> struct device of the physical device itself.
> So only overhead is linking this structure to native device structure
> and  passing most of the calls to native ib_device with thin filter
> layer in control path.
> post_send/recv/poll_cq will directly go native device and same performance.
Still, I think we already have code that wraps ib_device calls for
userspace, which is the ib_uverbs module. There's no need for an extra
layer.

Regards,
Haggai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC rdma cgroup
  2015-11-04 11:58         ` Haggai Eran
@ 2015-11-04 17:23           ` Parav Pandit
  0 siblings, 0 replies; 8+ messages in thread
From: Parav Pandit @ 2015-11-04 17:23 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Tejun Heo, Doug Ledford, Hefty, Sean, linux-rdma, cgroups,
	Liran Liss, linux-kernel, lizefan, Johannes Weiner,
	Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak,
	raindel, akpm, linux-security-module, Jason Gunthorpe

On Wed, Nov 4, 2015 at 5:28 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 03/11/2015 21:11, Parav Pandit wrote:
>> So it looks like below,
>> #cat rdma.resources.verbs.list
>> Output:
>> mlx4_0 uctx ah pd cq mr mw srq qp flow
>> mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq
> What happens if you set a limit of rss_wq to mlx4_0 in this example?
> Would it fail?
Yes, In above example, mlx4_0 device didn't had support for rss_wq, so
it didn't advertise in the list file that it supports rss_wq.

> I think it would be simpler for administrators if they
> can configure every resource supported by uverbs. If a resource is not
> supported by a specific device, you can never go over the limit anyway.
>
Exactly. Thats the implementation today.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC rdma cgroup
  2015-10-28  8:29 RFC rdma cgroup Parav Pandit
  2015-10-29 14:57 ` Haggai Eran
@ 2015-11-24 15:47 ` Tejun Heo
  1 sibling, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2015-11-24 15:47 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Doug Ledford, Hefty, Sean, linux-rdma, cgroups, Liran Liss,
	linux-kernel, lizefan, Johannes Weiner, Jonathan Corbet,
	james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module, Jason Gunthorpe, Haggai Eran

Hello, chiming in late.

On Wed, Oct 28, 2015 at 01:59:15PM +0530, Parav Pandit wrote:
> Design guidelines:
> -----------------------
> 1. There will be new rdma cgroup for accounting rdma resources
> (instead of extending device cgroup).
> Rationale: RDMA tracks different type of resources and it functions
> differently than device cgroup. Though device cgroup could have been
> extended for more generic nature, community feels that its better to
> create RDMA cgroup, which might have more features than just resource
> limit enforcement in future.

Yeap, it should definitely be separate from device cgroup.

> 2. RDMA cgroup will allow resource accounting, limit enforcement on
> per cgroup, per rdma device basis (instead of resource limiting across
> all devices).
> Rationale: this give granular control when multiple devices exist in the system.
> 
> 3. Resources are not defined by the RDMA cgroup. Resources are defined
> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
> subsystem can evolve without the need of rdma cgroup update. A new
> resource can be easily added by the RDMA/IB subsystem without touching
> rdma cgroup.

I'm *extremely* uncomfortable with this.  Drivers for this sort of
higher end devices tend to pull a lot of stunts for better or worse
and my gut feeling is that letting low level drivers run free with
resource definition is highly likely to lead to an unmanageable mess
in the long run.  I'd strongly urge to gather consensus on what the
resources should be across the board.

> Design:
> ---------
> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
> hw resource pool per such device.
> (Nothing stops to have more devices and pools, but design is around
> this use case).

Heh, 4 seems like an arbitrary number.  idk, it feels weird to bake in
a number like 4 into the design.

> 9. Resource pool object is created in following situations.
> (a) administrative operation is done to set the limit and no previous
> resource pool exist for the device of interest for the cgroup.
> (b) no resource limits were configured, but IB/RDMA subsystem tries to
> charge the resource. so that when applications are running without
> limits and later on when limits are enforced, during uncharging, it
> correctly uncharges them, otherwise usage count will drop to negative.
> This is done using default resource pool.
> Instead of implementing any sort of time markers, default pool
> simplifies the design.

So, the usual way to deal with this is that the root cgroup is exempt
from accounting and each resource tracks where they were charged to
and frees that cgroup on release.  IOW, asssociate on charge and
maintain the association till release.

For interface details, please refer to the following documentation.

 https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git/tree/Documentation/cgroup.txt?h=for-4.5

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-11-24 15:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-28  8:29 RFC rdma cgroup Parav Pandit
2015-10-29 14:57 ` Haggai Eran
2015-10-29 18:46   ` Parav Pandit
2015-11-02 13:43     ` Haggai Eran
2015-11-03 19:11       ` Parav Pandit
2015-11-04 11:58         ` Haggai Eran
2015-11-04 17:23           ` Parav Pandit
2015-11-24 15:47 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).