All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-09 20:27 ` Yosry Ahmed
  0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-09 20:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Johannes Weiner
  Cc: Hao Luo, Shakeel Butt, Stanislav Fomichev, David Rientjes, bpf,
	KP Singh, cgroups, linux-mm

Hey everyone,

I would like to discuss an idea to facilitate collection of
hierarchical cgroup stats using BPF programs. We want to provide a
simple interface for BPF programs to collect hierarchical cgroup stats
and integrate with the existing rstat aggregation mechanism in the
kernel. The most prominent use case is the ability to extend memcg
stats (and histograms) by BPF programs.

This also integrates nicely with Hao's work [1] that enables reading
those stats through files, similar to cgroupfs. This idea is more
concerned about the stats collection path.

The main idea is to introduce a new map type (let's call it BPF cgroup
stats map for now). This map will be keyed by cgroup_id (similar to
cgroup storage). The value is an array (or struct, more on this later)
that the user chooses its size and element type, which will hold the
stats. The main properties of the map are as follows:
1. Map entries creation and deletion is handled automatically by the kernel.
2. Internally, the map entries contain per-cpu arrays, a total array,
and a pending array.
3. BPF programs & user space see the entry as a single array, updates
are transparently made to per-cpu array, and lookups invoke stats
flushing.

The main differences between this and a cgroup storage is that it
naturally integrates with rstat hierarchical aggregation (more on that
later). The reason why we do not want to do aggregation in BPF
programs or in user space are:
1. Each program will loop through the cgroup descendants to do their
own stats aggregation, lots of repeated work.
2. We will loop through all the descendants, even those that do not
have updates.

These problems are already addressed by the rstat aggregation
mechanism in the kernel, which is primarily used for memcg stats. We
want to provide a way for BPF programs to be able to make use of this
as well.

The lifetime of map entries can be handled as follows:
- When the map is created, it gets as a parameter an initial
cgroup_id, maybe through the map_extra parameter struct bpf_attr. The
map is created and entries for the initial cgroup and all its
descendants are created.
- The update and delete interfaces are disabled. The kernel creates
entries for new cgroups and removes entries for destroyed cgroups (we
can use cgroup_bpf_inherit() and  cgroup_bpf_release()).
- When all the entries in the map are deleted (initial cgroup
destroyed), the map is destroyed.

The map usage by BPF programs and integration with rstat can be as follows:
- Internally, each map entry has per-cpu arrays, a total array, and a
pending array. BPF programs and user space only see one array.
- The update interface is disabled. BPF programs use helpers to modify
elements. Internally, the modifications are made to per-cpu arrays,
and invoke a call to cgroup_bpf_updated()  or an equivalent.
- Lookups (from BPF programs or user space) invoke an rstat flush and
read from the total array.
- In cgroup_rstat_flush_locked() flush BPF stats as well.

Flushing of BPF stats can be as follows:
- For every cgroup, we will either use flags to distinguish BPF stats
updates from normal stats updates, or flush both anyway (memcg stats
are periodically flushed anyway).
- We will need to link cgroups to the maps that have entries for them.
One possible implementation here is to store the map entries in struct
cgroup_bpf in a htable indexed by map fd. The update helpers will also
use this to avoid lookups.
- For each updated cgroup, we go through all of its maps, accumulate
per-cpu arrays to the total array, then propagate total to the
parent’s pending array (same mechanism as memcg stats flushing).

There is room for extensions or generalizations here:
- Provide flags to enable/disable using per-cpu arrays (for stats that
are not updated frequently), and enable/disable hierarchical
aggregation (for non-hierarchical stats, they can still make benefit
of the automatic entries creation & deletion).
- Provide different hierarchical aggregation operations : SUM, MAX, MIN, etc.
- Instead of an array as the map value, use a struct, and let the user
provide an aggregator function in the form of a BPF program.

I am happy to hear your thoughts about the idea in general and any
comments or concerns.







[1] https://lwn.net/Articles/886292/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-09 20:27 ` Yosry Ahmed
  0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-09 20:27 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Johannes Weiner
  Cc: Hao Luo, Shakeel Butt, Stanislav Fomichev, David Rientjes,
	bpf-u79uwXL29TY76Z2rM5mHXA, KP Singh,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg

Hey everyone,

I would like to discuss an idea to facilitate collection of
hierarchical cgroup stats using BPF programs. We want to provide a
simple interface for BPF programs to collect hierarchical cgroup stats
and integrate with the existing rstat aggregation mechanism in the
kernel. The most prominent use case is the ability to extend memcg
stats (and histograms) by BPF programs.

This also integrates nicely with Hao's work [1] that enables reading
those stats through files, similar to cgroupfs. This idea is more
concerned about the stats collection path.

The main idea is to introduce a new map type (let's call it BPF cgroup
stats map for now). This map will be keyed by cgroup_id (similar to
cgroup storage). The value is an array (or struct, more on this later)
that the user chooses its size and element type, which will hold the
stats. The main properties of the map are as follows:
1. Map entries creation and deletion is handled automatically by the kernel.
2. Internally, the map entries contain per-cpu arrays, a total array,
and a pending array.
3. BPF programs & user space see the entry as a single array, updates
are transparently made to per-cpu array, and lookups invoke stats
flushing.

The main differences between this and a cgroup storage is that it
naturally integrates with rstat hierarchical aggregation (more on that
later). The reason why we do not want to do aggregation in BPF
programs or in user space are:
1. Each program will loop through the cgroup descendants to do their
own stats aggregation, lots of repeated work.
2. We will loop through all the descendants, even those that do not
have updates.

These problems are already addressed by the rstat aggregation
mechanism in the kernel, which is primarily used for memcg stats. We
want to provide a way for BPF programs to be able to make use of this
as well.

The lifetime of map entries can be handled as follows:
- When the map is created, it gets as a parameter an initial
cgroup_id, maybe through the map_extra parameter struct bpf_attr. The
map is created and entries for the initial cgroup and all its
descendants are created.
- The update and delete interfaces are disabled. The kernel creates
entries for new cgroups and removes entries for destroyed cgroups (we
can use cgroup_bpf_inherit() and  cgroup_bpf_release()).
- When all the entries in the map are deleted (initial cgroup
destroyed), the map is destroyed.

The map usage by BPF programs and integration with rstat can be as follows:
- Internally, each map entry has per-cpu arrays, a total array, and a
pending array. BPF programs and user space only see one array.
- The update interface is disabled. BPF programs use helpers to modify
elements. Internally, the modifications are made to per-cpu arrays,
and invoke a call to cgroup_bpf_updated()  or an equivalent.
- Lookups (from BPF programs or user space) invoke an rstat flush and
read from the total array.
- In cgroup_rstat_flush_locked() flush BPF stats as well.

Flushing of BPF stats can be as follows:
- For every cgroup, we will either use flags to distinguish BPF stats
updates from normal stats updates, or flush both anyway (memcg stats
are periodically flushed anyway).
- We will need to link cgroups to the maps that have entries for them.
One possible implementation here is to store the map entries in struct
cgroup_bpf in a htable indexed by map fd. The update helpers will also
use this to avoid lookups.
- For each updated cgroup, we go through all of its maps, accumulate
per-cpu arrays to the total array, then propagate total to the
parent’s pending array (same mechanism as memcg stats flushing).

There is room for extensions or generalizations here:
- Provide flags to enable/disable using per-cpu arrays (for stats that
are not updated frequently), and enable/disable hierarchical
aggregation (for non-hierarchical stats, they can still make benefit
of the automatic entries creation & deletion).
- Provide different hierarchical aggregation operations : SUM, MAX, MIN, etc.
- Instead of an array as the map value, use a struct, and let the user
provide an aggregator function in the form of a BPF program.

I am happy to hear your thoughts about the idea in general and any
comments or concerns.







[1] https://lwn.net/Articles/886292/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
  2022-03-09 20:27 ` Yosry Ahmed
  (?)
@ 2022-03-14  5:35 ` Tejun Heo
  2022-03-16 16:35   ` Yosry Ahmed
  -1 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2022-03-14  5:35 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups, linux-mm

Hello,

On Wed, Mar 09, 2022 at 12:27:15PM -0800, Yosry Ahmed wrote:
...
> These problems are already addressed by the rstat aggregation
> mechanism in the kernel, which is primarily used for memcg stats. We

Not that it matters all that much but I don't think the above statement is
true given that sched stats are an integrated part of the rstat
implementation and io was converted before memcg.

> - For every cgroup, we will either use flags to distinguish BPF stats
> updates from normal stats updates, or flush both anyway (memcg stats
> are periodically flushed anyway).

I'd just keep them together. Usually most activities tend to happen
together, so it's cheaper to aggregate all of them in one go in most cases.

> - Provide flags to enable/disable using per-cpu arrays (for stats that
> are not updated frequently), and enable/disable hierarchical
> aggregation (for non-hierarchical stats, they can still make benefit
> of the automatic entries creation & deletion).
> - Provide different hierarchical aggregation operations : SUM, MAX, MIN, etc.
> - Instead of an array as the map value, use a struct, and let the user
> provide an aggregator function in the form of a BPF program.

I'm more partial to the last option. It does make the usage a bit more
compilcated but hopefully it shouldn't be too bad with good examples.

I don't have strong opinions on the bpf side of things but it'd be great to
be able to use rstat from bpf.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
  2022-03-09 20:27 ` Yosry Ahmed
  (?)
  (?)
@ 2022-03-16  6:04 ` Song Liu
  2022-03-16 16:11   ` Yosry Ahmed
                     ` (2 more replies)
  -1 siblings, 3 replies; 18+ messages in thread
From: Song Liu @ 2022-03-16  6:04 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups, Linux-MM

On Wed, Mar 9, 2022 at 12:27 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
>
> The map usage by BPF programs and integration with rstat can be as follows:
> - Internally, each map entry has per-cpu arrays, a total array, and a
> pending array. BPF programs and user space only see one array.
> - The update interface is disabled. BPF programs use helpers to modify
> elements. Internally, the modifications are made to per-cpu arrays,
> and invoke a call to cgroup_bpf_updated()  or an equivalent.
> - Lookups (from BPF programs or user space) invoke an rstat flush and
> read from the total array.

Lookups invoke a rstat flush, so we still walk every node of a subtree for
each lookup, no? So the actual cost should be similar than walking the
subtree with some BPF program? Did I miss something?

Thanks,
Song

> - In cgroup_rstat_flush_locked() flush BPF stats as well.
>
[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
  2022-03-16  6:04 ` Song Liu
@ 2022-03-16 16:11   ` Yosry Ahmed
  2022-03-16 16:13     ` Yosry Ahmed
  2022-03-16 16:31     ` Tejun Heo
  2 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-16 16:11 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1384 bytes --]

On Tue, Mar 15, 2022 at 11:05 PM Song Liu <song@kernel.org> wrote:

> On Wed, Mar 9, 2022 at 12:27 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> >
> > The map usage by BPF programs and integration with rstat can be as
> follows:
> > - Internally, each map entry has per-cpu arrays, a total array, and a
> > pending array. BPF programs and user space only see one array.
> > - The update interface is disabled. BPF programs use helpers to modify
> > elements. Internally, the modifications are made to per-cpu arrays,
> > and invoke a call to cgroup_bpf_updated()  or an equivalent.
> > - Lookups (from BPF programs or user space) invoke an rstat flush and
> > read from the total array.
>
> Lookups invoke a rstat flush, so we still walk every node of a subtree for
> each lookup, no? So the actual cost should be similar than walking the
> subtree with some BPF program? Did I miss something?
>
>
Hi Song,

Thanks for taking the time to read my proposal.

The rstat framework maintains a tree that contains only updated cgroups. An
rstat flush only traverses this tree, not the cgroup subtree/hierarchy.

This also ensures that consecutive readers do not have to do any traversals
unless new updates happened, because the first reader will have already
flushed the stats.


> Thanks,
> Song
>
> > - In cgroup_rstat_flush_locked() flush BPF stats as well.
> >
> [...]
>

[-- Attachment #2: Type: text/html, Size: 2096 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-16 16:13     ` Yosry Ahmed
  0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-16 16:13 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups, Linux-MM

On Tue, Mar 15, 2022 at 11:05 PM Song Liu <song@kernel.org> wrote:
>
> On Wed, Mar 9, 2022 at 12:27 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> >
> > The map usage by BPF programs and integration with rstat can be as follows:
> > - Internally, each map entry has per-cpu arrays, a total array, and a
> > pending array. BPF programs and user space only see one array.
> > - The update interface is disabled. BPF programs use helpers to modify
> > elements. Internally, the modifications are made to per-cpu arrays,
> > and invoke a call to cgroup_bpf_updated()  or an equivalent.
> > - Lookups (from BPF programs or user space) invoke an rstat flush and
> > read from the total array.
>
> Lookups invoke a rstat flush, so we still walk every node of a subtree for
> each lookup, no? So the actual cost should be similar than walking the
> subtree with some BPF program? Did I miss something?
>

Hi Song,

Thanks for taking the time to read my proposal.

The rstat framework maintains a tree that contains only updated
cgroups. An rstat flush only traverses this tree, not the cgroup
subtree/hierarchy.

This also ensures that consecutive readers do not have to do any
traversals unless new updates happen, because the first reader will
have already flushed the stats.

>
> Thanks,
> Song
>
> > - In cgroup_rstat_flush_locked() flush BPF stats as well.
> >
> [...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-16 16:13     ` Yosry Ahmed
  0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-16 16:13 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Linux-MM

On Tue, Mar 15, 2022 at 11:05 PM Song Liu <song-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Wed, Mar 9, 2022 at 12:27 PM Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> [...]
> >
> > The map usage by BPF programs and integration with rstat can be as follows:
> > - Internally, each map entry has per-cpu arrays, a total array, and a
> > pending array. BPF programs and user space only see one array.
> > - The update interface is disabled. BPF programs use helpers to modify
> > elements. Internally, the modifications are made to per-cpu arrays,
> > and invoke a call to cgroup_bpf_updated()  or an equivalent.
> > - Lookups (from BPF programs or user space) invoke an rstat flush and
> > read from the total array.
>
> Lookups invoke a rstat flush, so we still walk every node of a subtree for
> each lookup, no? So the actual cost should be similar than walking the
> subtree with some BPF program? Did I miss something?
>

Hi Song,

Thanks for taking the time to read my proposal.

The rstat framework maintains a tree that contains only updated
cgroups. An rstat flush only traverses this tree, not the cgroup
subtree/hierarchy.

This also ensures that consecutive readers do not have to do any
traversals unless new updates happen, because the first reader will
have already flushed the stats.

>
> Thanks,
> Song
>
> > - In cgroup_rstat_flush_locked() flush BPF stats as well.
> >
> [...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-16 16:31     ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2022-03-16 16:31 UTC (permalink / raw)
  To: Song Liu
  Cc: Yosry Ahmed, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Johannes Weiner, Hao Luo, Shakeel Butt,
	Stanislav Fomichev, David Rientjes, bpf, KP Singh, cgroups,
	Linux-MM

On Tue, Mar 15, 2022 at 11:04:56PM -0700, Song Liu wrote:
> Lookups invoke a rstat flush, so we still walk every node of a subtree for
> each lookup, no? So the actual cost should be similar than walking the
> subtree with some BPF program? Did I miss something?

rstat only walks cgroups / cpus which have been active since the flush.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-16 16:31     ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2022-03-16 16:31 UTC (permalink / raw)
  To: Song Liu
  Cc: Yosry Ahmed, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Johannes Weiner, Hao Luo, Shakeel Butt,
	Stanislav Fomichev, David Rientjes, bpf, KP Singh,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Linux-MM

On Tue, Mar 15, 2022 at 11:04:56PM -0700, Song Liu wrote:
> Lookups invoke a rstat flush, so we still walk every node of a subtree for
> each lookup, no? So the actual cost should be similar than walking the
> subtree with some BPF program? Did I miss something?

rstat only walks cgroups / cpus which have been active since the flush.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
  2022-03-14  5:35 ` Tejun Heo
@ 2022-03-16 16:35   ` Yosry Ahmed
  2022-03-22 18:09     ` Yonghong Song
  0 siblings, 1 reply; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-16 16:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups, Linux-MM

Hi Tejun,

Thanks for taking the time to read my proposal! Sorry for the late
reply. This email skipped my inbox for some reason.

On Sun, Mar 13, 2022 at 10:35 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Mar 09, 2022 at 12:27:15PM -0800, Yosry Ahmed wrote:
> ...
> > These problems are already addressed by the rstat aggregation
> > mechanism in the kernel, which is primarily used for memcg stats. We
>
> Not that it matters all that much but I don't think the above statement is
> true given that sched stats are an integrated part of the rstat
> implementation and io was converted before memcg.
>

Excuse my ignorance, I am new to kernel development. I only saw calls
to cgroup_rstat_updated() in memcg and io and assumed they were the
only users. Now I found cpu_account_cputime() :)

> > - For every cgroup, we will either use flags to distinguish BPF stats
> > updates from normal stats updates, or flush both anyway (memcg stats
> > are periodically flushed anyway).
>
> I'd just keep them together. Usually most activities tend to happen
> together, so it's cheaper to aggregate all of them in one go in most cases.

This makes sense to me, thanks.

>
> > - Provide flags to enable/disable using per-cpu arrays (for stats that
> > are not updated frequently), and enable/disable hierarchical
> > aggregation (for non-hierarchical stats, they can still make benefit
> > of the automatic entries creation & deletion).
> > - Provide different hierarchical aggregation operations : SUM, MAX, MIN, etc.
> > - Instead of an array as the map value, use a struct, and let the user
> > provide an aggregator function in the form of a BPF program.
>
> I'm more partial to the last option. It does make the usage a bit more
> compilcated but hopefully it shouldn't be too bad with good examples.
>
> I don't have strong opinions on the bpf side of things but it'd be great to
> be able to use rstat from bpf.

It indeed gives more flexibility but is more complicated. Also, I am
not sure about the overhead to make calls to BPF programs in every
aggregation step. Looking forward to get feedback on the bpf side of
things.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
  2022-03-09 20:27 ` Yosry Ahmed
                   ` (2 preceding siblings ...)
  (?)
@ 2022-03-18 19:59 ` Song Liu
  2022-03-28  9:16     ` Yosry Ahmed
  -1 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2022-03-18 19:59 UTC (permalink / raw)
  To: Yosry Ahmed, Namhyung Kim
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups, Linux-MM

On Wed, Mar 9, 2022 at 12:27 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Hey everyone,
>
> I would like to discuss an idea to facilitate collection of
> hierarchical cgroup stats using BPF programs. We want to provide a
> simple interface for BPF programs to collect hierarchical cgroup stats
> and integrate with the existing rstat aggregation mechanism in the
> kernel. The most prominent use case is the ability to extend memcg
> stats (and histograms) by BPF programs.
>

+ Namhyung,

I forgot to mention this in the office hour. The aggregation efficiency
problem is actually similar to Namhyung's work to use BPF programs
to aggregate perf counters. Check:
     tools/perf/util/bpf_skel/bperf_cgroup.bpf.c

Namhyung's solution is to walk up the cgroup tree on cgroup switch
events. This may not be as efficient as rstat flush logic, but I think it
is good enough for many use cases (unless the cgroup tree is very
deep). It also demonstrates how we can implement some cgroup
logic in BPF.

I hope this helps.

Thanks,
Song

[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
  2022-03-16 16:35   ` Yosry Ahmed
@ 2022-03-22 18:09     ` Yonghong Song
  2022-03-22 21:37       ` Hao Luo
  2022-03-28  9:22         ` Yosry Ahmed
  0 siblings, 2 replies; 18+ messages in thread
From: Yonghong Song @ 2022-03-22 18:09 UTC (permalink / raw)
  To: Yosry Ahmed, Tejun Heo
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups, Linux-MM



On 3/16/22 9:35 AM, Yosry Ahmed wrote:
> Hi Tejun,
> 
> Thanks for taking the time to read my proposal! Sorry for the late
> reply. This email skipped my inbox for some reason.
> 
> On Sun, Mar 13, 2022 at 10:35 PM Tejun Heo <tj@kernel.org> wrote:
>>
>> Hello,
>>
>> On Wed, Mar 09, 2022 at 12:27:15PM -0800, Yosry Ahmed wrote:
>> ...
>>> These problems are already addressed by the rstat aggregation
>>> mechanism in the kernel, which is primarily used for memcg stats. We
>>
>> Not that it matters all that much but I don't think the above statement is
>> true given that sched stats are an integrated part of the rstat
>> implementation and io was converted before memcg.
>>
> 
> Excuse my ignorance, I am new to kernel development. I only saw calls
> to cgroup_rstat_updated() in memcg and io and assumed they were the
> only users. Now I found cpu_account_cputime() :)
> 
>>> - For every cgroup, we will either use flags to distinguish BPF stats
>>> updates from normal stats updates, or flush both anyway (memcg stats
>>> are periodically flushed anyway).
>>
>> I'd just keep them together. Usually most activities tend to happen
>> together, so it's cheaper to aggregate all of them in one go in most cases.
> 
> This makes sense to me, thanks.
> 
>>
>>> - Provide flags to enable/disable using per-cpu arrays (for stats that
>>> are not updated frequently), and enable/disable hierarchical
>>> aggregation (for non-hierarchical stats, they can still make benefit
>>> of the automatic entries creation & deletion).
>>> - Provide different hierarchical aggregation operations : SUM, MAX, MIN, etc.
>>> - Instead of an array as the map value, use a struct, and let the user
>>> provide an aggregator function in the form of a BPF program.
>>
>> I'm more partial to the last option. It does make the usage a bit more
>> compilcated but hopefully it shouldn't be too bad with good examples.
>>
>> I don't have strong opinions on the bpf side of things but it'd be great to
>> be able to use rstat from bpf.
> 
> It indeed gives more flexibility but is more complicated. Also, I am
> not sure about the overhead to make calls to BPF programs in every
> aggregation step. Looking forward to get feedback on the bpf side of
> things.

Hi, Yosry, I heard this was discussed in bpf office hour which I
didn't attend. Could you summarize the conclusion and what is the
step forward? We also have an internal tool which collects cgroup
stats and this might help us as well. Thanks!

> 
>>
>> Thanks.
>>
>> --
>> tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
  2022-03-22 18:09     ` Yonghong Song
@ 2022-03-22 21:37       ` Hao Luo
  2022-03-22 22:06         ` Yonghong Song
  2022-03-28  9:22         ` Yosry Ahmed
  1 sibling, 1 reply; 18+ messages in thread
From: Hao Luo @ 2022-03-22 21:37 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Yosry Ahmed, Tejun Heo, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Johannes Weiner, Shakeel Butt,
	Stanislav Fomichev, David Rientjes, bpf, KP Singh, cgroups,
	Linux-MM

On Tue, Mar 22, 2022 at 11:09 AM Yonghong Song <yhs@fb.com> wrote:
>
> Hi, Yosry, I heard this was discussed in bpf office hour which I
> didn't attend. Could you summarize the conclusion and what is the
> step forward? We also have an internal tool which collects cgroup
> stats and this might help us as well. Thanks!
>

Hi Yonghong,

Yosry is out of office this entire week. I don't know whether he has
access to corp email account, so please allow me to reply on his
behalf.

So instead of using rstat in bpf (by providing a map interface), it's
better to leverage the flexibility provided by bpf to extend rstat.
"We can achieve a similar outcome with a less intrusive design." Yosry
"will look at programs providing their own aggregators and maintaining
their own stats, and only making callbacks to/from rstat." As the next
step, Yosry is going to work on an RFC patchset to drive the
discussion, when he is back.

It's great to see that this is helpful to you as well! I am looking
forward to collaboration if there is a need.

Hao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
  2022-03-22 21:37       ` Hao Luo
@ 2022-03-22 22:06         ` Yonghong Song
  0 siblings, 0 replies; 18+ messages in thread
From: Yonghong Song @ 2022-03-22 22:06 UTC (permalink / raw)
  To: Hao Luo
  Cc: Yosry Ahmed, Tejun Heo, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Johannes Weiner, Shakeel Butt,
	Stanislav Fomichev, David Rientjes, bpf, KP Singh, cgroups,
	Linux-MM



On 3/22/22 2:37 PM, Hao Luo wrote:
> On Tue, Mar 22, 2022 at 11:09 AM Yonghong Song <yhs@fb.com> wrote:
>>
>> Hi, Yosry, I heard this was discussed in bpf office hour which I
>> didn't attend. Could you summarize the conclusion and what is the
>> step forward? We also have an internal tool which collects cgroup
>> stats and this might help us as well. Thanks!
>>
> 
> Hi Yonghong,
> 
> Yosry is out of office this entire week. I don't know whether he has
> access to corp email account, so please allow me to reply on his
> behalf.
> 
> So instead of using rstat in bpf (by providing a map interface), it's
> better to leverage the flexibility provided by bpf to extend rstat.
> "We can achieve a similar outcome with a less intrusive design." Yosry
> "will look at programs providing their own aggregators and maintaining
> their own stats, and only making callbacks to/from rstat." As the next
> step, Yosry is going to work on an RFC patchset to drive the
> discussion, when he is back.
> 
> It's great to see that this is helpful to you as well! I am looking
> forward to collaboration if there is a need.

Thanks, Hao, for explanation. I am looking forward to the RFC patch!

> 
> Hao

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-28  9:16     ` Yosry Ahmed
  0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-28  9:16 UTC (permalink / raw)
  To: Song Liu
  Cc: Namhyung Kim, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Johannes Weiner, Hao Luo, Shakeel Butt,
	Stanislav Fomichev, David Rientjes, bpf, KP Singh, cgroups,
	Linux-MM

On Fri, Mar 18, 2022 at 12:59 PM Song Liu <song@kernel.org> wrote:
>
> On Wed, Mar 9, 2022 at 12:27 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Hey everyone,
> >
> > I would like to discuss an idea to facilitate collection of
> > hierarchical cgroup stats using BPF programs. We want to provide a
> > simple interface for BPF programs to collect hierarchical cgroup stats
> > and integrate with the existing rstat aggregation mechanism in the
> > kernel. The most prominent use case is the ability to extend memcg
> > stats (and histograms) by BPF programs.
> >
>
> + Namhyung,
>
> I forgot to mention this in the office hour. The aggregation efficiency
> problem is actually similar to Namhyung's work to use BPF programs
> to aggregate perf counters. Check:
>      tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
>
> Namhyung's solution is to walk up the cgroup tree on cgroup switch
> events. This may not be as efficient as rstat flush logic, but I think it
> is good enough for many use cases (unless the cgroup tree is very
> deep). It also demonstrates how we can implement some cgroup
> logic in BPF.
>
> I hope this helps.
>
> Thanks,
> Song
>
> [...]

Hi Song,

Thanks so much for pointing this out. I have an idea of a less
intrusive and more generic way to directly use rstat flushing in BPF
programs, which basically makes BPF programs maintain their own stats
and flushing logic and only using helpers to make calls to rstat
(instead of a whole new rstat map). I will work on an RFC patch series
for this.

If this doesn't work out, I will probably fallback to handling the
flushing completely in the BPF programs, similar to the example you
provided (which is a lot of help, thanks!).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-28  9:16     ` Yosry Ahmed
  0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-28  9:16 UTC (permalink / raw)
  To: Song Liu
  Cc: Namhyung Kim, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Johannes Weiner, Hao Luo, Shakeel Butt,
	Stanislav Fomichev, David Rientjes, bpf, KP Singh,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Linux-MM

On Fri, Mar 18, 2022 at 12:59 PM Song Liu <song-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Wed, Mar 9, 2022 at 12:27 PM Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > Hey everyone,
> >
> > I would like to discuss an idea to facilitate collection of
> > hierarchical cgroup stats using BPF programs. We want to provide a
> > simple interface for BPF programs to collect hierarchical cgroup stats
> > and integrate with the existing rstat aggregation mechanism in the
> > kernel. The most prominent use case is the ability to extend memcg
> > stats (and histograms) by BPF programs.
> >
>
> + Namhyung,
>
> I forgot to mention this in the office hour. The aggregation efficiency
> problem is actually similar to Namhyung's work to use BPF programs
> to aggregate perf counters. Check:
>      tools/perf/util/bpf_skel/bperf_cgroup.bpf.c
>
> Namhyung's solution is to walk up the cgroup tree on cgroup switch
> events. This may not be as efficient as rstat flush logic, but I think it
> is good enough for many use cases (unless the cgroup tree is very
> deep). It also demonstrates how we can implement some cgroup
> logic in BPF.
>
> I hope this helps.
>
> Thanks,
> Song
>
> [...]

Hi Song,

Thanks so much for pointing this out. I have an idea of a less
intrusive and more generic way to directly use rstat flushing in BPF
programs, which basically makes BPF programs maintain their own stats
and flushing logic and only using helpers to make calls to rstat
(instead of a whole new rstat map). I will work on an RFC patch series
for this.

If this doesn't work out, I will probably fallback to handling the
flushing completely in the BPF programs, similar to the example you
provided (which is a lot of help, thanks!).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-28  9:22         ` Yosry Ahmed
  0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-28  9:22 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Tejun Heo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups, Linux-MM

On Tue, Mar 22, 2022 at 11:09 AM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 3/16/22 9:35 AM, Yosry Ahmed wrote:
> > Hi Tejun,
> >
> > Thanks for taking the time to read my proposal! Sorry for the late
> > reply. This email skipped my inbox for some reason.
> >
> > On Sun, Mar 13, 2022 at 10:35 PM Tejun Heo <tj@kernel.org> wrote:
> >>
> >> Hello,
> >>
> >> On Wed, Mar 09, 2022 at 12:27:15PM -0800, Yosry Ahmed wrote:
> >> ...
> >>> These problems are already addressed by the rstat aggregation
> >>> mechanism in the kernel, which is primarily used for memcg stats. We
> >>
> >> Not that it matters all that much but I don't think the above statement is
> >> true given that sched stats are an integrated part of the rstat
> >> implementation and io was converted before memcg.
> >>
> >
> > Excuse my ignorance, I am new to kernel development. I only saw calls
> > to cgroup_rstat_updated() in memcg and io and assumed they were the
> > only users. Now I found cpu_account_cputime() :)
> >
> >>> - For every cgroup, we will either use flags to distinguish BPF stats
> >>> updates from normal stats updates, or flush both anyway (memcg stats
> >>> are periodically flushed anyway).
> >>
> >> I'd just keep them together. Usually most activities tend to happen
> >> together, so it's cheaper to aggregate all of them in one go in most cases.
> >
> > This makes sense to me, thanks.
> >
> >>
> >>> - Provide flags to enable/disable using per-cpu arrays (for stats that
> >>> are not updated frequently), and enable/disable hierarchical
> >>> aggregation (for non-hierarchical stats, they can still make benefit
> >>> of the automatic entries creation & deletion).
> >>> - Provide different hierarchical aggregation operations : SUM, MAX, MIN, etc.
> >>> - Instead of an array as the map value, use a struct, and let the user
> >>> provide an aggregator function in the form of a BPF program.
> >>
> >> I'm more partial to the last option. It does make the usage a bit more
> >> compilcated but hopefully it shouldn't be too bad with good examples.
> >>
> >> I don't have strong opinions on the bpf side of things but it'd be great to
> >> be able to use rstat from bpf.
> >
> > It indeed gives more flexibility but is more complicated. Also, I am
> > not sure about the overhead to make calls to BPF programs in every
> > aggregation step. Looking forward to get feedback on the bpf side of
> > things.
>
> Hi, Yosry, I heard this was discussed in bpf office hour which I
> didn't attend. Could you summarize the conclusion and what is the
> step forward? We also have an internal tool which collects cgroup
> stats and this might help us as well. Thanks!
>
> >
> >>
> >> Thanks.
> >>
> >> --
> >> tejun

Hi Yonghong,

Hao has already done an excellent job summarizing the outcome of the meeting.

The idea I have is basically to introduce "rstat flushing" BPF
programs. BPF programs that collect and display stats would use
helpers to call cgroup_rstat_flush() and cgroup_rstat_updated() (or
similar). rstat would then make calls to the "rstat flushing" BPF
programs during flushes, similar to calls to css_rstat_flush().

I will work on an RFC patch(es) for this soon. Let me know if you have
any comments/suggestions/feedback.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF
@ 2022-03-28  9:22         ` Yosry Ahmed
  0 siblings, 0 replies; 18+ messages in thread
From: Yosry Ahmed @ 2022-03-28  9:22 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Tejun Heo, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Johannes Weiner, Hao Luo, Shakeel Butt, Stanislav Fomichev,
	David Rientjes, bpf, KP Singh, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Linux-MM

On Tue, Mar 22, 2022 at 11:09 AM Yonghong Song <yhs-b10kYP2dOMg@public.gmane.org> wrote:
>
>
>
> On 3/16/22 9:35 AM, Yosry Ahmed wrote:
> > Hi Tejun,
> >
> > Thanks for taking the time to read my proposal! Sorry for the late
> > reply. This email skipped my inbox for some reason.
> >
> > On Sun, Mar 13, 2022 at 10:35 PM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >>
> >> Hello,
> >>
> >> On Wed, Mar 09, 2022 at 12:27:15PM -0800, Yosry Ahmed wrote:
> >> ...
> >>> These problems are already addressed by the rstat aggregation
> >>> mechanism in the kernel, which is primarily used for memcg stats. We
> >>
> >> Not that it matters all that much but I don't think the above statement is
> >> true given that sched stats are an integrated part of the rstat
> >> implementation and io was converted before memcg.
> >>
> >
> > Excuse my ignorance, I am new to kernel development. I only saw calls
> > to cgroup_rstat_updated() in memcg and io and assumed they were the
> > only users. Now I found cpu_account_cputime() :)
> >
> >>> - For every cgroup, we will either use flags to distinguish BPF stats
> >>> updates from normal stats updates, or flush both anyway (memcg stats
> >>> are periodically flushed anyway).
> >>
> >> I'd just keep them together. Usually most activities tend to happen
> >> together, so it's cheaper to aggregate all of them in one go in most cases.
> >
> > This makes sense to me, thanks.
> >
> >>
> >>> - Provide flags to enable/disable using per-cpu arrays (for stats that
> >>> are not updated frequently), and enable/disable hierarchical
> >>> aggregation (for non-hierarchical stats, they can still make benefit
> >>> of the automatic entries creation & deletion).
> >>> - Provide different hierarchical aggregation operations : SUM, MAX, MIN, etc.
> >>> - Instead of an array as the map value, use a struct, and let the user
> >>> provide an aggregator function in the form of a BPF program.
> >>
> >> I'm more partial to the last option. It does make the usage a bit more
> >> compilcated but hopefully it shouldn't be too bad with good examples.
> >>
> >> I don't have strong opinions on the bpf side of things but it'd be great to
> >> be able to use rstat from bpf.
> >
> > It indeed gives more flexibility but is more complicated. Also, I am
> > not sure about the overhead to make calls to BPF programs in every
> > aggregation step. Looking forward to get feedback on the bpf side of
> > things.
>
> Hi, Yosry, I heard this was discussed in bpf office hour which I
> didn't attend. Could you summarize the conclusion and what is the
> step forward? We also have an internal tool which collects cgroup
> stats and this might help us as well. Thanks!
>
> >
> >>
> >> Thanks.
> >>
> >> --
> >> tejun

Hi Yonghong,

Hao has already done an excellent job summarizing the outcome of the meeting.

The idea I have is basically to introduce "rstat flushing" BPF
programs. BPF programs that collect and display stats would use
helpers to call cgroup_rstat_flush() and cgroup_rstat_updated() (or
similar). rstat would then make calls to the "rstat flushing" BPF
programs during flushes, similar to calls to css_rstat_flush().

I will work on an RFC patch(es) for this soon. Let me know if you have
any comments/suggestions/feedback.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-03-28  9:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-09 20:27 [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF Yosry Ahmed
2022-03-09 20:27 ` Yosry Ahmed
2022-03-14  5:35 ` Tejun Heo
2022-03-16 16:35   ` Yosry Ahmed
2022-03-22 18:09     ` Yonghong Song
2022-03-22 21:37       ` Hao Luo
2022-03-22 22:06         ` Yonghong Song
2022-03-28  9:22       ` Yosry Ahmed
2022-03-28  9:22         ` Yosry Ahmed
2022-03-16  6:04 ` Song Liu
2022-03-16 16:11   ` Yosry Ahmed
2022-03-16 16:13   ` Yosry Ahmed
2022-03-16 16:13     ` Yosry Ahmed
2022-03-16 16:31   ` Tejun Heo
2022-03-16 16:31     ` Tejun Heo
2022-03-18 19:59 ` Song Liu
2022-03-28  9:16   ` Yosry Ahmed
2022-03-28  9:16     ` Yosry Ahmed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.