[PATCH RFC 0/4] Add ability to attach bpf programs to a tracepoint inside a cgroup

* [PATCH RFC 0/4] Add ability to attach bpf programs to a tracepoint inside a cgroup
@ 2021-11-18 20:28 Kenny Ho
  2021-11-18 20:28 ` [PATCH RFC 1/4] cgroup, perf: Add ability to connect to perf cgroup from other cgroup controller Kenny Ho
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Kenny Ho @ 2021-11-18 20:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Tejun Heo, Zefan Li, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Steven Rostedt,
	netdev, bpf, linux-kernel, cgroups, linux-perf-users, y2kenny,
	Kenny.Ho, amd-gfx

Per an earlier discussion last year[1], I have been looking for a mechanism to a) collect resource usages for devices (GPU for now but there could be other device type in the future) and b) possibly enforce some of the resource usages.  An obvious mechanism was to use cgroup but there are too much diversity in GPU hardware architecture to have a common cgroup interface at this point.  An alternative is to leverage tracepoint with a bpf program inside a cgroup hierarchy for usage collection and enforcement (via writable tracepoint.)

This is a prototype for such idea.  It is incomplete but I would like to solicit some feedback before continuing to make sure I am going down the right path.  This prototype is built based on my understanding of the followings:

- tracepoint (and kprobe, uprobe) is associated with perf event
- perf events/tracepoint can be a hook for bpf progs but those bpf progs are not part of the cgroup hierarchy
- bpf progs can be attached to the cgroup hierarchy with cgroup local storage and other benefits
- separately, perf subsystem has a cgroup controller (perf cgroup) that allow perf event to be triggered with a cgroup filter

So the key idea of this RFC is to leverage hierarchical organization of bpf-cgroup for the purpose of perf event/tracepoints.

==Known unresolved topics (feedback very much welcome)==
Storage:
I came across the idea of "preallocated" memory for bpf hash map/storage to avoid deadlock[2] but I don't have a good understanding about it currently.  If existing bpf_cgroup_storage_type are not considered pre-allocated then I am thinking we can introduce a new type but I am not sure if this is needed yet.

Scalability:
Scalability concern has been raised about perf cgroup [3] and there seems to be a solution to it recently with bperf [4].  This RFC does not change the status quo on the scalability question but if I understand the bperf idea correctly, this RFC may have some similarity.

[1] https://lore.kernel.org/netdev/YJXRHXIykyEBdnTF@slm.duckdns.org/T/#m52bc26bbbf16131c48e6b34d875c87660943c452
[2] https://lwn.net/Articles/679074/
[3] https://www.linuxplumbersconf.org/event/4/contributions/291/attachments/313/528/Linux_Plumbers_Conference_2019.pdf
[4] https://linuxplumbersconf.org/event/11/contributions/899/

Kenny Ho (4):
  cgroup, perf: Add ability to connect to perf cgroup from other cgroup
    controller
  bpf, perf: add ability to attach complete array of bpf prog to perf
    event
  bpf,cgroup,tracing: add new BPF_PROG_TYPE_CGROUP_TRACEPOINT
  bpf,cgroup,perf: extend bpf-cgroup to support tracepoint attachment

 include/linux/bpf-cgroup.h   | 17 +++++--
 include/linux/bpf_types.h    |  4 ++
 include/linux/cgroup.h       |  2 +
 include/linux/perf_event.h   |  6 +++
 include/linux/trace_events.h |  9 ++++
 include/uapi/linux/bpf.h     |  2 +
 kernel/bpf/cgroup.c          | 96 +++++++++++++++++++++++++++++-------
 kernel/bpf/syscall.c         |  4 ++
 kernel/cgroup/cgroup.c       | 13 ++---
 kernel/events/core.c         | 62 +++++++++++++++++++++++
 kernel/trace/bpf_trace.c     | 36 ++++++++++++++
 11 files changed, 222 insertions(+), 29 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 7+ messages in thread