Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

From: Song Liu <songliubraving@fb.com>
To: Namhyung Kim <namhyung@kernel.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Kernel Team <Kernel-team@fb.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	"Arnaldo Carvalho de Melo" <acme@redhat.com>,
	Jiri Olsa <jolsa@kernel.org>
Subject: Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
Date: Sat, 13 Mar 2021 19:37:44 +0000	[thread overview]
Message-ID: <E5F52446-8902-4A4B-AF5E-90EBAB7F7A07@fb.com> (raw)
In-Reply-To: <CAM9d7cg+HD3-vLXX_rUSg1kWSZ3MGeyrQwdJoa5CgbZjeD2+GA@mail.gmail.com>

> On Mar 12, 2021, at 6:47 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> On Sat, Mar 13, 2021 at 12:38 AM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhyung@kernel.org> wrote:
>>> 
>>> Hi,
>>> 
>>> On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> perf uses performance monitoring counters (PMCs) to monitor system
>>>> performance. The PMCs are limited hardware resources. For example,
>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>> 
>>>> Modern data center systems use these PMCs in many different ways:
>>>> system level monitoring, (maybe nested) container level monitoring, per
>>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>>> there are more active perf_events than available hardware PMCs. To allow
>>>> all perf_events to have a chance to run, it is necessary to do expensive
>>>> time multiplexing of events.
>>>> 
>>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>>> instructions). It is a waste to have multiple tools create multiple
>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>> 
>>>> bperf tries to reduce such wastes by allowing multiple perf_events of
>>>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>>>> of having each perf-stat session to read its own perf_events, bperf uses
>>>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>>>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>>> 
>>>> Please refer to the comment before the definition of bperf_ops for the
>>>> description of bperf architecture.
>>> 
>>> Interesting!  Actually I thought about something similar before,
>>> but my BPF knowledge is outdated.  So I need to catch up but
>>> failed to have some time for it so far. ;-)
>>> 
>>>> 
>>>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>>>> bperf uses a BPF hashmap to share information about BPF programs and maps
>>>> used by bperf. This map is pinned to bpffs. The default address is
>>>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>>>> --attr-map.
>>>> 
>>>> ---
>>>> Known limitations:
>>>> 1. Do not support per cgroup events;
>>>> 2. Do not support monitoring of BPF program (perf-stat -b);
>>>> 3. Do not support event groups.
>>> 
>>> In my case, per cgroup event counting is very important.
>>> And I'd like to do that with lots of cpus and cgroups.
>> 
>> We can easily extend this approach to support cgroups events. I didn't
>> implement it to keep the first version simple.
> 
> OK.
> 
>> 
>>> So I'm working on an in-kernel solution (without BPF),
>>> I hope to share it soon.
>> 
>> This is interesting! I cannot wait to see how it looks like. I spent
>> quite some time try to enable in kernel sharing (not just cgroup
>> events), but finally decided to try BPF approach.
> 
> Well I found it hard to support generic event sharing that works
> for all use cases.  So I'm focusing on the per cgroup case only.
> 
>> 
>>> 
>>> And for event groups, it seems the current implementation
>>> cannot handle more than one event (not even in a group).
>>> That could be a serious limitation..
>> 
>> It supports multiple events. Multiple events are independent, i.e.,
>> "cycles" and "instructions" would use two independent leader programs.
> 
> OK, then do you need multiple bperf_attr_maps?  Does it work
> for an arbitrary number of events?

The bperf_attr_map (or perf_attr_map) is shared among different events. 
It is a hash map with perf_event_attr as the key. Currently, I hard coded
its size to 16. We can introduce more flexible management of this map. 

Thanks,
Song