Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()

From: Rob Herring <robh@kernel.org>
To: Jiri Olsa <jolsa@redhat.com>
Cc: Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Namhyung Kim <namhyung@kernel.org>,
	Raphael Gault <raphael.gault@arm.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Ian Rogers <irogers@google.com>,
	Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>,
	Itaru Kitayama <itaru.kitayama@gmail.com>
Subject: Re: [PATCH v4 4/9] libperf: Add libperf_evsel__mmap()
Date: Fri, 6 Nov 2020 15:56:11 -0600	[thread overview]
Message-ID: <CAL_JsqJzeCebq4VP+xBtfh=fbomvaJoVMp35AQQDGTYD-fRWgw@mail.gmail.com> (raw)
In-Reply-To: <20201105224121.GA4112111@krava>

On Thu, Nov 5, 2020 at 4:41 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Thu, Nov 05, 2020 at 10:19:24AM -0600, Rob Herring wrote:
>
> SNIP
>
> > > > >
> > > > > that maps page for each event, then perf_evsel__read
> > > > > could go through the fast code, no?
> > > >
> > > > No, because we're not self-monitoring (pid == 0 and cpu == -1). With
> > > > the following change:
> > > >
> > > > diff --git a/tools/lib/perf/tests/test-evsel.c
> > > > b/tools/lib/perf/tests/test-evsel.c
> > > > index eeca8203d73d..1fca9c121f7c 100644
> > > > --- a/tools/lib/perf/tests/test-evsel.c
> > > > +++ b/tools/lib/perf/tests/test-evsel.c
> > > > @@ -17,6 +17,7 @@ static int test_stat_cpu(void)
> > > >  {
> > > >         struct perf_cpu_map *cpus;
> > > >         struct perf_evsel *evsel;
> > > > +       struct perf_event_mmap_page *pc;
> > > >         struct perf_event_attr attr = {
> > > >                 .type   = PERF_TYPE_SOFTWARE,
> > > >                 .config = PERF_COUNT_SW_CPU_CLOCK,
> > > > @@ -32,6 +33,15 @@ static int test_stat_cpu(void)
> > > >         err = perf_evsel__open(evsel, cpus, NULL);
> > > >         __T("failed to open evsel", err == 0);
> > > >
> > > > +       pc = perf_evsel__mmap(evsel, 0);
> > > > +       __T("failed to mmap evsel", pc);
> > > > +
> > > > +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
> > > > +       __T("userspace counter access not supported", pc->cap_user_rdpmc);
> > > > +       __T("userspace counter access not enabled", pc->index);
> > > > +       __T("userspace counter width not set", pc->pmc_width >= 32);
> > > > +#endif
> > >
> > > I'll need to check, I'm surprised this would depend on the way
> > > you open the event
> >
> > Any more thoughts on this?
>
> sry I got stuck with other stuff.. I tried your change
> and pc->cap_user_rdpmc is 0 because the test creates
> software event, which does not support that

Sigh, yes, of course.

> when I change that to:
>
>         .type   = PERF_TYPE_HARDWARE,
>         .config = PERF_COUNT_HW_CPU_CYCLES,
>
> I don't see any of those warning you added

So I've now implemented the per fd mmap. It seems to run and get some
data, but for the above case the counts don't look right.

cpu0: count = 0x10883, ena = 0xbf42, run = 0xbf42
cpu1: count = 0x1bc65, ena = 0xa278, run = 0xa278
cpu2: count = 0x1fab2, ena = 0x91ea, run = 0x91ea
cpu3: count = 0x23d61, ena = 0x81ac, run = 0x81ac
cpu4: count = 0x2936a, ena = 0x7149, run = 0x7149
cpu5: count = 0x2cd4e, ena = 0x634f, run = 0x634f
cpu6: count = 0x3139f, ena = 0x53e7, run = 0x53e7
cpu7: count = 0x35350, ena = 0x4690, run = 0x4690

For comparison, this is what I get using the slow path read():
cpu0: count = 0x1c40, ena = 0x188b5, run = 0x188b5
cpu1: count = 0x18e0, ena = 0x1b8f4, run = 0x1b8f4
cpu2: count = 0x745e, ena = 0x1ab9e, run = 0x1ab9e
cpu3: count = 0x2416, ena = 0x1a280, run = 0x1a280
cpu4: count = 0x19c7, ena = 0x19b00, run = 0x19b00
cpu5: count = 0x1737, ena = 0x19262, run = 0x19262
cpu6: count = 0x11d0e, ena = 0x18944, run = 0x18944
cpu7: count = 0x20dbe, ena = 0x181f4, run = 0x181f4

The difference is we get a sequentially increasing count rather than 1
random CPU (the one running the test) with a much higher count. That
seems to me we're just reading the count for the calling process, not
each CPU.

For this to work correctly, cap_user_rdpmc would have to be set only
for the CPU's mmap that matches the calling process's CPU. I'm not
sure whether that can be done. Even if it can, is it really worth
doing so? You're accelerating reading an event on 1 out of N CPUs. And
what do we do on every kernel up til now this won't work on? Another
cap bit?

Rob

P.S. I did find one bug with all this. The shifts by pmc_width in the
read seq need to be a signed count. This test happens to have raw
counter values starting at 2^47.