RE: [PATCH] perf/x86: fix event counter update issue

From: "Liang, Kan" <kan.liang@intel.com>
To: Stephane Eranian <eranian@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	"ak@linux.intel.com" <ak@linux.intel.com>,
	"Odzioba, Lukasz" <lukasz.odzioba@intel.com>
Subject: RE: [PATCH] perf/x86: fix event counter update issue
Date: Mon, 28 Nov 2016 19:59:42 +0000	[thread overview]
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F07750CA39C9@SHSMSX103.ccr.corp.intel.com> (raw)
In-Reply-To: <CABPqkBRYUJASncQai0dZ=prpFzgNfNXVA_MNA9E=ZmTCs5P0GA@mail.gmail.com>

> >
> > Here, all the possible failure cases are listed.
> > Terms:
> >     - new: current PMU counter value which read from rdpmcl.
> >     - prev: previous counter value which is stored in &hwc->prev_count.
> >     - in PMI/not in PMI: the event update happens in PMI handler or not.
> > Current code to calculate delta:
> >     delta = (new << shift) - (prev << shift);
> >     delta >>= shift;
> >
> > Case A: Not in PMI.  new > prev. But delta is negative.
> >    That's the failure case of Test 2.
> >    delta is s64 type. new and prev are u64 type. If the new is big
> >    enough, after doing left shift and sub, the bit 64 of the result may
> >    still be 1.
> >    After converting to s64, the sign flag will be set. Since delta is
> >    s64 type, arithmetic right shift is applied, which copy the sign flag
> >    into empty bit positions on the upper end of delta.
> >    It can be fixed by adding the max count value.
> >
> >    Here is the real data for test2 on KNL.
> >    new = aea96e1927
> >    prev = ffffff0000000001
> 
> 
> How can new be so large here?
> I mean between prev and new, the counter wrapped around. And it took
> that many occurrences (of cycles) to handle the overflow?

There is no overflow in this case.
The counter is 40bit width. The highest value which can be count without
overflow is 0xfffffffffe

> 
> >
> >    delta = aea96e1927000000 - 1000000 = aea96e1926000000
> >    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta
> >
> > Case B: In PMI. new > prev. delta is positive.
> >    That's the failure case of Test 3.
> >    The PMI is triggered by overflow. But there is latency between
> >    overflow and PMI handler. So new has small amount.
> >    Current calculation lose the max count value.
> >
> > Case C: In PMI. new < prev. delta is negative.
> >    The PMU counter may be start from a big value. E.g. the fixed period
> >    is small.
> >    It can be fixed by adding the max count value.
> >
> I am not sure I understand why PMI should matter here. What matters is
> prev vs. current and the wrap-around.
> Please explain.
> Thanks.

Right, PMI shouldn't matter here.
We should add max count value if delta is negative, no matter if it’s in PMI.

To fix this issue, I once had a list which include all the possible cases.
"non-PMI, new < prev, delta is negative" is one of them. But it rarely happen.
So I remove it, but forget to modify the description of Case C.

Thanks,
Kan

> 
> >
> > There is still a case which delta could be wrong.
> > The case is that event update just happens in PMI and overflow gap.
> > It's not in PMI, new > prev, and delta is positive. Current
> > calculation may lose the max count value.
> > But this case rarely happens. So the rare case doesn't handle here.
> >
> > Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> > Signed-off-by: Kan Liang <kan.liang@intel.com>
> > Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> > ---
> >  arch/x86/events/core.c       | 23 +++++++++++++++++++----
> >  arch/x86/events/intel/core.c |  4 ++--
> >  arch/x86/events/intel/p4.c   |  2 +-
> >  arch/x86/events/perf_event.h |  2 +-
> >  4 files changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> > 6c3b0ef..c351ac4 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
> >   * Can only be executed on the CPU where the event is active.
> >   * Returns the delta events processed.
> >   */
> > -u64 x86_perf_event_update(struct perf_event *event)
> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
> >  {
> >         struct hw_perf_event *hwc = &event->hw;
> >         int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64
> > x86_perf_event_update(struct perf_event *event)
> >         delta = (new_raw_count << shift) - (prev_raw_count << shift);
> >         delta >>= shift;
> >
> > +       /*
> > +        * Correct delta for special cases caused by the late PMI handle.
> > +        * Case1: PMU counter may be start from a big value.
> > +        *        In PMI, new < prev. delta is negative.
> > +        * Case2: new is big enough which impact the sign flag.
> > +        *        The delta will be negative after arithmetic right shift.
> > +        * Case3: In PMI, new > prev.
> > +        *        The new - prev lose the max count value.
> > +        *
> > +        * There may be event update in PMI and overflow gap,
> > +        * but it rarely happen. The rare case doesn't handle here.
> > +        */
> > +       if (((delta > 0) && pmi) || (delta < 0))
> > +               delta += x86_pmu.cntval_mask + 1;
> > +
> >         local64_add(delta, &event->count);
> >         local64_sub(delta, &hwc->period_left);
> >
> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
> int flags)
> >                  * Drain the remaining delta count out of a event
> >                  * that we are disabling:
> >                  */
> > -               x86_perf_event_update(event);
> > +               x86_perf_event_update(event, (flags == 0));
> >                 hwc->state |= PERF_HES_UPTODATE;
> >         }
> >  }
> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
> >
> >                 event = cpuc->events[idx];
> >
> > -               val = x86_perf_event_update(event);
> > +               val = x86_perf_event_update(event, true);
> >                 if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
> >                         continue;
> >
> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
> >
> >  static inline void x86_pmu_read(struct perf_event *event)  {
> > -       x86_perf_event_update(event);
> > +       x86_perf_event_update(event, false);
> >  }
> >
> >  /*
> > diff --git a/arch/x86/events/intel/core.c
> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
> >         for (i = 0; i < 4; i++) {
> >                 event = cpuc->events[i];
> >                 if (event)
> > -                       x86_perf_event_update(event);
> > +                       x86_perf_event_update(event, false);
> >         }
> >
> >         for (i = 0; i < 4; i++) {
> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
> perf_event *event)
> >   */
> >  int intel_pmu_save_and_restart(struct perf_event *event)  {
> > -       x86_perf_event_update(event);
> > +       x86_perf_event_update(event, true);
> >         /*
> >          * For a checkpointed counter always reset back to 0.  This
> >          * avoids a situation where the counter overflows, aborts the
> > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
> > index eb05335..8117de8 100644
> > --- a/arch/x86/events/intel/p4.c
> > +++ b/arch/x86/events/intel/p4.c
> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
> *regs)
> >                 /* it might be unflagged overflow */
> >                 overflow = p4_pmu_clear_cccr_ovf(hwc);
> >
> > -               val = x86_perf_event_update(event);
> > +               val = x86_perf_event_update(event, true);
> >                 if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
> >                         continue;
> >
> > diff --git a/arch/x86/events/perf_event.h
> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
> > --- a/arch/x86/events/perf_event.h
> > +++ b/arch/x86/events/perf_event.h
> > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
> >                                 [PERF_COUNT_HW_CACHE_OP_MAX]
> >                                 [PERF_COUNT_HW_CACHE_RESULT_MAX];
> >
> > -u64 x86_perf_event_update(struct perf_event *event);
> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
> >
> >  static inline unsigned int x86_pmu_config_addr(int index)  {
> > --
> > 2.4.3
> >