linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf/x86: fix event counter update issue
@ 2016-11-28 19:26 kan.liang
  2016-11-28 19:41 ` Stephane Eranian
  2016-11-29  9:25 ` Peter Zijlstra
  0 siblings, 2 replies; 27+ messages in thread
From: kan.liang @ 2016-11-28 19:26 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel
  Cc: eranian, alexander.shishkin, ak, lukasz.odzioba, Kan Liang

From: Kan Liang <kan.liang@intel.com>

Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update")

A bug can be easily reproduced by perf stat loop program on KNL/SLM.
The loop program is just "for (i=0;i<loops;i++);"
"loops" is set to different value during the test.
Test 1:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 60000000000
        451739576564,,cycles,313694125185,100.00
Test 2:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 100000000000
        18446743727079256064,,cycles,313694125185,100.00
Test 3:
perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 200000000000
        406240041973,,cycles,313694125185,100.00
Only test 1 has correct result.
The result for test 2 is too big. Test 3 is too small.

The bug happens more frequently on KNL/SLM due to narrower counters,
but should be generic issue on all platforms.

The issues are caused by mistakenly calculating the counter
increment (delta) in x86_perf_event_update for some cases.

Here, all the possible failure cases are listed.
Terms:
    - new: current PMU counter value which read from rdpmcl.
    - prev: previous counter value which is stored in &hwc->prev_count.
    - in PMI/not in PMI: the event update happens in PMI handler or not.
Current code to calculate delta:
    delta = (new << shift) - (prev << shift);
    delta >>= shift;

Case A: Not in PMI.  new > prev. But delta is negative.
   That's the failure case of Test 2.
   delta is s64 type. new and prev are u64 type. If the new is big
   enough, after doing left shift and sub, the bit 64 of the result may
   still be 1.
   After converting to s64, the sign flag will be set. Since delta is
   s64 type, arithmetic right shift is applied, which copy the sign flag
   into empty bit positions on the upper end of delta.
   It can be fixed by adding the max count value.

   Here is the real data for test2 on KNL.
   new = aea96e1927
   prev = ffffff0000000001
   delta = aea96e1927000000 - 1000000 = aea96e1926000000
   aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta

Case B: In PMI. new > prev. delta is positive.
   That's the failure case of Test 3.
   The PMI is triggered by overflow. But there is latency between
   overflow and PMI handler. So new has small amount.
   Current calculation lose the max count value.

Case C: In PMI. new < prev. delta is negative.
   The PMU counter may be start from a big value. E.g. the fixed period
   is small.
   It can be fixed by adding the max count value.

There is still a case which delta could be wrong.
The case is that event update just happens in PMI and overflow gap. It's
not in PMI, new > prev, and delta is positive. Current calculation may
lose the max count value.
But this case rarely happens. So the rare case doesn't handle here.

Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
---
 arch/x86/events/core.c       | 23 +++++++++++++++++++----
 arch/x86/events/intel/core.c |  4 ++--
 arch/x86/events/intel/p4.c   |  2 +-
 arch/x86/events/perf_event.h |  2 +-
 4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6c3b0ef..c351ac4 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
  * Can only be executed on the CPU where the event is active.
  * Returns the delta events processed.
  */
-u64 x86_perf_event_update(struct perf_event *event)
+u64 x86_perf_event_update(struct perf_event *event, bool pmi)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	int shift = 64 - x86_pmu.cntval_bits;
@@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event)
 	delta = (new_raw_count << shift) - (prev_raw_count << shift);
 	delta >>= shift;
 
+	/*
+	 * Correct delta for special cases caused by the late PMI handle.
+	 * Case1: PMU counter may be start from a big value.
+	 *	  In PMI, new < prev. delta is negative.
+	 * Case2: new is big enough which impact the sign flag.
+	 *	  The delta will be negative after arithmetic right shift.
+	 * Case3: In PMI, new > prev.
+	 *	  The new - prev lose the max count value.
+	 *
+	 * There may be event update in PMI and overflow gap,
+	 * but it rarely happen. The rare case doesn't handle here.
+	 */
+	if (((delta > 0) && pmi) || (delta < 0))
+		delta += x86_pmu.cntval_mask + 1;
+
 	local64_add(delta, &event->count);
 	local64_sub(delta, &hwc->period_left);
 
@@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
 		 * Drain the remaining delta count out of a event
 		 * that we are disabling:
 		 */
-		x86_perf_event_update(event);
+		x86_perf_event_update(event, (flags == 0));
 		hwc->state |= PERF_HES_UPTODATE;
 	}
 }
@@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
 
 		event = cpuc->events[idx];
 
-		val = x86_perf_event_update(event);
+		val = x86_perf_event_update(event, true);
 		if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
 			continue;
 
@@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
 
 static inline void x86_pmu_read(struct perf_event *event)
 {
-	x86_perf_event_update(event);
+	x86_perf_event_update(event, false);
 }
 
 /*
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a74a2db..69d65e6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
 	for (i = 0; i < 4; i++) {
 		event = cpuc->events[i];
 		if (event)
-			x86_perf_event_update(event);
+			x86_perf_event_update(event, false);
 	}
 
 	for (i = 0; i < 4; i++) {
@@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event)
  */
 int intel_pmu_save_and_restart(struct perf_event *event)
 {
-	x86_perf_event_update(event);
+	x86_perf_event_update(event, true);
 	/*
 	 * For a checkpointed counter always reset back to 0.  This
 	 * avoids a situation where the counter overflows, aborts the
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index eb05335..8117de8 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
 		/* it might be unflagged overflow */
 		overflow = p4_pmu_clear_cccr_ovf(hwc);
 
-		val = x86_perf_event_update(event);
+		val = x86_perf_event_update(event, true);
 		if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
 			continue;
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index c6b25ac..09c9db2 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
 				[PERF_COUNT_HW_CACHE_OP_MAX]
 				[PERF_COUNT_HW_CACHE_RESULT_MAX];
 
-u64 x86_perf_event_update(struct perf_event *event);
+u64 x86_perf_event_update(struct perf_event *event, bool pmi);
 
 static inline unsigned int x86_pmu_config_addr(int index)
 {
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-28 19:26 [PATCH] perf/x86: fix event counter update issue kan.liang
@ 2016-11-28 19:41 ` Stephane Eranian
  2016-11-28 19:59   ` Liang, Kan
  2016-11-29  9:25 ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Stephane Eranian @ 2016-11-28 19:41 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz

On Mon, Nov 28, 2016 at 11:26 AM, <kan.liang@intel.com> wrote:
>
> From: Kan Liang <kan.liang@intel.com>
>
> Fixes: ec3232bdf851 ("perf_counter: x86: More accurate counter update")
>
> A bug can be easily reproduced by perf stat loop program on KNL/SLM.
> The loop program is just "for (i=0;i<loops;i++);"
> "loops" is set to different value during the test.
> Test 1:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 60000000000
>         451739576564,,cycles,313694125185,100.00
> Test 2:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 100000000000
>         18446743727079256064,,cycles,313694125185,100.00
> Test 3:
> perf stat -x, -C1 -e cycles -- taskset 0x2 ./loop 200000000000
>         406240041973,,cycles,313694125185,100.00
> Only test 1 has correct result.
> The result for test 2 is too big. Test 3 is too small.
>
> The bug happens more frequently on KNL/SLM due to narrower counters,
> but should be generic issue on all platforms.
>
> The issues are caused by mistakenly calculating the counter
> increment (delta) in x86_perf_event_update for some cases.
>
> Here, all the possible failure cases are listed.
> Terms:
>     - new: current PMU counter value which read from rdpmcl.
>     - prev: previous counter value which is stored in &hwc->prev_count.
>     - in PMI/not in PMI: the event update happens in PMI handler or not.
> Current code to calculate delta:
>     delta = (new << shift) - (prev << shift);
>     delta >>= shift;
>
> Case A: Not in PMI.  new > prev. But delta is negative.
>    That's the failure case of Test 2.
>    delta is s64 type. new and prev are u64 type. If the new is big
>    enough, after doing left shift and sub, the bit 64 of the result may
>    still be 1.
>    After converting to s64, the sign flag will be set. Since delta is
>    s64 type, arithmetic right shift is applied, which copy the sign flag
>    into empty bit positions on the upper end of delta.
>    It can be fixed by adding the max count value.
>
>    Here is the real data for test2 on KNL.
>    new = aea96e1927
>    prev = ffffff0000000001


How can new be so large here?
I mean between prev and new, the counter wrapped around. And it took
that many occurrences (of cycles) to handle the overflow?

>
>    delta = aea96e1927000000 - 1000000 = aea96e1926000000
>    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta
>
> Case B: In PMI. new > prev. delta is positive.
>    That's the failure case of Test 3.
>    The PMI is triggered by overflow. But there is latency between
>    overflow and PMI handler. So new has small amount.
>    Current calculation lose the max count value.
>
> Case C: In PMI. new < prev. delta is negative.
>    The PMU counter may be start from a big value. E.g. the fixed period
>    is small.
>    It can be fixed by adding the max count value.
>
I am not sure I understand why PMI should matter here. What matters is
prev vs. current and the wrap-around.
Please explain.
Thanks.

>
> There is still a case which delta could be wrong.
> The case is that event update just happens in PMI and overflow gap. It's
> not in PMI, new > prev, and delta is positive. Current calculation may
> lose the max count value.
> But this case rarely happens. So the rare case doesn't handle here.
>
> Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> ---
>  arch/x86/events/core.c       | 23 +++++++++++++++++++----
>  arch/x86/events/intel/core.c |  4 ++--
>  arch/x86/events/intel/p4.c   |  2 +-
>  arch/x86/events/perf_event.h |  2 +-
>  4 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 6c3b0ef..c351ac4 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
>   * Can only be executed on the CPU where the event is active.
>   * Returns the delta events processed.
>   */
> -u64 x86_perf_event_update(struct perf_event *event)
> +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
>  {
>         struct hw_perf_event *hwc = &event->hw;
>         int shift = 64 - x86_pmu.cntval_bits;
> @@ -100,6 +100,21 @@ u64 x86_perf_event_update(struct perf_event *event)
>         delta = (new_raw_count << shift) - (prev_raw_count << shift);
>         delta >>= shift;
>
> +       /*
> +        * Correct delta for special cases caused by the late PMI handle.
> +        * Case1: PMU counter may be start from a big value.
> +        *        In PMI, new < prev. delta is negative.
> +        * Case2: new is big enough which impact the sign flag.
> +        *        The delta will be negative after arithmetic right shift.
> +        * Case3: In PMI, new > prev.
> +        *        The new - prev lose the max count value.
> +        *
> +        * There may be event update in PMI and overflow gap,
> +        * but it rarely happen. The rare case doesn't handle here.
> +        */
> +       if (((delta > 0) && pmi) || (delta < 0))
> +               delta += x86_pmu.cntval_mask + 1;
> +
>         local64_add(delta, &event->count);
>         local64_sub(delta, &hwc->period_left);
>
> @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
>                  * Drain the remaining delta count out of a event
>                  * that we are disabling:
>                  */
> -               x86_perf_event_update(event);
> +               x86_perf_event_update(event, (flags == 0));
>                 hwc->state |= PERF_HES_UPTODATE;
>         }
>  }
> @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
>
>                 event = cpuc->events[idx];
>
> -               val = x86_perf_event_update(event);
> +               val = x86_perf_event_update(event, true);
>                 if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
>                         continue;
>
> @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
>
>  static inline void x86_pmu_read(struct perf_event *event)
>  {
> -       x86_perf_event_update(event);
> +       x86_perf_event_update(event, false);
>  }
>
>  /*
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2db..69d65e6 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
>         for (i = 0; i < 4; i++) {
>                 event = cpuc->events[i];
>                 if (event)
> -                       x86_perf_event_update(event);
> +                       x86_perf_event_update(event, false);
>         }
>
>         for (i = 0; i < 4; i++) {
> @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct perf_event *event)
>   */
>  int intel_pmu_save_and_restart(struct perf_event *event)
>  {
> -       x86_perf_event_update(event);
> +       x86_perf_event_update(event, true);
>         /*
>          * For a checkpointed counter always reset back to 0.  This
>          * avoids a situation where the counter overflows, aborts the
> diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
> index eb05335..8117de8 100644
> --- a/arch/x86/events/intel/p4.c
> +++ b/arch/x86/events/intel/p4.c
> @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
>                 /* it might be unflagged overflow */
>                 overflow = p4_pmu_clear_cccr_ovf(hwc);
>
> -               val = x86_perf_event_update(event);
> +               val = x86_perf_event_update(event, true);
>                 if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
>                         continue;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index c6b25ac..09c9db2 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
>                                 [PERF_COUNT_HW_CACHE_OP_MAX]
>                                 [PERF_COUNT_HW_CACHE_RESULT_MAX];
>
> -u64 x86_perf_event_update(struct perf_event *event);
> +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
>
>  static inline unsigned int x86_pmu_config_addr(int index)
>  {
> --
> 2.4.3
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-11-28 19:41 ` Stephane Eranian
@ 2016-11-28 19:59   ` Liang, Kan
  2016-11-28 20:18     ` Stephane Eranian
  0 siblings, 1 reply; 27+ messages in thread
From: Liang, Kan @ 2016-11-28 19:59 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz



> >
> > Here, all the possible failure cases are listed.
> > Terms:
> >     - new: current PMU counter value which read from rdpmcl.
> >     - prev: previous counter value which is stored in &hwc->prev_count.
> >     - in PMI/not in PMI: the event update happens in PMI handler or not.
> > Current code to calculate delta:
> >     delta = (new << shift) - (prev << shift);
> >     delta >>= shift;
> >
> > Case A: Not in PMI.  new > prev. But delta is negative.
> >    That's the failure case of Test 2.
> >    delta is s64 type. new and prev are u64 type. If the new is big
> >    enough, after doing left shift and sub, the bit 64 of the result may
> >    still be 1.
> >    After converting to s64, the sign flag will be set. Since delta is
> >    s64 type, arithmetic right shift is applied, which copy the sign flag
> >    into empty bit positions on the upper end of delta.
> >    It can be fixed by adding the max count value.
> >
> >    Here is the real data for test2 on KNL.
> >    new = aea96e1927
> >    prev = ffffff0000000001
> 
> 
> How can new be so large here?
> I mean between prev and new, the counter wrapped around. And it took
> that many occurrences (of cycles) to handle the overflow?

There is no overflow in this case.
The counter is 40bit width. The highest value which can be count without
overflow is 0xfffffffffe

> 
> >
> >    delta = aea96e1927000000 - 1000000 = aea96e1926000000
> >    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta
> >
> > Case B: In PMI. new > prev. delta is positive.
> >    That's the failure case of Test 3.
> >    The PMI is triggered by overflow. But there is latency between
> >    overflow and PMI handler. So new has small amount.
> >    Current calculation lose the max count value.
> >
> > Case C: In PMI. new < prev. delta is negative.
> >    The PMU counter may be start from a big value. E.g. the fixed period
> >    is small.
> >    It can be fixed by adding the max count value.
> >
> I am not sure I understand why PMI should matter here. What matters is
> prev vs. current and the wrap-around.
> Please explain.
> Thanks.

Right, PMI shouldn't matter here.
We should add max count value if delta is negative, no matter if it’s in PMI.

To fix this issue, I once had a list which include all the possible cases.
"non-PMI, new < prev, delta is negative" is one of them. But it rarely happen.
So I remove it, but forget to modify the description of Case C.

Thanks,
Kan

> 
> >
> > There is still a case which delta could be wrong.
> > The case is that event update just happens in PMI and overflow gap.
> > It's not in PMI, new > prev, and delta is positive. Current
> > calculation may lose the max count value.
> > But this case rarely happens. So the rare case doesn't handle here.
> >
> > Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> > Signed-off-by: Kan Liang <kan.liang@intel.com>
> > Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> > ---
> >  arch/x86/events/core.c       | 23 +++++++++++++++++++----
> >  arch/x86/events/intel/core.c |  4 ++--
> >  arch/x86/events/intel/p4.c   |  2 +-
> >  arch/x86/events/perf_event.h |  2 +-
> >  4 files changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> > 6c3b0ef..c351ac4 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
> >   * Can only be executed on the CPU where the event is active.
> >   * Returns the delta events processed.
> >   */
> > -u64 x86_perf_event_update(struct perf_event *event)
> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
> >  {
> >         struct hw_perf_event *hwc = &event->hw;
> >         int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64
> > x86_perf_event_update(struct perf_event *event)
> >         delta = (new_raw_count << shift) - (prev_raw_count << shift);
> >         delta >>= shift;
> >
> > +       /*
> > +        * Correct delta for special cases caused by the late PMI handle.
> > +        * Case1: PMU counter may be start from a big value.
> > +        *        In PMI, new < prev. delta is negative.
> > +        * Case2: new is big enough which impact the sign flag.
> > +        *        The delta will be negative after arithmetic right shift.
> > +        * Case3: In PMI, new > prev.
> > +        *        The new - prev lose the max count value.
> > +        *
> > +        * There may be event update in PMI and overflow gap,
> > +        * but it rarely happen. The rare case doesn't handle here.
> > +        */
> > +       if (((delta > 0) && pmi) || (delta < 0))
> > +               delta += x86_pmu.cntval_mask + 1;
> > +
> >         local64_add(delta, &event->count);
> >         local64_sub(delta, &hwc->period_left);
> >
> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
> int flags)
> >                  * Drain the remaining delta count out of a event
> >                  * that we are disabling:
> >                  */
> > -               x86_perf_event_update(event);
> > +               x86_perf_event_update(event, (flags == 0));
> >                 hwc->state |= PERF_HES_UPTODATE;
> >         }
> >  }
> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
> >
> >                 event = cpuc->events[idx];
> >
> > -               val = x86_perf_event_update(event);
> > +               val = x86_perf_event_update(event, true);
> >                 if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
> >                         continue;
> >
> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
> >
> >  static inline void x86_pmu_read(struct perf_event *event)  {
> > -       x86_perf_event_update(event);
> > +       x86_perf_event_update(event, false);
> >  }
> >
> >  /*
> > diff --git a/arch/x86/events/intel/core.c
> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
> >         for (i = 0; i < 4; i++) {
> >                 event = cpuc->events[i];
> >                 if (event)
> > -                       x86_perf_event_update(event);
> > +                       x86_perf_event_update(event, false);
> >         }
> >
> >         for (i = 0; i < 4; i++) {
> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
> perf_event *event)
> >   */
> >  int intel_pmu_save_and_restart(struct perf_event *event)  {
> > -       x86_perf_event_update(event);
> > +       x86_perf_event_update(event, true);
> >         /*
> >          * For a checkpointed counter always reset back to 0.  This
> >          * avoids a situation where the counter overflows, aborts the
> > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
> > index eb05335..8117de8 100644
> > --- a/arch/x86/events/intel/p4.c
> > +++ b/arch/x86/events/intel/p4.c
> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
> *regs)
> >                 /* it might be unflagged overflow */
> >                 overflow = p4_pmu_clear_cccr_ovf(hwc);
> >
> > -               val = x86_perf_event_update(event);
> > +               val = x86_perf_event_update(event, true);
> >                 if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
> >                         continue;
> >
> > diff --git a/arch/x86/events/perf_event.h
> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
> > --- a/arch/x86/events/perf_event.h
> > +++ b/arch/x86/events/perf_event.h
> > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
> >                                 [PERF_COUNT_HW_CACHE_OP_MAX]
> >                                 [PERF_COUNT_HW_CACHE_RESULT_MAX];
> >
> > -u64 x86_perf_event_update(struct perf_event *event);
> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
> >
> >  static inline unsigned int x86_pmu_config_addr(int index)  {
> > --
> > 2.4.3
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-28 19:59   ` Liang, Kan
@ 2016-11-28 20:18     ` Stephane Eranian
  2016-11-28 20:23       ` Liang, Kan
  0 siblings, 1 reply; 27+ messages in thread
From: Stephane Eranian @ 2016-11-28 20:18 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz

On Mon, Nov 28, 2016 at 11:59 AM, Liang, Kan <kan.liang@intel.com> wrote:
>
>
>> >
>> > Here, all the possible failure cases are listed.
>> > Terms:
>> >     - new: current PMU counter value which read from rdpmcl.
>> >     - prev: previous counter value which is stored in &hwc->prev_count.
>> >     - in PMI/not in PMI: the event update happens in PMI handler or not.
>> > Current code to calculate delta:
>> >     delta = (new << shift) - (prev << shift);
>> >     delta >>= shift;
>> >
>> > Case A: Not in PMI.  new > prev. But delta is negative.
>> >    That's the failure case of Test 2.
>> >    delta is s64 type. new and prev are u64 type. If the new is big
>> >    enough, after doing left shift and sub, the bit 64 of the result may
>> >    still be 1.
>> >    After converting to s64, the sign flag will be set. Since delta is
>> >    s64 type, arithmetic right shift is applied, which copy the sign flag
>> >    into empty bit positions on the upper end of delta.
>> >    It can be fixed by adding the max count value.
>> >
>> >    Here is the real data for test2 on KNL.
>> >    new = aea96e1927
>> >    prev = ffffff0000000001
>>
>>
>> How can new be so large here?
>> I mean between prev and new, the counter wrapped around. And it took
>> that many occurrences (of cycles) to handle the overflow?
>
> There is no overflow in this case.
> The counter is 40bit width. The highest value which can be count without
> overflow is 0xfffffffffe
>
>>
>> >
>> >    delta = aea96e1927000000 - 1000000 = aea96e1926000000
>> >    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta
>> >
>> > Case B: In PMI. new > prev. delta is positive.
>> >    That's the failure case of Test 3.
>> >    The PMI is triggered by overflow. But there is latency between
>> >    overflow and PMI handler. So new has small amount.
>> >    Current calculation lose the max count value.
>> >
>> > Case C: In PMI. new < prev. delta is negative.
>> >    The PMU counter may be start from a big value. E.g. the fixed period
>> >    is small.
>> >    It can be fixed by adding the max count value.
>> >
>> I am not sure I understand why PMI should matter here. What matters is
>> prev vs. current and the wrap-around.
>> Please explain.
>> Thanks.
>
> Right, PMI shouldn't matter here.
> We should add max count value if delta is negative, no matter if it’s in PMI.
>
That sounds right, but then why do you have that boolean (bool pmi) in
your patch?

> To fix this issue, I once had a list which include all the possible cases.
> "non-PMI, new < prev, delta is negative" is one of them. But it rarely happen.
> So I remove it, but forget to modify the description of Case C.
>
> Thanks,
> Kan
>
>>
>> >
>> > There is still a case which delta could be wrong.
>> > The case is that event update just happens in PMI and overflow gap.
>> > It's not in PMI, new > prev, and delta is positive. Current
>> > calculation may lose the max count value.
>> > But this case rarely happens. So the rare case doesn't handle here.
>> >
>> > Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
>> > Signed-off-by: Kan Liang <kan.liang@intel.com>
>> > Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
>> > ---
>> >  arch/x86/events/core.c       | 23 +++++++++++++++++++----
>> >  arch/x86/events/intel/core.c |  4 ++--
>> >  arch/x86/events/intel/p4.c   |  2 +-
>> >  arch/x86/events/perf_event.h |  2 +-
>> >  4 files changed, 23 insertions(+), 8 deletions(-)
>> >
>> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
>> > 6c3b0ef..c351ac4 100644
>> > --- a/arch/x86/events/core.c
>> > +++ b/arch/x86/events/core.c
>> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
>> >   * Can only be executed on the CPU where the event is active.
>> >   * Returns the delta events processed.
>> >   */
>> > -u64 x86_perf_event_update(struct perf_event *event)
>> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
>> >  {
>> >         struct hw_perf_event *hwc = &event->hw;
>> >         int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@ u64
>> > x86_perf_event_update(struct perf_event *event)
>> >         delta = (new_raw_count << shift) - (prev_raw_count << shift);
>> >         delta >>= shift;
>> >
>> > +       /*
>> > +        * Correct delta for special cases caused by the late PMI handle.
>> > +        * Case1: PMU counter may be start from a big value.
>> > +        *        In PMI, new < prev. delta is negative.
>> > +        * Case2: new is big enough which impact the sign flag.
>> > +        *        The delta will be negative after arithmetic right shift.
>> > +        * Case3: In PMI, new > prev.
>> > +        *        The new - prev lose the max count value.
>> > +        *
>> > +        * There may be event update in PMI and overflow gap,
>> > +        * but it rarely happen. The rare case doesn't handle here.
>> > +        */
>> > +       if (((delta > 0) && pmi) || (delta < 0))
>> > +               delta += x86_pmu.cntval_mask + 1;
>> > +
>> >         local64_add(delta, &event->count);
>> >         local64_sub(delta, &hwc->period_left);
>> >
>> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
>> int flags)
>> >                  * Drain the remaining delta count out of a event
>> >                  * that we are disabling:
>> >                  */
>> > -               x86_perf_event_update(event);
>> > +               x86_perf_event_update(event, (flags == 0));
>> >                 hwc->state |= PERF_HES_UPTODATE;
>> >         }
>> >  }
>> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
>> >
>> >                 event = cpuc->events[idx];
>> >
>> > -               val = x86_perf_event_update(event);
>> > +               val = x86_perf_event_update(event, true);
>> >                 if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
>> >                         continue;
>> >
>> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
>> >
>> >  static inline void x86_pmu_read(struct perf_event *event)  {
>> > -       x86_perf_event_update(event);
>> > +       x86_perf_event_update(event, false);
>> >  }
>> >
>> >  /*
>> > diff --git a/arch/x86/events/intel/core.c
>> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
>> > --- a/arch/x86/events/intel/core.c
>> > +++ b/arch/x86/events/intel/core.c
>> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
>> >         for (i = 0; i < 4; i++) {
>> >                 event = cpuc->events[i];
>> >                 if (event)
>> > -                       x86_perf_event_update(event);
>> > +                       x86_perf_event_update(event, false);
>> >         }
>> >
>> >         for (i = 0; i < 4; i++) {
>> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
>> perf_event *event)
>> >   */
>> >  int intel_pmu_save_and_restart(struct perf_event *event)  {
>> > -       x86_perf_event_update(event);
>> > +       x86_perf_event_update(event, true);
>> >         /*
>> >          * For a checkpointed counter always reset back to 0.  This
>> >          * avoids a situation where the counter overflows, aborts the
>> > diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
>> > index eb05335..8117de8 100644
>> > --- a/arch/x86/events/intel/p4.c
>> > +++ b/arch/x86/events/intel/p4.c
>> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
>> *regs)
>> >                 /* it might be unflagged overflow */
>> >                 overflow = p4_pmu_clear_cccr_ovf(hwc);
>> >
>> > -               val = x86_perf_event_update(event);
>> > +               val = x86_perf_event_update(event, true);
>> >                 if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
>> >                         continue;
>> >
>> > diff --git a/arch/x86/events/perf_event.h
>> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
>> > --- a/arch/x86/events/perf_event.h
>> > +++ b/arch/x86/events/perf_event.h
>> > @@ -712,7 +712,7 @@ extern u64 __read_mostly hw_cache_extra_regs
>> >                                 [PERF_COUNT_HW_CACHE_OP_MAX]
>> >                                 [PERF_COUNT_HW_CACHE_RESULT_MAX];
>> >
>> > -u64 x86_perf_event_update(struct perf_event *event);
>> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
>> >
>> >  static inline unsigned int x86_pmu_config_addr(int index)  {
>> > --
>> > 2.4.3
>> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-11-28 20:18     ` Stephane Eranian
@ 2016-11-28 20:23       ` Liang, Kan
  0 siblings, 0 replies; 27+ messages in thread
From: Liang, Kan @ 2016-11-28 20:23 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz



> >> > Case B: In PMI. new > prev. delta is positive.
> >> >    That's the failure case of Test 3.
> >> >    The PMI is triggered by overflow. But there is latency between
> >> >    overflow and PMI handler. So new has small amount.
> >> >    Current calculation lose the max count value.
> >> >
> >> > Case C: In PMI. new < prev. delta is negative.
> >> >    The PMU counter may be start from a big value. E.g. the fixed period
> >> >    is small.
> >> >    It can be fixed by adding the max count value.
> >> >
> >> I am not sure I understand why PMI should matter here. What matters
> >> is prev vs. current and the wrap-around.
> >> Please explain.
> >> Thanks.
> >
> > Right, PMI shouldn't matter here.
> > We should add max count value if delta is negative, no matter if it’s in
> PMI.
> >
> That sounds right, but then why do you have that boolean (bool pmi) in
> your patch?

For case B. It needs to add max count value if new > prev in PMI.

Thanks,
Kan

> 
> > To fix this issue, I once had a list which include all the possible cases.
> > "non-PMI, new < prev, delta is negative" is one of them. But it rarely
> happen.
> > So I remove it, but forget to modify the description of Case C.
> >
> > Thanks,
> > Kan
> >
> >>
> >> >
> >> > There is still a case which delta could be wrong.
> >> > The case is that event update just happens in PMI and overflow gap.
> >> > It's not in PMI, new > prev, and delta is positive. Current
> >> > calculation may lose the max count value.
> >> > But this case rarely happens. So the rare case doesn't handle here.
> >> >
> >> > Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> >> > Signed-off-by: Kan Liang <kan.liang@intel.com>
> >> > Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
> >> > ---
> >> >  arch/x86/events/core.c       | 23 +++++++++++++++++++----
> >> >  arch/x86/events/intel/core.c |  4 ++--
> >> >  arch/x86/events/intel/p4.c   |  2 +-
> >> >  arch/x86/events/perf_event.h |  2 +-
> >> >  4 files changed, 23 insertions(+), 8 deletions(-)
> >> >
> >> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> >> > 6c3b0ef..c351ac4 100644
> >> > --- a/arch/x86/events/core.c
> >> > +++ b/arch/x86/events/core.c
> >> > @@ -63,7 +63,7 @@ u64 __read_mostly hw_cache_extra_regs
> >> >   * Can only be executed on the CPU where the event is active.
> >> >   * Returns the delta events processed.
> >> >   */
> >> > -u64 x86_perf_event_update(struct perf_event *event)
> >> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi)
> >> >  {
> >> >         struct hw_perf_event *hwc = &event->hw;
> >> >         int shift = 64 - x86_pmu.cntval_bits; @@ -100,6 +100,21 @@
> >> > u64 x86_perf_event_update(struct perf_event *event)
> >> >         delta = (new_raw_count << shift) - (prev_raw_count << shift);
> >> >         delta >>= shift;
> >> >
> >> > +       /*
> >> > +        * Correct delta for special cases caused by the late PMI handle.
> >> > +        * Case1: PMU counter may be start from a big value.
> >> > +        *        In PMI, new < prev. delta is negative.
> >> > +        * Case2: new is big enough which impact the sign flag.
> >> > +        *        The delta will be negative after arithmetic right shift.
> >> > +        * Case3: In PMI, new > prev.
> >> > +        *        The new - prev lose the max count value.
> >> > +        *
> >> > +        * There may be event update in PMI and overflow gap,
> >> > +        * but it rarely happen. The rare case doesn't handle here.
> >> > +        */
> >> > +       if (((delta > 0) && pmi) || (delta < 0))
> >> > +               delta += x86_pmu.cntval_mask + 1;
> >> > +
> >> >         local64_add(delta, &event->count);
> >> >         local64_sub(delta, &hwc->period_left);
> >> >
> >> > @@ -1342,7 +1357,7 @@ void x86_pmu_stop(struct perf_event *event,
> >> int flags)
> >> >                  * Drain the remaining delta count out of a event
> >> >                  * that we are disabling:
> >> >                  */
> >> > -               x86_perf_event_update(event);
> >> > +               x86_perf_event_update(event, (flags == 0));
> >> >                 hwc->state |= PERF_HES_UPTODATE;
> >> >         }
> >> >  }
> >> > @@ -1446,7 +1461,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
> >> >
> >> >                 event = cpuc->events[idx];
> >> >
> >> > -               val = x86_perf_event_update(event);
> >> > +               val = x86_perf_event_update(event, true);
> >> >                 if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
> >> >                         continue;
> >> >
> >> > @@ -1867,7 +1882,7 @@ early_initcall(init_hw_perf_events);
> >> >
> >> >  static inline void x86_pmu_read(struct perf_event *event)  {
> >> > -       x86_perf_event_update(event);
> >> > +       x86_perf_event_update(event, false);
> >> >  }
> >> >
> >> >  /*
> >> > diff --git a/arch/x86/events/intel/core.c
> >> > b/arch/x86/events/intel/core.c index a74a2db..69d65e6 100644
> >> > --- a/arch/x86/events/intel/core.c
> >> > +++ b/arch/x86/events/intel/core.c
> >> > @@ -1830,7 +1830,7 @@ static void intel_pmu_nhm_workaround(void)
> >> >         for (i = 0; i < 4; i++) {
> >> >                 event = cpuc->events[i];
> >> >                 if (event)
> >> > -                       x86_perf_event_update(event);
> >> > +                       x86_perf_event_update(event, false);
> >> >         }
> >> >
> >> >         for (i = 0; i < 4; i++) {
> >> > @@ -2002,7 +2002,7 @@ static void intel_pmu_add_event(struct
> >> perf_event *event)
> >> >   */
> >> >  int intel_pmu_save_and_restart(struct perf_event *event)  {
> >> > -       x86_perf_event_update(event);
> >> > +       x86_perf_event_update(event, true);
> >> >         /*
> >> >          * For a checkpointed counter always reset back to 0.  This
> >> >          * avoids a situation where the counter overflows, aborts
> >> > the diff --git a/arch/x86/events/intel/p4.c
> >> > b/arch/x86/events/intel/p4.c index eb05335..8117de8 100644
> >> > --- a/arch/x86/events/intel/p4.c
> >> > +++ b/arch/x86/events/intel/p4.c
> >> > @@ -1024,7 +1024,7 @@ static int p4_pmu_handle_irq(struct pt_regs
> >> *regs)
> >> >                 /* it might be unflagged overflow */
> >> >                 overflow = p4_pmu_clear_cccr_ovf(hwc);
> >> >
> >> > -               val = x86_perf_event_update(event);
> >> > +               val = x86_perf_event_update(event, true);
> >> >                 if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
> >> >                         continue;
> >> >
> >> > diff --git a/arch/x86/events/perf_event.h
> >> > b/arch/x86/events/perf_event.h index c6b25ac..09c9db2 100644
> >> > --- a/arch/x86/events/perf_event.h
> >> > +++ b/arch/x86/events/perf_event.h
> >> > @@ -712,7 +712,7 @@ extern u64 __read_mostly
> hw_cache_extra_regs
> >> >                                 [PERF_COUNT_HW_CACHE_OP_MAX]
> >> >                                 [PERF_COUNT_HW_CACHE_RESULT_MAX];
> >> >
> >> > -u64 x86_perf_event_update(struct perf_event *event);
> >> > +u64 x86_perf_event_update(struct perf_event *event, bool pmi);
> >> >
> >> >  static inline unsigned int x86_pmu_config_addr(int index)  {
> >> > --
> >> > 2.4.3
> >> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-28 19:26 [PATCH] perf/x86: fix event counter update issue kan.liang
  2016-11-28 19:41 ` Stephane Eranian
@ 2016-11-29  9:25 ` Peter Zijlstra
  2016-11-29 14:46   ` Liang, Kan
  2016-11-29 17:20   ` Stephane Eranian
  1 sibling, 2 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-11-29  9:25 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, linux-kernel, eranian, alexander.shishkin, ak, lukasz.odzioba


So caveat that I'm ill and cannot think much..

On Mon, Nov 28, 2016 at 11:26:46AM -0800, kan.liang@intel.com wrote:

> Here, all the possible failure cases are listed.
> Terms:
>     - new: current PMU counter value which read from rdpmcl.
>     - prev: previous counter value which is stored in &hwc->prev_count.
>     - in PMI/not in PMI: the event update happens in PMI handler or not.

> Current code to calculate delta:
>     delta = (new << shift) - (prev << shift);
>     delta >>= shift;
> 
> Case A: Not in PMI.  new > prev. But delta is negative.
>    That's the failure case of Test 2.
>    delta is s64 type. new and prev are u64 type. If the new is big
>    enough, after doing left shift and sub, the bit 64 of the result may
>    still be 1.
>    After converting to s64, the sign flag will be set. Since delta is
>    s64 type, arithmetic right shift is applied, which copy the sign flag
>    into empty bit positions on the upper end of delta.
>    It can be fixed by adding the max count value.
> 
>    Here is the real data for test2 on KNL.
>    new = aea96e1927
>    prev = ffffff0000000001
>    delta = aea96e1927000000 - 1000000 = aea96e1926000000
>    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta

How can this happen? IIRC the thing increments, we program a negative
value, and when it passes 0 we generate a PMI.

And note that we _ALWAYS_ set the IN bits, even for !sampling events.
Also note we set max_period to (1<<31) - 1, so we should never exceed 31
bits.


> Case B: In PMI. new > prev. delta is positive.
>    That's the failure case of Test 3.
>    The PMI is triggered by overflow. But there is latency between
>    overflow and PMI handler. So new has small amount.
>    Current calculation lose the max count value.

That doesn't make sense, per the 31bit limit.


> Case C: In PMI. new < prev. delta is negative.
>    The PMU counter may be start from a big value. E.g. the fixed period
>    is small.
>    It can be fixed by adding the max count value.

Doesn't make sense, how can this happen?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-11-29  9:25 ` Peter Zijlstra
@ 2016-11-29 14:46   ` Liang, Kan
  2016-11-29 16:58     ` Peter Zijlstra
  2016-11-29 17:20   ` Stephane Eranian
  1 sibling, 1 reply; 27+ messages in thread
From: Liang, Kan @ 2016-11-29 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, eranian, alexander.shishkin, ak, Odzioba, Lukasz



> So caveat that I'm ill and cannot think much..
> 
> On Mon, Nov 28, 2016 at 11:26:46AM -0800, kan.liang@intel.com wrote:
> 
> > Here, all the possible failure cases are listed.
> > Terms:
> >     - new: current PMU counter value which read from rdpmcl.
> >     - prev: previous counter value which is stored in &hwc->prev_count.
> >     - in PMI/not in PMI: the event update happens in PMI handler or not.
> 
> > Current code to calculate delta:
> >     delta = (new << shift) - (prev << shift);
> >     delta >>= shift;
> >
> > Case A: Not in PMI.  new > prev. But delta is negative.
> >    That's the failure case of Test 2.
> >    delta is s64 type. new and prev are u64 type. If the new is big
> >    enough, after doing left shift and sub, the bit 64 of the result may
> >    still be 1.
> >    After converting to s64, the sign flag will be set. Since delta is
> >    s64 type, arithmetic right shift is applied, which copy the sign flag
> >    into empty bit positions on the upper end of delta.
> >    It can be fixed by adding the max count value.
> >
> >    Here is the real data for test2 on KNL.
> >    new = aea96e1927
> >    prev = ffffff0000000001
> >    delta = aea96e1927000000 - 1000000 = aea96e1926000000
> >    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta
> 
> How can this happen? IIRC the thing increments, we program a negative
> value, and when it passes 0 we generate a PMI.
> 
> And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> Also note we set max_period to (1<<31) - 1, so we should never exceed 31
> bits.
> 

The max_period is 0xfffffffff.

The limit is breaked by this patch.
069e0c3c4058 ("perf/x86/intel: Support full width counting")
https://patchwork.kernel.org/patch/2784191/

	/* Support full width counters using alternative MSR range */
	if (x86_pmu.intel_cap.full_width_write) {
		x86_pmu.max_period = x86_pmu.cntval_mask;
		x86_pmu.perfctr = MSR_IA32_PMC0;
		pr_cont("full-width counters, ");
	}

> 
> > Case B: In PMI. new > prev. delta is positive.
> >    That's the failure case of Test 3.
> >    The PMI is triggered by overflow. But there is latency between
> >    overflow and PMI handler. So new has small amount.
> >    Current calculation lose the max count value.
> 
> That doesn't make sense, per the 31bit limit.
>
> 
> > Case C: In PMI. new < prev. delta is negative.
> >    The PMU counter may be start from a big value. E.g. the fixed period
> >    is small.
> >    It can be fixed by adding the max count value.
> 
> Doesn't make sense, how can this happen?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 14:46   ` Liang, Kan
@ 2016-11-29 16:58     ` Peter Zijlstra
  2016-11-29 17:06       ` Liang, Kan
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-11-29 16:58 UTC (permalink / raw)
  To: Liang, Kan
  Cc: mingo, linux-kernel, eranian, alexander.shishkin, ak, Odzioba, Lukasz

On Tue, Nov 29, 2016 at 02:46:14PM +0000, Liang, Kan wrote:
> > And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> > Also note we set max_period to (1<<31) - 1, so we should never exceed 31
> > bits.
> > 
> 
> The max_period is 0xfffffffff.
> 
> The limit is breaked by this patch.
> 069e0c3c4058 ("perf/x86/intel: Support full width counting")
> https://patchwork.kernel.org/patch/2784191/
> 
> 	/* Support full width counters using alternative MSR range */
> 	if (x86_pmu.intel_cap.full_width_write) {
> 		x86_pmu.max_period = x86_pmu.cntval_mask;
> 		x86_pmu.perfctr = MSR_IA32_PMC0;
> 		pr_cont("full-width counters, ");
> 	}
> 

Wth do KNL/SLM have full_width_write set if they have short counters? I
should the whole point of the full_wdith thing was that in that case the
counters were actually 64bit wide.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 16:58     ` Peter Zijlstra
@ 2016-11-29 17:06       ` Liang, Kan
  2016-11-29 17:17         ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: Liang, Kan @ 2016-11-29 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, eranian, alexander.shishkin, ak, Odzioba, Lukasz



> On Tue, Nov 29, 2016 at 02:46:14PM +0000, Liang, Kan wrote:
> > > And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> > > Also note we set max_period to (1<<31) - 1, so we should never
> > > exceed 31 bits.
> > >
> >
> > The max_period is 0xfffffffff.
> >
> > The limit is breaked by this patch.
> > 069e0c3c4058 ("perf/x86/intel: Support full width counting")
> > https://patchwork.kernel.org/patch/2784191/
> >
> > 	/* Support full width counters using alternative MSR range */
> > 	if (x86_pmu.intel_cap.full_width_write) {
> > 		x86_pmu.max_period = x86_pmu.cntval_mask;
> > 		x86_pmu.perfctr = MSR_IA32_PMC0;
> > 		pr_cont("full-width counters, ");
> > 	}
> >
> 
> Wth do KNL/SLM have full_width_write set if they have short counters?

Yes, the full_width_write is set on KNL/SLM.

> I should the whole point of the full_wdith thing was that in that case the
> counters were actually 64bit wide.

Even for big core, the counter width is 48 bit.
AFAIK, no Intel platform has 64bit wide counter.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 17:06       ` Liang, Kan
@ 2016-11-29 17:17         ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-11-29 17:17 UTC (permalink / raw)
  To: Liang, Kan
  Cc: mingo, linux-kernel, eranian, alexander.shishkin, ak, Odzioba, Lukasz

On Tue, Nov 29, 2016 at 05:06:36PM +0000, Liang, Kan wrote:
> 
> 
> > On Tue, Nov 29, 2016 at 02:46:14PM +0000, Liang, Kan wrote:
> > > > And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> > > > Also note we set max_period to (1<<31) - 1, so we should never
> > > > exceed 31 bits.
> > > >
> > >
> > > The max_period is 0xfffffffff.
> > >
> > > The limit is breaked by this patch.
> > > 069e0c3c4058 ("perf/x86/intel: Support full width counting")
> > > https://patchwork.kernel.org/patch/2784191/
> > >
> > > 	/* Support full width counters using alternative MSR range */
> > > 	if (x86_pmu.intel_cap.full_width_write) {
> > > 		x86_pmu.max_period = x86_pmu.cntval_mask;
> > > 		x86_pmu.perfctr = MSR_IA32_PMC0;
> > > 		pr_cont("full-width counters, ");
> > > 	}
> > >
> > 
> > Wth do KNL/SLM have full_width_write set if they have short counters?
> 
> Yes, the full_width_write is set on KNL/SLM.
> 
> > I should the whole point of the full_wdith thing was that in that case the
> > counters were actually 64bit wide.
> 
> Even for big core, the counter width is 48 bit.
> AFAIK, no Intel platform has 64bit wide counter.

Then yes, that patch is very broken.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-29  9:25 ` Peter Zijlstra
  2016-11-29 14:46   ` Liang, Kan
@ 2016-11-29 17:20   ` Stephane Eranian
  2016-11-29 17:30     ` Peter Zijlstra
  2016-11-29 19:08     ` Odzioba, Lukasz
  1 sibling, 2 replies; 27+ messages in thread
From: Stephane Eranian @ 2016-11-29 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liang, Kan, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz

On Tue, Nov 29, 2016 at 1:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> So caveat that I'm ill and cannot think much..
>
> On Mon, Nov 28, 2016 at 11:26:46AM -0800, kan.liang@intel.com wrote:
>
> > Here, all the possible failure cases are listed.
> > Terms:
> >     - new: current PMU counter value which read from rdpmcl.
> >     - prev: previous counter value which is stored in &hwc->prev_count.
> >     - in PMI/not in PMI: the event update happens in PMI handler or not.
>
> > Current code to calculate delta:
> >     delta = (new << shift) - (prev << shift);
> >     delta >>= shift;
> >
> > Case A: Not in PMI.  new > prev. But delta is negative.
> >    That's the failure case of Test 2.
> >    delta is s64 type. new and prev are u64 type. If the new is big
> >    enough, after doing left shift and sub, the bit 64 of the result may
> >    still be 1.
> >    After converting to s64, the sign flag will be set. Since delta is
> >    s64 type, arithmetic right shift is applied, which copy the sign flag
> >    into empty bit positions on the upper end of delta.
> >    It can be fixed by adding the max count value.
> >
> >    Here is the real data for test2 on KNL.
> >    new = aea96e1927
> >    prev = ffffff0000000001
> >    delta = aea96e1927000000 - 1000000 = aea96e1926000000
> >    aea96e1926000000 >> 24 = ffffffaea96e1926   <<  negative delta
>
> How can this happen? IIRC the thing increments, we program a negative
> value, and when it passes 0 we generate a PMI.
>
Yeah, that's the part I don't quite understand thus my initial
question. What you describe
is what is supposed to happen regardless of the counter width.

>
> And note that we _ALWAYS_ set the IN bits, even for !sampling events.
> Also note we set max_period to (1<<31) - 1, so we should never exceed 31
> bits.
>
Max period is limited by the number of bits the kernel can write to an MSR.
Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
>
>
> > Case B: In PMI. new > prev. delta is positive.
> >    That's the failure case of Test 3.
> >    The PMI is triggered by overflow. But there is latency between
> >    overflow and PMI handler. So new has small amount.
> >    Current calculation lose the max count value.
>
> That doesn't make sense, per the 31bit limit.
>
>
> > Case C: In PMI. new < prev. delta is negative.
> >    The PMU counter may be start from a big value. E.g. the fixed period
> >    is small.
> >    It can be fixed by adding the max count value.
>
> Doesn't make sense, how can this happen?


The only issue I could think of is in case the counter is reprogrammed
to a different value which could be much lower or higher than the last
saved value, like for instance, in frequency mode. Otherwise each
counter increments monotonically from the period. Counting events are
also programmed that way.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 17:20   ` Stephane Eranian
@ 2016-11-29 17:30     ` Peter Zijlstra
  2016-11-29 18:11       ` Stephane Eranian
                         ` (2 more replies)
  2016-11-29 19:08     ` Odzioba, Lukasz
  1 sibling, 3 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-11-29 17:30 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Liang, Kan, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz

On Tue, Nov 29, 2016 at 09:20:10AM -0800, Stephane Eranian wrote:
> Max period is limited by the number of bits the kernel can write to an MSR.
> Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.

No, I think it sets it to 48 now, which is the problem. It should be 1
bit less than the total width.

So something like so.

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a74a2dbc0180..cb8522290e6a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
 
 	/* Support full width counters using alternative MSR range */
 	if (x86_pmu.intel_cap.full_width_write) {
-		x86_pmu.max_period = x86_pmu.cntval_mask;
+		x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
 		x86_pmu.perfctr = MSR_IA32_PMC0;
 		pr_cont("full-width counters, ");
 	}

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 17:30     ` Peter Zijlstra
@ 2016-11-29 18:11       ` Stephane Eranian
  2016-11-29 18:37       ` Andi Kleen
  2016-11-29 19:07       ` Liang, Kan
  2 siblings, 0 replies; 27+ messages in thread
From: Stephane Eranian @ 2016-11-29 18:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liang, Kan, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz

On Tue, Nov 29, 2016 at 9:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Nov 29, 2016 at 09:20:10AM -0800, Stephane Eranian wrote:
>> Max period is limited by the number of bits the kernel can write to an MSR.
>> Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
>
> No, I think it sets it to 48 now, which is the problem. It should be 1
> bit less than the total width.
>
> So something like so.
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2dbc0180..cb8522290e6a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>
>         /* Support full width counters using alternative MSR range */
>         if (x86_pmu.intel_cap.full_width_write) {
> -               x86_pmu.max_period = x86_pmu.cntval_mask;
> +               x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
>                 x86_pmu.perfctr = MSR_IA32_PMC0;
>                 pr_cont("full-width counters, ");
>         }
Ah, yes!
That would make it consistent with the other Intel PMU settings I see
in intel/core.c such as for
intel_core_pmu.max_period = (1<<31) -1 which is 0x7ffffff whereas now
I see max_period of
0x0000ffffffffffff instead of  0x7fffffffffff. So I think Peter's
patch is required no matter what.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 17:30     ` Peter Zijlstra
  2016-11-29 18:11       ` Stephane Eranian
@ 2016-11-29 18:37       ` Andi Kleen
  2016-11-29 19:07       ` Liang, Kan
  2 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2016-11-29 18:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Liang, Kan, mingo, LKML, Alexander Shishkin,
	Odzioba, Lukasz

On Tue, Nov 29, 2016 at 06:30:55PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 29, 2016 at 09:20:10AM -0800, Stephane Eranian wrote:
> > Max period is limited by the number of bits the kernel can write to an MSR.
> > Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
> 
> No, I think it sets it to 48 now, which is the problem. It should be 1
> bit less than the total width.
> 
> So something like so.

That looks good.  Kan can you test it?

-Andi
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2dbc0180..cb8522290e6a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>  
>  	/* Support full width counters using alternative MSR range */
>  	if (x86_pmu.intel_cap.full_width_write) {
> -		x86_pmu.max_period = x86_pmu.cntval_mask;
> +		x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
>  		x86_pmu.perfctr = MSR_IA32_PMC0;
>  		pr_cont("full-width counters, ");
>  	}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 17:30     ` Peter Zijlstra
  2016-11-29 18:11       ` Stephane Eranian
  2016-11-29 18:37       ` Andi Kleen
@ 2016-11-29 19:07       ` Liang, Kan
  2016-11-29 19:32         ` Peter Zijlstra
  2 siblings, 1 reply; 27+ messages in thread
From: Liang, Kan @ 2016-11-29 19:07 UTC (permalink / raw)
  To: Peter Zijlstra, Stephane Eranian
  Cc: mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz



> On Tue, Nov 29, 2016 at 09:20:10AM -0800, Stephane Eranian wrote:
> > Max period is limited by the number of bits the kernel can write to an
> MSR.
> > Used to be 31, now it is 47 for core PMU as per patch pointed to by Kan.
> 
> No, I think it sets it to 48 now, which is the problem. It should be 1 bit less
> than the total width.
> 
> So something like so.
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2dbc0180..cb8522290e6a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
> 
>  	/* Support full width counters using alternative MSR range */
>  	if (x86_pmu.intel_cap.full_width_write) {
> -		x86_pmu.max_period = x86_pmu.cntval_mask;
> +		x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
>  		x86_pmu.perfctr = MSR_IA32_PMC0;
>  		pr_cont("full-width counters, ");
>  	}

It doesn't work. 
perf stat -x, -C1 -e cycles -- sudo taskset 0x2 ./loop 100000000000
18446743727217821696,,cycles,313837854019,100.00

delta 0xffffff8000001803 new 0x1804 prev 0xffffff8000000001

I guess we need at least x86_pmu.cntval_mask >> 2 to prevent
the sign flag set.
I'm testing it now.


Also, no matter how it's fixed. I think we'd better add WARN_ONCE
for delta.

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6c3b0ef..2ce8299 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -100,6 +100,9 @@ u64 x86_perf_event_update(struct perf_event *event)
 	delta = (new_raw_count << shift) - (prev_raw_count << shift);
 	delta >>= shift;
 
+	WARN_ONCE((delta < 0), "counter increment must be positive. delta 0x%llx new 0x%llx prev 0x%llx\n",
+		  delta, new_raw_count, prev_raw_count);
+
 	local64_add(delta, &event->count);
 	local64_sub(delta, &hwc->period_left);

Thanks,
Kan

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 17:20   ` Stephane Eranian
  2016-11-29 17:30     ` Peter Zijlstra
@ 2016-11-29 19:08     ` Odzioba, Lukasz
  1 sibling, 0 replies; 27+ messages in thread
From: Odzioba, Lukasz @ 2016-11-29 19:08 UTC (permalink / raw)
  To: Stephane Eranian, Peter Zijlstra
  Cc: Liang, Kan, mingo, LKML, Alexander Shishkin, ak

On Tuesday, November 29, 2016 6:20 PM Stephane Eranian wrote:
>> On Tue, Nov 29, 2016 at 1:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>  How can this happen? IIRC the thing increments, we program a negative
>>  value, and when it passes 0 we generate a PMI.
>>	
> Yeah, that's the part I don't quite understand thus my initial
> question. What you describe
> is what is supposed to happen regardless of the counter width.

I don't see any interrupts flying around:
# cat /proc/interrupts | grep PMI
PMI:         10         14         12         18   Performance monitoring interrupts
(test case here)
# cat /proc/interrupts | grep PMI
PMI:         10         14         12         18   Performance monitoring interrupts

What I also don't understand is why this is only seen on per cpu, or system wide mode.
PMIs seems to be programmed because if we wait long enough it will eventually overflow,
what we saw in one of tests cases from commit msg.
In the failing case (system wide) x86_perf_event_update is called only once at the end of workload:

259.154530: x86_perf_event_set_period: x86_perf_event_set_period left = ffffffffff, ret = 0
550.079623: x86_pmu_stop: x86_pmu_stop -> need update
550.079625: x86_perf_event_update: x86_perf_event_update config = 11003c, period = ffff880276212000
550.079627: x86_perf_event_update: x86_perf_event_update sample_period = ffffffffff,  last_period = ffffffffff, period_left = ffffffffff
550.079629: x86_perf_event_update: x86_perf_event_update(1) prev = ffffff0000000001, new = 8bba934b9e, delta = ffffff8bba934b9d, shift=24

In the good case -per pid case, x86_perf_event_update is called every ~3 seconds.
568.448083: x86_perf_event_set_period: x86_perf_event_set_period left = ffffffffff, ret = 0
568.949105: x86_pmu_stop: x86_pmu_stop -> need update
568.949106: x86_perf_event_update: x86_perf_event_update config = 11003c, period = ffff880276212000
568.949108: x86_perf_event_update: x86_perf_event_update sample_period = ffffffffff,  last_period = ffffffffff, period_left = ffffffffff
568.949110: x86_perf_event_update: x86_perf_event_update(1) prev = ffffff0000000001, new = 3d9525cc, delta = 3d9525cb, shift=24
568.949135: x86_perf_event_set_period: x86_perf_event_set_period left = ffc26ada34, ret = 0
570.679697: x86_pmu_stop: x86_pmu_stop -> need update
570.679699: x86_perf_event_update: x86_perf_event_update config = 11003c, period = ffff880276212000
570.679700: x86_perf_event_update: x86_perf_event_update sample_period = ffffffffff,  last_period = ffffffffff, period_left = ffc26ada34
570.679701: x86_perf_event_update: x86_perf_event_update(1) prev = ffffff003d9525cc, new = 112601ddf, delta = d4caf813, shift=24
570.679723: x86_perf_event_set_period: x86_perf_event_set_period left = feed9fe221, ret = 0
(... ~115 similar calls here)
859.431686: x86_pmu_stop: x86_pmu_stop -> need update
859.431687: x86_perf_event_update: x86_perf_event_update config = 11003c, period = ffff880276209000
859.431688: x86_perf_event_update: x86_perf_event_update sample_period = ffffffffff,  last_period = ffffffffff, period_left = 7453fc7d0d
859.431689: x86_perf_event_update: x86_perf_event_update(1) prev = ffffff8bac0382f3, new = 8bbaa4a147, delta = ea11e54, shift=24

Thanks,
Lukas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 19:07       ` Liang, Kan
@ 2016-11-29 19:32         ` Peter Zijlstra
  2016-11-29 20:33           ` Liang, Kan
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2016-11-29 19:32 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Stephane Eranian, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz

On Tue, Nov 29, 2016 at 07:07:25PM +0000, Liang, Kan wrote:
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index a74a2dbc0180..cb8522290e6a 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
> > 
> >  	/* Support full width counters using alternative MSR range */
> >  	if (x86_pmu.intel_cap.full_width_write) {
> > -		x86_pmu.max_period = x86_pmu.cntval_mask;
> > +		x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
> >  		x86_pmu.perfctr = MSR_IA32_PMC0;
> >  		pr_cont("full-width counters, ");
> >  	}
> 
> It doesn't work. 
> perf stat -x, -C1 -e cycles -- sudo taskset 0x2 ./loop 100000000000
> 18446743727217821696,,cycles,313837854019,100.00
> 
> delta 0xffffff8000001803 new 0x1804 prev 0xffffff8000000001
> 
> I guess we need at least x86_pmu.cntval_mask >> 2 to prevent
> the sign flag set.

Possible delta should be u64, as we know the counter cannot decrement.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 19:32         ` Peter Zijlstra
@ 2016-11-29 20:33           ` Liang, Kan
  2016-11-29 20:37             ` Stephane Eranian
  2016-12-02 12:58             ` Odzioba, Lukasz
  0 siblings, 2 replies; 27+ messages in thread
From: Liang, Kan @ 2016-11-29 20:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz



> On Tue, Nov 29, 2016 at 07:07:25PM +0000, Liang, Kan wrote:
> > > diff --git a/arch/x86/events/intel/core.c
> > > b/arch/x86/events/intel/core.c index a74a2dbc0180..cb8522290e6a
> > > 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
> > >
> > >  	/* Support full width counters using alternative MSR range */
> > >  	if (x86_pmu.intel_cap.full_width_write) {
> > > -		x86_pmu.max_period = x86_pmu.cntval_mask;
> > > +		x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
> > >  		x86_pmu.perfctr = MSR_IA32_PMC0;
> > >  		pr_cont("full-width counters, ");
> > >  	}
> >
> > It doesn't work.
> > perf stat -x, -C1 -e cycles -- sudo taskset 0x2 ./loop 100000000000
> > 18446743727217821696,,cycles,313837854019,100.00
> >
> > delta 0xffffff8000001803 new 0x1804 prev 0xffffff8000000001
> >
> > I guess we need at least x86_pmu.cntval_mask >> 2 to prevent the sign
> > flag set.
> 
> Possible delta should be u64, as we know the counter cannot decrement.

Yes, the patch as below fixes the issue on my SLM.

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6c3b0ef..abd97e8 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -69,7 +69,7 @@ u64 x86_perf_event_update(struct perf_event *event)
 	int shift = 64 - x86_pmu.cntval_bits;
 	u64 prev_raw_count, new_raw_count;
 	int idx = hwc->idx;
-	s64 delta;
+	u64 delta;
 
 	if (idx == INTEL_PMC_IDX_FIXED_BTS)
 		return 0;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a74a2db..cb85222 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
 
 	/* Support full width counters using alternative MSR range */
 	if (x86_pmu.intel_cap.full_width_write) {
-		x86_pmu.max_period = x86_pmu.cntval_mask;
+		x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
 		x86_pmu.perfctr = MSR_IA32_PMC0;
 		pr_cont("full-width counters, ");
 	}

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 20:33           ` Liang, Kan
@ 2016-11-29 20:37             ` Stephane Eranian
  2016-12-02 12:58             ` Odzioba, Lukasz
  1 sibling, 0 replies; 27+ messages in thread
From: Stephane Eranian @ 2016-11-29 20:37 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, mingo, LKML, Alexander Shishkin, ak, Odzioba, Lukasz

On Tue, Nov 29, 2016 at 12:33 PM, Liang, Kan <kan.liang@intel.com> wrote:
>
>
>> On Tue, Nov 29, 2016 at 07:07:25PM +0000, Liang, Kan wrote:
>> > > diff --git a/arch/x86/events/intel/core.c
>> > > b/arch/x86/events/intel/core.c index a74a2dbc0180..cb8522290e6a
>> > > 100644
>> > > --- a/arch/x86/events/intel/core.c
>> > > +++ b/arch/x86/events/intel/core.c
>> > > @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>> > >
>> > >   /* Support full width counters using alternative MSR range */
>> > >   if (x86_pmu.intel_cap.full_width_write) {
>> > > -         x86_pmu.max_period = x86_pmu.cntval_mask;
>> > > +         x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
>> > >           x86_pmu.perfctr = MSR_IA32_PMC0;
>> > >           pr_cont("full-width counters, ");
>> > >   }
>> >
>> > It doesn't work.
>> > perf stat -x, -C1 -e cycles -- sudo taskset 0x2 ./loop 100000000000
>> > 18446743727217821696,,cycles,313837854019,100.00
>> >
>> > delta 0xffffff8000001803 new 0x1804 prev 0xffffff8000000001
>> >
>> > I guess we need at least x86_pmu.cntval_mask >> 2 to prevent the sign
>> > flag set.
>>
>> Possible delta should be u64, as we know the counter cannot decrement.
>
> Yes, the patch as below fixes the issue on my SLM.
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 6c3b0ef..abd97e8 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -69,7 +69,7 @@ u64 x86_perf_event_update(struct perf_event *event)
>         int shift = 64 - x86_pmu.cntval_bits;
>         u64 prev_raw_count, new_raw_count;
>         int idx = hwc->idx;
> -       s64 delta;
> +       u64 delta;
>
>         if (idx == INTEL_PMC_IDX_FIXED_BTS)
>                 return 0;
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a74a2db..cb85222 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
>
>         /* Support full width counters using alternative MSR range */
>         if (x86_pmu.intel_cap.full_width_write) {
> -               x86_pmu.max_period = x86_pmu.cntval_mask;
> +               x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
>                 x86_pmu.perfctr = MSR_IA32_PMC0;
>                 pr_cont("full-width counters, ");
>         }
>
Yes, now that makes sense! This s64 has been there since the beginning....
Thanks for fixing this.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-11-29 20:33           ` Liang, Kan
  2016-11-29 20:37             ` Stephane Eranian
@ 2016-12-02 12:58             ` Odzioba, Lukasz
  2016-12-05 10:25               ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: Odzioba, Lukasz @ 2016-12-02 12:58 UTC (permalink / raw)
  To: Liang, Kan, Peter Zijlstra
  Cc: Stephane Eranian, mingo, LKML, Alexander Shishkin, ak

On Tuesday, November 29, 2016 9:33 PM, Liang, Kan wrote:
> Yes, the patch as below fixes the issue on my SLM.

It works for me as well. 
Can we still have it in 4.9?

Thanks,
Lukas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-12-02 12:58             ` Odzioba, Lukasz
@ 2016-12-05 10:25               ` Peter Zijlstra
  2016-12-05 11:21                 ` Odzioba, Lukasz
  2017-02-22 14:49                 ` Vince Weaver
  0 siblings, 2 replies; 27+ messages in thread
From: Peter Zijlstra @ 2016-12-05 10:25 UTC (permalink / raw)
  To: Odzioba, Lukasz
  Cc: Liang, Kan, Stephane Eranian, mingo, LKML, Alexander Shishkin, ak

On Fri, Dec 02, 2016 at 12:58:17PM +0000, Odzioba, Lukasz wrote:
> On Tuesday, November 29, 2016 9:33 PM, Liang, Kan wrote:
> > Yes, the patch as below fixes the issue on my SLM.
> 
> It works for me as well. 
> Can we still have it in 4.9?

I'll certainly, try. I've queued it as per the below.

---
Subject: perf,x86: Fix full width counter, counter overflow
Date: Tue, 29 Nov 2016 20:33:28 +0000

Lukasz reported that perf stat counters overflow is broken on KNL/SLM.

Both these parts have full_width_write set, and that does indeed have
a problem. In order to deal with counter wrap, we must sample the
counter at at least half the counter period (see also the sampling
theorem) such that we can unambiguously reconstruct the count.

However commit:

  069e0c3c4058 ("perf/x86/intel: Support full width counting")

sets the sampling interval to the full period, not half.

Fixing that exposes another issue, in that we must not sign extend the
delta value when we shift it right; the counter cannot have
decremented after all.

With both these issues fixed, counter overflow functions correctly
again.

Cc: "mingo@redhat.com" <mingo@redhat.com>
Cc: "ak@linux.intel.com" <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: stable@vger.kernel.org
Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Tested-by: "Liang, Kan" <kan.liang@intel.com>
Tested-by: "Odzioba, Lukasz" <lukasz.odzioba@intel.com>
Fixes: 069e0c3c4058 ("perf/x86/intel: Support full width counting")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c       |    2 +-
 arch/x86/events/intel/core.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -69,7 +69,7 @@ u64 x86_perf_event_update(struct perf_ev
 	int shift = 64 - x86_pmu.cntval_bits;
 	u64 prev_raw_count, new_raw_count;
 	int idx = hwc->idx;
-	s64 delta;
+	u64 delta;
 
 	if (idx == INTEL_PMC_IDX_FIXED_BTS)
 		return 0;
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4034,7 +4034,7 @@ __init int intel_pmu_init(void)
 
 	/* Support full width counters using alternative MSR range */
 	if (x86_pmu.intel_cap.full_width_write) {
-		x86_pmu.max_period = x86_pmu.cntval_mask;
+		x86_pmu.max_period = x86_pmu.cntval_mask >> 1;
 		x86_pmu.perfctr = MSR_IA32_PMC0;
 		pr_cont("full-width counters, ");
 	}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2016-12-05 10:25               ` Peter Zijlstra
@ 2016-12-05 11:21                 ` Odzioba, Lukasz
  2017-02-22 14:49                 ` Vince Weaver
  1 sibling, 0 replies; 27+ messages in thread
From: Odzioba, Lukasz @ 2016-12-05 11:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liang, Kan, Stephane Eranian, mingo, LKML, Alexander Shishkin, ak

On Monday, December 5, 2016 11:25 AM, Peter Zijlstra wrote:
> I'll certainly, try. I've queued it as per the below.

Great, thank you!

Lukas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2016-12-05 10:25               ` Peter Zijlstra
  2016-12-05 11:21                 ` Odzioba, Lukasz
@ 2017-02-22 14:49                 ` Vince Weaver
  2017-02-22 15:28                   ` Liang, Kan
  1 sibling, 1 reply; 27+ messages in thread
From: Vince Weaver @ 2017-02-22 14:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Odzioba, Lukasz, Liang, Kan, Stephane Eranian, mingo, LKML,
	Alexander Shishkin, ak

On Mon, 5 Dec 2016, Peter Zijlstra wrote:

> ---
> Subject: perf,x86: Fix full width counter, counter overflow
> Date: Tue, 29 Nov 2016 20:33:28 +0000
> 
> Lukasz reported that perf stat counters overflow is broken on KNL/SLM.
> 
> Both these parts have full_width_write set, and that does indeed have
> a problem. In order to deal with counter wrap, we must sample the
> counter at at least half the counter period (see also the sampling
> theorem) such that we can unambiguously reconstruct the count.

I know I'm a bit late to this issue, but I suddenly have PAPI users being 
very worried that their results are going to be wrong ad I hadn't heard 
about this until recently.

I'm trying to make a reproducer test and want to make sure I understand 
the issue.  (And I don't have any of the easy trigger hardware either. 
And what is SLM?  Silvermont?  Are we really that short on Changelog space
that we can't spell out the abbreviations to make things clear for 
non-Intel employees?)

So from what I understand, the issue is if we have an architecture with 
full-width counters and we trigger a x86_perf_event_update() when bit
47 is set?

So if I have a test that runs in a loop for 2^48 retired instructions
(which takes ~12 hours on a recent machine) and then reads the results,
they might be wrong?

It sounds like this can also be triggered by a sampling event with a 
really long period, but I couldn't puzzle out from the Changelog exactly 
how to reproduce this (or even how serious the issue is).

Vince

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2017-02-22 14:49                 ` Vince Weaver
@ 2017-02-22 15:28                   ` Liang, Kan
  2017-02-22 19:18                     ` Andi Kleen
  2017-02-23 15:07                     ` Vince Weaver
  0 siblings, 2 replies; 27+ messages in thread
From: Liang, Kan @ 2017-02-22 15:28 UTC (permalink / raw)
  To: Vince Weaver, Peter Zijlstra
  Cc: Odzioba, Lukasz, Stephane Eranian, mingo, LKML, Alexander Shishkin, ak


> 
> On Mon, 5 Dec 2016, Peter Zijlstra wrote:
> 
> > ---
> > Subject: perf,x86: Fix full width counter, counter overflow
> > Date: Tue, 29 Nov 2016 20:33:28 +0000
> >
> > Lukasz reported that perf stat counters overflow is broken on KNL/SLM.
> >
> > Both these parts have full_width_write set, and that does indeed have
> > a problem. In order to deal with counter wrap, we must sample the
> > counter at at least half the counter period (see also the sampling
> > theorem) such that we can unambiguously reconstruct the count.
> 
> I know I'm a bit late to this issue, but I suddenly have PAPI users being very
> worried that their results are going to be wrong ad I hadn't heard about
> this until recently.
> 
> I'm trying to make a reproducer test and want to make sure I understand
> the issue.  (And I don't have any of the easy trigger hardware either.
> And what is SLM?  Silvermont?  

Yes.

> Are we really that short on Changelog space
> that we can't spell out the abbreviations to make things clear for non-Intel
> employees?)

Here is the original patch I posted for this issue, which include the test cases.
https://lkml.org/lkml/2016/11/28/540
After the discussion, my patch was discarded. Now, we use Peter's fix.

> 
> So from what I understand, the issue is if we have an architecture with full-
> width counters and we trigger a x86_perf_event_update() when bit
> 47 is set?

No. It related to the counter width. The number of bits we can use should be
1 bit less than the total width. Otherwise, there will be problem.
For big cores such as haswell, broadwell, skylake, the counter width is 48 bit.
So we can only use 47 bits.
For Silvermont and KNL, the counter width is only 32 bit I think. So we can only
use 31 bits.

> 
> So if I have a test that runs in a loop for 2^48 retired instructions (which
> takes ~12 hours on a recent machine) and then reads the results, they
> might be wrong?

It only needs several minutes to reproduce the issue on SLM/KNL.

Thanks,
Kan 

> 
> It sounds like this can also be triggered by a sampling event with a really
> long period, but I couldn't puzzle out from the Changelog exactly how to
> reproduce this (or even how serious the issue is).
> 
> Vince
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] perf/x86: fix event counter update issue
  2017-02-22 15:28                   ` Liang, Kan
@ 2017-02-22 19:18                     ` Andi Kleen
  2017-02-23 15:07                     ` Vince Weaver
  1 sibling, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2017-02-22 19:18 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Vince Weaver, Peter Zijlstra, Odzioba, Lukasz, Stephane Eranian,
	mingo, LKML, Alexander Shishkin

> No. It related to the counter width. The number of bits we can use should be
> 1 bit less than the total width. Otherwise, there will be problem.
> For big cores such as haswell, broadwell, skylake, the counter width is 48 bit.
> So we can only use 47 bits.
> For Silvermont and KNL, the counter width is only 32 bit I think. So we can only
> use 31 bits.

It is 40 bits on these cores, so 39bits.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2017-02-22 15:28                   ` Liang, Kan
  2017-02-22 19:18                     ` Andi Kleen
@ 2017-02-23 15:07                     ` Vince Weaver
  2017-02-23 16:14                       ` Liang, Kan
  1 sibling, 1 reply; 27+ messages in thread
From: Vince Weaver @ 2017-02-23 15:07 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Odzioba, Lukasz, Stephane Eranian, mingo, LKML,
	Alexander Shishkin, ak

On Wed, 22 Feb 2017, Liang, Kan wrote:

> > So from what I understand, the issue is if we have an architecture with full-
> > width counters and we trigger a x86_perf_event_update() when bit
> > 47 is set?
> 
> No. It related to the counter width. The number of bits we can use should be
> 1 bit less than the total width. Otherwise, there will be problem.
> For big cores such as haswell, broadwell, skylake, the counter width is 48 bit.
> So we can only use 47 bits.
> For Silvermont and KNL, the counter width is only 32 bit I think. So we can only
> use 31 bits.

So on a machine with 48-bit counters I should just have a counting event
that counts to somewhere above 0x8000 0000 0001 and it should show 
problems?
Because I am unable to trigger this.

But I guess if anywhere along the line x86_perf_event_update() is run
then you start over?

I noticed your original reproducer bound the event to a core, is that 
needed to trigger this?

Can it happen on a fixed event or only a genearl purpose event?

> > So if I have a test that runs in a loop for 2^48 retired instructions (which
> > takes ~12 hours on a recent machine) and then reads the results, they
> > might be wrong?
> 
> It only needs several minutes to reproduce the issue on SLM/KNL.

Yes, but I only have machines with 48-bit counters.  So it's going to take 
256 times as long as on a machine with 40-bit counters.

I have an assembly loop that can consistently generate 2 instructions/cycle
(I'd be glad to hear suggestions for events that count faster) and on
a broadwell-ep machine it still takes at least 7 hours or so to get up
to 0x800000000000.

Vince

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH] perf/x86: fix event counter update issue
  2017-02-23 15:07                     ` Vince Weaver
@ 2017-02-23 16:14                       ` Liang, Kan
  0 siblings, 0 replies; 27+ messages in thread
From: Liang, Kan @ 2017-02-23 16:14 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Peter Zijlstra, Odzioba, Lukasz, Stephane Eranian, mingo, LKML,
	Alexander Shishkin, ak


> 
> On Wed, 22 Feb 2017, Liang, Kan wrote:
> 
> > > So from what I understand, the issue is if we have an architecture
> > > with full- width counters and we trigger a x86_perf_event_update()
> > > when bit
> > > 47 is set?
> >
> > No. It related to the counter width. The number of bits we can use
> > should be
> > 1 bit less than the total width. Otherwise, there will be problem.
> > For big cores such as haswell, broadwell, skylake, the counter width is 48
> bit.
> > So we can only use 47 bits.
> > For Silvermont and KNL, the counter width is only 32 bit I think. So
> > we can only use 31 bits.
> 
> So on a machine with 48-bit counters I should just have a counting event
> that counts to somewhere above 0x8000 0000 0001 and it should show
> problems?
Yes

> Because I am unable to trigger this.
> 
> But I guess if anywhere along the line x86_perf_event_update() is run then
> you start over?
> 

Probably. It depends on the left.

> I noticed your original reproducer bound the event to a core, is that needed
> to trigger this?

I don't think it's needed. But I didn't try anything without bound.

> 
> Can it happen on a fixed event or only a genearl purpose event?

I think it can happens on both. Because fixed counter and GP counter have
same counter width and code path.

> 
> > > So if I have a test that runs in a loop for 2^48 retired
> > > instructions (which takes ~12 hours on a recent machine) and then
> > > reads the results, they might be wrong?
> >
> > It only needs several minutes to reproduce the issue on SLM/KNL.
> 
> Yes, but I only have machines with 48-bit counters.  So it's going to take
> 256 times as long as on a machine with 40-bit counters.
> 
> I have an assembly loop that can consistently generate 2 instructions/cycle
> (I'd be glad to hear suggestions for events that count faster) and on a
> broadwell-ep machine it still takes at least 7 hours or so to get up to
> 0x800000000000.

I think you may use MSR tool to write a big number into IA32_PMC0
during your test. 
The writable IA32_PMC0 alias is 0x4C1.


Thanks,
Kan

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-02-23 16:17 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-28 19:26 [PATCH] perf/x86: fix event counter update issue kan.liang
2016-11-28 19:41 ` Stephane Eranian
2016-11-28 19:59   ` Liang, Kan
2016-11-28 20:18     ` Stephane Eranian
2016-11-28 20:23       ` Liang, Kan
2016-11-29  9:25 ` Peter Zijlstra
2016-11-29 14:46   ` Liang, Kan
2016-11-29 16:58     ` Peter Zijlstra
2016-11-29 17:06       ` Liang, Kan
2016-11-29 17:17         ` Peter Zijlstra
2016-11-29 17:20   ` Stephane Eranian
2016-11-29 17:30     ` Peter Zijlstra
2016-11-29 18:11       ` Stephane Eranian
2016-11-29 18:37       ` Andi Kleen
2016-11-29 19:07       ` Liang, Kan
2016-11-29 19:32         ` Peter Zijlstra
2016-11-29 20:33           ` Liang, Kan
2016-11-29 20:37             ` Stephane Eranian
2016-12-02 12:58             ` Odzioba, Lukasz
2016-12-05 10:25               ` Peter Zijlstra
2016-12-05 11:21                 ` Odzioba, Lukasz
2017-02-22 14:49                 ` Vince Weaver
2017-02-22 15:28                   ` Liang, Kan
2017-02-22 19:18                     ` Andi Kleen
2017-02-23 15:07                     ` Vince Weaver
2017-02-23 16:14                       ` Liang, Kan
2016-11-29 19:08     ` Odzioba, Lukasz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).