Re: oprofile and ARM A9 hardware counter

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: oprofile and ARM A9 hardware counter
       [not found]                 ` <CAMsRxfJH+9MS=Svmh0BvDSb8xqsPe2HPobkeKcn8JO35npjhwQ@mail.gmail.com>
@ 2012-02-08  2:24                   ` Ming Lei
  2012-02-15 16:38                     ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2012-02-08  2:24 UTC (permalink / raw)
  To: eranian
  Cc: Shilimkar, Santosh, David Long, b-cousson, mans, will.deacon,
	linux-arm, Peter Zijlstra, Ingo Molnar,
	Linux Kernel Mailing List

Hi,

CC lkml and perf guys, since looks it is related with perf core.

On Tue, Feb 7, 2012 at 7:59 PM, stephane eranian <eranian@googlemail.com> wrote:
> An easier way to verify we're getting the right number of samples is
> to use perf top:
>
> $ taskset -c 1 noploop 1000 &
> $ sudo perf top
>
> You'll see around 850 irqs/sec, should be closer to 1000.
> But if I drop the rate to 100Hz, then it works:
>
> $ sudo perf top -F 100
>
> That leads me to believe there is too much overhead somewhere.
> Could be in perf_event itself.

Looks like the issue is caused by perf_event itself, but nothing to do
with much overhead
somewhere.

On OMAP4, HZ is 128, and perf_rotate_context may set a new sample period(~8ms),
which is much longer than 1ms in 1000HZ freq mode, so less sample events are
observed. X86 isn't affected since its HZ is 1000.

With patch[1], about 10000 sample events can be generated after running
'perf record -e cycles  ./noploop 10' and 'perf report -D | tail -20'
on panda board.

I am not sure if patch[1] is a right fix, but it can verify the problem.

thanks,
--
Ming Lei

[1], fix adjusting frequency in perf_rotate_context

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 32b48c8..db4faf2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2300,14 +2300,12 @@ do {					\
 	return div64_u64(dividend, divisor);
 }

-static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
+static void perf_adjust_period(struct perf_event *event, u64 period, u64 count)
 {
 	struct hw_perf_event *hwc = &event->hw;
-	s64 period, sample_period;
+	s64 sample_period;
 	s64 delta;

-	period = perf_calculate_period(event, nsec, count);
-
 	delta = (s64)(period - hwc->sample_period);
 	delta = (delta + 7) / 8; /* low pass filter */

@@ -2363,8 +2361,13 @@ static void perf_ctx_adjust_freq(struct
perf_event_context *ctx, u64 period)
 		delta = now - hwc->freq_count_stamp;
 		hwc->freq_count_stamp = now;

-		if (delta > 0)
+		if (delta > 0) {
+			period = perf_calculate_period(event, period, delta);
+
+			if (period > 4*hwc->sample_period)
+				period = hwc->sample_period;
 			perf_adjust_period(event, period, delta);
+		}
 	}
 }

@@ -4533,8 +4536,10 @@ static int __perf_event_overflow(struct
perf_event *event,

 		hwc->freq_time_stamp = now;

-		if (delta > 0 && delta < 2*TICK_NSEC)
+		if (delta > 0 && delta < 2*TICK_NSEC) {
+			delta = perf_calculate_period(event, delta, hwc->last_period);
 			perf_adjust_period(event, delta, hwc->last_period);
+		}
 	}

 	/*

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-08  2:24                   ` oprofile and ARM A9 hardware counter Ming Lei
@ 2012-02-15 16:38                     ` Peter Zijlstra
  2012-02-16 10:25                       ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2012-02-15 16:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: eranian, Shilimkar, Santosh, David Long, b-cousson, mans,
	will.deacon, linux-arm, Ingo Molnar, Linux Kernel Mailing List

On Wed, 2012-02-08 at 10:24 +0800, Ming Lei wrote:
> 
> On OMAP4, HZ is 128, and perf_rotate_context may set a new sample period(~8ms),
> which is much longer than 1ms in 1000HZ freq mode, so less sample events are
> observed. X86 isn't affected since its HZ is 1000.
> 
> With patch[1], about 10000 sample events can be generated after running
> 'perf record -e cycles  ./noploop 10' and 'perf report -D | tail -20'
> on panda board.


> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 32b48c8..db4faf2 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2300,14 +2300,12 @@ do {                                    \
>         return div64_u64(dividend, divisor);
>  }
> 
> -static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
> +static void perf_adjust_period(struct perf_event *event, u64 period, u64 count)
>  {
>         struct hw_perf_event *hwc = &event->hw;
> -       s64 period, sample_period;
> +       s64 sample_period;
>         s64 delta;
> 
> -       period = perf_calculate_period(event, nsec, count);
> -
>         delta = (s64)(period - hwc->sample_period);
>         delta = (delta + 7) / 8; /* low pass filter */
> 
> @@ -2363,8 +2361,13 @@ static void perf_ctx_adjust_freq(struct
> perf_event_context *ctx, u64 period)
>                 delta = now - hwc->freq_count_stamp;
>                 hwc->freq_count_stamp = now;
> 
> -               if (delta > 0)
> +               if (delta > 0) {
> +                       period = perf_calculate_period(event, period, delta);
> +
> +                       if (period > 4*hwc->sample_period)
> +                               period = hwc->sample_period;
>                         perf_adjust_period(event, period, delta);
> +               }
>         }
>  }
> 
> @@ -4533,8 +4536,10 @@ static int __perf_event_overflow(struct
> perf_event *event,
> 
>                 hwc->freq_time_stamp = now;
> 
> -               if (delta > 0 && delta < 2*TICK_NSEC)
> +               if (delta > 0 && delta < 2*TICK_NSEC) {
> +                       delta = perf_calculate_period(event, delta, hwc->last_period);
>                         perf_adjust_period(event, delta, hwc->last_period);
> +               }
>         }
> 
>         /* 

So what this patch seems to do is put that filter on period in
perf_ctx_adjust_freq(). Not making sense.. nor can I see a HZ
dependency, perf_ctx_adjust_freq() uses TICK_NSEC as time base.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-15 16:38                     ` Peter Zijlstra
@ 2012-02-16 10:25                       ` Ming Lei
  2012-02-16 15:00                         ` Will Deacon
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2012-02-16 10:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Shilimkar, Santosh, David Long, b-cousson, mans,
	will.deacon, linux-arm, Ingo Molnar, Linux Kernel Mailing List

Hi,

On Thu, Feb 16, 2012 at 12:38 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> So what this patch seems to do is put that filter on period in
> perf_ctx_adjust_freq(). Not making sense.. nor can I see a HZ
> dependency, perf_ctx_adjust_freq() uses TICK_NSEC as time base.

Yes, you are right, I remembered it was observed it on -rc1, and
Stephane's unthrottling
patch was not merged at that time. Today I investigated the problem
further on -rc3 and found that seems the problem is caused by arm pmu code.

The patch below may fix the problem, now about 40000 sample events
can be generated on the command:

	'perf record -e cycles -F 4000  ./noploop 10&& perf report -D | tail -20'

armpmu_event_update may be called in tick path, so the running counter
will be overflowed and produce a great value of 'delta', then a mistaken
count is stored into event->count and event->hw.freq_count_stamp. Finally
the two variables are not synchronous, then a invalid and large period is
computed and written to pmu, and sample events are decreased much.

Will, this patch simplifies the 'delta' computation and doesn't use the
overflow flag, even though which can be read directly from PMOVSR, could you
comment on the patch?

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 5bb91bf..789700a 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -193,13 +193,8 @@ again:
 			     new_raw_count) != prev_raw_count)
 		goto again;

-	new_raw_count &= armpmu->max_period;
-	prev_raw_count &= armpmu->max_period;
-
-	if (overflow)
-		delta = armpmu->max_period - prev_raw_count + new_raw_count + 1;
-	else
-		delta = new_raw_count - prev_raw_count;
+	delta = (armpmu->max_period - prev_raw_count + new_raw_count
+				+ 1) & armpmu->max_period;

 	local64_add(delta, &event->count);
 	local64_sub(delta, &hwc->period_left);

thanks,
--
Ming Lei

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-16 10:25                       ` Ming Lei
@ 2012-02-16 15:00                         ` Will Deacon
  2012-02-16 16:12                           ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2012-02-16 15:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Zijlstra, eranian, Shilimkar, Santosh, David Long,
	b-cousson, mans, linux-arm, Ingo Molnar,
	Linux Kernel Mailing List

On Thu, Feb 16, 2012 at 10:25:05AM +0000, Ming Lei wrote:
> On Thu, Feb 16, 2012 at 12:38 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > So what this patch seems to do is put that filter on period in
> > perf_ctx_adjust_freq(). Not making sense.. nor can I see a HZ
> > dependency, perf_ctx_adjust_freq() uses TICK_NSEC as time base.
> 
> Yes, you are right, I remembered it was observed it on -rc1, and
> Stephane's unthrottling
> patch was not merged at that time. Today I investigated the problem
> further on -rc3 and found that seems the problem is caused by arm pmu code.

As I reported previously, Stephane's patch is causing warnings on -rc3:

http://lists.infradead.org/pipermail/linux-arm-kernel/2012-February/084391.html

so I'd like to get to the bottom of that before changing anything else.

I'd also like to know why this has only been reported on OMAP4 and I can't
reproduce it on my boards.

> The patch below may fix the problem, now about 40000 sample events
> can be generated on the command:
> 
> 	'perf record -e cycles -F 4000  ./noploop 10&& perf report -D | tail -20'
> 
> armpmu_event_update may be called in tick path, so the running counter
> will be overflowed and produce a great value of 'delta', then a mistaken
> count is stored into event->count and event->hw.freq_count_stamp. Finally
> the two variables are not synchronous, then a invalid and large period is
> computed and written to pmu, and sample events are decreased much.

Hmm, so are you observing an event overflow during the tick handler? This
should be fine unless the new value has wrapped past the previous one (i.e.
more than 2^32 events have occurred). I find this extremely unlikely for
sample-based profiling unless you have some major IRQ latency issues...

The only way I can think of improving this (bearing in mind that at some
point we're limited by 32 bits of counter) is to check for overflow in the
tick path and then invoke the PMU irq handler if there is an overflow, but
that's really not very nice.

> diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
> index 5bb91bf..789700a 100644
> --- a/arch/arm/kernel/perf_event.c
> +++ b/arch/arm/kernel/perf_event.c
> @@ -193,13 +193,8 @@ again:
>  			     new_raw_count) != prev_raw_count)
>  		goto again;
> 
> -	new_raw_count &= armpmu->max_period;
> -	prev_raw_count &= armpmu->max_period;
> -
> -	if (overflow)
> -		delta = armpmu->max_period - prev_raw_count + new_raw_count + 1;
> -	else
> -		delta = new_raw_count - prev_raw_count;
> +	delta = (armpmu->max_period - prev_raw_count + new_raw_count
> +				+ 1) & armpmu->max_period;

This breaks when more than max_period events have passed. See a737823d
("ARM: 6835/1: perf: ensure overflows aren't missed due to IRQ latency").

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-16 15:00                         ` Will Deacon
@ 2012-02-16 16:12                           ` Ming Lei
  2012-02-16 16:19                             ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2012-02-16 16:12 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, eranian, Shilimkar, Santosh, David Long,
	b-cousson, mans, linux-arm, Ingo Molnar,
	Linux Kernel Mailing List

On Thu, Feb 16, 2012 at 11:00 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Feb 16, 2012 at 10:25:05AM +0000, Ming Lei wrote:
>> On Thu, Feb 16, 2012 at 12:38 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> >
>> > So what this patch seems to do is put that filter on period in
>> > perf_ctx_adjust_freq(). Not making sense.. nor can I see a HZ
>> > dependency, perf_ctx_adjust_freq() uses TICK_NSEC as time base.
>>
>> Yes, you are right, I remembered it was observed it on -rc1, and
>> Stephane's unthrottling
>> patch was not merged at that time. Today I investigated the problem
>> further on -rc3 and found that seems the problem is caused by arm pmu code.
>
> As I reported previously, Stephane's patch is causing warnings on -rc3:
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2012-February/084391.html
>
> so I'd like to get to the bottom of that before changing anything else.

Looks I am luck enough and not see the warning on OMAP4, either -rc3 or
-rc3--next-20120210, :-)

Maybe we have different config options.

>
> I'd also like to know why this has only been reported on OMAP4 and I can't
> reproduce it on my boards.
>
>> The patch below may fix the problem, now about 40000 sample events
>> can be generated on the command:
>>
>>       'perf record -e cycles -F 4000  ./noploop 10&& perf report -D | tail -20'
>>
>> armpmu_event_update may be called in tick path, so the running counter
>> will be overflowed and produce a great value of 'delta', then a mistaken
>> count is stored into event->count and event->hw.freq_count_stamp. Finally
>> the two variables are not synchronous, then a invalid and large period is
>> computed and written to pmu, and sample events are decreased much.
>
> Hmm, so are you observing an event overflow during the tick handler? This

Yes, I am sure I can observe it without much difficulty.

> should be fine unless the new value has wrapped past the previous one (i.e.
> more than 2^32 events have occurred). I find this extremely unlikely for
> sample-based profiling unless you have some major IRQ latency issues...

IMO, it is not so difficult to get it, suppose prev_raw_count is
1000000 and -prev_raw_count
was write to one pmu counter, then the counter will be expired and pmu
irq is not handled
quickly enough, so the pmu counter will warp and start counting from zero.

When the tick is scheduled just before handling pmu irq,
armpmu_event_update() is
called to read the pmu counter as 'new_raw_count', suppose it is 100,
then the issue
is triggered: u64 delta = 100 -  1000000 = 18446744073708551716.

Looks the higher the frequency is, the easier the problem is reproduced.

>
> The only way I can think of improving this (bearing in mind that at some
> point we're limited by 32 bits of counter) is to check for overflow in the
> tick path and then invoke the PMU irq handler if there is an overflow, but
> that's really not very nice.

Also we may remove the 'overflow' parameter from armpmu_event_update,
and introduce armpmu->is_overflow(idx) callback to check if the counter(event)
is overflow inside armpmu_event_update.

IMO, the pmu irq can't be lost, so the pmu irq handler is not needed to invoke
in tick path.

>
>> diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
>> index 5bb91bf..789700a 100644
>> --- a/arch/arm/kernel/perf_event.c
>> +++ b/arch/arm/kernel/perf_event.c
>> @@ -193,13 +193,8 @@ again:
>>                            new_raw_count) != prev_raw_count)
>>               goto again;
>>
>> -     new_raw_count &= armpmu->max_period;
>> -     prev_raw_count &= armpmu->max_period;
>> -
>> -     if (overflow)
>> -             delta = armpmu->max_period - prev_raw_count + new_raw_count + 1;
>> -     else
>> -             delta = new_raw_count - prev_raw_count;
>> +     delta = (armpmu->max_period - prev_raw_count + new_raw_count
>> +                             + 1) & armpmu->max_period;
>
> This breaks when more than max_period events have passed. See a737823d
> ("ARM: 6835/1: perf: ensure overflows aren't missed due to IRQ latency").

Sorry, I didn't notice the commit and the problem addressed, so looks
the 'overflow'
information is needed for armpmu_event_update.

thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-16 16:12                           ` Ming Lei
@ 2012-02-16 16:19                             ` Peter Zijlstra
  2012-02-16 16:37                               ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2012-02-16 16:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Will Deacon, eranian, Shilimkar, Santosh, David Long, b-cousson,
	mans, linux-arm, Ingo Molnar, Linux Kernel Mailing List

On Fri, 2012-02-17 at 00:12 +0800, Ming Lei wrote:
> is triggered: u64 delta = 100 -  1000000 = 18446744073708551716.

on x86 we do:

  int shift = 64 - x86_pmu.cntval_bits;
  s64 delta;

  delta = (new_raw_count << shift) - (prev_raw_count << shift);
  delta >>= shift;

This deals with short overflows (on x86 the registers are typically 40
or 48 bits wide). If the arm register is 32 you can of course also get
there with some u32 casts.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-16 16:19                             ` Peter Zijlstra
@ 2012-02-16 16:37                               ` Ming Lei
  2012-02-16 18:08                                 ` Will Deacon
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2012-02-16 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, eranian, Shilimkar, Santosh, David Long, b-cousson,
	mans, linux-arm, Ingo Molnar, Linux Kernel Mailing List

On Fri, Feb 17, 2012 at 12:19 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2012-02-17 at 00:12 +0800, Ming Lei wrote:
>> is triggered: u64 delta = 100 -  1000000 = 18446744073708551716.
>
> on x86 we do:
>
>  int shift = 64 - x86_pmu.cntval_bits;
>  s64 delta;
>
>  delta = (new_raw_count << shift) - (prev_raw_count << shift);
>  delta >>= shift;
>
> This deals with short overflows (on x86 the registers are typically 40
> or 48 bits wide). If the arm register is 32 you can of course also get
> there with some u32 casts.

Good idea, but it may not work if new_raw_count is bigger than prev_raw_count.

thanks
--
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-16 16:37                               ` Ming Lei
@ 2012-02-16 18:08                                 ` Will Deacon
  2012-02-17  5:24                                   ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2012-02-16 18:08 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Zijlstra, eranian, Shilimkar, Santosh, David Long,
	b-cousson, mans, linux-arm, Ingo Molnar,
	Linux Kernel Mailing List

On Thu, Feb 16, 2012 at 04:37:35PM +0000, Ming Lei wrote:
> On Fri, Feb 17, 2012 at 12:19 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Fri, 2012-02-17 at 00:12 +0800, Ming Lei wrote:
> >> is triggered: u64 delta = 100 -  1000000 = 18446744073708551716.
> >
> > on x86 we do:
> >
> >  int shift = 64 - x86_pmu.cntval_bits;
> >  s64 delta;
> >
> >  delta = (new_raw_count << shift) - (prev_raw_count << shift);
> >  delta >>= shift;
> >
> > This deals with short overflows (on x86 the registers are typically 40
> > or 48 bits wide). If the arm register is 32 you can of course also get
> > there with some u32 casts.
> 
> Good idea, but it may not work if new_raw_count is bigger than prev_raw_count.

The more I think about this, the more I think that the overflow parameter to
armpmu_event_update needs to go. It was introduced to prevent massive event
loss in non-sampling mode, but I think we can get around that by changing
the default sample_period to be half of the max_period, therefore giving
ourselves a much better chance of handling the interrupt before new wraps
around past prev.

Ming Lei - can you try the following please? If it works for you, then I'll
do it properly and kill the overflow parameter altogether.

Thanks,

Will

git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 5bb91bf..ef597a3 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -193,13 +193,7 @@ again:
                             new_raw_count) != prev_raw_count)
                goto again;
 
-       new_raw_count &= armpmu->max_period;
-       prev_raw_count &= armpmu->max_period;
-
-       if (overflow)
-               delta = armpmu->max_period - prev_raw_count + new_raw_count + 1;
-       else
-               delta = new_raw_count - prev_raw_count;
+       delta = (new_raw_count - prev_raw_count) & armpmu->max_period;
 
        local64_add(delta, &event->count);
        local64_sub(delta, &hwc->period_left);
@@ -518,7 +512,7 @@ __hw_perf_event_init(struct perf_event *event)
        hwc->config_base            |= (unsigned long)mapping;
 
        if (!hwc->sample_period) {
-               hwc->sample_period  = armpmu->max_period;
+               hwc->sample_period  = armpmu->max_period >> 1;
                hwc->last_period    = hwc->sample_period;
                local64_set(&hwc->period_left, hwc->sample_period);
        }


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-16 18:08                                 ` Will Deacon
@ 2012-02-17  5:24                                   ` Ming Lei
  2012-02-17 10:26                                     ` Will Deacon
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2012-02-17  5:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, eranian, Shilimkar, Santosh, David Long,
	b-cousson, mans, linux-arm, Ingo Molnar,
	Linux Kernel Mailing List

On Fri, Feb 17, 2012 at 2:08 AM, Will Deacon <will.deacon@arm.com> wrote:

>
> The more I think about this, the more I think that the overflow parameter to
> armpmu_event_update needs to go. It was introduced to prevent massive event
> loss in non-sampling mode, but I think we can get around that by changing
> the default sample_period to be half of the max_period, therefore giving
> ourselves a much better chance of handling the interrupt before new wraps
> around past prev.
>
> Ming Lei - can you try the following please? If it works for you, then I'll
> do it properly and kill the overflow parameter altogether.

Of course, it does work for the problem reported by Stephane since
it changes the delta computation basically as I did, but I am afraid that
it may be not good enough for the issue fixed in a737823d ("ARM: 6835/1:
perf: ensure overflows aren't missed due to IRQ latency").

>
> Thanks,
>
> Will
>
> git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
> index 5bb91bf..ef597a3 100644
> --- a/arch/arm/kernel/perf_event.c
> +++ b/arch/arm/kernel/perf_event.c
> @@ -193,13 +193,7 @@ again:
>                             new_raw_count) != prev_raw_count)
>                goto again;
>
> -       new_raw_count &= armpmu->max_period;
> -       prev_raw_count &= armpmu->max_period;
> -
> -       if (overflow)
> -               delta = armpmu->max_period - prev_raw_count + new_raw_count + 1;
> -       else
> -               delta = new_raw_count - prev_raw_count;
> +       delta = (new_raw_count - prev_raw_count) & armpmu->max_period;
>
>        local64_add(delta, &event->count);
>        local64_sub(delta, &hwc->period_left);
> @@ -518,7 +512,7 @@ __hw_perf_event_init(struct perf_event *event)
>        hwc->config_base            |= (unsigned long)mapping;
>
>        if (!hwc->sample_period) {
> -               hwc->sample_period  = armpmu->max_period;
> +               hwc->sample_period  = armpmu->max_period >> 1;

If you assume that the issue addressed by a737823d can only happen in
non-sample situation, Peter's idea of u32 cast is OK and maybe simpler.

But I am afraid that the issue still can be triggered in sample-based situation,
especially in very high frequency case: suppose the sample freq is 10000,
100us IRQ delay may trigger the issue.

So we may use the overflow information to make perf more robust, IMO.

thanks
--
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-17  5:24                                   ` Ming Lei
@ 2012-02-17 10:26                                     ` Will Deacon
  2012-02-20  3:19                                       ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2012-02-17 10:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Zijlstra, eranian, Shilimkar, Santosh, David Long,
	b-cousson, mans, linux-arm, Ingo Molnar,
	Linux Kernel Mailing List

On Fri, Feb 17, 2012 at 05:24:02AM +0000, Ming Lei wrote:
> On Fri, Feb 17, 2012 at 2:08 AM, Will Deacon <will.deacon@arm.com> wrote:
> 
> >
> > The more I think about this, the more I think that the overflow parameter to
> > armpmu_event_update needs to go. It was introduced to prevent massive event
> > loss in non-sampling mode, but I think we can get around that by changing
> > the default sample_period to be half of the max_period, therefore giving
> > ourselves a much better chance of handling the interrupt before new wraps
> > around past prev.
> >
> > Ming Lei - can you try the following please? If it works for you, then I'll
> > do it properly and kill the overflow parameter altogether.
> 
> Of course, it does work for the problem reported by Stephane since
> it changes the delta computation basically as I did, but I am afraid that
> it may be not good enough for the issue fixed in a737823d ("ARM: 6835/1:
> perf: ensure overflows aren't missed due to IRQ latency").

I think it does solve that problem.

> >
> >        if (!hwc->sample_period) {
> > -               hwc->sample_period  = armpmu->max_period;
> > +               hwc->sample_period  = armpmu->max_period >> 1;
> 
> If you assume that the issue addressed by a737823d can only happen in
> non-sample situation, Peter's idea of u32 cast is OK and maybe simpler.

I don't want to assume that the counters are 32-bit in this code as we may
want to plug in some other PMU with 16-bit counters, for example. That's why
we have max_period defined for each PMU. Furthermore, it still doesn't help us
in the stat case where prev will typically be quite small after we've had an
overflow and new may well overtake it.

> But I am afraid that the issue still can be triggered in sample-based situation,
> especially in very high frequency case: suppose the sample freq is 10000,
> 100us IRQ delay may trigger the issue.

Not sure I follow. If the frequency is 10000, then we write 0xffffd8f0 to
the counter. That means we have a 0xffffd8f0 event window to read the
counter after it overflows before new overtakes prev and we get confused.
If this passed in 100us then either your clock speed is 4.3*10^12Hz or you
have a seriously wide issue :)

> So we may use the overflow information to make perf more robust, IMO.

There's a trade off between allowing the counter to wrap back around past
its previous value or being able to handle overflow on a non-interrupt path.
Given that the former problem is only a real issue for non-sampling runs,
halving the period in that case should sort it (and it does stop my simple
test case from exploding).

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-17 10:26                                     ` Will Deacon
@ 2012-02-20  3:19                                       ` Ming Lei
  2012-02-20  9:44                                         ` Will Deacon
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2012-02-20  3:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, eranian, Shilimkar, Santosh, David Long,
	b-cousson, mans, linux-arm, Ingo Molnar,
	Linux Kernel Mailing List

Hi,

On Fri, Feb 17, 2012 at 6:26 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Fri, Feb 17, 2012 at 05:24:02AM +0000, Ming Lei wrote:
>> On Fri, Feb 17, 2012 at 2:08 AM, Will Deacon <will.deacon@arm.com> wrote:
>>
>> >
>> > The more I think about this, the more I think that the overflow parameter to
>> > armpmu_event_update needs to go. It was introduced to prevent massive event
>> > loss in non-sampling mode, but I think we can get around that by changing
>> > the default sample_period to be half of the max_period, therefore giving
>> > ourselves a much better chance of handling the interrupt before new wraps
>> > around past prev.
>> >
>> > Ming Lei - can you try the following please? If it works for you, then I'll
>> > do it properly and kill the overflow parameter altogether.
>>
>> Of course, it does work for the problem reported by Stephane since
>> it changes the delta computation basically as I did, but I am afraid that
>> it may be not good enough for the issue fixed in a737823d ("ARM: 6835/1:
>> perf: ensure overflows aren't missed due to IRQ latency").
>
> I think it does solve that problem.
>
>> >
>> >        if (!hwc->sample_period) {
>> > -               hwc->sample_period  = armpmu->max_period;
>> > +               hwc->sample_period  = armpmu->max_period >> 1;
>>
>> If you assume that the issue addressed by a737823d can only happen in
>> non-sample situation, Peter's idea of u32 cast is OK and maybe simpler.
>
> I don't want to assume that the counters are 32-bit in this code as we may
> want to plug in some other PMU with 16-bit counters, for example. That's why
> we have max_period defined for each PMU. Furthermore, it still doesn't help us
> in the stat case where prev will typically be quite small after we've had an
> overflow and new may well overtake it.

In fact, I suggested to keep the overflow information to handle the
case by reading
hw counter overflow flag, but looks it is a bit frail to sync overflow
flag with the
counter value in non-interrupt situation, so I agree with you to
remove 'overflow'
parameter from armpmu_event_update.

>
>> But I am afraid that the issue still can be triggered in sample-based situation,
>> especially in very high frequency case: suppose the sample freq is 10000,
>> 100us IRQ delay may trigger the issue.
>
> Not sure I follow. If the frequency is 10000, then we write 0xffffd8f0 to
> the counter. That means we have a 0xffffd8f0 event window to read the

The frequency I described is the 'freq' from '-F freq'. On OMAP4, when
the 'freq'
is 10000 and the interval for one samle is 100us, the observed counter
('left' variable in armpmu_event_set_period) is about 90000, so the written
value to the hw counter is 0xFFFEA06F, looks the window is wide enough,
and we may not consider the issue in a737823d for sample-based profiling.

> counter after it overflows before new overtakes prev and we get confused.
> If this passed in 100us then either your clock speed is 4.3*10^12Hz or you
> have a seriously wide issue :)

On OMAP4, I can observe that about 19800 sample events can be generated
with the below command:

 'perf record -e cycles -F 10000  ./noploop 2&& perf report -D | tail -20'

So the above result looks not bad, :-)

>
>> So we may use the overflow information to make perf more robust, IMO.
>
> There's a trade off between allowing the counter to wrap back around past
> its previous value or being able to handle overflow on a non-interrupt path.

I agree, it is not easy to read overflow flag and counter accurately
from hardware
directly on a non-interrupt path.

So do you need me to prepare a new patch, or you will do it by yourself?


thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: oprofile and ARM A9 hardware counter
  2012-02-20  3:19                                       ` Ming Lei
@ 2012-02-20  9:44                                         ` Will Deacon
  0 siblings, 0 replies; 12+ messages in thread
From: Will Deacon @ 2012-02-20  9:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Peter Zijlstra, eranian, Shilimkar, Santosh, David Long,
	b-cousson, mans, linux-arm, Ingo Molnar,
	Linux Kernel Mailing List

On Mon, Feb 20, 2012 at 03:19:02AM +0000, Ming Lei wrote:
> On Fri, Feb 17, 2012 at 6:26 PM, Will Deacon <will.deacon@arm.com> wrote:
> > Not sure I follow. If the frequency is 10000, then we write 0xffffd8f0 to
> > the counter. That means we have a 0xffffd8f0 event window to read the
> 
> The frequency I described is the 'freq' from '-F freq'. On OMAP4, when
> the 'freq'
> is 10000 and the interval for one samle is 100us, the observed counter
> ('left' variable in armpmu_event_set_period) is about 90000, so the written
> value to the hw counter is 0xFFFEA06F, looks the window is wide enough,
> and we may not consider the issue in a737823d for sample-based profiling.

My mistake - I thought you were referring to a period of 10000, but similar
logic follows.

> > There's a trade off between allowing the counter to wrap back around past
> > its previous value or being able to handle overflow on a non-interrupt path.
> 
> I agree, it is not easy to read overflow flag and counter accurately
> from hardware
> directly on a non-interrupt path.
> 
> So do you need me to prepare a new patch, or you will do it by yourself?

If the last one I posted is alright to you (seems that it is), then I'll
update it with the removal of the overflow flag and post it to the list with
you on CC.

I also need to get to the bottom of those warnings.

Thanks,

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-02-20  9:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1328578047.1724.17.camel@dave-Dell-System-XPS-L502X>
     [not found] ` <CAMQu2gwfo5JXAqQaLUNs7C7J3TUhEO2zOcyD0Rk-D_O61Yrfag@mail.gmail.com>
     [not found]   ` <CAMsRxfLNBbQO5XF+EHJqNsnW+s=ay3mjSV5dh=sxdwzktUu03g@mail.gmail.com>
     [not found]     ` <CAMQu2gzfNAwtf1c6jrTZpfGMSqBgBrQKmFTeCFzbvMh9ESBDUg@mail.gmail.com>
     [not found]       ` <CAMsRxfKR1ODH56BtcUT5Dv6qOVEYGVheEcW9ugXsZmLKok==bg@mail.gmail.com>
     [not found]         ` <CAMQu2gwWt6oQPjxc1YHgOoxVkHdck_q78Xf==7ncwRj0_uS-JQ@mail.gmail.com>
     [not found]           ` <CAMQu2gyCPtbNfBa5V8ve2JORN+sDeAY+hvR0g0_9U354JgcuiA@mail.gmail.com>
     [not found]             ` <CAMsRxfJVEgVJoHcaPQbMg5mwz=tZUce4oibDUxtnevSF8bGY9g@mail.gmail.com>
     [not found]               ` <CAMQu2gxVUqDF2HkPD=QeX0N_tUd=9FTd+T+1LHCR0tsCgQxC8Q@mail.gmail.com>
     [not found]                 ` <CAMsRxfJH+9MS=Svmh0BvDSb8xqsPe2HPobkeKcn8JO35npjhwQ@mail.gmail.com>
2012-02-08  2:24                   ` oprofile and ARM A9 hardware counter Ming Lei
2012-02-15 16:38                     ` Peter Zijlstra
2012-02-16 10:25                       ` Ming Lei
2012-02-16 15:00                         ` Will Deacon
2012-02-16 16:12                           ` Ming Lei
2012-02-16 16:19                             ` Peter Zijlstra
2012-02-16 16:37                               ` Ming Lei
2012-02-16 18:08                                 ` Will Deacon
2012-02-17  5:24                                   ` Ming Lei
2012-02-17 10:26                                     ` Will Deacon
2012-02-20  3:19                                       ` Ming Lei
2012-02-20  9:44                                         ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).