All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] perf: perf sched warning possibly due to clock granularity on AMD
@ 2012-02-06 13:25 Stephane Eranian
  2012-02-06 14:26 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Stephane Eranian @ 2012-02-06 13:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: acme, peterz, mingo, robert.richter, eric.dumazet

Hi,

I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode.
I was testing perf sched and I ran into an issue. That issue seems
to exist only on AMD and not on Intel systems. It is not PMU related
because I am doing tracing.

I am running a simple ping pong test to stress the context switch
code. Two processes exchanging a byte through a pipe (program provided
below).

Then, I capture a trace using perf sched, and I run sched lat, but
on certain runs, I get:

   $ perf sched rec pong 2
   $ perf sched lat
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: TimesFound 4934 unknown events!
     Is this an older tool processing a perf.data file generated by a more recent tool?

I did some investigation in the kernel to figure out what
could cause this problem. I found out that with this measurement, it is
possible to get multiple consecutive samples with the SAME timestamp.
That is possible on AMD because, apparently, local_clock() ends up calling,
a sched_clock() function that is only based on jiffies. That means, the
granularity is a tick (1ms in my case). On Intel systems, at least on mines,
local_clock() ends up calling a routine which uses TSC and thus has a much
greater precision. This is all decided by the sched_clock_stable variable. It
is set at init time for Intel and not used on AMD. Yet, recent AMD systems do
have support for NONSTOP_TSC. It is not clear to me why that would not be enough
(unless it is a cross-socket synchronization issue).

One thing is clear, is that if I enable sched_clock_stable on my AMD system, then
the perf sched warning disappears.

I believe there is really a bug in perf. I am sure there are systems out there
which cannot provide a very fine grain timestamp, thus it is possible to get samples
with identical timestamps. I don't quite understand the sample re-ordering code in
perf but it seems sensitive to timer granularity. I think it should not.

In summary, two issues:
  - Why is sched_clock_stable not set or even tested on recent AMD systems?
  - perf should not rely on fine granularity timestamps in the re-ordering code

Any comments?

#include <sys/types.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <signal.h>
#include <syscall.h>
#include <err.h>

#define MAX_CPUS	64
#define NR_CPU_BITS	(MAX_CPUS>>3)
int
pin_cpu(pid_t pid, unsigned int cpu)
{
	uint64_t my_mask[NR_CPU_BITS];

	if (cpu >= MAX_CPUS)
		errx(1, "this program supports only up to %d CPUs", MAX_CPUS);

	my_mask[cpu>>6] = 1ULL << (cpu&63);

	return syscall(__NR_sched_setaffinity, pid, sizeof(my_mask), &my_mask);
}

static volatile int quit;
void sig_handler(int n)
{
	quit = 1;
}

static void
do_child(int fr, int fw)
{
	char c;
	ssize_t ret;

	for(;;) {
		ret = read(fr, &c, 1);	
		if (ret < 0)
			break;
		ret = write(fw, "c", 1);
		if (ret < 0)
			break;
		
	}
	printf("child exited\n");
	exit(0);
}


int
main(int argc, char **argv)
{
	int ret;
	int pr[2], pw[2];
	int which_cpu;
	pid_t pid;
	uint64_t nctx = 0;
	ssize_t nbytes;
	char c = '0';
	int delay;

	delay = argc > 1 ? atoi(argv[1]) : 10;
	
	srandom(getpid());
	which_cpu = random() % sysconf(_SC_NPROCESSORS_ONLN);

	ret = pipe(pr);
	if (ret)
		err(1, "cannot create read pipe");

	ret = pipe(pw);
	if (ret)
		err(1, "cannot create write pipe");

	ret = pin_cpu(getpid(), which_cpu);
	if (ret)
		err(1, "cannot pin to CPU%d", which_cpu);

	printf("Both processes pinned to CPU%d, running for %ds\n", which_cpu, delay);

	switch(pid=fork()) {
		case -1:
			err(1, "cannot create child");
		case 0:
			close(pr[1]); close(pw[0]);
			do_child(pr[0], pw[1]);
			exit(1);
	}

	close(pr[0]);
	close(pw[1]);

	signal(SIGALRM, sig_handler);
	alarm(delay);

	while(!quit) {
		nbytes = write(pr[1], "c", 1);
		nbytes = read(pw[0], &c, 1);	
		nctx++;
	}
	close(pr[1]);
	close(pw[0]);
	return 0;
}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian
@ 2012-02-06 14:26 ` Peter Zijlstra
  2012-02-06 14:31   ` Stephane Eranian
  2012-02-06 15:34   ` Borislav Petkov
  2012-02-07 19:43 ` [tip:perf/core] x86/sched/perf/AMD: " tip-bot for Borislav Petkov
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2012-02-06 14:26 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, mingo, robert.richter, eric.dumazet, Borislav Petkov

On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote:
> In summary, two issues:
>   - Why is sched_clock_stable not set or even tested on recent AMD systems?

I'll leave that for Robert to answer, also added Boris to the CC, him
know AMD bits too. 

>   - perf should not rely on fine granularity timestamps in the re-ordering code

Uhm, what else is there?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 14:26 ` Peter Zijlstra
@ 2012-02-06 14:31   ` Stephane Eranian
  2012-02-06 17:17     ` Arnaldo Carvalho de Melo
  2012-02-06 15:34   ` Borislav Petkov
  1 sibling, 1 reply; 31+ messages in thread
From: Stephane Eranian @ 2012-02-06 14:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, acme, mingo, robert.richter, eric.dumazet, Borislav Petkov

On Mon, Feb 6, 2012 at 3:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote:
>> In summary, two issues:
>>   - Why is sched_clock_stable not set or even tested on recent AMD systems?
>
> I'll leave that for Robert to answer, also added Boris to the CC, him
> know AMD bits too.
>
Ok, good.

>>   - perf should not rely on fine granularity timestamps in the re-ordering code
>
> Uhm, what else is there?
I understand your point. I don't quite understand what the code is
supposed to do for odered
samples. I'd like Arnaldo to explain this warning first.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 14:26 ` Peter Zijlstra
  2012-02-06 14:31   ` Stephane Eranian
@ 2012-02-06 15:34   ` Borislav Petkov
  2012-02-06 16:37     ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2012-02-06 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter,
	eric.dumazet, Borislav Petkov, Andreas Herrmann

+ Andreas.

On Mon, Feb 06, 2012 at 03:26:43PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote:
> > In summary, two issues:
> >   - Why is sched_clock_stable not set or even tested on recent AMD systems?

AFAICT, sched_clock_stable is set on Intel under the following conditions:

        /*
         * c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate
         * with P/T states and does not stop in deep C-states.
         *
         * It is also reliable across cores and sockets. (but not across
         * cabinets - we turn it off in that case explicitly.)
         */
        if (c->x86_power & (1 << 8)) {
                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
                if (!check_tsc_unstable())
                        sched_clock_stable = 1;
        }

and yes, we can do CONSTANT_TSC and NONSTOP_TSC on the now older F10h
already, so Bulldozer can do that too, implicitly.

I don't see why sched_clock_stable shouldn't be moved to generic x86
code and set based on the two CPUID cap flags above.

HTH.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 15:34   ` Borislav Petkov
@ 2012-02-06 16:37     ` Peter Zijlstra
  2012-02-06 16:46       ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2012-02-06 16:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter,
	eric.dumazet, Andreas Herrmann

On Mon, 2012-02-06 at 16:34 +0100, Borislav Petkov wrote:
> + Andreas.
> 
> On Mon, Feb 06, 2012 at 03:26:43PM +0100, Peter Zijlstra wrote:
> > On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote:
> > > In summary, two issues:
> > >   - Why is sched_clock_stable not set or even tested on recent AMD systems?
> 
> AFAICT, sched_clock_stable is set on Intel under the following conditions:
> 
>         /*
>          * c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate
>          * with P/T states and does not stop in deep C-states.
>          *
>          * It is also reliable across cores and sockets. (but not across
>          * cabinets - we turn it off in that case explicitly.)
>          */
>         if (c->x86_power & (1 << 8)) {
>                 set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
>                 set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
>                 if (!check_tsc_unstable())
>                         sched_clock_stable = 1;
>         }
> 
> and yes, we can do CONSTANT_TSC and NONSTOP_TSC on the now older F10h
> already, so Bulldozer can do that too, implicitly.
> 
> I don't see why sched_clock_stable shouldn't be moved to generic x86
> code and set based on the two CPUID cap flags above.

stable means more than both those flags together.. 

 - CONSTANT_TSC means freq independent
 - NONSTOP_TSC means it doesn't get stopped in any C state

Together they make TSC completely C-state independent.
sched_clock_stable actually means that TSC is both _AND_ is consistent
across all CPUs in the entire system.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 16:37     ` Peter Zijlstra
@ 2012-02-06 16:46       ` Borislav Petkov
  2012-02-06 16:54         ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2012-02-06 16:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter,
	eric.dumazet, Andreas Herrmann

On Mon, Feb 06, 2012 at 05:37:26PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-02-06 at 16:34 +0100, Borislav Petkov wrote:
> > + Andreas.
> > 
> > On Mon, Feb 06, 2012 at 03:26:43PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote:
> > > > In summary, two issues:
> > > >   - Why is sched_clock_stable not set or even tested on recent AMD systems?
> > 
> > AFAICT, sched_clock_stable is set on Intel under the following conditions:
> > 
> >         /*
> >          * c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate
> >          * with P/T states and does not stop in deep C-states.
> >          *
> >          * It is also reliable across cores and sockets. (but not across
> >          * cabinets - we turn it off in that case explicitly.)
> >          */
> >         if (c->x86_power & (1 << 8)) {
> >                 set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
> >                 set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
> >                 if (!check_tsc_unstable())
> >                         sched_clock_stable = 1;
> >         }
> > 
> > and yes, we can do CONSTANT_TSC and NONSTOP_TSC on the now older F10h
> > already, so Bulldozer can do that too, implicitly.
> > 
> > I don't see why sched_clock_stable shouldn't be moved to generic x86
> > code and set based on the two CPUID cap flags above.
> 
> stable means more than both those flags together.. 
> 
>  - CONSTANT_TSC means freq independent
>  - NONSTOP_TSC means it doesn't get stopped in any C state
> 
> Together they make TSC completely C-state independent.
> sched_clock_stable actually means that TSC is both _AND_ is consistent

yes.

> across all CPUs in the entire system.

Right, by the "entire system" you mean consistent across cores and
sockets but not necessarily across cabinets, as in the comment above,
correct?

If so, let me ask around if this holds true too.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 16:46       ` Borislav Petkov
@ 2012-02-06 16:54         ` Peter Zijlstra
  2012-02-06 20:27           ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2012-02-06 16:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter,
	eric.dumazet, Andreas Herrmann

On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote:
> > across all CPUs in the entire system.
> 
> Right, by the "entire system" you mean consistent across cores and
> sockets but not necessarily across cabinets, as in the comment above,
> correct?
> 
> If so, let me ask around if this holds true too.

Every CPU available to the kernel. So if you run a single system image
across your cabinets, then yes those too.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 14:31   ` Stephane Eranian
@ 2012-02-06 17:17     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 31+ messages in thread
From: Arnaldo Carvalho de Melo @ 2012-02-06 17:17 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, linux-kernel, mingo, robert.richter,
	eric.dumazet, Borislav Petkov, Frederic Weisbecker,
	Thomas Gleixner

Em Mon, Feb 06, 2012 at 03:31:00PM +0100, Stephane Eranian escreveu:
> On Mon, Feb 6, 2012 at 3:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote:
> >> In summary, two issues:
> >>   - Why is sched_clock_stable not set or even tested on recent AMD systems?
> >
> > I'll leave that for Robert to answer, also added Boris to the CC, him
> > know AMD bits too.
> >
> Ok, good.
> 
> >>   - perf should not rely on fine granularity timestamps in the re-ordering code
> >
> > Uhm, what else is there?

> I understand your point. I don't quite understand what the code is
> supposed to do for odered
> samples. I'd like Arnaldo to explain this warning first.

That code was written by Frédéric and IIRC also Thomas, its something I
plan to rework somehow in the future, but for now I'll let them chime
in... Frédéric?

- Arnaldo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 16:54         ` Peter Zijlstra
@ 2012-02-06 20:27           ` Borislav Petkov
  2012-02-06 20:31             ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2012-02-06 20:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter,
	eric.dumazet, Andreas Herrmann

On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote:
> > > across all CPUs in the entire system.
> > 
> > Right, by the "entire system" you mean consistent across cores and
> > sockets but not necessarily across cabinets, as in the comment above,
> > correct?
> > 
> > If so, let me ask around if this holds true too.
> 
> Every CPU available to the kernel. So if you run a single system image
> across your cabinets, then yes those too.

Ok, but what about that sentence "(but not across cabinets - we turn
it off in that case explicitly.)" - I don't see any place where it is
turned off explicitly... Maybe a stale comment?

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 20:27           ` Borislav Petkov
@ 2012-02-06 20:31             ` Peter Zijlstra
  2012-02-06 20:37               ` Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2012-02-06 20:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter,
	eric.dumazet, Andreas Herrmann, Venkatesh Pallipadi

On Mon, 2012-02-06 at 21:27 +0100, Borislav Petkov wrote:
> On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote:
> > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote:
> > > > across all CPUs in the entire system.
> > > 
> > > Right, by the "entire system" you mean consistent across cores and
> > > sockets but not necessarily across cabinets, as in the comment above,
> > > correct?
> > > 
> > > If so, let me ask around if this holds true too.
> > 
> > Every CPU available to the kernel. So if you run a single system image
> > across your cabinets, then yes those too.
> 
> Ok, but what about that sentence "(but not across cabinets - we turn
> it off in that case explicitly.)" - I don't see any place where it is
> turned off explicitly... Maybe a stale comment?

I suspect it might be the sched_clock_stable = 0 in mark_tsc_unstable(),
but lets ask Venki, IIRC he wrote all that.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 20:31             ` Peter Zijlstra
@ 2012-02-06 20:37               ` Borislav Petkov
  2012-02-06 21:19                 ` Venki Pallipadi
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2012-02-06 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter,
	eric.dumazet, Andreas Herrmann, Venkatesh Pallipadi

On Mon, Feb 06, 2012 at 09:31:33PM +0100, Peter Zijlstra wrote:
> On Mon, 2012-02-06 at 21:27 +0100, Borislav Petkov wrote:
> > On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote:
> > > > > across all CPUs in the entire system.
> > > > 
> > > > Right, by the "entire system" you mean consistent across cores and
> > > > sockets but not necessarily across cabinets, as in the comment above,
> > > > correct?
> > > > 
> > > > If so, let me ask around if this holds true too.
> > > 
> > > Every CPU available to the kernel. So if you run a single system image
> > > across your cabinets, then yes those too.
> > 
> > Ok, but what about that sentence "(but not across cabinets - we turn
> > it off in that case explicitly.)" - I don't see any place where it is
> > turned off explicitly... Maybe a stale comment?
> 
> I suspect it might be the sched_clock_stable = 0 in mark_tsc_unstable(),
> but lets ask Venki, IIRC he wrote all that.

Yeah, I was looking at the code further and on Intel it does:

        if (c->x86_power & (1 << 8)) {
                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
                if (!check_tsc_unstable())
                        sched_clock_stable = 1;
        }

while on AMD, in early_init_amd() we do:

        if (c->x86_power & (1 << 8)) {
                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
        }

and having in mind that tsc_unstable is set on generic x86 paths,
nothing stops us to do the same on AMD too, and as a result, set
sched_clock_stable too.

But yeah, let's see what Venki has to say first.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 20:37               ` Borislav Petkov
@ 2012-02-06 21:19                 ` Venki Pallipadi
  2012-02-07  7:51                   ` Peter Zijlstra
  2012-02-07  8:32                   ` Ingo Molnar
  0 siblings, 2 replies; 31+ messages in thread
From: Venki Pallipadi @ 2012-02-06 21:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Stephane Eranian, linux-kernel, acme, mingo,
	robert.richter, eric.dumazet, Andreas Herrmann

On Mon, Feb 6, 2012 at 12:37 PM, Borislav Petkov <bp@amd64.org> wrote:
> On Mon, Feb 06, 2012 at 09:31:33PM +0100, Peter Zijlstra wrote:
>> On Mon, 2012-02-06 at 21:27 +0100, Borislav Petkov wrote:
>> > On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote:
>> > > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote:
>> > > > > across all CPUs in the entire system.
>> > > >
>> > > > Right, by the "entire system" you mean consistent across cores and
>> > > > sockets but not necessarily across cabinets, as in the comment above,
>> > > > correct?
>> > > >
>> > > > If so, let me ask around if this holds true too.
>> > >
>> > > Every CPU available to the kernel. So if you run a single system image
>> > > across your cabinets, then yes those too.
>> >
>> > Ok, but what about that sentence "(but not across cabinets - we turn
>> > it off in that case explicitly.)" - I don't see any place where it is
>> > turned off explicitly... Maybe a stale comment?
>>
>> I suspect it might be the sched_clock_stable = 0 in mark_tsc_unstable(),
>> but lets ask Venki, IIRC he wrote all that.
>
> Yeah, I was looking at the code further and on Intel it does:
>
>        if (c->x86_power & (1 << 8)) {
>                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
>                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
>                if (!check_tsc_unstable())
>                        sched_clock_stable = 1;
>        }
>
> while on AMD, in early_init_amd() we do:
>
>        if (c->x86_power & (1 << 8)) {
>                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
>                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
>        }
>
> and having in mind that tsc_unstable is set on generic x86 paths,
> nothing stops us to do the same on AMD too, and as a result, set
> sched_clock_stable too.
>
> But yeah, let's see what Venki has to say first.
>

Looks like cabinet comment came from Ingo (commit 83ce4009) in reference to
    (We will turn this off in DMI quirks for multi-chassis systems)

Yes. If these two flags are set, TSC should be consistent and sched_clock_stable
could be set and it will be reset if there is a call to mark_tsc_unstable().

Thanks,
Venki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 21:19                 ` Venki Pallipadi
@ 2012-02-07  7:51                   ` Peter Zijlstra
  2012-02-07  8:32                   ` Ingo Molnar
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2012-02-07  7:51 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Borislav Petkov, Stephane Eranian, linux-kernel, acme, mingo,
	robert.richter, eric.dumazet, Andreas Herrmann

On Mon, 2012-02-06 at 13:19 -0800, Venki Pallipadi wrote:
> 
> Yes. If these two flags are set, TSC should be consistent and sched_clock_stable
> could be set and it will be reset if there is a call to mark_tsc_unstable(). 

Only trouble is that sched_clock_cpu() et al. will jump when you flip
sched_clock_stable to 0 after you started using it.. its something we
might want to fix.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 21:19                 ` Venki Pallipadi
  2012-02-07  7:51                   ` Peter Zijlstra
@ 2012-02-07  8:32                   ` Ingo Molnar
  2012-02-07  9:06                     ` Borislav Petkov
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2012-02-07  8:32 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Borislav Petkov, Peter Zijlstra, Stephane Eranian, linux-kernel,
	acme, robert.richter, eric.dumazet, Andreas Herrmann


* Venki Pallipadi <venki@google.com> wrote:

> On Mon, Feb 6, 2012 at 12:37 PM, Borislav Petkov <bp@amd64.org> wrote:
> > On Mon, Feb 06, 2012 at 09:31:33PM +0100, Peter Zijlstra wrote:
> >> On Mon, 2012-02-06 at 21:27 +0100, Borislav Petkov wrote:
> >> > On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote:
> >> > > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote:
> >> > > > > across all CPUs in the entire system.
> >> > > >
> >> > > > Right, by the "entire system" you mean consistent across cores and
> >> > > > sockets but not necessarily across cabinets, as in the comment above,
> >> > > > correct?
> >> > > >
> >> > > > If so, let me ask around if this holds true too.
> >> > >
> >> > > Every CPU available to the kernel. So if you run a single system image
> >> > > across your cabinets, then yes those too.
> >> >
> >> > Ok, but what about that sentence "(but not across cabinets - we turn
> >> > it off in that case explicitly.)" - I don't see any place where it is
> >> > turned off explicitly... Maybe a stale comment?
> >>
> >> I suspect it might be the sched_clock_stable = 0 in mark_tsc_unstable(),
> >> but lets ask Venki, IIRC he wrote all that.
> >
> > Yeah, I was looking at the code further and on Intel it does:
> >
> >        if (c->x86_power & (1 << 8)) {
> >                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
> >                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
> >                if (!check_tsc_unstable())
> >                        sched_clock_stable = 1;
> >        }
> >
> > while on AMD, in early_init_amd() we do:
> >
> >        if (c->x86_power & (1 << 8)) {
> >                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
> >                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
> >        }
> >
> > and having in mind that tsc_unstable is set on generic x86 paths,
> > nothing stops us to do the same on AMD too, and as a result, set
> > sched_clock_stable too.
> >
> > But yeah, let's see what Venki has to say first.
> >
> 
> Looks like cabinet comment came from Ingo (commit 83ce4009) in 
> reference to
>     (We will turn this off in DMI quirks for multi-chassis 
>     systems)
> 
> Yes. If these two flags are set, TSC should be consistent and 
> sched_clock_stable could be set and it will be reset if there 
> is a call to mark_tsc_unstable().

Most of the details swapped out from my brain meanwhile, but I 
have some vague memories of a DMI quirk for some high-end system 
that just did a sched_clock_stable = 0 or such.

So if the common case is that the TSC is entirely synchronized 
across CPUs, then we can default to that and rely on platform 
initialization code or DMI quirks setting the few large-NUMA 
systems to an unstable TSC.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-07  8:32                   ` Ingo Molnar
@ 2012-02-07  9:06                     ` Borislav Petkov
  2012-02-07  9:50                       ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2012-02-07  9:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Venki Pallipadi, Borislav Petkov, Peter Zijlstra,
	Stephane Eranian, linux-kernel, acme, robert.richter,
	eric.dumazet, Andreas Herrmann

On Tue, Feb 07, 2012 at 09:32:53AM +0100, Ingo Molnar wrote:
> > Yes. If these two flags are set, TSC should be consistent and 
> > sched_clock_stable could be set and it will be reset if there 
> > is a call to mark_tsc_unstable().
> 
> Most of the details swapped out from my brain meanwhile, but I 
> have some vague memories of a DMI quirk for some high-end system 
> that just did a sched_clock_stable = 0 or such.
> 
> So if the common case is that the TSC is entirely synchronized 
> across CPUs, then we can default to that and rely on platform 
> initialization code or DMI quirks setting the few large-NUMA 
> systems to an unstable TSC.

There's also 14be1f7454ea96ee614467a49cf018a1a383b189 which removed
the setting of sched_clock_stable to 1 due to UV systems not being
TSC-synchronized across blades.

I guess, we could tentatively enable it on AMD provided nothing has
marked the TSC as unstable already:

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index f4773f4..ddee619 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -456,6 +456,8 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c)
        if (c->x86_power & (1 << 8)) {
                set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
                set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
+               if (!check_tsc_unstable())
+                       sched_clock_stable = 1;
        }
 
 #ifdef CONFIG_X86_64

Hmm...

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-07  9:06                     ` Borislav Petkov
@ 2012-02-07  9:50                       ` Ingo Molnar
  2012-02-07 12:08                         ` [PATCH] x86, AMD: Set sched_clock_stable Borislav Petkov
  0 siblings, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2012-02-07  9:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Venki Pallipadi, Peter Zijlstra, Stephane Eranian, linux-kernel,
	acme, robert.richter, eric.dumazet, Andreas Herrmann


* Borislav Petkov <bp@amd64.org> wrote:

> On Tue, Feb 07, 2012 at 09:32:53AM +0100, Ingo Molnar wrote:
> > > Yes. If these two flags are set, TSC should be consistent and 
> > > sched_clock_stable could be set and it will be reset if there 
> > > is a call to mark_tsc_unstable().
> > 
> > Most of the details swapped out from my brain meanwhile, but I 
> > have some vague memories of a DMI quirk for some high-end system 
> > that just did a sched_clock_stable = 0 or such.
> > 
> > So if the common case is that the TSC is entirely synchronized 
> > across CPUs, then we can default to that and rely on platform 
> > initialization code or DMI quirks setting the few large-NUMA 
> > systems to an unstable TSC.
> 
> There's also 14be1f7454ea96ee614467a49cf018a1a383b189 which removed
> the setting of sched_clock_stable to 1 due to UV systems not being
> TSC-synchronized across blades.
> 
> I guess, we could tentatively enable it on AMD provided nothing has
> marked the TSC as unstable already:
> 
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index f4773f4..ddee619 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -456,6 +456,8 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c)
>         if (c->x86_power & (1 << 8)) {
>                 set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
>                 set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
> +               if (!check_tsc_unstable())
> +                       sched_clock_stable = 1;
>         }

i'd go for that.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] x86, AMD: Set sched_clock_stable
  2012-02-07  9:50                       ` Ingo Molnar
@ 2012-02-07 12:08                         ` Borislav Petkov
  2012-02-15 15:30                           ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Borislav Petkov @ 2012-02-07 12:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Venki Pallipadi, Peter Zijlstra,
	Stephane Eranian, LKML, Arnaldo Carvalho de Melo, Robert Richter,
	Eric Dumazet, Andreas Herrmann, Borislav Petkov

Stephane Eranian reported that doing a scheduler latency measurements
with perf on AMD doesn't work out as expected due to the fact that the
sched_clock() granularity is too coarse, i.e. done in jiffies due to the
sched_clock_stable not set, which, if set, would mean that we get to use
the TSC as sample source which would give us much higher precision.

However, there's no reason not to set sched_clock_stable on AMD because
all families from F10h and upwards do have an invariant TSC and have the
CPUID flag to prove (CPUID_8000_0007_EDX[8]).

Make it so, #1.

Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Link: http://lkml.kernel.org/r/20120206132546.GA30854@quad
---
 arch/x86/kernel/cpu/amd.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index f4773f4..0a44b90 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 
 #include <linux/io.h>
+#include <linux/sched.h>
 #include <asm/processor.h>
 #include <asm/apic.h>
 #include <asm/cpu.h>
@@ -456,6 +457,8 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c)
 	if (c->x86_power & (1 << 8)) {
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
+		if (!check_tsc_unstable())
+			sched_clock_stable = 1;
 	}
 
 #ifdef CONFIG_X86_64
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [tip:perf/core] x86/sched/perf/AMD: Set sched_clock_stable
  2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian
  2012-02-06 14:26 ` Peter Zijlstra
@ 2012-02-07 19:43 ` tip-bot for Borislav Petkov
  2012-02-08 15:07 ` [BUG] perf: perf sched warning possibly due to clock granularity on AMD Frederic Weisbecker
  2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker
  3 siblings, 0 replies; 31+ messages in thread
From: tip-bot for Borislav Petkov @ 2012-02-07 19:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, eric.dumazet,
	andreas.herrmann3, peterz, bp, bp, robert.richter, acme, tglx,
	mingo, venki

Commit-ID:  c98fdeaa92731308ed80386261fa2589addefa47
Gitweb:     http://git.kernel.org/tip/c98fdeaa92731308ed80386261fa2589addefa47
Author:     Borislav Petkov <bp@alien8.de>
AuthorDate: Tue, 7 Feb 2012 13:08:52 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 7 Feb 2012 13:12:08 +0100

x86/sched/perf/AMD: Set sched_clock_stable

Stephane Eranian reported that doing a scheduler latency
measurements with perf on AMD doesn't work out as expected due
to the fact that the sched_clock() granularity is too coarse,
i.e. done in jiffies due to the sched_clock_stable not set,
which, if set, would mean that we get to use the TSC as sample
source which would give us much higher precision.

However, there's no reason not to set sched_clock_stable on AMD
because all families from F10h and upwards do have an invariant
TSC and have the CPUID flag to prove (CPUID_8000_0007_EDX[8]).

Make it so, #1.

Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Venki Pallipadi <venki@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/20120206132546.GA30854@quad
[ Should any non-standard system break the TSC, we should
  mark them so explicitly, in their platform init handler, or
  in a DMI quirk. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/amd.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index f4773f4..0a44b90 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 
 #include <linux/io.h>
+#include <linux/sched.h>
 #include <asm/processor.h>
 #include <asm/apic.h>
 #include <asm/cpu.h>
@@ -456,6 +457,8 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c)
 	if (c->x86_power & (1 << 8)) {
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
+		if (!check_tsc_unstable())
+			sched_clock_stable = 1;
 	}
 
 #ifdef CONFIG_X86_64

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian
  2012-02-06 14:26 ` Peter Zijlstra
  2012-02-07 19:43 ` [tip:perf/core] x86/sched/perf/AMD: " tip-bot for Borislav Petkov
@ 2012-02-08 15:07 ` Frederic Weisbecker
  2012-02-08 15:10   ` Stephane Eranian
  2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker
  3 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2012-02-08 15:07 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, peterz, mingo, robert.richter, eric.dumazet

On Mon, Feb 06, 2012 at 02:25:46PM +0100, Stephane Eranian wrote:
> Hi,
> 
> I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode.
> I was testing perf sched and I ran into an issue. That issue seems
> to exist only on AMD and not on Intel systems. It is not PMU related
> because I am doing tracing.
> 
> I am running a simple ping pong test to stress the context switch
> code. Two processes exchanging a byte through a pipe (program provided
> below).
> 
> Then, I capture a trace using perf sched, and I run sched lat, but
> on certain runs, I get:
> 
>    $ perf sched rec pong 2
>    $ perf sched lat
>      Warning: Timestamp below last timeslice flush
>      Warning: Timestamp below last timeslice flush
>      Warning: Timestamp below last timeslice flush
>      Warning: Timestamp below last timeslice flush
>      Warning: Timestamp below last timeslice flush
>      Warning: Timestamp below last timeslice flush
>      Warning: Timestamp below last timeslice flush
>      Warning: Timestamp below last timeslice flush
>      Warning: TimesFound 4934 unknown events!
>      Is this an older tool processing a perf.data file generated by a more recent tool?

I've tried to reproduce this on an AMD phenom but haven't got any such
warning.

Is there any chance you could "perf archive" your data and upload it somewhere
I can fetch it from?

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-08 15:07 ` [BUG] perf: perf sched warning possibly due to clock granularity on AMD Frederic Weisbecker
@ 2012-02-08 15:10   ` Stephane Eranian
  2012-02-08 15:22     ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Stephane Eranian @ 2012-02-08 15:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, acme, peterz, mingo, robert.richter, eric.dumazet

On Wed, Feb 8, 2012 at 4:07 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Mon, Feb 06, 2012 at 02:25:46PM +0100, Stephane Eranian wrote:
>> Hi,
>>
>> I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode.
>> I was testing perf sched and I ran into an issue. That issue seems
>> to exist only on AMD and not on Intel systems. It is not PMU related
>> because I am doing tracing.
>>
>> I am running a simple ping pong test to stress the context switch
>> code. Two processes exchanging a byte through a pipe (program provided
>> below).
>>
>> Then, I capture a trace using perf sched, and I run sched lat, but
>> on certain runs, I get:
>>
>>    $ perf sched rec pong 2
>>    $ perf sched lat
>>      Warning: Timestamp below last timeslice flush
>>      Warning: Timestamp below last timeslice flush
>>      Warning: Timestamp below last timeslice flush
>>      Warning: Timestamp below last timeslice flush
>>      Warning: Timestamp below last timeslice flush
>>      Warning: Timestamp below last timeslice flush
>>      Warning: Timestamp below last timeslice flush
>>      Warning: Timestamp below last timeslice flush
>>      Warning: TimesFound 4934 unknown events!
>>      Is this an older tool processing a perf.data file generated by a more recent tool?
>
> I've tried to reproduce this on an AMD phenom but haven't got any such
> warning.
>
It does not happen at each run.

> Is there any chance you could "perf archive" your data and upload it somewhere
> I can fetch it from?
>
I can send you the perf.data in a private Email.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-08 15:10   ` Stephane Eranian
@ 2012-02-08 15:22     ` Frederic Weisbecker
  2012-02-08 15:23       ` Stephane Eranian
  0 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2012-02-08 15:22 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, peterz, mingo, robert.richter, eric.dumazet

On Wed, Feb 08, 2012 at 04:10:24PM +0100, Stephane Eranian wrote:
> On Wed, Feb 8, 2012 at 4:07 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > On Mon, Feb 06, 2012 at 02:25:46PM +0100, Stephane Eranian wrote:
> >> Hi,
> >>
> >> I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode.
> >> I was testing perf sched and I ran into an issue. That issue seems
> >> to exist only on AMD and not on Intel systems. It is not PMU related
> >> because I am doing tracing.
> >>
> >> I am running a simple ping pong test to stress the context switch
> >> code. Two processes exchanging a byte through a pipe (program provided
> >> below).
> >>
> >> Then, I capture a trace using perf sched, and I run sched lat, but
> >> on certain runs, I get:
> >>
> >>    $ perf sched rec pong 2
> >>    $ perf sched lat
> >>      Warning: Timestamp below last timeslice flush
> >>      Warning: Timestamp below last timeslice flush
> >>      Warning: Timestamp below last timeslice flush
> >>      Warning: Timestamp below last timeslice flush
> >>      Warning: Timestamp below last timeslice flush
> >>      Warning: Timestamp below last timeslice flush
> >>      Warning: Timestamp below last timeslice flush
> >>      Warning: Timestamp below last timeslice flush
> >>      Warning: TimesFound 4934 unknown events!
> >>      Is this an older tool processing a perf.data file generated by a more recent tool?
> >
> > I've tried to reproduce this on an AMD phenom but haven't got any such
> > warning.
> >
> It does not happen at each run.

Yeah I tried 5 times but wasn't lucky.

> 
> > Is there any chance you could "perf archive" your data and upload it somewhere
> > I can fetch it from?
> >
> I can send you the perf.data in a private Email.

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD
  2012-02-08 15:22     ` Frederic Weisbecker
@ 2012-02-08 15:23       ` Stephane Eranian
  0 siblings, 0 replies; 31+ messages in thread
From: Stephane Eranian @ 2012-02-08 15:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, acme, peterz, mingo, robert.richter, eric.dumazet

On Wed, Feb 8, 2012 at 4:22 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On Wed, Feb 08, 2012 at 04:10:24PM +0100, Stephane Eranian wrote:
>> On Wed, Feb 8, 2012 at 4:07 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>> > On Mon, Feb 06, 2012 at 02:25:46PM +0100, Stephane Eranian wrote:
>> >> Hi,
>> >>
>> >> I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode.
>> >> I was testing perf sched and I ran into an issue. That issue seems
>> >> to exist only on AMD and not on Intel systems. It is not PMU related
>> >> because I am doing tracing.
>> >>
>> >> I am running a simple ping pong test to stress the context switch
>> >> code. Two processes exchanging a byte through a pipe (program provided
>> >> below).
>> >>
>> >> Then, I capture a trace using perf sched, and I run sched lat, but
>> >> on certain runs, I get:
>> >>
>> >>    $ perf sched rec pong 2
>> >>    $ perf sched lat
>> >>      Warning: Timestamp below last timeslice flush
>> >>      Warning: Timestamp below last timeslice flush
>> >>      Warning: Timestamp below last timeslice flush
>> >>      Warning: Timestamp below last timeslice flush
>> >>      Warning: Timestamp below last timeslice flush
>> >>      Warning: Timestamp below last timeslice flush
>> >>      Warning: Timestamp below last timeslice flush
>> >>      Warning: Timestamp below last timeslice flush
>> >>      Warning: TimesFound 4934 unknown events!
>> >>      Is this an older tool processing a perf.data file generated by a more recent tool?
>> >
>> > I've tried to reproduce this on an AMD phenom but haven't got any such
>> > warning.
>> >
>> It does not happen at each run.
>
> Yeah I tried 5 times but wasn't lucky.

That's not enough!

>
>>
>> > Is there any chance you could "perf archive" your data and upload it somewhere
>> > I can fetch it from?
>> >
>> I can send you the perf.data in a private Email.
>
> Thanks!

Will dig it up and send it to you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86, AMD: Set sched_clock_stable
  2012-02-07 12:08                         ` [PATCH] x86, AMD: Set sched_clock_stable Borislav Petkov
@ 2012-02-15 15:30                           ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2012-02-15 15:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Borislav Petkov, Venki Pallipadi, Stephane Eranian,
	LKML, Arnaldo Carvalho de Melo, Robert Richter, Eric Dumazet,
	Andreas Herrmann

On Tue, 2012-02-07 at 13:08 +0100, Borislav Petkov wrote:
> Stephane Eranian reported that doing a scheduler latency measurements
> with perf on AMD doesn't work out as expected due to the fact that the
> sched_clock() granularity is too coarse, i.e. done in jiffies due to the
> sched_clock_stable not set, which, if set, would mean that we get to use
> the TSC as sample source which would give us much higher precision.

FWIW that's not entirely accurate, the problem with !sched_clock_stable
is that the time between CPUs can drift a little. So while timestamps on
the same CPU are high-res, when comparing them across CPUs you can
observe funnies.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] perf tools: Fix ordering with unstable tsc
  2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian
                   ` (2 preceding siblings ...)
  2012-02-08 15:07 ` [BUG] perf: perf sched warning possibly due to clock granularity on AMD Frederic Weisbecker
@ 2012-02-18 16:50 ` Frederic Weisbecker
  2012-02-22 15:35   ` Stephane Eranian
                     ` (2 more replies)
  3 siblings, 3 replies; 31+ messages in thread
From: Frederic Weisbecker @ 2012-02-18 16:50 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: LKML, Frederic Weisbecker, David Ahern, Peter Zijlstra,
	Stephane Eranian, Ingo Molnar

On a system with a TSC considered as unstable, one can encounter this
kind of warning:

     $ perf sched rec pong 2
     $ perf sched lat
     Warning: Timestamp below last timeslice flush

This happens when trace events trigger with a potentially high period,
such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc...
The perf event core then implement that weight by sending as many events
as the given period. For example as many as the time the task has been
sleeping in sched_stat_sleep event.

If this happens while irqs are disabled with an unstable tsc and this takes
more time than a jiffy, then the timestamps of the events get stuck to
the value of that next jiffy because sched_clock_local() bounds the timestamp
to that maximum. The local timer tick is supposed to update that boundary but
it can't given that irqs are disabled.

We can then meet this kind of scenario in perf record:

===== CPU 0 =====      ==== CPU 1 ====

              PASS n
     ...                    ...
      1                      1
      1                      2
      1                      3 <-- max recorded

           finished round event
            PASS n + 1

      1                      4
      1                      5
      1                      6

           finished round event
            PASS n + 2

      1                      7
     ...                    ...

CPU 0 is stuck sending events with irqs disabled and with the stale
timestamp. When we do the events reordering for perf script for example,
we flush all the events before timestamp 3 when we reach PASS n + 2,
considering we can't anymore have timestamps below 3 now.
But we still do have timestamps below 3 on PASS n + 2.

To solve that issue, instead of considering that timestamps are globally
monotonic, we assume they are locally monotonic. Instead of recording
the max timestamp on each pass, we check the max one per CPU on each
pass and keep the smallest over these as the new barrier up to which
we flush the events on the PASS n + 2. This still relies on a bit of
global monotonicity because if some CPU doesn't have events in PASS n,
we expect it not to have event in PASS n + 2 past the barrier recorded
in PASS n. So this is still not a totally robust ordering but it's still
better than what we had before.

The only way to have a deterministic and solid ordering will be to use
per cpu perf.data files.

Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/evsel.c   |    5 +-
 tools/perf/util/session.c |  146 +++++++++++++++++++++++++++++++++-----------
 tools/perf/util/session.h |    3 +-
 3 files changed, 115 insertions(+), 39 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 302d49a..1c8eb4b 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
 	if (opts->raw_samples) {
 		attr->sample_type	|= PERF_SAMPLE_TIME;
 		attr->sample_type	|= PERF_SAMPLE_RAW;
-		attr->sample_type	|= PERF_SAMPLE_CPU;
 	}
 
+	/* Need to know the CPU for tools that need to order events */
+	if (attr->sample_type & PERF_SAMPLE_TIME)
+		attr->sample_type	|= PERF_SAMPLE_CPU;
+
 	if (opts->no_delay) {
 		attr->watermark = 0;
 		attr->wakeup_events = 1;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 9f833cf..f297342 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
 		list_del(&sq->list);
 		free(sq);
 	}
+
+	free(os->last_cpu_timestamp);
 }
 
 static int perf_session_deliver_event(struct perf_session *session,
@@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s,
 }
 
 /*
- * When perf record finishes a pass on every buffers, it records this pseudo
- * event.
- * We record the max timestamp t found in the pass n.
- * Assuming these timestamps are monotonic across cpus, we know that if
- * a buffer still has events with timestamps below t, they will be all
- * available and then read in the pass n + 1.
- * Hence when we start to read the pass n + 2, we can safely flush every
- * events with timestamps below t.
+ * We make the assumption that timestamps are not globally monotonic but locally
+ * non-strictly monotonic. In practice, this is because if we are dealing with a
+ * machine with unstable TSC, the kernel bounds the result of the tsc between
+ * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more
+ * than one jiffy, all of its timestamps will be equal to next_tick_time after we
+ * cross that jiffy, without any further progress whereas the other CPU continue
+ * with normal timestamps. This can happen if a CPU sends crazillions of events
+ * while interrupts are disabled. But there are potentially other random scenarios
+ * with unstable TSC that drives us to assume the monotonicity of time only per CPU
+ * and not globally.
+ *
+ * To solve this, when perf record finishes a round of write on every buffers, it
+ * records a pseudo event named "finished round". The frame of events that happen
+ * between two finished rounds is called a "pass".
+ * We record the max timestamp T[cpu] per CPU found over the events in the pass n.
+ * Then when we finish a round, we iterate over these T[cpu]and keep the smallest
+ * one: min(T).
+ *
+ * Assuming these timestamps are locally monotonic (non strictly), we can flush all
+ * queued events having a timestamp below min(T) when we start to process PASS n + 1.
+ * But we actually wait until we start PASS n + 2 in case a CPU did not have any
+ * event in PASS n but came in PASS n + 1 with events below min(T). We truly
+ * hope no CPU will come with events below min(T) after pass n + 1. This
+ * heuristicly rely on some minimal global consistancy. This should work in most
+ * real world case, the only way to ensure a truly safe ordering with regular
+ * flush will be to switch to per CPU record files.
+ *
  *
- *    ============ PASS n =================
- *       CPU 0         |   CPU 1
- *                     |
- *    cnt1 timestamps  |   cnt2 timestamps
- *          1          |         2
- *          2          |         3
- *          -          |         4  <--- max recorded
+ *    ========================== PASS n ============================
+ *       CPU 0                   |   CPU 1
+ *                               |
+ *    cnt1 timestamps            |   cnt2 timestamps
+ *          1                    |         2
+ *          2 <--- max recorded  |         3
+ *          -                    |         4 <--- max recorded
+ *                          min(T) = 2
  *
- *    ============ PASS n + 1 ==============
- *       CPU 0         |   CPU 1
- *                     |
- *    cnt1 timestamps  |   cnt2 timestamps
- *          3          |         5
- *          4          |         6
- *          5          |         7 <---- max recorded
+ *    ========================== PASS n + 1 ========================
+ *       CPU 0                   |   CPU 1
+ *                               |
+ *    cnt1 timestamps            |   cnt2 timestamps
+ *          3                    |         5
+ *          4                    |         6
+ *          5 <--- max record    |         7 <---- max recorded
+ *                          min(T) = 5
  *
- *      Flush every events below timestamp 4
+ *                Flush every events below timestamp 2
  *
- *    ============ PASS n + 2 ==============
- *       CPU 0         |   CPU 1
- *                     |
- *    cnt1 timestamps  |   cnt2 timestamps
- *          6          |         8
- *          7          |         9
- *          -          |         10
+ *    ========================== PASS n + 2 ========================
+ *       CPU 0                   |   CPU 1
+ *                               |
+ *    cnt1 timestamps            |   cnt2 timestamps
+ *          6                    |         8
+ *          7                    |         9
+ *          -                    |         10
  *
- *      Flush every events below timestamp 7
- *      etc...
+ *                Flush every events below timestamp 5, etc...
  */
 static int process_finished_round(struct perf_tool *tool,
 				  union perf_event *event __used,
 				  struct perf_session *session)
 {
+	unsigned int i;
+	u64 min = ULLONG_MAX;
+	struct ordered_samples *os = &session->ordered_samples;
+
 	flush_sample_queue(session, tool);
-	session->ordered_samples.next_flush = session->ordered_samples.max_timestamp;
+
+	for (i = 0; i < session->nr_cpus; i++) {
+		if (os->last_cpu_timestamp[i] < min)
+			min = os->last_cpu_timestamp[i];
+
+		os->last_cpu_timestamp[i] = ULLONG_MAX;
+	}
+
+	if (min != ULLONG_MAX)
+		os->next_flush = min;
 
 	return 0;
 }
 
 /* The queue is ordered by time */
-static void __queue_event(struct sample_queue *new, struct perf_session *s)
+static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu)
 {
 	struct ordered_samples *os = &s->ordered_samples;
 	struct sample_queue *sample = os->last_sample;
@@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
 
 	++os->nr_samples;
 	os->last_sample = new;
+	os->last_cpu_timestamp[cpu] = timestamp;
 
 	if (!sample) {
 		list_add(&new->list, &os->samples);
-		os->max_timestamp = timestamp;
 		return;
 	}
 
@@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
 			p = sample->list.next;
 			if (p == &os->samples) {
 				list_add_tail(&new->list, &os->samples);
-				os->max_timestamp = timestamp;
 				return;
 			}
 			sample = list_entry(p, struct sample_queue, list);
@@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
 	}
 }
 
+static int alloc_cpus_timestamp_array(struct perf_session *s,
+				      struct perf_sample *sample,
+				      struct ordered_samples *os)
+{
+	int i;
+	int nr_cpus;
+
+	if (sample->cpu < s->nr_cpus)
+		return 0;
+
+	nr_cpus = sample->cpu + 1;
+
+	if (!os->last_cpu_timestamp)
+		os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus);
+	else
+		os->last_cpu_timestamp = realloc(os->last_cpu_timestamp,
+						 sizeof(u64) * nr_cpus);
+	if (!os->last_cpu_timestamp)
+		return -ENOMEM;
+
+	for (i = s->nr_cpus; i < nr_cpus; i++)
+		os->last_cpu_timestamp[i] = ULLONG_MAX;
+
+	s->nr_cpus = nr_cpus;
+
+	return 0;
+}
+
 #define MAX_SAMPLE_BUFFER	(64 * 1024 / sizeof(struct sample_queue))
 
 static int perf_session_queue_event(struct perf_session *s, union perf_event *event,
@@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
 	struct list_head *sc = &os->sample_cache;
 	u64 timestamp = sample->time;
 	struct sample_queue *new;
+	int err;
+
+	if (!(s->sample_type & PERF_SAMPLE_CPU)) {
+		pr_err("Warning: Need to record CPU on samples for ordering\n");
+		return -EINVAL;
+	}
 
 	if (!timestamp || timestamp == ~0ULL)
 		return -ETIME;
@@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
 		return -EINVAL;
 	}
 
+	err = alloc_cpus_timestamp_array(s, sample, os);
+	if (err)
+		return err;
+
 	if (!list_empty(sc)) {
 		new = list_entry(sc->next, struct sample_queue, list);
 		list_del(&new->list);
@@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
 	new->file_offset = file_offset;
 	new->event = event;
 
-	__queue_event(new, s);
+	__queue_event(new, s, sample->cpu);
 
 	return 0;
 }
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index c8d9017..642591b 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -16,7 +16,7 @@ struct thread;
 struct ordered_samples {
 	u64			last_flush;
 	u64			next_flush;
-	u64			max_timestamp;
+	u64			*last_cpu_timestamp;
 	struct list_head	samples;
 	struct list_head	sample_cache;
 	struct list_head	to_free;
@@ -50,6 +50,7 @@ struct perf_session {
 	int			cwdlen;
 	char			*cwd;
 	struct ordered_samples	ordered_samples;
+	unsigned int		nr_cpus;
 	char			filename[1];
 };
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] perf tools: Fix ordering with unstable tsc
  2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker
@ 2012-02-22 15:35   ` Stephane Eranian
  2012-02-22 15:39     ` David Ahern
  2012-03-05 18:43   ` Frederic Weisbecker
  2012-03-14 19:55   ` Arnaldo Carvalho de Melo
  2 siblings, 1 reply; 31+ messages in thread
From: Stephane Eranian @ 2012-02-22 15:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Arnaldo Carvalho de Melo, LKML, David Ahern, Peter Zijlstra, Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 15998 bytes --]

On Sat, Feb 18, 2012 at 5:50 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> On a system with a TSC considered as unstable, one can encounter this
> kind of warning:
>
>     $ perf sched rec pong 2
>     $ perf sched lat
>     Warning: Timestamp below last timeslice flush
>
> This happens when trace events trigger with a potentially high period,
> such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc...
> The perf event core then implement that weight by sending as many events
> as the given period. For example as many as the time the task has been
> sleeping in sched_stat_sleep event.
>
> If this happens while irqs are disabled with an unstable tsc and this takes
> more time than a jiffy, then the timestamps of the events get stuck to
> the value of that next jiffy because sched_clock_local() bounds the timestamp
> to that maximum. The local timer tick is supposed to update that boundary but
> it can't given that irqs are disabled.
>
> We can then meet this kind of scenario in perf record:
>
> ===== CPU 0 =====      ==== CPU 1 ====
>
>              PASS n
>     ...                    ...
>      1                      1
>      1                      2
>      1                      3 <-- max recorded
>
>           finished round event
>            PASS n + 1
>
>      1                      4
>      1                      5
>      1                      6
>
>           finished round event
>            PASS n + 2
>
>      1                      7
>     ...                    ...
>
> CPU 0 is stuck sending events with irqs disabled and with the stale
> timestamp. When we do the events reordering for perf script for example,
> we flush all the events before timestamp 3 when we reach PASS n + 2,
> considering we can't anymore have timestamps below 3 now.
> But we still do have timestamps below 3 on PASS n + 2.
>
> To solve that issue, instead of considering that timestamps are globally
> monotonic, we assume they are locally monotonic. Instead of recording
> the max timestamp on each pass, we check the max one per CPU on each
> pass and keep the smallest over these as the new barrier up to which
> we flush the events on the PASS n + 2. This still relies on a bit of
> global monotonicity because if some CPU doesn't have events in PASS n,
> we expect it not to have event in PASS n + 2 past the barrier recorded
> in PASS n. So this is still not a totally robust ordering but it's still
> better than what we had before.
>
> The only way to have a deterministic and solid ordering will be to use
> per cpu perf.data files.
>
> Reported-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephane Eranian <eranian@google.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>  tools/perf/util/evsel.c   |    5 +-
>  tools/perf/util/session.c |  146 +++++++++++++++++++++++++++++++++-----------
>  tools/perf/util/session.h |    3 +-
>  3 files changed, 115 insertions(+), 39 deletions(-)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 302d49a..1c8eb4b 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
>        if (opts->raw_samples) {
>                attr->sample_type       |= PERF_SAMPLE_TIME;
>                attr->sample_type       |= PERF_SAMPLE_RAW;
> -               attr->sample_type       |= PERF_SAMPLE_CPU;
>        }
>
I don't get this bit here. You may want CPU information when capturing
in raw + per-thread mode.


> +       /* Need to know the CPU for tools that need to order events */
> +       if (attr->sample_type & PERF_SAMPLE_TIME)
> +               attr->sample_type       |= PERF_SAMPLE_CPU;
> +
>        if (opts->no_delay) {
>                attr->watermark = 0;
>                attr->wakeup_events = 1;
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 9f833cf..f297342 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
>                list_del(&sq->list);
>                free(sq);
>        }
> +
> +       free(os->last_cpu_timestamp);
>  }
>
>  static int perf_session_deliver_event(struct perf_session *session,
> @@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s,
>  }
>
>  /*
> - * When perf record finishes a pass on every buffers, it records this pseudo
> - * event.
> - * We record the max timestamp t found in the pass n.
> - * Assuming these timestamps are monotonic across cpus, we know that if
> - * a buffer still has events with timestamps below t, they will be all
> - * available and then read in the pass n + 1.
> - * Hence when we start to read the pass n + 2, we can safely flush every
> - * events with timestamps below t.
> + * We make the assumption that timestamps are not globally monotonic but locally
> + * non-strictly monotonic. In practice, this is because if we are dealing with a
> + * machine with unstable TSC, the kernel bounds the result of the tsc between
> + * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more
> + * than one jiffy, all of its timestamps will be equal to next_tick_time after we
> + * cross that jiffy, without any further progress whereas the other CPU continue
> + * with normal timestamps. This can happen if a CPU sends crazillions of events
> + * while interrupts are disabled. But there are potentially other random scenarios
> + * with unstable TSC that drives us to assume the monotonicity of time only per CPU
> + * and not globally.
> + *
> + * To solve this, when perf record finishes a round of write on every buffers, it
> + * records a pseudo event named "finished round". The frame of events that happen
> + * between two finished rounds is called a "pass".
> + * We record the max timestamp T[cpu] per CPU found over the events in the pass n.
> + * Then when we finish a round, we iterate over these T[cpu]and keep the smallest
> + * one: min(T).
> + *
> + * Assuming these timestamps are locally monotonic (non strictly), we can flush all
> + * queued events having a timestamp below min(T) when we start to process PASS n + 1.
> + * But we actually wait until we start PASS n + 2 in case a CPU did not have any
> + * event in PASS n but came in PASS n + 1 with events below min(T). We truly
> + * hope no CPU will come with events below min(T) after pass n + 1. This
> + * heuristicly rely on some minimal global consistancy. This should work in most
> + * real world case, the only way to ensure a truly safe ordering with regular
> + * flush will be to switch to per CPU record files.
> + *
>  *
> - *    ============ PASS n =================
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          1          |         2
> - *          2          |         3
> - *          -          |         4  <--- max recorded
> + *    ========================== PASS n ============================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          1                    |         2
> + *          2 <--- max recorded  |         3
> + *          -                    |         4 <--- max recorded
> + *                          min(T) = 2
>  *
> - *    ============ PASS n + 1 ==============
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          3          |         5
> - *          4          |         6
> - *          5          |         7 <---- max recorded
> + *    ========================== PASS n + 1 ========================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          3                    |         5
> + *          4                    |         6
> + *          5 <--- max record    |         7 <---- max recorded
> + *                          min(T) = 5
>  *
> - *      Flush every events below timestamp 4
> + *                Flush every events below timestamp 2
>  *
> - *    ============ PASS n + 2 ==============
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          6          |         8
> - *          7          |         9
> - *          -          |         10
> + *    ========================== PASS n + 2 ========================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          6                    |         8
> + *          7                    |         9
> + *          -                    |         10
>  *
> - *      Flush every events below timestamp 7
> - *      etc...
> + *                Flush every events below timestamp 5, etc...
>  */
>  static int process_finished_round(struct perf_tool *tool,
>                                  union perf_event *event __used,
>                                  struct perf_session *session)
>  {
> +       unsigned int i;
> +       u64 min = ULLONG_MAX;
> +       struct ordered_samples *os = &session->ordered_samples;
> +
>        flush_sample_queue(session, tool);
> -       session->ordered_samples.next_flush = session->ordered_samples.max_timestamp;
> +
> +       for (i = 0; i < session->nr_cpus; i++) {
> +               if (os->last_cpu_timestamp[i] < min)
> +                       min = os->last_cpu_timestamp[i];
> +
> +               os->last_cpu_timestamp[i] = ULLONG_MAX;
> +       }
> +
> +       if (min != ULLONG_MAX)
> +               os->next_flush = min;
>
>        return 0;
>  }
>
>  /* The queue is ordered by time */
> -static void __queue_event(struct sample_queue *new, struct perf_session *s)
> +static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu)
>  {
>        struct ordered_samples *os = &s->ordered_samples;
>        struct sample_queue *sample = os->last_sample;
> @@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>
>        ++os->nr_samples;
>        os->last_sample = new;
> +       os->last_cpu_timestamp[cpu] = timestamp;
>
>        if (!sample) {
>                list_add(&new->list, &os->samples);
> -               os->max_timestamp = timestamp;
>                return;
>        }
>
> @@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>                        p = sample->list.next;
>                        if (p == &os->samples) {
>                                list_add_tail(&new->list, &os->samples);
> -                               os->max_timestamp = timestamp;
>                                return;
>                        }
>                        sample = list_entry(p, struct sample_queue, list);
> @@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>        }
>  }
>
> +static int alloc_cpus_timestamp_array(struct perf_session *s,
> +                                     struct perf_sample *sample,
> +                                     struct ordered_samples *os)
> +{
> +       int i;
> +       int nr_cpus;
> +
> +       if (sample->cpu < s->nr_cpus)
> +               return 0;
> +
> +       nr_cpus = sample->cpu + 1;
> +
> +       if (!os->last_cpu_timestamp)
> +               os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus);
> +       else
> +               os->last_cpu_timestamp = realloc(os->last_cpu_timestamp,
> +                                                sizeof(u64) * nr_cpus);
> +       if (!os->last_cpu_timestamp)
> +               return -ENOMEM;
> +
> +       for (i = s->nr_cpus; i < nr_cpus; i++)
> +               os->last_cpu_timestamp[i] = ULLONG_MAX;
> +
> +       s->nr_cpus = nr_cpus;
> +
> +       return 0;
> +}
> +
>  #define MAX_SAMPLE_BUFFER      (64 * 1024 / sizeof(struct sample_queue))
>
>  static int perf_session_queue_event(struct perf_session *s, union perf_event *event,
> @@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>        struct list_head *sc = &os->sample_cache;
>        u64 timestamp = sample->time;
>        struct sample_queue *new;
> +       int err;
> +
> +       if (!(s->sample_type & PERF_SAMPLE_CPU)) {
> +               pr_err("Warning: Need to record CPU on samples for ordering\n");
> +               return -EINVAL;
> +       }
>
>        if (!timestamp || timestamp == ~0ULL)
>                return -ETIME;
> @@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>                return -EINVAL;
>        }
>
> +       err = alloc_cpus_timestamp_array(s, sample, os);
> +       if (err)
> +               return err;
> +
>        if (!list_empty(sc)) {
>                new = list_entry(sc->next, struct sample_queue, list);
>                list_del(&new->list);
> @@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>        new->file_offset = file_offset;
>        new->event = event;
>
> -       __queue_event(new, s);
> +       __queue_event(new, s, sample->cpu);
>
>        return 0;
>  }
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index c8d9017..642591b 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -16,7 +16,7 @@ struct thread;
>  struct ordered_samples {
>        u64                     last_flush;
>        u64                     next_flush;
> -       u64                     max_timestamp;
> +       u64                     *last_cpu_timestamp;
>        struct list_head        samples;
>        struct list_head        sample_cache;
>        struct list_head        to_free;
> @@ -50,6 +50,7 @@ struct perf_session {
>        int                     cwdlen;
>        char                    *cwd;
>        struct ordered_samples  ordered_samples;
> +       unsigned int            nr_cpus;
>        char                    filename[1];
>  };
>
> --
> 1.7.5.4
>
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] perf tools: Fix ordering with unstable tsc
  2012-02-22 15:35   ` Stephane Eranian
@ 2012-02-22 15:39     ` David Ahern
  0 siblings, 0 replies; 31+ messages in thread
From: David Ahern @ 2012-02-22 15:39 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Frederic Weisbecker, Arnaldo Carvalho de Melo, LKML,
	Peter Zijlstra, Ingo Molnar

On 2/22/12 8:35 AM, Stephane Eranian wrote:
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 302d49a..1c8eb4b 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
>>         if (opts->raw_samples) {
>>                 attr->sample_type       |= PERF_SAMPLE_TIME;
>>                 attr->sample_type       |= PERF_SAMPLE_RAW;
>> -               attr->sample_type       |= PERF_SAMPLE_CPU;
>>         }
>>
> I don't get this bit here. You may want CPU information when capturing
> in raw + per-thread mode.

TIME is added to sample_type for raw samples and then CPU gets added below.

>
>
>> +       /* Need to know the CPU for tools that need to order events */
>> +       if (attr->sample_type&  PERF_SAMPLE_TIME)
>> +               attr->sample_type       |= PERF_SAMPLE_CPU;
>> +


David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] perf tools: Fix ordering with unstable tsc
  2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker
  2012-02-22 15:35   ` Stephane Eranian
@ 2012-03-05 18:43   ` Frederic Weisbecker
  2012-03-14 19:55   ` Arnaldo Carvalho de Melo
  2 siblings, 0 replies; 31+ messages in thread
From: Frederic Weisbecker @ 2012-03-05 18:43 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: LKML, David Ahern, Peter Zijlstra, Stephane Eranian, Ingo Molnar

ping?

On Sat, Feb 18, 2012 at 05:50:37PM +0100, Frederic Weisbecker wrote:
> On a system with a TSC considered as unstable, one can encounter this
> kind of warning:
> 
>      $ perf sched rec pong 2
>      $ perf sched lat
>      Warning: Timestamp below last timeslice flush
> 
> This happens when trace events trigger with a potentially high period,
> such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc...
> The perf event core then implement that weight by sending as many events
> as the given period. For example as many as the time the task has been
> sleeping in sched_stat_sleep event.
> 
> If this happens while irqs are disabled with an unstable tsc and this takes
> more time than a jiffy, then the timestamps of the events get stuck to
> the value of that next jiffy because sched_clock_local() bounds the timestamp
> to that maximum. The local timer tick is supposed to update that boundary but
> it can't given that irqs are disabled.
> 
> We can then meet this kind of scenario in perf record:
> 
> ===== CPU 0 =====      ==== CPU 1 ====
> 
>               PASS n
>      ...                    ...
>       1                      1
>       1                      2
>       1                      3 <-- max recorded
> 
>            finished round event
>             PASS n + 1
> 
>       1                      4
>       1                      5
>       1                      6
> 
>            finished round event
>             PASS n + 2
> 
>       1                      7
>      ...                    ...
> 
> CPU 0 is stuck sending events with irqs disabled and with the stale
> timestamp. When we do the events reordering for perf script for example,
> we flush all the events before timestamp 3 when we reach PASS n + 2,
> considering we can't anymore have timestamps below 3 now.
> But we still do have timestamps below 3 on PASS n + 2.
> 
> To solve that issue, instead of considering that timestamps are globally
> monotonic, we assume they are locally monotonic. Instead of recording
> the max timestamp on each pass, we check the max one per CPU on each
> pass and keep the smallest over these as the new barrier up to which
> we flush the events on the PASS n + 2. This still relies on a bit of
> global monotonicity because if some CPU doesn't have events in PASS n,
> we expect it not to have event in PASS n + 2 past the barrier recorded
> in PASS n. So this is still not a totally robust ordering but it's still
> better than what we had before.
> 
> The only way to have a deterministic and solid ordering will be to use
> per cpu perf.data files.
> 
> Reported-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephane Eranian <eranian@google.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>  tools/perf/util/evsel.c   |    5 +-
>  tools/perf/util/session.c |  146 +++++++++++++++++++++++++++++++++-----------
>  tools/perf/util/session.h |    3 +-
>  3 files changed, 115 insertions(+), 39 deletions(-)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 302d49a..1c8eb4b 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
>  	if (opts->raw_samples) {
>  		attr->sample_type	|= PERF_SAMPLE_TIME;
>  		attr->sample_type	|= PERF_SAMPLE_RAW;
> -		attr->sample_type	|= PERF_SAMPLE_CPU;
>  	}
>  
> +	/* Need to know the CPU for tools that need to order events */
> +	if (attr->sample_type & PERF_SAMPLE_TIME)
> +		attr->sample_type	|= PERF_SAMPLE_CPU;
> +
>  	if (opts->no_delay) {
>  		attr->watermark = 0;
>  		attr->wakeup_events = 1;
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 9f833cf..f297342 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
>  		list_del(&sq->list);
>  		free(sq);
>  	}
> +
> +	free(os->last_cpu_timestamp);
>  }
>  
>  static int perf_session_deliver_event(struct perf_session *session,
> @@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s,
>  }
>  
>  /*
> - * When perf record finishes a pass on every buffers, it records this pseudo
> - * event.
> - * We record the max timestamp t found in the pass n.
> - * Assuming these timestamps are monotonic across cpus, we know that if
> - * a buffer still has events with timestamps below t, they will be all
> - * available and then read in the pass n + 1.
> - * Hence when we start to read the pass n + 2, we can safely flush every
> - * events with timestamps below t.
> + * We make the assumption that timestamps are not globally monotonic but locally
> + * non-strictly monotonic. In practice, this is because if we are dealing with a
> + * machine with unstable TSC, the kernel bounds the result of the tsc between
> + * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more
> + * than one jiffy, all of its timestamps will be equal to next_tick_time after we
> + * cross that jiffy, without any further progress whereas the other CPU continue
> + * with normal timestamps. This can happen if a CPU sends crazillions of events
> + * while interrupts are disabled. But there are potentially other random scenarios
> + * with unstable TSC that drives us to assume the monotonicity of time only per CPU
> + * and not globally.
> + *
> + * To solve this, when perf record finishes a round of write on every buffers, it
> + * records a pseudo event named "finished round". The frame of events that happen
> + * between two finished rounds is called a "pass".
> + * We record the max timestamp T[cpu] per CPU found over the events in the pass n.
> + * Then when we finish a round, we iterate over these T[cpu]and keep the smallest
> + * one: min(T).
> + *
> + * Assuming these timestamps are locally monotonic (non strictly), we can flush all
> + * queued events having a timestamp below min(T) when we start to process PASS n + 1.
> + * But we actually wait until we start PASS n + 2 in case a CPU did not have any
> + * event in PASS n but came in PASS n + 1 with events below min(T). We truly
> + * hope no CPU will come with events below min(T) after pass n + 1. This
> + * heuristicly rely on some minimal global consistancy. This should work in most
> + * real world case, the only way to ensure a truly safe ordering with regular
> + * flush will be to switch to per CPU record files.
> + *
>   *
> - *    ============ PASS n =================
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          1          |         2
> - *          2          |         3
> - *          -          |         4  <--- max recorded
> + *    ========================== PASS n ============================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          1                    |         2
> + *          2 <--- max recorded  |         3
> + *          -                    |         4 <--- max recorded
> + *                          min(T) = 2
>   *
> - *    ============ PASS n + 1 ==============
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          3          |         5
> - *          4          |         6
> - *          5          |         7 <---- max recorded
> + *    ========================== PASS n + 1 ========================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          3                    |         5
> + *          4                    |         6
> + *          5 <--- max record    |         7 <---- max recorded
> + *                          min(T) = 5
>   *
> - *      Flush every events below timestamp 4
> + *                Flush every events below timestamp 2
>   *
> - *    ============ PASS n + 2 ==============
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          6          |         8
> - *          7          |         9
> - *          -          |         10
> + *    ========================== PASS n + 2 ========================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          6                    |         8
> + *          7                    |         9
> + *          -                    |         10
>   *
> - *      Flush every events below timestamp 7
> - *      etc...
> + *                Flush every events below timestamp 5, etc...
>   */
>  static int process_finished_round(struct perf_tool *tool,
>  				  union perf_event *event __used,
>  				  struct perf_session *session)
>  {
> +	unsigned int i;
> +	u64 min = ULLONG_MAX;
> +	struct ordered_samples *os = &session->ordered_samples;
> +
>  	flush_sample_queue(session, tool);
> -	session->ordered_samples.next_flush = session->ordered_samples.max_timestamp;
> +
> +	for (i = 0; i < session->nr_cpus; i++) {
> +		if (os->last_cpu_timestamp[i] < min)
> +			min = os->last_cpu_timestamp[i];
> +
> +		os->last_cpu_timestamp[i] = ULLONG_MAX;
> +	}
> +
> +	if (min != ULLONG_MAX)
> +		os->next_flush = min;
>  
>  	return 0;
>  }
>  
>  /* The queue is ordered by time */
> -static void __queue_event(struct sample_queue *new, struct perf_session *s)
> +static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu)
>  {
>  	struct ordered_samples *os = &s->ordered_samples;
>  	struct sample_queue *sample = os->last_sample;
> @@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>  
>  	++os->nr_samples;
>  	os->last_sample = new;
> +	os->last_cpu_timestamp[cpu] = timestamp;
>  
>  	if (!sample) {
>  		list_add(&new->list, &os->samples);
> -		os->max_timestamp = timestamp;
>  		return;
>  	}
>  
> @@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>  			p = sample->list.next;
>  			if (p == &os->samples) {
>  				list_add_tail(&new->list, &os->samples);
> -				os->max_timestamp = timestamp;
>  				return;
>  			}
>  			sample = list_entry(p, struct sample_queue, list);
> @@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>  	}
>  }
>  
> +static int alloc_cpus_timestamp_array(struct perf_session *s,
> +				      struct perf_sample *sample,
> +				      struct ordered_samples *os)
> +{
> +	int i;
> +	int nr_cpus;
> +
> +	if (sample->cpu < s->nr_cpus)
> +		return 0;
> +
> +	nr_cpus = sample->cpu + 1;
> +
> +	if (!os->last_cpu_timestamp)
> +		os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus);
> +	else
> +		os->last_cpu_timestamp = realloc(os->last_cpu_timestamp,
> +						 sizeof(u64) * nr_cpus);
> +	if (!os->last_cpu_timestamp)
> +		return -ENOMEM;
> +
> +	for (i = s->nr_cpus; i < nr_cpus; i++)
> +		os->last_cpu_timestamp[i] = ULLONG_MAX;
> +
> +	s->nr_cpus = nr_cpus;
> +
> +	return 0;
> +}
> +
>  #define MAX_SAMPLE_BUFFER	(64 * 1024 / sizeof(struct sample_queue))
>  
>  static int perf_session_queue_event(struct perf_session *s, union perf_event *event,
> @@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>  	struct list_head *sc = &os->sample_cache;
>  	u64 timestamp = sample->time;
>  	struct sample_queue *new;
> +	int err;
> +
> +	if (!(s->sample_type & PERF_SAMPLE_CPU)) {
> +		pr_err("Warning: Need to record CPU on samples for ordering\n");
> +		return -EINVAL;
> +	}
>  
>  	if (!timestamp || timestamp == ~0ULL)
>  		return -ETIME;
> @@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>  		return -EINVAL;
>  	}
>  
> +	err = alloc_cpus_timestamp_array(s, sample, os);
> +	if (err)
> +		return err;
> +
>  	if (!list_empty(sc)) {
>  		new = list_entry(sc->next, struct sample_queue, list);
>  		list_del(&new->list);
> @@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>  	new->file_offset = file_offset;
>  	new->event = event;
>  
> -	__queue_event(new, s);
> +	__queue_event(new, s, sample->cpu);
>  
>  	return 0;
>  }
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index c8d9017..642591b 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -16,7 +16,7 @@ struct thread;
>  struct ordered_samples {
>  	u64			last_flush;
>  	u64			next_flush;
> -	u64			max_timestamp;
> +	u64			*last_cpu_timestamp;
>  	struct list_head	samples;
>  	struct list_head	sample_cache;
>  	struct list_head	to_free;
> @@ -50,6 +50,7 @@ struct perf_session {
>  	int			cwdlen;
>  	char			*cwd;
>  	struct ordered_samples	ordered_samples;
> +	unsigned int		nr_cpus;
>  	char			filename[1];
>  };
>  
> -- 
> 1.7.5.4
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] perf tools: Fix ordering with unstable tsc
  2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker
  2012-02-22 15:35   ` Stephane Eranian
  2012-03-05 18:43   ` Frederic Weisbecker
@ 2012-03-14 19:55   ` Arnaldo Carvalho de Melo
  2012-03-14 20:07     ` David Ahern
  2012-03-22  0:10     ` Frederic Weisbecker
  2 siblings, 2 replies; 31+ messages in thread
From: Arnaldo Carvalho de Melo @ 2012-03-14 19:55 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner
  Cc: LKML, David Ahern, Peter Zijlstra, Stephane Eranian, Ingo Molnar

Em Sat, Feb 18, 2012 at 05:50:37PM +0100, Frederic Weisbecker escreveu:
> On a system with a TSC considered as unstable, one can encounter this
> kind of warning:
> 
>      $ perf sched rec pong 2
>      $ perf sched lat
>      Warning: Timestamp below last timeslice flush


Some nits below.

Thomas,

        You rewrote the ordering code in a1225dec, could you take a look
at this new patch by Frédéric and provide an Acked-by?

- Arnaldo

 
> This happens when trace events trigger with a potentially high period,
> such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc...
> The perf event core then implement that weight by sending as many events
> as the given period. For example as many as the time the task has been
> sleeping in sched_stat_sleep event.
> 
> If this happens while irqs are disabled with an unstable tsc and this takes
> more time than a jiffy, then the timestamps of the events get stuck to
> the value of that next jiffy because sched_clock_local() bounds the timestamp
> to that maximum. The local timer tick is supposed to update that boundary but
> it can't given that irqs are disabled.
> 
> We can then meet this kind of scenario in perf record:
> 
> ===== CPU 0 =====      ==== CPU 1 ====
> 
>               PASS n
>      ...                    ...
>       1                      1
>       1                      2
>       1                      3 <-- max recorded
> 
>            finished round event
>             PASS n + 1
> 
>       1                      4
>       1                      5
>       1                      6
> 
>            finished round event
>             PASS n + 2
> 
>       1                      7
>      ...                    ...
> 
> CPU 0 is stuck sending events with irqs disabled and with the stale
> timestamp. When we do the events reordering for perf script for example,
> we flush all the events before timestamp 3 when we reach PASS n + 2,
> considering we can't anymore have timestamps below 3 now.
> But we still do have timestamps below 3 on PASS n + 2.
> 
> To solve that issue, instead of considering that timestamps are globally
> monotonic, we assume they are locally monotonic. Instead of recording
> the max timestamp on each pass, we check the max one per CPU on each
> pass and keep the smallest over these as the new barrier up to which
> we flush the events on the PASS n + 2. This still relies on a bit of
> global monotonicity because if some CPU doesn't have events in PASS n,
> we expect it not to have event in PASS n + 2 past the barrier recorded
> in PASS n. So this is still not a totally robust ordering but it's still
> better than what we had before.
> 
> The only way to have a deterministic and solid ordering will be to use
> per cpu perf.data files.
> 
> Reported-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Stephane Eranian <eranian@google.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>  tools/perf/util/evsel.c   |    5 +-
>  tools/perf/util/session.c |  146 +++++++++++++++++++++++++++++++++-----------
>  tools/perf/util/session.h |    3 +-
>  3 files changed, 115 insertions(+), 39 deletions(-)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 302d49a..1c8eb4b 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
>  	if (opts->raw_samples) {
>  		attr->sample_type	|= PERF_SAMPLE_TIME;
>  		attr->sample_type	|= PERF_SAMPLE_RAW;
> -		attr->sample_type	|= PERF_SAMPLE_CPU;
>  	}
>  
> +	/* Need to know the CPU for tools that need to order events */
> +	if (attr->sample_type & PERF_SAMPLE_TIME)
> +		attr->sample_type	|= PERF_SAMPLE_CPU;
> +
>  	if (opts->no_delay) {
>  		attr->watermark = 0;
>  		attr->wakeup_events = 1;
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 9f833cf..f297342 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
>  		list_del(&sq->list);
>  		free(sq);
>  	}
> +
> +	free(os->last_cpu_timestamp);
>  }
>  
>  static int perf_session_deliver_event(struct perf_session *session,
> @@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s,
>  }
>  
>  /*
> - * When perf record finishes a pass on every buffers, it records this pseudo
> - * event.
> - * We record the max timestamp t found in the pass n.
> - * Assuming these timestamps are monotonic across cpus, we know that if
> - * a buffer still has events with timestamps below t, they will be all
> - * available and then read in the pass n + 1.
> - * Hence when we start to read the pass n + 2, we can safely flush every
> - * events with timestamps below t.
> + * We make the assumption that timestamps are not globally monotonic but locally
> + * non-strictly monotonic. In practice, this is because if we are dealing with a
> + * machine with unstable TSC, the kernel bounds the result of the tsc between
> + * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more
> + * than one jiffy, all of its timestamps will be equal to next_tick_time after we
> + * cross that jiffy, without any further progress whereas the other CPU continue
> + * with normal timestamps. This can happen if a CPU sends crazillions of events
> + * while interrupts are disabled. But there are potentially other random scenarios
> + * with unstable TSC that drives us to assume the monotonicity of time only per CPU
> + * and not globally.
> + *
> + * To solve this, when perf record finishes a round of write on every buffers, it
> + * records a pseudo event named "finished round". The frame of events that happen
> + * between two finished rounds is called a "pass".
> + * We record the max timestamp T[cpu] per CPU found over the events in the pass n.
> + * Then when we finish a round, we iterate over these T[cpu]and keep the smallest
> + * one: min(T).
> + *
> + * Assuming these timestamps are locally monotonic (non strictly), we can flush all
> + * queued events having a timestamp below min(T) when we start to process PASS n + 1.
> + * But we actually wait until we start PASS n + 2 in case a CPU did not have any
> + * event in PASS n but came in PASS n + 1 with events below min(T). We truly
> + * hope no CPU will come with events below min(T) after pass n + 1. This
> + * heuristicly rely on some minimal global consistancy. This should work in most
> + * real world case, the only way to ensure a truly safe ordering with regular
> + * flush will be to switch to per CPU record files.
> + *
>   *
> - *    ============ PASS n =================
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          1          |         2
> - *          2          |         3
> - *          -          |         4  <--- max recorded
> + *    ========================== PASS n ============================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          1                    |         2
> + *          2 <--- max recorded  |         3
> + *          -                    |         4 <--- max recorded
> + *                          min(T) = 2
>   *
> - *    ============ PASS n + 1 ==============
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          3          |         5
> - *          4          |         6
> - *          5          |         7 <---- max recorded
> + *    ========================== PASS n + 1 ========================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          3                    |         5
> + *          4                    |         6
> + *          5 <--- max record    |         7 <---- max recorded
> + *                          min(T) = 5
>   *
> - *      Flush every events below timestamp 4
> + *                Flush every events below timestamp 2
>   *
> - *    ============ PASS n + 2 ==============
> - *       CPU 0         |   CPU 1
> - *                     |
> - *    cnt1 timestamps  |   cnt2 timestamps
> - *          6          |         8
> - *          7          |         9
> - *          -          |         10
> + *    ========================== PASS n + 2 ========================
> + *       CPU 0                   |   CPU 1
> + *                               |
> + *    cnt1 timestamps            |   cnt2 timestamps
> + *          6                    |         8
> + *          7                    |         9
> + *          -                    |         10
>   *
> - *      Flush every events below timestamp 7
> - *      etc...
> + *                Flush every events below timestamp 5, etc...
>   */
>  static int process_finished_round(struct perf_tool *tool,
>  				  union perf_event *event __used,
>  				  struct perf_session *session)
>  {
> +	unsigned int i;
> +	u64 min = ULLONG_MAX;
> +	struct ordered_samples *os = &session->ordered_samples;
> +
>  	flush_sample_queue(session, tool);
> -	session->ordered_samples.next_flush = session->ordered_samples.max_timestamp;
> +
> +	for (i = 0; i < session->nr_cpus; i++) {
> +		if (os->last_cpu_timestamp[i] < min)
> +			min = os->last_cpu_timestamp[i];
> +
> +		os->last_cpu_timestamp[i] = ULLONG_MAX;
> +	}
> +
> +	if (min != ULLONG_MAX)
> +		os->next_flush = min;
>  
>  	return 0;
>  }
>  
>  /* The queue is ordered by time */
> -static void __queue_event(struct sample_queue *new, struct perf_session *s)
> +static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu)
>  {
>  	struct ordered_samples *os = &s->ordered_samples;
>  	struct sample_queue *sample = os->last_sample;
> @@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>  
>  	++os->nr_samples;
>  	os->last_sample = new;
> +	os->last_cpu_timestamp[cpu] = timestamp;
>  
>  	if (!sample) {
>  		list_add(&new->list, &os->samples);
> -		os->max_timestamp = timestamp;
>  		return;
>  	}
>  
> @@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>  			p = sample->list.next;
>  			if (p == &os->samples) {
>  				list_add_tail(&new->list, &os->samples);
> -				os->max_timestamp = timestamp;
>  				return;
>  			}
>  			sample = list_entry(p, struct sample_queue, list);
> @@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s)
>  	}
>  }
>  
> +static int alloc_cpus_timestamp_array(struct perf_session *s,
> +				      struct perf_sample *sample,
> +				      struct ordered_samples *os)
> +{
> +	int i;
> +	int nr_cpus;
> +
> +	if (sample->cpu < s->nr_cpus)
> +		return 0;


Shouldn't we try to robustify this checking against HEADER_NRCPUS (if
present)?

I just tried to see how to find that info, but unfortunately it just
prints it when calling perf_header__fprintf_info, it can be made to
work.

I need to work on storing those values in a struct accessible thru
perf_session or perf_evlist, so that I can show that info on the TUI,
perhaps at that time I can robustify this way.

> +
> +	nr_cpus = sample->cpu + 1;
> +
> +	if (!os->last_cpu_timestamp)
> +		os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus);
> +	else
> +		os->last_cpu_timestamp = realloc(os->last_cpu_timestamp,
> +						 sizeof(u64) * nr_cpus);

If realloc fails we return -ENOMEM, but leak the old buffer.

At some point we can have in the TUI/GUI a way for the user to ask for
an specific perf.data file to be processed, if it fails to process due
to the above realloc, we'll continue, allowing other perf.data files to
be processed, but will have this leak.

> +	if (!os->last_cpu_timestamp)
> +		return -ENOMEM;
> +
> +	for (i = s->nr_cpus; i < nr_cpus; i++)
> +		os->last_cpu_timestamp[i] = ULLONG_MAX;
> +
> +	s->nr_cpus = nr_cpus;
> +
> +	return 0;
> +}
> +
>  #define MAX_SAMPLE_BUFFER	(64 * 1024 / sizeof(struct sample_queue))
>  
>  static int perf_session_queue_event(struct perf_session *s, union perf_event *event,
> @@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>  	struct list_head *sc = &os->sample_cache;
>  	u64 timestamp = sample->time;
>  	struct sample_queue *new;
> +	int err;
> +
> +	if (!(s->sample_type & PERF_SAMPLE_CPU)) {
> +		pr_err("Warning: Need to record CPU on samples for ordering\n");
> +		return -EINVAL;
> +	}
>  
>  	if (!timestamp || timestamp == ~0ULL)
>  		return -ETIME;
> @@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>  		return -EINVAL;
>  	}
>  
> +	err = alloc_cpus_timestamp_array(s, sample, os);
> +	if (err)
> +		return err;
> +
>  	if (!list_empty(sc)) {
>  		new = list_entry(sc->next, struct sample_queue, list);
>  		list_del(&new->list);
> @@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev
>  	new->file_offset = file_offset;
>  	new->event = event;
>  
> -	__queue_event(new, s);
> +	__queue_event(new, s, sample->cpu);
>  
>  	return 0;
>  }
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index c8d9017..642591b 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -16,7 +16,7 @@ struct thread;
>  struct ordered_samples {
>  	u64			last_flush;
>  	u64			next_flush;
> -	u64			max_timestamp;
> +	u64			*last_cpu_timestamp;
>  	struct list_head	samples;
>  	struct list_head	sample_cache;
>  	struct list_head	to_free;
> @@ -50,6 +50,7 @@ struct perf_session {
>  	int			cwdlen;
>  	char			*cwd;
>  	struct ordered_samples	ordered_samples;
> +	unsigned int		nr_cpus;
>  	char			filename[1];
>  };
>  
> -- 
> 1.7.5.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] perf tools: Fix ordering with unstable tsc
  2012-03-14 19:55   ` Arnaldo Carvalho de Melo
@ 2012-03-14 20:07     ` David Ahern
  2012-03-22  0:10     ` Frederic Weisbecker
  1 sibling, 0 replies; 31+ messages in thread
From: David Ahern @ 2012-03-14 20:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Frederic Weisbecker
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Stephane Eranian, Ingo Molnar

On 3/14/12 1:55 PM, Arnaldo Carvalho de Melo wrote:
> Shouldn't we try to robustify this checking against HEADER_NRCPUS (if
> present)?
>
> I just tried to see how to find that info, but unfortunately it just
> prints it when calling perf_header__fprintf_info, it can be made to
> work.
>
> I need to work on storing those values in a struct accessible thru
> perf_session or perf_evlist, so that I can show that info on the TUI,
> perhaps at that time I can robustify this way.

Xiao added a design for retrieving data for the features in the proposed 
kvm-events tool. See the header.c changes in 
http://thread.gmane.org/gmane.linux.kernel/1262482

David

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] perf tools: Fix ordering with unstable tsc
  2012-03-14 19:55   ` Arnaldo Carvalho de Melo
  2012-03-14 20:07     ` David Ahern
@ 2012-03-22  0:10     ` Frederic Weisbecker
  2012-03-22 15:28       ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2012-03-22  0:10 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Thomas Gleixner, LKML, David Ahern, Peter Zijlstra,
	Stephane Eranian, Ingo Molnar

On Wed, Mar 14, 2012 at 04:55:35PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Sat, Feb 18, 2012 at 05:50:37PM +0100, Frederic Weisbecker escreveu:
> > +static int alloc_cpus_timestamp_array(struct perf_session *s,
> > +				      struct perf_sample *sample,
> > +				      struct ordered_samples *os)
> > +{
> > +	int i;
> > +	int nr_cpus;
> > +
> > +	if (sample->cpu < s->nr_cpus)
> > +		return 0;
> 
> 
> Shouldn't we try to robustify this checking against HEADER_NRCPUS (if
> present)?
> 
> I just tried to see how to find that info, but unfortunately it just
> prints it when calling perf_header__fprintf_info, it can be made to
> work.
> 
> I need to work on storing those values in a struct accessible thru
> perf_session or perf_evlist, so that I can show that info on the TUI,
> perhaps at that time I can robustify this way.

Yeah I thought about that too. I can try to make that working.
I just thought this was an optimization that I could later add (ie: first
try to see if the core idea of the patch is accepted).

Of course the real long term optimization is going to have one file per
CPU. There, the ordering will be much easier and deterministically
correct.

> 
> > +
> > +	nr_cpus = sample->cpu + 1;
> > +
> > +	if (!os->last_cpu_timestamp)
> > +		os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus);
> > +	else
> > +		os->last_cpu_timestamp = realloc(os->last_cpu_timestamp,
> > +						 sizeof(u64) * nr_cpus);
> 
> If realloc fails we return -ENOMEM, but leak the old buffer.

Doh! the common trap with realloc...

> 
> At some point we can have in the TUI/GUI a way for the user to ask for
> an specific perf.data file to be processed, if it fails to process due
> to the above realloc, we'll continue, allowing other perf.data files to
> be processed, but will have this leak.

Ok. Will fix.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] perf tools: Fix ordering with unstable tsc
  2012-03-22  0:10     ` Frederic Weisbecker
@ 2012-03-22 15:28       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 31+ messages in thread
From: Arnaldo Carvalho de Melo @ 2012-03-22 15:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, LKML, David Ahern, Peter Zijlstra,
	Stephane Eranian, Ingo Molnar

Em Thu, Mar 22, 2012 at 01:10:31AM +0100, Frederic Weisbecker escreveu:
> On Wed, Mar 14, 2012 at 04:55:35PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Sat, Feb 18, 2012 at 05:50:37PM +0100, Frederic Weisbecker escreveu:
> > > +static int alloc_cpus_timestamp_array(struct perf_session *s,
> > > +	if (sample->cpu < s->nr_cpus)

> > Shouldn't we try to robustify this checking against HEADER_NRCPUS (if
> > present)?

> Yeah I thought about that too. I can try to make that working.
> I just thought this was an optimization that I could later add (ie: first
> try to see if the core idea of the patch is accepted).

Right, I deferred that to tglx, but he must be busy as always :-P
 
> Of course the real long term optimization is going to have one file per
> CPU. There, the ordering will be much easier and deterministically
> correct.

Yeah.
 
> > > +		os->last_cpu_timestamp = realloc(os->last_cpu_timestamp,
> > > +						 sizeof(u64) * nr_cpus);
> > If realloc fails we return -ENOMEM, but leak the old buffer.

> Doh! the common trap with realloc...

:-)

> > At some point we can have in the TUI/GUI a way for the user to ask for
> > an specific perf.data file to be processed, if it fails to process due
> > to the above realloc, we'll continue, allowing other perf.data files to
> > be processed, but will have this leak.

> Ok. Will fix.

Thanks!

- Arnaldo

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-03-22 15:28 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian
2012-02-06 14:26 ` Peter Zijlstra
2012-02-06 14:31   ` Stephane Eranian
2012-02-06 17:17     ` Arnaldo Carvalho de Melo
2012-02-06 15:34   ` Borislav Petkov
2012-02-06 16:37     ` Peter Zijlstra
2012-02-06 16:46       ` Borislav Petkov
2012-02-06 16:54         ` Peter Zijlstra
2012-02-06 20:27           ` Borislav Petkov
2012-02-06 20:31             ` Peter Zijlstra
2012-02-06 20:37               ` Borislav Petkov
2012-02-06 21:19                 ` Venki Pallipadi
2012-02-07  7:51                   ` Peter Zijlstra
2012-02-07  8:32                   ` Ingo Molnar
2012-02-07  9:06                     ` Borislav Petkov
2012-02-07  9:50                       ` Ingo Molnar
2012-02-07 12:08                         ` [PATCH] x86, AMD: Set sched_clock_stable Borislav Petkov
2012-02-15 15:30                           ` Peter Zijlstra
2012-02-07 19:43 ` [tip:perf/core] x86/sched/perf/AMD: " tip-bot for Borislav Petkov
2012-02-08 15:07 ` [BUG] perf: perf sched warning possibly due to clock granularity on AMD Frederic Weisbecker
2012-02-08 15:10   ` Stephane Eranian
2012-02-08 15:22     ` Frederic Weisbecker
2012-02-08 15:23       ` Stephane Eranian
2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker
2012-02-22 15:35   ` Stephane Eranian
2012-02-22 15:39     ` David Ahern
2012-03-05 18:43   ` Frederic Weisbecker
2012-03-14 19:55   ` Arnaldo Carvalho de Melo
2012-03-14 20:07     ` David Ahern
2012-03-22  0:10     ` Frederic Weisbecker
2012-03-22 15:28       ` Arnaldo Carvalho de Melo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.