* [BUG] perf: perf sched warning possibly due to clock granularity on AMD @ 2012-02-06 13:25 Stephane Eranian 2012-02-06 14:26 ` Peter Zijlstra ` (3 more replies) 0 siblings, 4 replies; 31+ messages in thread From: Stephane Eranian @ 2012-02-06 13:25 UTC (permalink / raw) To: linux-kernel; +Cc: acme, peterz, mingo, robert.richter, eric.dumazet Hi, I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode. I was testing perf sched and I ran into an issue. That issue seems to exist only on AMD and not on Intel systems. It is not PMU related because I am doing tracing. I am running a simple ping pong test to stress the context switch code. Two processes exchanging a byte through a pipe (program provided below). Then, I capture a trace using perf sched, and I run sched lat, but on certain runs, I get: $ perf sched rec pong 2 $ perf sched lat Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: TimesFound 4934 unknown events! Is this an older tool processing a perf.data file generated by a more recent tool? I did some investigation in the kernel to figure out what could cause this problem. I found out that with this measurement, it is possible to get multiple consecutive samples with the SAME timestamp. That is possible on AMD because, apparently, local_clock() ends up calling, a sched_clock() function that is only based on jiffies. That means, the granularity is a tick (1ms in my case). On Intel systems, at least on mines, local_clock() ends up calling a routine which uses TSC and thus has a much greater precision. This is all decided by the sched_clock_stable variable. It is set at init time for Intel and not used on AMD. Yet, recent AMD systems do have support for NONSTOP_TSC. It is not clear to me why that would not be enough (unless it is a cross-socket synchronization issue). One thing is clear, is that if I enable sched_clock_stable on my AMD system, then the perf sched warning disappears. I believe there is really a bug in perf. I am sure there are systems out there which cannot provide a very fine grain timestamp, thus it is possible to get samples with identical timestamps. I don't quite understand the sample re-ordering code in perf but it seems sensitive to timer granularity. I think it should not. In summary, two issues: - Why is sched_clock_stable not set or even tested on recent AMD systems? - perf should not rely on fine granularity timestamps in the re-ordering code Any comments? #include <sys/types.h> #include <inttypes.h> #include <stdio.h> #include <stdlib.h> #include <stdarg.h> #include <errno.h> #include <unistd.h> #include <string.h> #include <signal.h> #include <syscall.h> #include <err.h> #define MAX_CPUS 64 #define NR_CPU_BITS (MAX_CPUS>>3) int pin_cpu(pid_t pid, unsigned int cpu) { uint64_t my_mask[NR_CPU_BITS]; if (cpu >= MAX_CPUS) errx(1, "this program supports only up to %d CPUs", MAX_CPUS); my_mask[cpu>>6] = 1ULL << (cpu&63); return syscall(__NR_sched_setaffinity, pid, sizeof(my_mask), &my_mask); } static volatile int quit; void sig_handler(int n) { quit = 1; } static void do_child(int fr, int fw) { char c; ssize_t ret; for(;;) { ret = read(fr, &c, 1); if (ret < 0) break; ret = write(fw, "c", 1); if (ret < 0) break; } printf("child exited\n"); exit(0); } int main(int argc, char **argv) { int ret; int pr[2], pw[2]; int which_cpu; pid_t pid; uint64_t nctx = 0; ssize_t nbytes; char c = '0'; int delay; delay = argc > 1 ? atoi(argv[1]) : 10; srandom(getpid()); which_cpu = random() % sysconf(_SC_NPROCESSORS_ONLN); ret = pipe(pr); if (ret) err(1, "cannot create read pipe"); ret = pipe(pw); if (ret) err(1, "cannot create write pipe"); ret = pin_cpu(getpid(), which_cpu); if (ret) err(1, "cannot pin to CPU%d", which_cpu); printf("Both processes pinned to CPU%d, running for %ds\n", which_cpu, delay); switch(pid=fork()) { case -1: err(1, "cannot create child"); case 0: close(pr[1]); close(pw[0]); do_child(pr[0], pw[1]); exit(1); } close(pr[0]); close(pw[1]); signal(SIGALRM, sig_handler); alarm(delay); while(!quit) { nbytes = write(pr[1], "c", 1); nbytes = read(pw[0], &c, 1); nctx++; } close(pr[1]); close(pw[0]); return 0; } ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian @ 2012-02-06 14:26 ` Peter Zijlstra 2012-02-06 14:31 ` Stephane Eranian 2012-02-06 15:34 ` Borislav Petkov 2012-02-07 19:43 ` [tip:perf/core] x86/sched/perf/AMD: " tip-bot for Borislav Petkov ` (2 subsequent siblings) 3 siblings, 2 replies; 31+ messages in thread From: Peter Zijlstra @ 2012-02-06 14:26 UTC (permalink / raw) To: Stephane Eranian Cc: linux-kernel, acme, mingo, robert.richter, eric.dumazet, Borislav Petkov On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote: > In summary, two issues: > - Why is sched_clock_stable not set or even tested on recent AMD systems? I'll leave that for Robert to answer, also added Boris to the CC, him know AMD bits too. > - perf should not rely on fine granularity timestamps in the re-ordering code Uhm, what else is there? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 14:26 ` Peter Zijlstra @ 2012-02-06 14:31 ` Stephane Eranian 2012-02-06 17:17 ` Arnaldo Carvalho de Melo 2012-02-06 15:34 ` Borislav Petkov 1 sibling, 1 reply; 31+ messages in thread From: Stephane Eranian @ 2012-02-06 14:31 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, acme, mingo, robert.richter, eric.dumazet, Borislav Petkov On Mon, Feb 6, 2012 at 3:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote: >> In summary, two issues: >> - Why is sched_clock_stable not set or even tested on recent AMD systems? > > I'll leave that for Robert to answer, also added Boris to the CC, him > know AMD bits too. > Ok, good. >> - perf should not rely on fine granularity timestamps in the re-ordering code > > Uhm, what else is there? I understand your point. I don't quite understand what the code is supposed to do for odered samples. I'd like Arnaldo to explain this warning first. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 14:31 ` Stephane Eranian @ 2012-02-06 17:17 ` Arnaldo Carvalho de Melo 0 siblings, 0 replies; 31+ messages in thread From: Arnaldo Carvalho de Melo @ 2012-02-06 17:17 UTC (permalink / raw) To: Stephane Eranian Cc: Peter Zijlstra, linux-kernel, mingo, robert.richter, eric.dumazet, Borislav Petkov, Frederic Weisbecker, Thomas Gleixner Em Mon, Feb 06, 2012 at 03:31:00PM +0100, Stephane Eranian escreveu: > On Mon, Feb 6, 2012 at 3:26 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote: > >> In summary, two issues: > >> - Why is sched_clock_stable not set or even tested on recent AMD systems? > > > > I'll leave that for Robert to answer, also added Boris to the CC, him > > know AMD bits too. > > > Ok, good. > > >> - perf should not rely on fine granularity timestamps in the re-ordering code > > > > Uhm, what else is there? > I understand your point. I don't quite understand what the code is > supposed to do for odered > samples. I'd like Arnaldo to explain this warning first. That code was written by Frédéric and IIRC also Thomas, its something I plan to rework somehow in the future, but for now I'll let them chime in... Frédéric? - Arnaldo ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 14:26 ` Peter Zijlstra 2012-02-06 14:31 ` Stephane Eranian @ 2012-02-06 15:34 ` Borislav Petkov 2012-02-06 16:37 ` Peter Zijlstra 1 sibling, 1 reply; 31+ messages in thread From: Borislav Petkov @ 2012-02-06 15:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Borislav Petkov, Andreas Herrmann + Andreas. On Mon, Feb 06, 2012 at 03:26:43PM +0100, Peter Zijlstra wrote: > On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote: > > In summary, two issues: > > - Why is sched_clock_stable not set or even tested on recent AMD systems? AFAICT, sched_clock_stable is set on Intel under the following conditions: /* * c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate * with P/T states and does not stop in deep C-states. * * It is also reliable across cores and sockets. (but not across * cabinets - we turn it off in that case explicitly.) */ if (c->x86_power & (1 << 8)) { set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); if (!check_tsc_unstable()) sched_clock_stable = 1; } and yes, we can do CONSTANT_TSC and NONSTOP_TSC on the now older F10h already, so Bulldozer can do that too, implicitly. I don't see why sched_clock_stable shouldn't be moved to generic x86 code and set based on the two CPUID cap flags above. HTH. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 15:34 ` Borislav Petkov @ 2012-02-06 16:37 ` Peter Zijlstra 2012-02-06 16:46 ` Borislav Petkov 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2012-02-06 16:37 UTC (permalink / raw) To: Borislav Petkov Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Andreas Herrmann On Mon, 2012-02-06 at 16:34 +0100, Borislav Petkov wrote: > + Andreas. > > On Mon, Feb 06, 2012 at 03:26:43PM +0100, Peter Zijlstra wrote: > > On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote: > > > In summary, two issues: > > > - Why is sched_clock_stable not set or even tested on recent AMD systems? > > AFAICT, sched_clock_stable is set on Intel under the following conditions: > > /* > * c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate > * with P/T states and does not stop in deep C-states. > * > * It is also reliable across cores and sockets. (but not across > * cabinets - we turn it off in that case explicitly.) > */ > if (c->x86_power & (1 << 8)) { > set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); > if (!check_tsc_unstable()) > sched_clock_stable = 1; > } > > and yes, we can do CONSTANT_TSC and NONSTOP_TSC on the now older F10h > already, so Bulldozer can do that too, implicitly. > > I don't see why sched_clock_stable shouldn't be moved to generic x86 > code and set based on the two CPUID cap flags above. stable means more than both those flags together.. - CONSTANT_TSC means freq independent - NONSTOP_TSC means it doesn't get stopped in any C state Together they make TSC completely C-state independent. sched_clock_stable actually means that TSC is both _AND_ is consistent across all CPUs in the entire system. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 16:37 ` Peter Zijlstra @ 2012-02-06 16:46 ` Borislav Petkov 2012-02-06 16:54 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Borislav Petkov @ 2012-02-06 16:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Andreas Herrmann On Mon, Feb 06, 2012 at 05:37:26PM +0100, Peter Zijlstra wrote: > On Mon, 2012-02-06 at 16:34 +0100, Borislav Petkov wrote: > > + Andreas. > > > > On Mon, Feb 06, 2012 at 03:26:43PM +0100, Peter Zijlstra wrote: > > > On Mon, 2012-02-06 at 14:25 +0100, Stephane Eranian wrote: > > > > In summary, two issues: > > > > - Why is sched_clock_stable not set or even tested on recent AMD systems? > > > > AFAICT, sched_clock_stable is set on Intel under the following conditions: > > > > /* > > * c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate > > * with P/T states and does not stop in deep C-states. > > * > > * It is also reliable across cores and sockets. (but not across > > * cabinets - we turn it off in that case explicitly.) > > */ > > if (c->x86_power & (1 << 8)) { > > set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > > set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); > > if (!check_tsc_unstable()) > > sched_clock_stable = 1; > > } > > > > and yes, we can do CONSTANT_TSC and NONSTOP_TSC on the now older F10h > > already, so Bulldozer can do that too, implicitly. > > > > I don't see why sched_clock_stable shouldn't be moved to generic x86 > > code and set based on the two CPUID cap flags above. > > stable means more than both those flags together.. > > - CONSTANT_TSC means freq independent > - NONSTOP_TSC means it doesn't get stopped in any C state > > Together they make TSC completely C-state independent. > sched_clock_stable actually means that TSC is both _AND_ is consistent yes. > across all CPUs in the entire system. Right, by the "entire system" you mean consistent across cores and sockets but not necessarily across cabinets, as in the comment above, correct? If so, let me ask around if this holds true too. Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 16:46 ` Borislav Petkov @ 2012-02-06 16:54 ` Peter Zijlstra 2012-02-06 20:27 ` Borislav Petkov 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2012-02-06 16:54 UTC (permalink / raw) To: Borislav Petkov Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Andreas Herrmann On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote: > > across all CPUs in the entire system. > > Right, by the "entire system" you mean consistent across cores and > sockets but not necessarily across cabinets, as in the comment above, > correct? > > If so, let me ask around if this holds true too. Every CPU available to the kernel. So if you run a single system image across your cabinets, then yes those too. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 16:54 ` Peter Zijlstra @ 2012-02-06 20:27 ` Borislav Petkov 2012-02-06 20:31 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Borislav Petkov @ 2012-02-06 20:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Andreas Herrmann On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote: > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote: > > > across all CPUs in the entire system. > > > > Right, by the "entire system" you mean consistent across cores and > > sockets but not necessarily across cabinets, as in the comment above, > > correct? > > > > If so, let me ask around if this holds true too. > > Every CPU available to the kernel. So if you run a single system image > across your cabinets, then yes those too. Ok, but what about that sentence "(but not across cabinets - we turn it off in that case explicitly.)" - I don't see any place where it is turned off explicitly... Maybe a stale comment? -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 20:27 ` Borislav Petkov @ 2012-02-06 20:31 ` Peter Zijlstra 2012-02-06 20:37 ` Borislav Petkov 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2012-02-06 20:31 UTC (permalink / raw) To: Borislav Petkov Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Andreas Herrmann, Venkatesh Pallipadi On Mon, 2012-02-06 at 21:27 +0100, Borislav Petkov wrote: > On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote: > > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote: > > > > across all CPUs in the entire system. > > > > > > Right, by the "entire system" you mean consistent across cores and > > > sockets but not necessarily across cabinets, as in the comment above, > > > correct? > > > > > > If so, let me ask around if this holds true too. > > > > Every CPU available to the kernel. So if you run a single system image > > across your cabinets, then yes those too. > > Ok, but what about that sentence "(but not across cabinets - we turn > it off in that case explicitly.)" - I don't see any place where it is > turned off explicitly... Maybe a stale comment? I suspect it might be the sched_clock_stable = 0 in mark_tsc_unstable(), but lets ask Venki, IIRC he wrote all that. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 20:31 ` Peter Zijlstra @ 2012-02-06 20:37 ` Borislav Petkov 2012-02-06 21:19 ` Venki Pallipadi 0 siblings, 1 reply; 31+ messages in thread From: Borislav Petkov @ 2012-02-06 20:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Andreas Herrmann, Venkatesh Pallipadi On Mon, Feb 06, 2012 at 09:31:33PM +0100, Peter Zijlstra wrote: > On Mon, 2012-02-06 at 21:27 +0100, Borislav Petkov wrote: > > On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote: > > > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote: > > > > > across all CPUs in the entire system. > > > > > > > > Right, by the "entire system" you mean consistent across cores and > > > > sockets but not necessarily across cabinets, as in the comment above, > > > > correct? > > > > > > > > If so, let me ask around if this holds true too. > > > > > > Every CPU available to the kernel. So if you run a single system image > > > across your cabinets, then yes those too. > > > > Ok, but what about that sentence "(but not across cabinets - we turn > > it off in that case explicitly.)" - I don't see any place where it is > > turned off explicitly... Maybe a stale comment? > > I suspect it might be the sched_clock_stable = 0 in mark_tsc_unstable(), > but lets ask Venki, IIRC he wrote all that. Yeah, I was looking at the code further and on Intel it does: if (c->x86_power & (1 << 8)) { set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); if (!check_tsc_unstable()) sched_clock_stable = 1; } while on AMD, in early_init_amd() we do: if (c->x86_power & (1 << 8)) { set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); } and having in mind that tsc_unstable is set on generic x86 paths, nothing stops us to do the same on AMD too, and as a result, set sched_clock_stable too. But yeah, let's see what Venki has to say first. Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 20:37 ` Borislav Petkov @ 2012-02-06 21:19 ` Venki Pallipadi 2012-02-07 7:51 ` Peter Zijlstra 2012-02-07 8:32 ` Ingo Molnar 0 siblings, 2 replies; 31+ messages in thread From: Venki Pallipadi @ 2012-02-06 21:19 UTC (permalink / raw) To: Borislav Petkov Cc: Peter Zijlstra, Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Andreas Herrmann On Mon, Feb 6, 2012 at 12:37 PM, Borislav Petkov <bp@amd64.org> wrote: > On Mon, Feb 06, 2012 at 09:31:33PM +0100, Peter Zijlstra wrote: >> On Mon, 2012-02-06 at 21:27 +0100, Borislav Petkov wrote: >> > On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote: >> > > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote: >> > > > > across all CPUs in the entire system. >> > > > >> > > > Right, by the "entire system" you mean consistent across cores and >> > > > sockets but not necessarily across cabinets, as in the comment above, >> > > > correct? >> > > > >> > > > If so, let me ask around if this holds true too. >> > > >> > > Every CPU available to the kernel. So if you run a single system image >> > > across your cabinets, then yes those too. >> > >> > Ok, but what about that sentence "(but not across cabinets - we turn >> > it off in that case explicitly.)" - I don't see any place where it is >> > turned off explicitly... Maybe a stale comment? >> >> I suspect it might be the sched_clock_stable = 0 in mark_tsc_unstable(), >> but lets ask Venki, IIRC he wrote all that. > > Yeah, I was looking at the code further and on Intel it does: > > if (c->x86_power & (1 << 8)) { > set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); > if (!check_tsc_unstable()) > sched_clock_stable = 1; > } > > while on AMD, in early_init_amd() we do: > > if (c->x86_power & (1 << 8)) { > set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); > } > > and having in mind that tsc_unstable is set on generic x86 paths, > nothing stops us to do the same on AMD too, and as a result, set > sched_clock_stable too. > > But yeah, let's see what Venki has to say first. > Looks like cabinet comment came from Ingo (commit 83ce4009) in reference to (We will turn this off in DMI quirks for multi-chassis systems) Yes. If these two flags are set, TSC should be consistent and sched_clock_stable could be set and it will be reset if there is a call to mark_tsc_unstable(). Thanks, Venki ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 21:19 ` Venki Pallipadi @ 2012-02-07 7:51 ` Peter Zijlstra 2012-02-07 8:32 ` Ingo Molnar 1 sibling, 0 replies; 31+ messages in thread From: Peter Zijlstra @ 2012-02-07 7:51 UTC (permalink / raw) To: Venki Pallipadi Cc: Borislav Petkov, Stephane Eranian, linux-kernel, acme, mingo, robert.richter, eric.dumazet, Andreas Herrmann On Mon, 2012-02-06 at 13:19 -0800, Venki Pallipadi wrote: > > Yes. If these two flags are set, TSC should be consistent and sched_clock_stable > could be set and it will be reset if there is a call to mark_tsc_unstable(). Only trouble is that sched_clock_cpu() et al. will jump when you flip sched_clock_stable to 0 after you started using it.. its something we might want to fix. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 21:19 ` Venki Pallipadi 2012-02-07 7:51 ` Peter Zijlstra @ 2012-02-07 8:32 ` Ingo Molnar 2012-02-07 9:06 ` Borislav Petkov 1 sibling, 1 reply; 31+ messages in thread From: Ingo Molnar @ 2012-02-07 8:32 UTC (permalink / raw) To: Venki Pallipadi Cc: Borislav Petkov, Peter Zijlstra, Stephane Eranian, linux-kernel, acme, robert.richter, eric.dumazet, Andreas Herrmann * Venki Pallipadi <venki@google.com> wrote: > On Mon, Feb 6, 2012 at 12:37 PM, Borislav Petkov <bp@amd64.org> wrote: > > On Mon, Feb 06, 2012 at 09:31:33PM +0100, Peter Zijlstra wrote: > >> On Mon, 2012-02-06 at 21:27 +0100, Borislav Petkov wrote: > >> > On Mon, Feb 06, 2012 at 05:54:19PM +0100, Peter Zijlstra wrote: > >> > > On Mon, 2012-02-06 at 17:46 +0100, Borislav Petkov wrote: > >> > > > > across all CPUs in the entire system. > >> > > > > >> > > > Right, by the "entire system" you mean consistent across cores and > >> > > > sockets but not necessarily across cabinets, as in the comment above, > >> > > > correct? > >> > > > > >> > > > If so, let me ask around if this holds true too. > >> > > > >> > > Every CPU available to the kernel. So if you run a single system image > >> > > across your cabinets, then yes those too. > >> > > >> > Ok, but what about that sentence "(but not across cabinets - we turn > >> > it off in that case explicitly.)" - I don't see any place where it is > >> > turned off explicitly... Maybe a stale comment? > >> > >> I suspect it might be the sched_clock_stable = 0 in mark_tsc_unstable(), > >> but lets ask Venki, IIRC he wrote all that. > > > > Yeah, I was looking at the code further and on Intel it does: > > > > if (c->x86_power & (1 << 8)) { > > set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > > set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); > > if (!check_tsc_unstable()) > > sched_clock_stable = 1; > > } > > > > while on AMD, in early_init_amd() we do: > > > > if (c->x86_power & (1 << 8)) { > > set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > > set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); > > } > > > > and having in mind that tsc_unstable is set on generic x86 paths, > > nothing stops us to do the same on AMD too, and as a result, set > > sched_clock_stable too. > > > > But yeah, let's see what Venki has to say first. > > > > Looks like cabinet comment came from Ingo (commit 83ce4009) in > reference to > (We will turn this off in DMI quirks for multi-chassis > systems) > > Yes. If these two flags are set, TSC should be consistent and > sched_clock_stable could be set and it will be reset if there > is a call to mark_tsc_unstable(). Most of the details swapped out from my brain meanwhile, but I have some vague memories of a DMI quirk for some high-end system that just did a sched_clock_stable = 0 or such. So if the common case is that the TSC is entirely synchronized across CPUs, then we can default to that and rely on platform initialization code or DMI quirks setting the few large-NUMA systems to an unstable TSC. Thanks, Ingo ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-07 8:32 ` Ingo Molnar @ 2012-02-07 9:06 ` Borislav Petkov 2012-02-07 9:50 ` Ingo Molnar 0 siblings, 1 reply; 31+ messages in thread From: Borislav Petkov @ 2012-02-07 9:06 UTC (permalink / raw) To: Ingo Molnar Cc: Venki Pallipadi, Borislav Petkov, Peter Zijlstra, Stephane Eranian, linux-kernel, acme, robert.richter, eric.dumazet, Andreas Herrmann On Tue, Feb 07, 2012 at 09:32:53AM +0100, Ingo Molnar wrote: > > Yes. If these two flags are set, TSC should be consistent and > > sched_clock_stable could be set and it will be reset if there > > is a call to mark_tsc_unstable(). > > Most of the details swapped out from my brain meanwhile, but I > have some vague memories of a DMI quirk for some high-end system > that just did a sched_clock_stable = 0 or such. > > So if the common case is that the TSC is entirely synchronized > across CPUs, then we can default to that and rely on platform > initialization code or DMI quirks setting the few large-NUMA > systems to an unstable TSC. There's also 14be1f7454ea96ee614467a49cf018a1a383b189 which removed the setting of sched_clock_stable to 1 due to UV systems not being TSC-synchronized across blades. I guess, we could tentatively enable it on AMD provided nothing has marked the TSC as unstable already: diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index f4773f4..ddee619 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -456,6 +456,8 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c) if (c->x86_power & (1 << 8)) { set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); + if (!check_tsc_unstable()) + sched_clock_stable = 1; } #ifdef CONFIG_X86_64 Hmm... -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-07 9:06 ` Borislav Petkov @ 2012-02-07 9:50 ` Ingo Molnar 2012-02-07 12:08 ` [PATCH] x86, AMD: Set sched_clock_stable Borislav Petkov 0 siblings, 1 reply; 31+ messages in thread From: Ingo Molnar @ 2012-02-07 9:50 UTC (permalink / raw) To: Borislav Petkov Cc: Venki Pallipadi, Peter Zijlstra, Stephane Eranian, linux-kernel, acme, robert.richter, eric.dumazet, Andreas Herrmann * Borislav Petkov <bp@amd64.org> wrote: > On Tue, Feb 07, 2012 at 09:32:53AM +0100, Ingo Molnar wrote: > > > Yes. If these two flags are set, TSC should be consistent and > > > sched_clock_stable could be set and it will be reset if there > > > is a call to mark_tsc_unstable(). > > > > Most of the details swapped out from my brain meanwhile, but I > > have some vague memories of a DMI quirk for some high-end system > > that just did a sched_clock_stable = 0 or such. > > > > So if the common case is that the TSC is entirely synchronized > > across CPUs, then we can default to that and rely on platform > > initialization code or DMI quirks setting the few large-NUMA > > systems to an unstable TSC. > > There's also 14be1f7454ea96ee614467a49cf018a1a383b189 which removed > the setting of sched_clock_stable to 1 due to UV systems not being > TSC-synchronized across blades. > > I guess, we could tentatively enable it on AMD provided nothing has > marked the TSC as unstable already: > > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c > index f4773f4..ddee619 100644 > --- a/arch/x86/kernel/cpu/amd.c > +++ b/arch/x86/kernel/cpu/amd.c > @@ -456,6 +456,8 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c) > if (c->x86_power & (1 << 8)) { > set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); > + if (!check_tsc_unstable()) > + sched_clock_stable = 1; > } i'd go for that. Thanks, Ingo ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH] x86, AMD: Set sched_clock_stable 2012-02-07 9:50 ` Ingo Molnar @ 2012-02-07 12:08 ` Borislav Petkov 2012-02-15 15:30 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Borislav Petkov @ 2012-02-07 12:08 UTC (permalink / raw) To: Ingo Molnar Cc: Borislav Petkov, Venki Pallipadi, Peter Zijlstra, Stephane Eranian, LKML, Arnaldo Carvalho de Melo, Robert Richter, Eric Dumazet, Andreas Herrmann, Borislav Petkov Stephane Eranian reported that doing a scheduler latency measurements with perf on AMD doesn't work out as expected due to the fact that the sched_clock() granularity is too coarse, i.e. done in jiffies due to the sched_clock_stable not set, which, if set, would mean that we get to use the TSC as sample source which would give us much higher precision. However, there's no reason not to set sched_clock_stable on AMD because all families from F10h and upwards do have an invariant TSC and have the CPUID flag to prove (CPUID_8000_0007_EDX[8]). Make it so, #1. Signed-off-by: Borislav Petkov <bp@alien8.de> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Link: http://lkml.kernel.org/r/20120206132546.GA30854@quad --- arch/x86/kernel/cpu/amd.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index f4773f4..0a44b90 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -5,6 +5,7 @@ #include <linux/mm.h> #include <linux/io.h> +#include <linux/sched.h> #include <asm/processor.h> #include <asm/apic.h> #include <asm/cpu.h> @@ -456,6 +457,8 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c) if (c->x86_power & (1 << 8)) { set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); + if (!check_tsc_unstable()) + sched_clock_stable = 1; } #ifdef CONFIG_X86_64 -- 1.7.9 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH] x86, AMD: Set sched_clock_stable 2012-02-07 12:08 ` [PATCH] x86, AMD: Set sched_clock_stable Borislav Petkov @ 2012-02-15 15:30 ` Peter Zijlstra 0 siblings, 0 replies; 31+ messages in thread From: Peter Zijlstra @ 2012-02-15 15:30 UTC (permalink / raw) To: Borislav Petkov Cc: Ingo Molnar, Borislav Petkov, Venki Pallipadi, Stephane Eranian, LKML, Arnaldo Carvalho de Melo, Robert Richter, Eric Dumazet, Andreas Herrmann On Tue, 2012-02-07 at 13:08 +0100, Borislav Petkov wrote: > Stephane Eranian reported that doing a scheduler latency measurements > with perf on AMD doesn't work out as expected due to the fact that the > sched_clock() granularity is too coarse, i.e. done in jiffies due to the > sched_clock_stable not set, which, if set, would mean that we get to use > the TSC as sample source which would give us much higher precision. FWIW that's not entirely accurate, the problem with !sched_clock_stable is that the time between CPUs can drift a little. So while timestamps on the same CPU are high-res, when comparing them across CPUs you can observe funnies. ^ permalink raw reply [flat|nested] 31+ messages in thread
* [tip:perf/core] x86/sched/perf/AMD: Set sched_clock_stable 2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian 2012-02-06 14:26 ` Peter Zijlstra @ 2012-02-07 19:43 ` tip-bot for Borislav Petkov 2012-02-08 15:07 ` [BUG] perf: perf sched warning possibly due to clock granularity on AMD Frederic Weisbecker 2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker 3 siblings, 0 replies; 31+ messages in thread From: tip-bot for Borislav Petkov @ 2012-02-07 19:43 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, eranian, hpa, mingo, eric.dumazet, andreas.herrmann3, peterz, bp, bp, robert.richter, acme, tglx, mingo, venki Commit-ID: c98fdeaa92731308ed80386261fa2589addefa47 Gitweb: http://git.kernel.org/tip/c98fdeaa92731308ed80386261fa2589addefa47 Author: Borislav Petkov <bp@alien8.de> AuthorDate: Tue, 7 Feb 2012 13:08:52 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 7 Feb 2012 13:12:08 +0100 x86/sched/perf/AMD: Set sched_clock_stable Stephane Eranian reported that doing a scheduler latency measurements with perf on AMD doesn't work out as expected due to the fact that the sched_clock() granularity is too coarse, i.e. done in jiffies due to the sched_clock_stable not set, which, if set, would mean that we get to use the TSC as sample source which would give us much higher precision. However, there's no reason not to set sched_clock_stable on AMD because all families from F10h and upwards do have an invariant TSC and have the CPUID flag to prove (CPUID_8000_0007_EDX[8]). Make it so, #1. Signed-off-by: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@amd64.org> Cc: Venki Pallipadi <venki@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Robert Richter <robert.richter@amd.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Link: http://lkml.kernel.org/r/20120206132546.GA30854@quad [ Should any non-standard system break the TSC, we should mark them so explicitly, in their platform init handler, or in a DMI quirk. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/cpu/amd.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index f4773f4..0a44b90 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -5,6 +5,7 @@ #include <linux/mm.h> #include <linux/io.h> +#include <linux/sched.h> #include <asm/processor.h> #include <asm/apic.h> #include <asm/cpu.h> @@ -456,6 +457,8 @@ static void __cpuinit early_init_amd(struct cpuinfo_x86 *c) if (c->x86_power & (1 << 8)) { set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); + if (!check_tsc_unstable()) + sched_clock_stable = 1; } #ifdef CONFIG_X86_64 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian 2012-02-06 14:26 ` Peter Zijlstra 2012-02-07 19:43 ` [tip:perf/core] x86/sched/perf/AMD: " tip-bot for Borislav Petkov @ 2012-02-08 15:07 ` Frederic Weisbecker 2012-02-08 15:10 ` Stephane Eranian 2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker 3 siblings, 1 reply; 31+ messages in thread From: Frederic Weisbecker @ 2012-02-08 15:07 UTC (permalink / raw) To: Stephane Eranian Cc: linux-kernel, acme, peterz, mingo, robert.richter, eric.dumazet On Mon, Feb 06, 2012 at 02:25:46PM +0100, Stephane Eranian wrote: > Hi, > > I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode. > I was testing perf sched and I ran into an issue. That issue seems > to exist only on AMD and not on Intel systems. It is not PMU related > because I am doing tracing. > > I am running a simple ping pong test to stress the context switch > code. Two processes exchanging a byte through a pipe (program provided > below). > > Then, I capture a trace using perf sched, and I run sched lat, but > on certain runs, I get: > > $ perf sched rec pong 2 > $ perf sched lat > Warning: Timestamp below last timeslice flush > Warning: Timestamp below last timeslice flush > Warning: Timestamp below last timeslice flush > Warning: Timestamp below last timeslice flush > Warning: Timestamp below last timeslice flush > Warning: Timestamp below last timeslice flush > Warning: Timestamp below last timeslice flush > Warning: Timestamp below last timeslice flush > Warning: TimesFound 4934 unknown events! > Is this an older tool processing a perf.data file generated by a more recent tool? I've tried to reproduce this on an AMD phenom but haven't got any such warning. Is there any chance you could "perf archive" your data and upload it somewhere I can fetch it from? Thanks. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-08 15:07 ` [BUG] perf: perf sched warning possibly due to clock granularity on AMD Frederic Weisbecker @ 2012-02-08 15:10 ` Stephane Eranian 2012-02-08 15:22 ` Frederic Weisbecker 0 siblings, 1 reply; 31+ messages in thread From: Stephane Eranian @ 2012-02-08 15:10 UTC (permalink / raw) To: Frederic Weisbecker Cc: linux-kernel, acme, peterz, mingo, robert.richter, eric.dumazet On Wed, Feb 8, 2012 at 4:07 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote: > On Mon, Feb 06, 2012 at 02:25:46PM +0100, Stephane Eranian wrote: >> Hi, >> >> I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode. >> I was testing perf sched and I ran into an issue. That issue seems >> to exist only on AMD and not on Intel systems. It is not PMU related >> because I am doing tracing. >> >> I am running a simple ping pong test to stress the context switch >> code. Two processes exchanging a byte through a pipe (program provided >> below). >> >> Then, I capture a trace using perf sched, and I run sched lat, but >> on certain runs, I get: >> >> $ perf sched rec pong 2 >> $ perf sched lat >> Warning: Timestamp below last timeslice flush >> Warning: Timestamp below last timeslice flush >> Warning: Timestamp below last timeslice flush >> Warning: Timestamp below last timeslice flush >> Warning: Timestamp below last timeslice flush >> Warning: Timestamp below last timeslice flush >> Warning: Timestamp below last timeslice flush >> Warning: Timestamp below last timeslice flush >> Warning: TimesFound 4934 unknown events! >> Is this an older tool processing a perf.data file generated by a more recent tool? > > I've tried to reproduce this on an AMD phenom but haven't got any such > warning. > It does not happen at each run. > Is there any chance you could "perf archive" your data and upload it somewhere > I can fetch it from? > I can send you the perf.data in a private Email. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-08 15:10 ` Stephane Eranian @ 2012-02-08 15:22 ` Frederic Weisbecker 2012-02-08 15:23 ` Stephane Eranian 0 siblings, 1 reply; 31+ messages in thread From: Frederic Weisbecker @ 2012-02-08 15:22 UTC (permalink / raw) To: Stephane Eranian Cc: linux-kernel, acme, peterz, mingo, robert.richter, eric.dumazet On Wed, Feb 08, 2012 at 04:10:24PM +0100, Stephane Eranian wrote: > On Wed, Feb 8, 2012 at 4:07 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote: > > On Mon, Feb 06, 2012 at 02:25:46PM +0100, Stephane Eranian wrote: > >> Hi, > >> > >> I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode. > >> I was testing perf sched and I ran into an issue. That issue seems > >> to exist only on AMD and not on Intel systems. It is not PMU related > >> because I am doing tracing. > >> > >> I am running a simple ping pong test to stress the context switch > >> code. Two processes exchanging a byte through a pipe (program provided > >> below). > >> > >> Then, I capture a trace using perf sched, and I run sched lat, but > >> on certain runs, I get: > >> > >> $ perf sched rec pong 2 > >> $ perf sched lat > >> Warning: Timestamp below last timeslice flush > >> Warning: Timestamp below last timeslice flush > >> Warning: Timestamp below last timeslice flush > >> Warning: Timestamp below last timeslice flush > >> Warning: Timestamp below last timeslice flush > >> Warning: Timestamp below last timeslice flush > >> Warning: Timestamp below last timeslice flush > >> Warning: Timestamp below last timeslice flush > >> Warning: TimesFound 4934 unknown events! > >> Is this an older tool processing a perf.data file generated by a more recent tool? > > > > I've tried to reproduce this on an AMD phenom but haven't got any such > > warning. > > > It does not happen at each run. Yeah I tried 5 times but wasn't lucky. > > > Is there any chance you could "perf archive" your data and upload it somewhere > > I can fetch it from? > > > I can send you the perf.data in a private Email. Thanks! ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] perf: perf sched warning possibly due to clock granularity on AMD 2012-02-08 15:22 ` Frederic Weisbecker @ 2012-02-08 15:23 ` Stephane Eranian 0 siblings, 0 replies; 31+ messages in thread From: Stephane Eranian @ 2012-02-08 15:23 UTC (permalink / raw) To: Frederic Weisbecker Cc: linux-kernel, acme, peterz, mingo, robert.richter, eric.dumazet On Wed, Feb 8, 2012 at 4:22 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote: > On Wed, Feb 08, 2012 at 04:10:24PM +0100, Stephane Eranian wrote: >> On Wed, Feb 8, 2012 at 4:07 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote: >> > On Mon, Feb 06, 2012 at 02:25:46PM +0100, Stephane Eranian wrote: >> >> Hi, >> >> >> >> I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode. >> >> I was testing perf sched and I ran into an issue. That issue seems >> >> to exist only on AMD and not on Intel systems. It is not PMU related >> >> because I am doing tracing. >> >> >> >> I am running a simple ping pong test to stress the context switch >> >> code. Two processes exchanging a byte through a pipe (program provided >> >> below). >> >> >> >> Then, I capture a trace using perf sched, and I run sched lat, but >> >> on certain runs, I get: >> >> >> >> $ perf sched rec pong 2 >> >> $ perf sched lat >> >> Warning: Timestamp below last timeslice flush >> >> Warning: Timestamp below last timeslice flush >> >> Warning: Timestamp below last timeslice flush >> >> Warning: Timestamp below last timeslice flush >> >> Warning: Timestamp below last timeslice flush >> >> Warning: Timestamp below last timeslice flush >> >> Warning: Timestamp below last timeslice flush >> >> Warning: Timestamp below last timeslice flush >> >> Warning: TimesFound 4934 unknown events! >> >> Is this an older tool processing a perf.data file generated by a more recent tool? >> > >> > I've tried to reproduce this on an AMD phenom but haven't got any such >> > warning. >> > >> It does not happen at each run. > > Yeah I tried 5 times but wasn't lucky. That's not enough! > >> >> > Is there any chance you could "perf archive" your data and upload it somewhere >> > I can fetch it from? >> > >> I can send you the perf.data in a private Email. > > Thanks! Will dig it up and send it to you. ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH] perf tools: Fix ordering with unstable tsc 2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian ` (2 preceding siblings ...) 2012-02-08 15:07 ` [BUG] perf: perf sched warning possibly due to clock granularity on AMD Frederic Weisbecker @ 2012-02-18 16:50 ` Frederic Weisbecker 2012-02-22 15:35 ` Stephane Eranian ` (2 more replies) 3 siblings, 3 replies; 31+ messages in thread From: Frederic Weisbecker @ 2012-02-18 16:50 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: LKML, Frederic Weisbecker, David Ahern, Peter Zijlstra, Stephane Eranian, Ingo Molnar On a system with a TSC considered as unstable, one can encounter this kind of warning: $ perf sched rec pong 2 $ perf sched lat Warning: Timestamp below last timeslice flush This happens when trace events trigger with a potentially high period, such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc... The perf event core then implement that weight by sending as many events as the given period. For example as many as the time the task has been sleeping in sched_stat_sleep event. If this happens while irqs are disabled with an unstable tsc and this takes more time than a jiffy, then the timestamps of the events get stuck to the value of that next jiffy because sched_clock_local() bounds the timestamp to that maximum. The local timer tick is supposed to update that boundary but it can't given that irqs are disabled. We can then meet this kind of scenario in perf record: ===== CPU 0 ===== ==== CPU 1 ==== PASS n ... ... 1 1 1 2 1 3 <-- max recorded finished round event PASS n + 1 1 4 1 5 1 6 finished round event PASS n + 2 1 7 ... ... CPU 0 is stuck sending events with irqs disabled and with the stale timestamp. When we do the events reordering for perf script for example, we flush all the events before timestamp 3 when we reach PASS n + 2, considering we can't anymore have timestamps below 3 now. But we still do have timestamps below 3 on PASS n + 2. To solve that issue, instead of considering that timestamps are globally monotonic, we assume they are locally monotonic. Instead of recording the max timestamp on each pass, we check the max one per CPU on each pass and keep the smallest over these as the new barrier up to which we flush the events on the PASS n + 2. This still relies on a bit of global monotonicity because if some CPU doesn't have events in PASS n, we expect it not to have event in PASS n + 2 past the barrier recorded in PASS n. So this is still not a totally robust ordering but it's still better than what we had before. The only way to have a deterministic and solid ordering will be to use per cpu perf.data files. Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> --- tools/perf/util/evsel.c | 5 +- tools/perf/util/session.c | 146 +++++++++++++++++++++++++++++++++----------- tools/perf/util/session.h | 3 +- 3 files changed, 115 insertions(+), 39 deletions(-) diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index 302d49a..1c8eb4b 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts) if (opts->raw_samples) { attr->sample_type |= PERF_SAMPLE_TIME; attr->sample_type |= PERF_SAMPLE_RAW; - attr->sample_type |= PERF_SAMPLE_CPU; } + /* Need to know the CPU for tools that need to order events */ + if (attr->sample_type & PERF_SAMPLE_TIME) + attr->sample_type |= PERF_SAMPLE_CPU; + if (opts->no_delay) { attr->watermark = 0; attr->wakeup_events = 1; diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c index 9f833cf..f297342 100644 --- a/tools/perf/util/session.c +++ b/tools/perf/util/session.c @@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session) list_del(&sq->list); free(sq); } + + free(os->last_cpu_timestamp); } static int perf_session_deliver_event(struct perf_session *session, @@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s, } /* - * When perf record finishes a pass on every buffers, it records this pseudo - * event. - * We record the max timestamp t found in the pass n. - * Assuming these timestamps are monotonic across cpus, we know that if - * a buffer still has events with timestamps below t, they will be all - * available and then read in the pass n + 1. - * Hence when we start to read the pass n + 2, we can safely flush every - * events with timestamps below t. + * We make the assumption that timestamps are not globally monotonic but locally + * non-strictly monotonic. In practice, this is because if we are dealing with a + * machine with unstable TSC, the kernel bounds the result of the tsc between + * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more + * than one jiffy, all of its timestamps will be equal to next_tick_time after we + * cross that jiffy, without any further progress whereas the other CPU continue + * with normal timestamps. This can happen if a CPU sends crazillions of events + * while interrupts are disabled. But there are potentially other random scenarios + * with unstable TSC that drives us to assume the monotonicity of time only per CPU + * and not globally. + * + * To solve this, when perf record finishes a round of write on every buffers, it + * records a pseudo event named "finished round". The frame of events that happen + * between two finished rounds is called a "pass". + * We record the max timestamp T[cpu] per CPU found over the events in the pass n. + * Then when we finish a round, we iterate over these T[cpu]and keep the smallest + * one: min(T). + * + * Assuming these timestamps are locally monotonic (non strictly), we can flush all + * queued events having a timestamp below min(T) when we start to process PASS n + 1. + * But we actually wait until we start PASS n + 2 in case a CPU did not have any + * event in PASS n but came in PASS n + 1 with events below min(T). We truly + * hope no CPU will come with events below min(T) after pass n + 1. This + * heuristicly rely on some minimal global consistancy. This should work in most + * real world case, the only way to ensure a truly safe ordering with regular + * flush will be to switch to per CPU record files. + * * - * ============ PASS n ================= - * CPU 0 | CPU 1 - * | - * cnt1 timestamps | cnt2 timestamps - * 1 | 2 - * 2 | 3 - * - | 4 <--- max recorded + * ========================== PASS n ============================ + * CPU 0 | CPU 1 + * | + * cnt1 timestamps | cnt2 timestamps + * 1 | 2 + * 2 <--- max recorded | 3 + * - | 4 <--- max recorded + * min(T) = 2 * - * ============ PASS n + 1 ============== - * CPU 0 | CPU 1 - * | - * cnt1 timestamps | cnt2 timestamps - * 3 | 5 - * 4 | 6 - * 5 | 7 <---- max recorded + * ========================== PASS n + 1 ======================== + * CPU 0 | CPU 1 + * | + * cnt1 timestamps | cnt2 timestamps + * 3 | 5 + * 4 | 6 + * 5 <--- max record | 7 <---- max recorded + * min(T) = 5 * - * Flush every events below timestamp 4 + * Flush every events below timestamp 2 * - * ============ PASS n + 2 ============== - * CPU 0 | CPU 1 - * | - * cnt1 timestamps | cnt2 timestamps - * 6 | 8 - * 7 | 9 - * - | 10 + * ========================== PASS n + 2 ======================== + * CPU 0 | CPU 1 + * | + * cnt1 timestamps | cnt2 timestamps + * 6 | 8 + * 7 | 9 + * - | 10 * - * Flush every events below timestamp 7 - * etc... + * Flush every events below timestamp 5, etc... */ static int process_finished_round(struct perf_tool *tool, union perf_event *event __used, struct perf_session *session) { + unsigned int i; + u64 min = ULLONG_MAX; + struct ordered_samples *os = &session->ordered_samples; + flush_sample_queue(session, tool); - session->ordered_samples.next_flush = session->ordered_samples.max_timestamp; + + for (i = 0; i < session->nr_cpus; i++) { + if (os->last_cpu_timestamp[i] < min) + min = os->last_cpu_timestamp[i]; + + os->last_cpu_timestamp[i] = ULLONG_MAX; + } + + if (min != ULLONG_MAX) + os->next_flush = min; return 0; } /* The queue is ordered by time */ -static void __queue_event(struct sample_queue *new, struct perf_session *s) +static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu) { struct ordered_samples *os = &s->ordered_samples; struct sample_queue *sample = os->last_sample; @@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) ++os->nr_samples; os->last_sample = new; + os->last_cpu_timestamp[cpu] = timestamp; if (!sample) { list_add(&new->list, &os->samples); - os->max_timestamp = timestamp; return; } @@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) p = sample->list.next; if (p == &os->samples) { list_add_tail(&new->list, &os->samples); - os->max_timestamp = timestamp; return; } sample = list_entry(p, struct sample_queue, list); @@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) } } +static int alloc_cpus_timestamp_array(struct perf_session *s, + struct perf_sample *sample, + struct ordered_samples *os) +{ + int i; + int nr_cpus; + + if (sample->cpu < s->nr_cpus) + return 0; + + nr_cpus = sample->cpu + 1; + + if (!os->last_cpu_timestamp) + os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus); + else + os->last_cpu_timestamp = realloc(os->last_cpu_timestamp, + sizeof(u64) * nr_cpus); + if (!os->last_cpu_timestamp) + return -ENOMEM; + + for (i = s->nr_cpus; i < nr_cpus; i++) + os->last_cpu_timestamp[i] = ULLONG_MAX; + + s->nr_cpus = nr_cpus; + + return 0; +} + #define MAX_SAMPLE_BUFFER (64 * 1024 / sizeof(struct sample_queue)) static int perf_session_queue_event(struct perf_session *s, union perf_event *event, @@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev struct list_head *sc = &os->sample_cache; u64 timestamp = sample->time; struct sample_queue *new; + int err; + + if (!(s->sample_type & PERF_SAMPLE_CPU)) { + pr_err("Warning: Need to record CPU on samples for ordering\n"); + return -EINVAL; + } if (!timestamp || timestamp == ~0ULL) return -ETIME; @@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev return -EINVAL; } + err = alloc_cpus_timestamp_array(s, sample, os); + if (err) + return err; + if (!list_empty(sc)) { new = list_entry(sc->next, struct sample_queue, list); list_del(&new->list); @@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev new->file_offset = file_offset; new->event = event; - __queue_event(new, s); + __queue_event(new, s, sample->cpu); return 0; } diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h index c8d9017..642591b 100644 --- a/tools/perf/util/session.h +++ b/tools/perf/util/session.h @@ -16,7 +16,7 @@ struct thread; struct ordered_samples { u64 last_flush; u64 next_flush; - u64 max_timestamp; + u64 *last_cpu_timestamp; struct list_head samples; struct list_head sample_cache; struct list_head to_free; @@ -50,6 +50,7 @@ struct perf_session { int cwdlen; char *cwd; struct ordered_samples ordered_samples; + unsigned int nr_cpus; char filename[1]; }; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH] perf tools: Fix ordering with unstable tsc 2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker @ 2012-02-22 15:35 ` Stephane Eranian 2012-02-22 15:39 ` David Ahern 2012-03-05 18:43 ` Frederic Weisbecker 2012-03-14 19:55 ` Arnaldo Carvalho de Melo 2 siblings, 1 reply; 31+ messages in thread From: Stephane Eranian @ 2012-02-22 15:35 UTC (permalink / raw) To: Frederic Weisbecker Cc: Arnaldo Carvalho de Melo, LKML, David Ahern, Peter Zijlstra, Ingo Molnar [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 15998 bytes --] On Sat, Feb 18, 2012 at 5:50 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote: > On a system with a TSC considered as unstable, one can encounter this > kind of warning: > >   $ perf sched rec pong 2 >   $ perf sched lat >   Warning: Timestamp below last timeslice flush > > This happens when trace events trigger with a potentially high period, > such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc... > The perf event core then implement that weight by sending as many events > as the given period. For example as many as the time the task has been > sleeping in sched_stat_sleep event. > > If this happens while irqs are disabled with an unstable tsc and this takes > more time than a jiffy, then the timestamps of the events get stuck to > the value of that next jiffy because sched_clock_local() bounds the timestamp > to that maximum. The local timer tick is supposed to update that boundary but > it can't given that irqs are disabled. > > We can then meet this kind of scenario in perf record: > > ===== CPU 0 =====    ==== CPU 1 ==== > >        PASS n >   ...           ... >    1            1 >    1            2 >    1            3 <-- max recorded > >      finished round event >       PASS n + 1 > >    1            4 >    1            5 >    1            6 > >      finished round event >       PASS n + 2 > >    1            7 >   ...           ... > > CPU 0 is stuck sending events with irqs disabled and with the stale > timestamp. When we do the events reordering for perf script for example, > we flush all the events before timestamp 3 when we reach PASS n + 2, > considering we can't anymore have timestamps below 3 now. > But we still do have timestamps below 3 on PASS n + 2. > > To solve that issue, instead of considering that timestamps are globally > monotonic, we assume they are locally monotonic. Instead of recording > the max timestamp on each pass, we check the max one per CPU on each > pass and keep the smallest over these as the new barrier up to which > we flush the events on the PASS n + 2. This still relies on a bit of > global monotonicity because if some CPU doesn't have events in PASS n, > we expect it not to have event in PASS n + 2 past the barrier recorded > in PASS n. So this is still not a totally robust ordering but it's still > better than what we had before. > > The only way to have a deterministic and solid ordering will be to use > per cpu perf.data files. > > Reported-by: Stephane Eranian <eranian@google.com> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> > Cc: David Ahern <dsahern@gmail.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Stephane Eranian <eranian@google.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com> > --- >  tools/perf/util/evsel.c  |   5 +- >  tools/perf/util/session.c |  146 +++++++++++++++++++++++++++++++++----------- >  tools/perf/util/session.h |   3 +- >  3 files changed, 115 insertions(+), 39 deletions(-) > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c > index 302d49a..1c8eb4b 100644 > --- a/tools/perf/util/evsel.c > +++ b/tools/perf/util/evsel.c > @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts) >     if (opts->raw_samples) { >         attr->sample_type    |= PERF_SAMPLE_TIME; >         attr->sample_type    |= PERF_SAMPLE_RAW; > -        attr->sample_type    |= PERF_SAMPLE_CPU; >     } > I don't get this bit here. You may want CPU information when capturing in raw + per-thread mode. > +    /* Need to know the CPU for tools that need to order events */ > +    if (attr->sample_type & PERF_SAMPLE_TIME) > +        attr->sample_type    |= PERF_SAMPLE_CPU; > + >     if (opts->no_delay) { >         attr->watermark = 0; >         attr->wakeup_events = 1; > diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c > index 9f833cf..f297342 100644 > --- a/tools/perf/util/session.c > +++ b/tools/perf/util/session.c > @@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session) >         list_del(&sq->list); >         free(sq); >     } > + > +    free(os->last_cpu_timestamp); >  } > >  static int perf_session_deliver_event(struct perf_session *session, > @@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s, >  } > >  /* > - * When perf record finishes a pass on every buffers, it records this pseudo > - * event. > - * We record the max timestamp t found in the pass n. > - * Assuming these timestamps are monotonic across cpus, we know that if > - * a buffer still has events with timestamps below t, they will be all > - * available and then read in the pass n + 1. > - * Hence when we start to read the pass n + 2, we can safely flush every > - * events with timestamps below t. > + * We make the assumption that timestamps are not globally monotonic but locally > + * non-strictly monotonic. In practice, this is because if we are dealing with a > + * machine with unstable TSC, the kernel bounds the result of the tsc between > + * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more > + * than one jiffy, all of its timestamps will be equal to next_tick_time after we > + * cross that jiffy, without any further progress whereas the other CPU continue > + * with normal timestamps. This can happen if a CPU sends crazillions of events > + * while interrupts are disabled. But there are potentially other random scenarios > + * with unstable TSC that drives us to assume the monotonicity of time only per CPU > + * and not globally. > + * > + * To solve this, when perf record finishes a round of write on every buffers, it > + * records a pseudo event named "finished round". The frame of events that happen > + * between two finished rounds is called a "pass". > + * We record the max timestamp T[cpu] per CPU found over the events in the pass n. > + * Then when we finish a round, we iterate over these T[cpu]and keep the smallest > + * one: min(T). > + * > + * Assuming these timestamps are locally monotonic (non strictly), we can flush all > + * queued events having a timestamp below min(T) when we start to process PASS n + 1. > + * But we actually wait until we start PASS n + 2 in case a CPU did not have any > + * event in PASS n but came in PASS n + 1 with events below min(T). We truly > + * hope no CPU will come with events below min(T) after pass n + 1. This > + * heuristicly rely on some minimal global consistancy. This should work in most > + * real world case, the only way to ensure a truly safe ordering with regular > + * flush will be to switch to per CPU record files. > + * >  * > - *   ============ PASS n ================= > - *    CPU 0     |  CPU 1 > - *           | > - *   cnt1 timestamps  |  cnt2 timestamps > - *      1      |     2 > - *      2      |     3 > - *      -      |     4  <--- max recorded > + *   ========================== PASS n ============================ > + *    CPU 0          |  CPU 1 > + *                | > + *   cnt1 timestamps       |  cnt2 timestamps > + *      1           |     2 > + *      2 <--- max recorded  |     3 > + *      -           |     4 <--- max recorded > + *              min(T) = 2 >  * > - *   ============ PASS n + 1 ============== > - *    CPU 0     |  CPU 1 > - *           | > - *   cnt1 timestamps  |  cnt2 timestamps > - *      3      |     5 > - *      4      |     6 > - *      5      |     7 <---- max recorded > + *   ========================== PASS n + 1 ======================== > + *    CPU 0          |  CPU 1 > + *                | > + *   cnt1 timestamps       |  cnt2 timestamps > + *      3           |     5 > + *      4           |     6 > + *      5 <--- max record   |     7 <---- max recorded > + *              min(T) = 5 >  * > - *    Flush every events below timestamp 4 > + *         Flush every events below timestamp 2 >  * > - *   ============ PASS n + 2 ============== > - *    CPU 0     |  CPU 1 > - *           | > - *   cnt1 timestamps  |  cnt2 timestamps > - *      6      |     8 > - *      7      |     9 > - *      -      |     10 > + *   ========================== PASS n + 2 ======================== > + *    CPU 0          |  CPU 1 > + *                | > + *   cnt1 timestamps       |  cnt2 timestamps > + *      6           |     8 > + *      7           |     9 > + *      -           |     10 >  * > - *    Flush every events below timestamp 7 > - *    etc... > + *         Flush every events below timestamp 5, etc... >  */ >  static int process_finished_round(struct perf_tool *tool, >                  union perf_event *event __used, >                  struct perf_session *session) >  { > +    unsigned int i; > +    u64 min = ULLONG_MAX; > +    struct ordered_samples *os = &session->ordered_samples; > + >     flush_sample_queue(session, tool); > -    session->ordered_samples.next_flush = session->ordered_samples.max_timestamp; > + > +    for (i = 0; i < session->nr_cpus; i++) { > +        if (os->last_cpu_timestamp[i] < min) > +            min = os->last_cpu_timestamp[i]; > + > +        os->last_cpu_timestamp[i] = ULLONG_MAX; > +    } > + > +    if (min != ULLONG_MAX) > +        os->next_flush = min; > >     return 0; >  } > >  /* The queue is ordered by time */ > -static void __queue_event(struct sample_queue *new, struct perf_session *s) > +static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu) >  { >     struct ordered_samples *os = &s->ordered_samples; >     struct sample_queue *sample = os->last_sample; > @@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) > >     ++os->nr_samples; >     os->last_sample = new; > +    os->last_cpu_timestamp[cpu] = timestamp; > >     if (!sample) { >         list_add(&new->list, &os->samples); > -        os->max_timestamp = timestamp; >         return; >     } > > @@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) >             p = sample->list.next; >             if (p == &os->samples) { >                 list_add_tail(&new->list, &os->samples); > -                os->max_timestamp = timestamp; >                 return; >             } >             sample = list_entry(p, struct sample_queue, list); > @@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) >     } >  } > > +static int alloc_cpus_timestamp_array(struct perf_session *s, > +                   struct perf_sample *sample, > +                   struct ordered_samples *os) > +{ > +    int i; > +    int nr_cpus; > + > +    if (sample->cpu < s->nr_cpus) > +        return 0; > + > +    nr_cpus = sample->cpu + 1; > + > +    if (!os->last_cpu_timestamp) > +        os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus); > +    else > +        os->last_cpu_timestamp = realloc(os->last_cpu_timestamp, > +                         sizeof(u64) * nr_cpus); > +    if (!os->last_cpu_timestamp) > +        return -ENOMEM; > + > +    for (i = s->nr_cpus; i < nr_cpus; i++) > +        os->last_cpu_timestamp[i] = ULLONG_MAX; > + > +    s->nr_cpus = nr_cpus; > + > +    return 0; > +} > + >  #define MAX_SAMPLE_BUFFER    (64 * 1024 / sizeof(struct sample_queue)) > >  static int perf_session_queue_event(struct perf_session *s, union perf_event *event, > @@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev >     struct list_head *sc = &os->sample_cache; >     u64 timestamp = sample->time; >     struct sample_queue *new; > +    int err; > + > +    if (!(s->sample_type & PERF_SAMPLE_CPU)) { > +        pr_err("Warning: Need to record CPU on samples for ordering\n"); > +        return -EINVAL; > +    } > >     if (!timestamp || timestamp == ~0ULL) >         return -ETIME; > @@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev >         return -EINVAL; >     } > > +    err = alloc_cpus_timestamp_array(s, sample, os); > +    if (err) > +        return err; > + >     if (!list_empty(sc)) { >         new = list_entry(sc->next, struct sample_queue, list); >         list_del(&new->list); > @@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev >     new->file_offset = file_offset; >     new->event = event; > > -    __queue_event(new, s); > +    __queue_event(new, s, sample->cpu); > >     return 0; >  } > diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h > index c8d9017..642591b 100644 > --- a/tools/perf/util/session.h > +++ b/tools/perf/util/session.h > @@ -16,7 +16,7 @@ struct thread; >  struct ordered_samples { >     u64           last_flush; >     u64           next_flush; > -    u64           max_timestamp; > +    u64           *last_cpu_timestamp; >     struct list_head     samples; >     struct list_head     sample_cache; >     struct list_head     to_free; > @@ -50,6 +50,7 @@ struct perf_session { >     int           cwdlen; >     char           *cwd; >     struct ordered_samples  ordered_samples; > +    unsigned int       nr_cpus; >     char           filename[1]; >  }; > > -- > 1.7.5.4 > ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] perf tools: Fix ordering with unstable tsc 2012-02-22 15:35 ` Stephane Eranian @ 2012-02-22 15:39 ` David Ahern 0 siblings, 0 replies; 31+ messages in thread From: David Ahern @ 2012-02-22 15:39 UTC (permalink / raw) To: Stephane Eranian Cc: Frederic Weisbecker, Arnaldo Carvalho de Melo, LKML, Peter Zijlstra, Ingo Molnar On 2/22/12 8:35 AM, Stephane Eranian wrote: >> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c >> index 302d49a..1c8eb4b 100644 >> --- a/tools/perf/util/evsel.c >> +++ b/tools/perf/util/evsel.c >> @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts) >> if (opts->raw_samples) { >> attr->sample_type |= PERF_SAMPLE_TIME; >> attr->sample_type |= PERF_SAMPLE_RAW; >> - attr->sample_type |= PERF_SAMPLE_CPU; >> } >> > I don't get this bit here. You may want CPU information when capturing > in raw + per-thread mode. TIME is added to sample_type for raw samples and then CPU gets added below. > > >> + /* Need to know the CPU for tools that need to order events */ >> + if (attr->sample_type& PERF_SAMPLE_TIME) >> + attr->sample_type |= PERF_SAMPLE_CPU; >> + David ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] perf tools: Fix ordering with unstable tsc 2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker 2012-02-22 15:35 ` Stephane Eranian @ 2012-03-05 18:43 ` Frederic Weisbecker 2012-03-14 19:55 ` Arnaldo Carvalho de Melo 2 siblings, 0 replies; 31+ messages in thread From: Frederic Weisbecker @ 2012-03-05 18:43 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: LKML, David Ahern, Peter Zijlstra, Stephane Eranian, Ingo Molnar ping? On Sat, Feb 18, 2012 at 05:50:37PM +0100, Frederic Weisbecker wrote: > On a system with a TSC considered as unstable, one can encounter this > kind of warning: > > $ perf sched rec pong 2 > $ perf sched lat > Warning: Timestamp below last timeslice flush > > This happens when trace events trigger with a potentially high period, > such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc... > The perf event core then implement that weight by sending as many events > as the given period. For example as many as the time the task has been > sleeping in sched_stat_sleep event. > > If this happens while irqs are disabled with an unstable tsc and this takes > more time than a jiffy, then the timestamps of the events get stuck to > the value of that next jiffy because sched_clock_local() bounds the timestamp > to that maximum. The local timer tick is supposed to update that boundary but > it can't given that irqs are disabled. > > We can then meet this kind of scenario in perf record: > > ===== CPU 0 ===== ==== CPU 1 ==== > > PASS n > ... ... > 1 1 > 1 2 > 1 3 <-- max recorded > > finished round event > PASS n + 1 > > 1 4 > 1 5 > 1 6 > > finished round event > PASS n + 2 > > 1 7 > ... ... > > CPU 0 is stuck sending events with irqs disabled and with the stale > timestamp. When we do the events reordering for perf script for example, > we flush all the events before timestamp 3 when we reach PASS n + 2, > considering we can't anymore have timestamps below 3 now. > But we still do have timestamps below 3 on PASS n + 2. > > To solve that issue, instead of considering that timestamps are globally > monotonic, we assume they are locally monotonic. Instead of recording > the max timestamp on each pass, we check the max one per CPU on each > pass and keep the smallest over these as the new barrier up to which > we flush the events on the PASS n + 2. This still relies on a bit of > global monotonicity because if some CPU doesn't have events in PASS n, > we expect it not to have event in PASS n + 2 past the barrier recorded > in PASS n. So this is still not a totally robust ordering but it's still > better than what we had before. > > The only way to have a deterministic and solid ordering will be to use > per cpu perf.data files. > > Reported-by: Stephane Eranian <eranian@google.com> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> > Cc: David Ahern <dsahern@gmail.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Stephane Eranian <eranian@google.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com> > --- > tools/perf/util/evsel.c | 5 +- > tools/perf/util/session.c | 146 +++++++++++++++++++++++++++++++++----------- > tools/perf/util/session.h | 3 +- > 3 files changed, 115 insertions(+), 39 deletions(-) > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c > index 302d49a..1c8eb4b 100644 > --- a/tools/perf/util/evsel.c > +++ b/tools/perf/util/evsel.c > @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts) > if (opts->raw_samples) { > attr->sample_type |= PERF_SAMPLE_TIME; > attr->sample_type |= PERF_SAMPLE_RAW; > - attr->sample_type |= PERF_SAMPLE_CPU; > } > > + /* Need to know the CPU for tools that need to order events */ > + if (attr->sample_type & PERF_SAMPLE_TIME) > + attr->sample_type |= PERF_SAMPLE_CPU; > + > if (opts->no_delay) { > attr->watermark = 0; > attr->wakeup_events = 1; > diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c > index 9f833cf..f297342 100644 > --- a/tools/perf/util/session.c > +++ b/tools/perf/util/session.c > @@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session) > list_del(&sq->list); > free(sq); > } > + > + free(os->last_cpu_timestamp); > } > > static int perf_session_deliver_event(struct perf_session *session, > @@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s, > } > > /* > - * When perf record finishes a pass on every buffers, it records this pseudo > - * event. > - * We record the max timestamp t found in the pass n. > - * Assuming these timestamps are monotonic across cpus, we know that if > - * a buffer still has events with timestamps below t, they will be all > - * available and then read in the pass n + 1. > - * Hence when we start to read the pass n + 2, we can safely flush every > - * events with timestamps below t. > + * We make the assumption that timestamps are not globally monotonic but locally > + * non-strictly monotonic. In practice, this is because if we are dealing with a > + * machine with unstable TSC, the kernel bounds the result of the tsc between > + * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more > + * than one jiffy, all of its timestamps will be equal to next_tick_time after we > + * cross that jiffy, without any further progress whereas the other CPU continue > + * with normal timestamps. This can happen if a CPU sends crazillions of events > + * while interrupts are disabled. But there are potentially other random scenarios > + * with unstable TSC that drives us to assume the monotonicity of time only per CPU > + * and not globally. > + * > + * To solve this, when perf record finishes a round of write on every buffers, it > + * records a pseudo event named "finished round". The frame of events that happen > + * between two finished rounds is called a "pass". > + * We record the max timestamp T[cpu] per CPU found over the events in the pass n. > + * Then when we finish a round, we iterate over these T[cpu]and keep the smallest > + * one: min(T). > + * > + * Assuming these timestamps are locally monotonic (non strictly), we can flush all > + * queued events having a timestamp below min(T) when we start to process PASS n + 1. > + * But we actually wait until we start PASS n + 2 in case a CPU did not have any > + * event in PASS n but came in PASS n + 1 with events below min(T). We truly > + * hope no CPU will come with events below min(T) after pass n + 1. This > + * heuristicly rely on some minimal global consistancy. This should work in most > + * real world case, the only way to ensure a truly safe ordering with regular > + * flush will be to switch to per CPU record files. > + * > * > - * ============ PASS n ================= > - * CPU 0 | CPU 1 > - * | > - * cnt1 timestamps | cnt2 timestamps > - * 1 | 2 > - * 2 | 3 > - * - | 4 <--- max recorded > + * ========================== PASS n ============================ > + * CPU 0 | CPU 1 > + * | > + * cnt1 timestamps | cnt2 timestamps > + * 1 | 2 > + * 2 <--- max recorded | 3 > + * - | 4 <--- max recorded > + * min(T) = 2 > * > - * ============ PASS n + 1 ============== > - * CPU 0 | CPU 1 > - * | > - * cnt1 timestamps | cnt2 timestamps > - * 3 | 5 > - * 4 | 6 > - * 5 | 7 <---- max recorded > + * ========================== PASS n + 1 ======================== > + * CPU 0 | CPU 1 > + * | > + * cnt1 timestamps | cnt2 timestamps > + * 3 | 5 > + * 4 | 6 > + * 5 <--- max record | 7 <---- max recorded > + * min(T) = 5 > * > - * Flush every events below timestamp 4 > + * Flush every events below timestamp 2 > * > - * ============ PASS n + 2 ============== > - * CPU 0 | CPU 1 > - * | > - * cnt1 timestamps | cnt2 timestamps > - * 6 | 8 > - * 7 | 9 > - * - | 10 > + * ========================== PASS n + 2 ======================== > + * CPU 0 | CPU 1 > + * | > + * cnt1 timestamps | cnt2 timestamps > + * 6 | 8 > + * 7 | 9 > + * - | 10 > * > - * Flush every events below timestamp 7 > - * etc... > + * Flush every events below timestamp 5, etc... > */ > static int process_finished_round(struct perf_tool *tool, > union perf_event *event __used, > struct perf_session *session) > { > + unsigned int i; > + u64 min = ULLONG_MAX; > + struct ordered_samples *os = &session->ordered_samples; > + > flush_sample_queue(session, tool); > - session->ordered_samples.next_flush = session->ordered_samples.max_timestamp; > + > + for (i = 0; i < session->nr_cpus; i++) { > + if (os->last_cpu_timestamp[i] < min) > + min = os->last_cpu_timestamp[i]; > + > + os->last_cpu_timestamp[i] = ULLONG_MAX; > + } > + > + if (min != ULLONG_MAX) > + os->next_flush = min; > > return 0; > } > > /* The queue is ordered by time */ > -static void __queue_event(struct sample_queue *new, struct perf_session *s) > +static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu) > { > struct ordered_samples *os = &s->ordered_samples; > struct sample_queue *sample = os->last_sample; > @@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) > > ++os->nr_samples; > os->last_sample = new; > + os->last_cpu_timestamp[cpu] = timestamp; > > if (!sample) { > list_add(&new->list, &os->samples); > - os->max_timestamp = timestamp; > return; > } > > @@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) > p = sample->list.next; > if (p == &os->samples) { > list_add_tail(&new->list, &os->samples); > - os->max_timestamp = timestamp; > return; > } > sample = list_entry(p, struct sample_queue, list); > @@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) > } > } > > +static int alloc_cpus_timestamp_array(struct perf_session *s, > + struct perf_sample *sample, > + struct ordered_samples *os) > +{ > + int i; > + int nr_cpus; > + > + if (sample->cpu < s->nr_cpus) > + return 0; > + > + nr_cpus = sample->cpu + 1; > + > + if (!os->last_cpu_timestamp) > + os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus); > + else > + os->last_cpu_timestamp = realloc(os->last_cpu_timestamp, > + sizeof(u64) * nr_cpus); > + if (!os->last_cpu_timestamp) > + return -ENOMEM; > + > + for (i = s->nr_cpus; i < nr_cpus; i++) > + os->last_cpu_timestamp[i] = ULLONG_MAX; > + > + s->nr_cpus = nr_cpus; > + > + return 0; > +} > + > #define MAX_SAMPLE_BUFFER (64 * 1024 / sizeof(struct sample_queue)) > > static int perf_session_queue_event(struct perf_session *s, union perf_event *event, > @@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev > struct list_head *sc = &os->sample_cache; > u64 timestamp = sample->time; > struct sample_queue *new; > + int err; > + > + if (!(s->sample_type & PERF_SAMPLE_CPU)) { > + pr_err("Warning: Need to record CPU on samples for ordering\n"); > + return -EINVAL; > + } > > if (!timestamp || timestamp == ~0ULL) > return -ETIME; > @@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev > return -EINVAL; > } > > + err = alloc_cpus_timestamp_array(s, sample, os); > + if (err) > + return err; > + > if (!list_empty(sc)) { > new = list_entry(sc->next, struct sample_queue, list); > list_del(&new->list); > @@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev > new->file_offset = file_offset; > new->event = event; > > - __queue_event(new, s); > + __queue_event(new, s, sample->cpu); > > return 0; > } > diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h > index c8d9017..642591b 100644 > --- a/tools/perf/util/session.h > +++ b/tools/perf/util/session.h > @@ -16,7 +16,7 @@ struct thread; > struct ordered_samples { > u64 last_flush; > u64 next_flush; > - u64 max_timestamp; > + u64 *last_cpu_timestamp; > struct list_head samples; > struct list_head sample_cache; > struct list_head to_free; > @@ -50,6 +50,7 @@ struct perf_session { > int cwdlen; > char *cwd; > struct ordered_samples ordered_samples; > + unsigned int nr_cpus; > char filename[1]; > }; > > -- > 1.7.5.4 > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] perf tools: Fix ordering with unstable tsc 2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker 2012-02-22 15:35 ` Stephane Eranian 2012-03-05 18:43 ` Frederic Weisbecker @ 2012-03-14 19:55 ` Arnaldo Carvalho de Melo 2012-03-14 20:07 ` David Ahern 2012-03-22 0:10 ` Frederic Weisbecker 2 siblings, 2 replies; 31+ messages in thread From: Arnaldo Carvalho de Melo @ 2012-03-14 19:55 UTC (permalink / raw) To: Frederic Weisbecker, Thomas Gleixner Cc: LKML, David Ahern, Peter Zijlstra, Stephane Eranian, Ingo Molnar Em Sat, Feb 18, 2012 at 05:50:37PM +0100, Frederic Weisbecker escreveu: > On a system with a TSC considered as unstable, one can encounter this > kind of warning: > > $ perf sched rec pong 2 > $ perf sched lat > Warning: Timestamp below last timeslice flush Some nits below. Thomas, You rewrote the ordering code in a1225dec, could you take a look at this new patch by Frédéric and provide an Acked-by? - Arnaldo > This happens when trace events trigger with a potentially high period, > such as sched_stat_sleep, sched_stat_runtime, sched_stat_wait, etc... > The perf event core then implement that weight by sending as many events > as the given period. For example as many as the time the task has been > sleeping in sched_stat_sleep event. > > If this happens while irqs are disabled with an unstable tsc and this takes > more time than a jiffy, then the timestamps of the events get stuck to > the value of that next jiffy because sched_clock_local() bounds the timestamp > to that maximum. The local timer tick is supposed to update that boundary but > it can't given that irqs are disabled. > > We can then meet this kind of scenario in perf record: > > ===== CPU 0 ===== ==== CPU 1 ==== > > PASS n > ... ... > 1 1 > 1 2 > 1 3 <-- max recorded > > finished round event > PASS n + 1 > > 1 4 > 1 5 > 1 6 > > finished round event > PASS n + 2 > > 1 7 > ... ... > > CPU 0 is stuck sending events with irqs disabled and with the stale > timestamp. When we do the events reordering for perf script for example, > we flush all the events before timestamp 3 when we reach PASS n + 2, > considering we can't anymore have timestamps below 3 now. > But we still do have timestamps below 3 on PASS n + 2. > > To solve that issue, instead of considering that timestamps are globally > monotonic, we assume they are locally monotonic. Instead of recording > the max timestamp on each pass, we check the max one per CPU on each > pass and keep the smallest over these as the new barrier up to which > we flush the events on the PASS n + 2. This still relies on a bit of > global monotonicity because if some CPU doesn't have events in PASS n, > we expect it not to have event in PASS n + 2 past the barrier recorded > in PASS n. So this is still not a totally robust ordering but it's still > better than what we had before. > > The only way to have a deterministic and solid ordering will be to use > per cpu perf.data files. > > Reported-by: Stephane Eranian <eranian@google.com> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> > Cc: David Ahern <dsahern@gmail.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Stephane Eranian <eranian@google.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com> > --- > tools/perf/util/evsel.c | 5 +- > tools/perf/util/session.c | 146 +++++++++++++++++++++++++++++++++----------- > tools/perf/util/session.h | 3 +- > 3 files changed, 115 insertions(+), 39 deletions(-) > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c > index 302d49a..1c8eb4b 100644 > --- a/tools/perf/util/evsel.c > +++ b/tools/perf/util/evsel.c > @@ -119,9 +119,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts) > if (opts->raw_samples) { > attr->sample_type |= PERF_SAMPLE_TIME; > attr->sample_type |= PERF_SAMPLE_RAW; > - attr->sample_type |= PERF_SAMPLE_CPU; > } > > + /* Need to know the CPU for tools that need to order events */ > + if (attr->sample_type & PERF_SAMPLE_TIME) > + attr->sample_type |= PERF_SAMPLE_CPU; > + > if (opts->no_delay) { > attr->watermark = 0; > attr->wakeup_events = 1; > diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c > index 9f833cf..f297342 100644 > --- a/tools/perf/util/session.c > +++ b/tools/perf/util/session.c > @@ -494,6 +494,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session) > list_del(&sq->list); > free(sq); > } > + > + free(os->last_cpu_timestamp); > } > > static int perf_session_deliver_event(struct perf_session *session, > @@ -549,56 +551,89 @@ static void flush_sample_queue(struct perf_session *s, > } > > /* > - * When perf record finishes a pass on every buffers, it records this pseudo > - * event. > - * We record the max timestamp t found in the pass n. > - * Assuming these timestamps are monotonic across cpus, we know that if > - * a buffer still has events with timestamps below t, they will be all > - * available and then read in the pass n + 1. > - * Hence when we start to read the pass n + 2, we can safely flush every > - * events with timestamps below t. > + * We make the assumption that timestamps are not globally monotonic but locally > + * non-strictly monotonic. In practice, this is because if we are dealing with a > + * machine with unstable TSC, the kernel bounds the result of the tsc between > + * last_tick_time < tsc < next_tick_time. Thus, if a CPU disables interrupts for more > + * than one jiffy, all of its timestamps will be equal to next_tick_time after we > + * cross that jiffy, without any further progress whereas the other CPU continue > + * with normal timestamps. This can happen if a CPU sends crazillions of events > + * while interrupts are disabled. But there are potentially other random scenarios > + * with unstable TSC that drives us to assume the monotonicity of time only per CPU > + * and not globally. > + * > + * To solve this, when perf record finishes a round of write on every buffers, it > + * records a pseudo event named "finished round". The frame of events that happen > + * between two finished rounds is called a "pass". > + * We record the max timestamp T[cpu] per CPU found over the events in the pass n. > + * Then when we finish a round, we iterate over these T[cpu]and keep the smallest > + * one: min(T). > + * > + * Assuming these timestamps are locally monotonic (non strictly), we can flush all > + * queued events having a timestamp below min(T) when we start to process PASS n + 1. > + * But we actually wait until we start PASS n + 2 in case a CPU did not have any > + * event in PASS n but came in PASS n + 1 with events below min(T). We truly > + * hope no CPU will come with events below min(T) after pass n + 1. This > + * heuristicly rely on some minimal global consistancy. This should work in most > + * real world case, the only way to ensure a truly safe ordering with regular > + * flush will be to switch to per CPU record files. > + * > * > - * ============ PASS n ================= > - * CPU 0 | CPU 1 > - * | > - * cnt1 timestamps | cnt2 timestamps > - * 1 | 2 > - * 2 | 3 > - * - | 4 <--- max recorded > + * ========================== PASS n ============================ > + * CPU 0 | CPU 1 > + * | > + * cnt1 timestamps | cnt2 timestamps > + * 1 | 2 > + * 2 <--- max recorded | 3 > + * - | 4 <--- max recorded > + * min(T) = 2 > * > - * ============ PASS n + 1 ============== > - * CPU 0 | CPU 1 > - * | > - * cnt1 timestamps | cnt2 timestamps > - * 3 | 5 > - * 4 | 6 > - * 5 | 7 <---- max recorded > + * ========================== PASS n + 1 ======================== > + * CPU 0 | CPU 1 > + * | > + * cnt1 timestamps | cnt2 timestamps > + * 3 | 5 > + * 4 | 6 > + * 5 <--- max record | 7 <---- max recorded > + * min(T) = 5 > * > - * Flush every events below timestamp 4 > + * Flush every events below timestamp 2 > * > - * ============ PASS n + 2 ============== > - * CPU 0 | CPU 1 > - * | > - * cnt1 timestamps | cnt2 timestamps > - * 6 | 8 > - * 7 | 9 > - * - | 10 > + * ========================== PASS n + 2 ======================== > + * CPU 0 | CPU 1 > + * | > + * cnt1 timestamps | cnt2 timestamps > + * 6 | 8 > + * 7 | 9 > + * - | 10 > * > - * Flush every events below timestamp 7 > - * etc... > + * Flush every events below timestamp 5, etc... > */ > static int process_finished_round(struct perf_tool *tool, > union perf_event *event __used, > struct perf_session *session) > { > + unsigned int i; > + u64 min = ULLONG_MAX; > + struct ordered_samples *os = &session->ordered_samples; > + > flush_sample_queue(session, tool); > - session->ordered_samples.next_flush = session->ordered_samples.max_timestamp; > + > + for (i = 0; i < session->nr_cpus; i++) { > + if (os->last_cpu_timestamp[i] < min) > + min = os->last_cpu_timestamp[i]; > + > + os->last_cpu_timestamp[i] = ULLONG_MAX; > + } > + > + if (min != ULLONG_MAX) > + os->next_flush = min; > > return 0; > } > > /* The queue is ordered by time */ > -static void __queue_event(struct sample_queue *new, struct perf_session *s) > +static void __queue_event(struct sample_queue *new, struct perf_session *s, int cpu) > { > struct ordered_samples *os = &s->ordered_samples; > struct sample_queue *sample = os->last_sample; > @@ -607,10 +642,10 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) > > ++os->nr_samples; > os->last_sample = new; > + os->last_cpu_timestamp[cpu] = timestamp; > > if (!sample) { > list_add(&new->list, &os->samples); > - os->max_timestamp = timestamp; > return; > } > > @@ -624,7 +659,6 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) > p = sample->list.next; > if (p == &os->samples) { > list_add_tail(&new->list, &os->samples); > - os->max_timestamp = timestamp; > return; > } > sample = list_entry(p, struct sample_queue, list); > @@ -643,6 +677,34 @@ static void __queue_event(struct sample_queue *new, struct perf_session *s) > } > } > > +static int alloc_cpus_timestamp_array(struct perf_session *s, > + struct perf_sample *sample, > + struct ordered_samples *os) > +{ > + int i; > + int nr_cpus; > + > + if (sample->cpu < s->nr_cpus) > + return 0; Shouldn't we try to robustify this checking against HEADER_NRCPUS (if present)? I just tried to see how to find that info, but unfortunately it just prints it when calling perf_header__fprintf_info, it can be made to work. I need to work on storing those values in a struct accessible thru perf_session or perf_evlist, so that I can show that info on the TUI, perhaps at that time I can robustify this way. > + > + nr_cpus = sample->cpu + 1; > + > + if (!os->last_cpu_timestamp) > + os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus); > + else > + os->last_cpu_timestamp = realloc(os->last_cpu_timestamp, > + sizeof(u64) * nr_cpus); If realloc fails we return -ENOMEM, but leak the old buffer. At some point we can have in the TUI/GUI a way for the user to ask for an specific perf.data file to be processed, if it fails to process due to the above realloc, we'll continue, allowing other perf.data files to be processed, but will have this leak. > + if (!os->last_cpu_timestamp) > + return -ENOMEM; > + > + for (i = s->nr_cpus; i < nr_cpus; i++) > + os->last_cpu_timestamp[i] = ULLONG_MAX; > + > + s->nr_cpus = nr_cpus; > + > + return 0; > +} > + > #define MAX_SAMPLE_BUFFER (64 * 1024 / sizeof(struct sample_queue)) > > static int perf_session_queue_event(struct perf_session *s, union perf_event *event, > @@ -652,6 +714,12 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev > struct list_head *sc = &os->sample_cache; > u64 timestamp = sample->time; > struct sample_queue *new; > + int err; > + > + if (!(s->sample_type & PERF_SAMPLE_CPU)) { > + pr_err("Warning: Need to record CPU on samples for ordering\n"); > + return -EINVAL; > + } > > if (!timestamp || timestamp == ~0ULL) > return -ETIME; > @@ -661,6 +729,10 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev > return -EINVAL; > } > > + err = alloc_cpus_timestamp_array(s, sample, os); > + if (err) > + return err; > + > if (!list_empty(sc)) { > new = list_entry(sc->next, struct sample_queue, list); > list_del(&new->list); > @@ -681,7 +753,7 @@ static int perf_session_queue_event(struct perf_session *s, union perf_event *ev > new->file_offset = file_offset; > new->event = event; > > - __queue_event(new, s); > + __queue_event(new, s, sample->cpu); > > return 0; > } > diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h > index c8d9017..642591b 100644 > --- a/tools/perf/util/session.h > +++ b/tools/perf/util/session.h > @@ -16,7 +16,7 @@ struct thread; > struct ordered_samples { > u64 last_flush; > u64 next_flush; > - u64 max_timestamp; > + u64 *last_cpu_timestamp; > struct list_head samples; > struct list_head sample_cache; > struct list_head to_free; > @@ -50,6 +50,7 @@ struct perf_session { > int cwdlen; > char *cwd; > struct ordered_samples ordered_samples; > + unsigned int nr_cpus; > char filename[1]; > }; > > -- > 1.7.5.4 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] perf tools: Fix ordering with unstable tsc 2012-03-14 19:55 ` Arnaldo Carvalho de Melo @ 2012-03-14 20:07 ` David Ahern 2012-03-22 0:10 ` Frederic Weisbecker 1 sibling, 0 replies; 31+ messages in thread From: David Ahern @ 2012-03-14 20:07 UTC (permalink / raw) To: Arnaldo Carvalho de Melo, Frederic Weisbecker Cc: Thomas Gleixner, LKML, Peter Zijlstra, Stephane Eranian, Ingo Molnar On 3/14/12 1:55 PM, Arnaldo Carvalho de Melo wrote: > Shouldn't we try to robustify this checking against HEADER_NRCPUS (if > present)? > > I just tried to see how to find that info, but unfortunately it just > prints it when calling perf_header__fprintf_info, it can be made to > work. > > I need to work on storing those values in a struct accessible thru > perf_session or perf_evlist, so that I can show that info on the TUI, > perhaps at that time I can robustify this way. Xiao added a design for retrieving data for the features in the proposed kvm-events tool. See the header.c changes in http://thread.gmane.org/gmane.linux.kernel/1262482 David ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] perf tools: Fix ordering with unstable tsc 2012-03-14 19:55 ` Arnaldo Carvalho de Melo 2012-03-14 20:07 ` David Ahern @ 2012-03-22 0:10 ` Frederic Weisbecker 2012-03-22 15:28 ` Arnaldo Carvalho de Melo 1 sibling, 1 reply; 31+ messages in thread From: Frederic Weisbecker @ 2012-03-22 0:10 UTC (permalink / raw) To: Arnaldo Carvalho de Melo Cc: Thomas Gleixner, LKML, David Ahern, Peter Zijlstra, Stephane Eranian, Ingo Molnar On Wed, Mar 14, 2012 at 04:55:35PM -0300, Arnaldo Carvalho de Melo wrote: > Em Sat, Feb 18, 2012 at 05:50:37PM +0100, Frederic Weisbecker escreveu: > > +static int alloc_cpus_timestamp_array(struct perf_session *s, > > + struct perf_sample *sample, > > + struct ordered_samples *os) > > +{ > > + int i; > > + int nr_cpus; > > + > > + if (sample->cpu < s->nr_cpus) > > + return 0; > > > Shouldn't we try to robustify this checking against HEADER_NRCPUS (if > present)? > > I just tried to see how to find that info, but unfortunately it just > prints it when calling perf_header__fprintf_info, it can be made to > work. > > I need to work on storing those values in a struct accessible thru > perf_session or perf_evlist, so that I can show that info on the TUI, > perhaps at that time I can robustify this way. Yeah I thought about that too. I can try to make that working. I just thought this was an optimization that I could later add (ie: first try to see if the core idea of the patch is accepted). Of course the real long term optimization is going to have one file per CPU. There, the ordering will be much easier and deterministically correct. > > > + > > + nr_cpus = sample->cpu + 1; > > + > > + if (!os->last_cpu_timestamp) > > + os->last_cpu_timestamp = malloc(sizeof(u64) * nr_cpus); > > + else > > + os->last_cpu_timestamp = realloc(os->last_cpu_timestamp, > > + sizeof(u64) * nr_cpus); > > If realloc fails we return -ENOMEM, but leak the old buffer. Doh! the common trap with realloc... > > At some point we can have in the TUI/GUI a way for the user to ask for > an specific perf.data file to be processed, if it fails to process due > to the above realloc, we'll continue, allowing other perf.data files to > be processed, but will have this leak. Ok. Will fix. Thanks. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] perf tools: Fix ordering with unstable tsc 2012-03-22 0:10 ` Frederic Weisbecker @ 2012-03-22 15:28 ` Arnaldo Carvalho de Melo 0 siblings, 0 replies; 31+ messages in thread From: Arnaldo Carvalho de Melo @ 2012-03-22 15:28 UTC (permalink / raw) To: Frederic Weisbecker Cc: Thomas Gleixner, LKML, David Ahern, Peter Zijlstra, Stephane Eranian, Ingo Molnar Em Thu, Mar 22, 2012 at 01:10:31AM +0100, Frederic Weisbecker escreveu: > On Wed, Mar 14, 2012 at 04:55:35PM -0300, Arnaldo Carvalho de Melo wrote: > > Em Sat, Feb 18, 2012 at 05:50:37PM +0100, Frederic Weisbecker escreveu: > > > +static int alloc_cpus_timestamp_array(struct perf_session *s, > > > + if (sample->cpu < s->nr_cpus) > > Shouldn't we try to robustify this checking against HEADER_NRCPUS (if > > present)? > Yeah I thought about that too. I can try to make that working. > I just thought this was an optimization that I could later add (ie: first > try to see if the core idea of the patch is accepted). Right, I deferred that to tglx, but he must be busy as always :-P > Of course the real long term optimization is going to have one file per > CPU. There, the ordering will be much easier and deterministically > correct. Yeah. > > > + os->last_cpu_timestamp = realloc(os->last_cpu_timestamp, > > > + sizeof(u64) * nr_cpus); > > If realloc fails we return -ENOMEM, but leak the old buffer. > Doh! the common trap with realloc... :-) > > At some point we can have in the TUI/GUI a way for the user to ask for > > an specific perf.data file to be processed, if it fails to process due > > to the above realloc, we'll continue, allowing other perf.data files to > > be processed, but will have this leak. > Ok. Will fix. Thanks! - Arnaldo ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2012-03-22 15:28 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-02-06 13:25 [BUG] perf: perf sched warning possibly due to clock granularity on AMD Stephane Eranian 2012-02-06 14:26 ` Peter Zijlstra 2012-02-06 14:31 ` Stephane Eranian 2012-02-06 17:17 ` Arnaldo Carvalho de Melo 2012-02-06 15:34 ` Borislav Petkov 2012-02-06 16:37 ` Peter Zijlstra 2012-02-06 16:46 ` Borislav Petkov 2012-02-06 16:54 ` Peter Zijlstra 2012-02-06 20:27 ` Borislav Petkov 2012-02-06 20:31 ` Peter Zijlstra 2012-02-06 20:37 ` Borislav Petkov 2012-02-06 21:19 ` Venki Pallipadi 2012-02-07 7:51 ` Peter Zijlstra 2012-02-07 8:32 ` Ingo Molnar 2012-02-07 9:06 ` Borislav Petkov 2012-02-07 9:50 ` Ingo Molnar 2012-02-07 12:08 ` [PATCH] x86, AMD: Set sched_clock_stable Borislav Petkov 2012-02-15 15:30 ` Peter Zijlstra 2012-02-07 19:43 ` [tip:perf/core] x86/sched/perf/AMD: " tip-bot for Borislav Petkov 2012-02-08 15:07 ` [BUG] perf: perf sched warning possibly due to clock granularity on AMD Frederic Weisbecker 2012-02-08 15:10 ` Stephane Eranian 2012-02-08 15:22 ` Frederic Weisbecker 2012-02-08 15:23 ` Stephane Eranian 2012-02-18 16:50 ` [PATCH] perf tools: Fix ordering with unstable tsc Frederic Weisbecker 2012-02-22 15:35 ` Stephane Eranian 2012-02-22 15:39 ` David Ahern 2012-03-05 18:43 ` Frederic Weisbecker 2012-03-14 19:55 ` Arnaldo Carvalho de Melo 2012-03-14 20:07 ` David Ahern 2012-03-22 0:10 ` Frederic Weisbecker 2012-03-22 15:28 ` Arnaldo Carvalho de Melo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.