All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] perf changes for v3.12
@ 2013-09-03 13:29 Ingo Molnar
  2013-09-03 13:37 ` Arnaldo Carvalho de Melo
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Ingo Molnar @ 2013-09-03 13:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Thomas Gleixner, Andrew Morton

Linus,

Please pull the latest perf-core-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf-core-for-linus

   HEAD: 61bf86ad86443a710ae7eed372cccb8ed5038eda Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

As a first remark I'd like to point out that the obsolete '-f' (--force) 
option, which has not done anything for several releases, has been removed 
from 'perf record' and related utilities. Everyone please update muscle 
memory accordingly! :-)

Main changes on the perf kernel side:

   * Performance optimizations:
        . for trace events, by Steve Rostedt.
        . for time values, by Peter Zijlstra

   * New hardware support:
        . for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
        . for Intel SNB-EP uncore PMUs, by Zheng Yan

   * Enhanced hardware support:
        . for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan

   * Core perf events code enhancements and fixes:
        . for full-nohz feature handling, by Frederic Weisbecker
        . for group events, by Jiri Olsa
        . for call chains, by Frederic Weisbecker
        . for event stream parsing, by Adrian Hunter

   * New ABI details:
        . Add attr->mmap2 attribute, by Stephane Eranian
        . Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
        . Export u64 time_zero on the mmap header page to allow TSC
          calculation, by Adrian Hunter
        . Add dummy software event, by Adrian Hunter.
        . Add a new PERF_SAMPLE_IDENTIFIER to make samples always 
          parseable, by Adrian Hunter.
        . Make Power7 events available via sysfs, by Runzhen Wang.
        
   * Code cleanups and refactorings:
        . for nohz-full, by Frederic Weisbecker
        . for group events, by Jiri Olsa

   * Documentation updates:
        . for perf_event_type, by Peter Zijlstra

Main changes on the perf tooling side (some of these tooling changes 
utilize the above kernel side changes):

   * Lots of 'perf trace' enhancements:

        . Make 'perf trace' command line arguments consistent with
          'perf record', by David Ahern.
    
        . Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.

        . Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.

        . Support ! in -e expressions, to filter a list of syscalls,
          by Arnaldo Carvalho de Melo.
 
        . Arg formatting improvements to allow masking arguments in 
          syscalls such as futex and open, where the some arguments are 
          ignored and thus should not be printed depending on other args, 
          by Arnaldo Carvalho de Melo.
    
        . Beautify futex open, openat, open_by_handle_at, lseek and futex
          syscalls, by Arnaldo Carvalho de Melo.
    
        . Add option to analyze events in a file versus live, so that one can do:
    
           [root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
           [ perf record: Woken up 0 times to write data ]
           [ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
           [root@zoo ~]# perf trace -i perf.data -e futex --duration 1
              17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
             113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
             133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
           [root@zoo ~]#
    
          By David Ahern.
    
        . Honor target pid / tid options when analyzing a file, by David Ahern.
    
        . Introduce better formatting of syscall arguments, including so 
          far beautifiers for mmap, madvise, syscall return values,
          by Arnaldo Carvalho de Melo.
    
        . Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.


   * 'perf report/top' enhancements:

        . Do annotation using /proc/kcore and /proc/kallsyms when 
          available, removing the forced need for a vmlinux file kernel 
          assembly annotation. This also improves this use case because 
          vmlinux has just the initial kernel image, not what is actually 
          in use after various code patchings by things like alternatives.
          By Adrian Hunter.
    
        . Add --ignore-callees=<regex> option to collapse undesired parts 
          of call graphs, by Greg Price.

        . Simplify symbol filtering by doing it at machine class level,
          by Adrian Hunter.

        . Add support for callchains in the gtk UI, by Namhyung Kim.
    
        . Add --objdump option to 'perf top', by Sukadev Bhattiprolu.

    
   * 'perf kvm' enhancements:

        . Add option to print only events that exceed a specified time
          duration, by David Ahern.

        . Improve stack trace printing, by David Ahern.
    
        . Update documentation of the live command, by David Ahern

        . Add perf kvm stat live mode that combines aspects of 'perf kvm 
          stat' record and report, by David Ahern.
    
        . Add option to analyze specific VM in perf kvm stat report, by 
          David Ahern.
    
        . Do not require /lib/modules/* on a guest, by Jason Wessel.

    
   * 'perf script' enhancements:
     
        . Fix symbol offset computation for some dsos, by David Ahern.

        . Fix named threads support, by David Ahern.

        . Don't install scripting files files when perl/python support
          is disabled, by Arnaldo Carvalho de Melo.

    
   * 'perf test' enhancements:

        . Add various improvements and fixes to the "vmlinux matches 
          kallsyms" 'perf test' entry, related to the /proc/kcore 
          annotation feature. By Arnaldo Carvalho de Melo.
    
        . Add sample parsing test, by Adrian Hunter.

        . Add test for reading object code, by Adrian Hunter.

        . Add attr record group sampling test, by Jiri Olsa.
    
        . Misc testing infrastructure improvements and other details,
          by Jiri Olsa.


   * 'perf list' enhancements:

        . Skip unsupported hardware events, by Namhyung Kim.
    
        . List pmu events, by Andi Kleen.

    
   * 'perf diff' enhancements:    
    
        . Add support for more than two files comparison, by Jiri Olsa.
    
    
   * 'perf sched' enhancements:
    
        . Various improvements, including removing reliance on some 
          scheduler tracepoints that provide the same information as the 
          PERF_RECORD_{FORK,EXIT} events. By Arnaldo Carvalho de Melo.

        . Remove odd build stall by moving a large struct initialization
          from a local variable to a global one, by Namhyung Kim.
    
   * 'perf stat' enhancements:

        . Add --initial-delay option to skip measuring for a defined
          startup phase, by Andi Kleen.

   * Generic perf tooling infrastructure/plumbing changes:

        . Tidy up sample parsing validation, by Adrian Hunter.
    
        . Fix up jobserver setup in libtraceevent Makefile.
          by Arnaldo Carvalho de Melo.
    
        . Debug improvements, by Adrian Hunter.
    
        . Fix correlation of samples coming after PERF_RECORD_EXIT event, 
          by David Ahern.
    
        . Improve robustness of the topology parsing code,
          by Stephane Eranian.
    
        . Add group leader sampling, that allows just one event in a group 
          to sample while the other events have just its values read,
          by Jiri Olsa.
    
        . Add support for a new modifier "D", which requests that the 
          event, or group of events, be pinned to the PMU.
          By Michael Ellerman.
    
        . Support callchain sorting based on addresses, by Andi Kleen

        . Prep work for multi perf data file storage, by Jiri Olsa.

        . libtraceevent cleanups, by Namhyung Kim.

And lots and lots of other fixes and code reorganizations that did not 
make it into the list, see the shortlog, diffstat and the Git log for 
details!

 Thanks,

	Ingo

------------------>
Adrian Hunter (48):
      perf inject: Remove unused parameter
      perf tools: Fix missing tool parameter
      perf inject: Add missing 'finished_round'
      perf tools: Add const specifier to perf_pmu__find name parameter
      perf evlist: Tidy duplicated munmap code
      perf tools: Validate perf event header size
      perf tools: struct thread has a tid not a pid
      perf: Fix broken union in 'struct perf_event_mmap_page'
      perf/x86: Add ability to calculate TSC from perf sample timestamps
      perf tools: Add test for converting perf time to/from TSC
      perf symbols: avoid SyS kernel syscall aliases
      perf tests: Add test for reading object code
      perf symbols: Load kernel maps before using
      perf tools: Make it possible to read object code from vmlinux
      perf tests: Adjust the vmlinux symtab matches kallsyms test
      perf tools: Make it possible to read object code from kernel modules
      perf symbols: Add support for reading from /proc/kcore
      perf tests: Adjust the vmlinux symtab matches kallsyms test again
      perf tests: Add kcore to the object code reading test
      perf annotate: Allow disassembly using /proc/kcore
      perf annotate: Put dso name in symbol annotation title
      perf annotate: Remove nop at end of annotation
      perf annotate: Add call target name if it is missing
      perf machine: Add symbol filter to struct machine
      perf top: Set the machines symbol filter
      perf report: Set the machines symbol filter
      perf mem: Remove unused symbol filter member
      perf annotate: Set the machines symbol filter
      perf tools: Remove filter parameter of perf_event__preprocess_sample()
      perf tools: Remove filter parameter of thread__find_addr_location()
      perf tools: Remove filter parameter of thread__find_addr_map()
      perf tools: Re-implement debug print function for linking python/perf.so
      perf tools: Add debug prints
      perf tools: Add pid to struct thread
      perf tools: change machine__findnew_thread() to set thread pid
      perf evsel: Tidy up sample parsing overflow checking
      perf callchain: Remove unnecessary validation
      perf tools: Remove references to struct ip_event
      perf: make events stream always parsable
      perf evlist: Move perf_evlist__config() to a new source file
      perf tools: Add support for PERF_SAMPLE_IDENTIFIER
      perf tools: Add missing 'abi' member to 'struct regs_dump'
      perf tools: Expand perf_event__synthesize_sample()
      perf tools: Add a function to calculate sample event size
      perf tests: Add a sample parsing test
      perf: Add a dummy software event to keep tracking
      perf tools: Add support for PERF_COUNT_SW_DUMMY
      perf tests: Add 'keep tracking' test

Andi Kleen (10):
      perf tools: Default to cpu// for events v5
      perf list: List kernel supplied event aliases
      perf bench: Fix memcpy benchmark for large sizes
      perf tools: Support callchain sorting based on addresses
      perf tools: Move weight back to common sort keys
      perf evlist: Remove obsolete dummy execve
      perf evsel: Add support for enabling counters
      perf stat: Add support for --initial-delay option
      perf stat: Flush output after each line in interval mode
      perf tools: Try to increase the file descriptor limits on EMFILE

Arnaldo Carvalho de Melo (23):
      perf trace: Beautify 'connect' result
      perf python: Remove duplicate TID bit from mask
      perf annotate browser: Improve description of '?' hotkey
      perf trace: Allow specifying which syscalls to trace
      perf trace: Implement -o/--output filename
      perf tools: Don't install scripting files files when disabled
      perf trace: Support ! in -e expressions
      perf trace: Add --verbose option
      perf trace: Hide sys_exit messages about syscall id = -1
      perf trace: Introduce syscall arg formatters
      perf trace: Simplify sys_exit return printing
      perf trace: Allow printing syscall return values in hex
      perf trace: Add aliases to remaining syscalls of the sys_enter_newfoo
      perf trace: Allow overiding the formatting of syscall fields
      perf trace: Add beautifier for mmap prot parm
      perf trace: Add beautifier for mmap flags parm
      perf trace: Add beautifier for madvise behaviour/advice parm
      tools lib traceevent: Fixup jobserver setup
      perf trace: Allow syscall arg formatters to mask args
      perf trace: Add beautifier for futex 'operation' parm
      perf trace: Add beautifier for lseek's whence arg
      perf trace: Add beautifier for open's flags arg
      perf trace: Tell arg formatters the arg index

David Ahern (36):
      perf tools: Add methods for setting/retrieving priv element of thread struct
      perf evlist: Fix use of uninitialized variable
      perf tools: Don't free list head in parse_events__free_terms
      perf tests: Make terms a stack variable in test_term
      perf parse events: Demystify memory allocations
      perf script: Fix named threads support
      perf evsel: Handle ENODEV on default cycles event
      perf evsel: Actually show symbol offset in stack trace when requested
      perf tools: Fix compile of util/tsc.c
      perf top: move CONSOLE_CLEAR to header file
      perf stats: Add max and min stats
      perf session: Export a few functions for event processing
      perf kvm: Split out tracepoints from record args
      perf session: Export queue_event function
      perf kvm: Add live mode
      perf kvm: Add min and max stats to display
      perf kvm stat report: Add option to analyze specific VM
      perf kvm: Option to print events that exceed a duration
      perf kvm: Update documentation with live command
      perf sched: Simplify arguments to read_events
      perf sched: Remove thread lookup in sample handler
      perf sched: Remove sched_process_exit tracepoint
      perf sched: Remove sched_process_fork tracepoint
      perf tool: Simplify options to perf_evsel__print_ip
      perf evsel: Add option to print stack trace on single line
      perf evsel: Add option to limit stack depth in callchain dumps
      perf session: Change perf_session__has_traces to actually check for tracepoints
      perf tests: Fix compile failure on do_sort_something
      perf kvm: Remove force option to cmd_record
      perf trace: Make command line arguments consistent with perf-record
      perf tools: Sample after exit loses thread correlation
      perf evlist: Add tracepoint lookup by name
      perf trace: Add option to analyze events in a file versus live
      perf trace: Honor target pid / tid options when analyzing a file
      perf trace: Handle missing HUGEPAGE defines
      perf tools: Fix symbol offset computation for some dsos

Frederic Weisbecker (10):
      perf: Fix branch stack refcount leak on callchain init failure
      perf: Sanitize get_callchain_buffer()
      perf: Factor out event accounting code to account_event()/__free_event()
      perf: Split the per-cpu accounting part of the event accounting code
      perf: Migrate per cpu event accounting
      perf: Account freq events per cpu
      perf: Implement finer grained full dynticks kick
      watchdog: Make it work under full dynticks
      perf: Roll back callchain buffer refcount under the callchain mutex
      perf: Account freq events globally

Greg Price (1):
      perf report/top: Add option to collapse undesired parts of call graph

Ingo Molnar (1):
      perf annotate browser: Fix typo

Jason Wessel (1):
      perf machine: Do not require /lib/modules/* on a guest

Jiri Kosina (3):
      x86: Introduce int3 (breakpoint)-based instruction patching
      x86: Make jump_label use int3-based patching
      kprobes/x86: Call out into INT3 handler directly instead of using notifier

Jiri Olsa (46):
      perf: Remove the 'match' callback for auxiliary events processing
      perf tools: Remove cwd from perf_session struct
      perf tests: Omit end of the symbol check failure for test 1
      perf tests: Make TEST_ASSERT_VAL global
      perf tools: Remove callchain_cursor_reset call
      perf tools: Do not elide parent symbol column
      perf report: Fix perf_session__delete removal
      perf tools: Add struct perf_hpp_fmt into hpp callbacks
      perf tools: Centralize default columns init in perf_hpp__init
      perf diff: Introducing diff_data object to hold files
      perf diff: Switching the base hists to be pairs head
      perf hists: Marking dummy hists entries
      perf diff: Display data file info ahead of the diff output
      perf diff: Move diff related columns into diff command
      perf diff: Move columns into struct data__file
      perf diff: Change diff command to work over multiple data files
      perf diff: Update perf diff documentation for multiple data comparison
      perf diff: Making compute functions static
      perf diff: Add generic order option for compute sorting
      perf tools: Move hist_entry__period_snprintf into stdio code
      perf timechart: Use traceevent lib event-parse.h include
      perf timechart: Remove event types framework only user
      perf tools: Remove event types from perf data file
      perf record: Remove event types pushing
      perf tools: Remove event types framework completely
      perf tests: Check proper prev_state size for sched_switch tp
      perf session: Use session->fd instead of passing fd as argument
      perf header: Remove data_offset seek as it's not needed
      perf header: Remove attr_offset from perf_header
      perf header: Introduce feat_offset into perf_header
      perf header: Recognize version number for perf data file
      perf tests: Run ctags/cscope make tests only with needed binaries
      perf tests: Rename TMP to TMP_O tests/make variable
      perf tests: Add DESTDIR=TMP_DEST tests/make variable
      perf tests: Add 'make install/install-bin' tests into tests/make
      perf tests: Add broken install-* tests into tests/make
      perf: Add PERF_EVENT_IOC_ID ioctl to return event ID
      perf: Do not get values from disabled counters in group format read
      perf evlist: Use PERF_EVENT_IOC_ID perf ioctl to read event id
      perf tools: Add support for parsing PERF_SAMPLE_READ sample type
      perf evlist: Fix event ID retrieval for group format read case
      perf evlist: Add perf_evlist__id2sid method to get event ID related data
      perf evsel: Add PERF_SAMPLE_READ sample related processing
      perf tools: Add 'S' event/group modifier to read sample value
      perf tests: Add attr record group sampling test
      perf tests: Add parse events tests for leader sampling

Joe Perches (1):
      perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()

Joonsoo Kim (1):
      Revert "tools lib lk: Fix for cross build"

Kirill A. Shutemov (1):
      perf tools: Fix build with perl 5.18

Masami Hiramatsu (3):
      kprobes/x86: Remove an incorrect comment about int3 in NMI/MCE
      kprobes/x86: Use text_poke_bp() instead of text_poke_smp*()
      kprobes/x86: Remove unused text_poke_smp() and text_poke_smp_batch() functions

Michael Ellerman (2):
      perf tools: Add support for pinned modifier
      perf tests: Add tests of new pinned modifier

Michael Wang (1):
      sched: Implement smarter wake-affine logic

Namhyung Kim (29):
      perf util: Move debugfs/tracing helper functions to util.c
      perf util: Use evsel->name to get tracepoint_paths
      tools lib traceevent: Remove unused install targets
      tools lib traceevent: Get rid of unused gui target
      tools lib traceevent: Add const qualifier to string arguments
      tools lib traceevent: Add trace_seq_reset()
      tools lib traceevent: Add page_size field to pevent
      tools lib traceevent: Port kbuffer parser routines
      perf util: Save page size in a trace file to pevent
      perf util: Save long size of traced system
      perf util: Make file/host_bigendian variable local
      perf util: Skip reading header_event file
      perf util: Parse header_page to get proper long size
      perf util: Get rid of unused header_page_* variables
      perf script: Adopt latency_format variable
      perf util: Rename read_*() functions in trace-event-info.c
      perf util: No need to call read_trace_init() in tracing_data_header()
      perf util: Remove unused enum and macro in trace-event.h
      perf sched: Move struct perf_sched definition out of cmd_sched()
      perf gtk/hists: Use GtkTreeStore instead of GtkListStore
      perf gtk/hists: Add support for callchains
      perf gtk/hists: Display callchain overhead also
      perf gtk/hists: Make column headers resizable
      perf gtk/hists: Add a double-click handler for callchains
      perf gtk/hists: Set rules hint for the hist browser
      perf symbols: Do not apply symfs for an absolute vmlinux path
      perf util: Add parse_nsec_time() function
      perf ui/gtk: Fix segmentation fault on perf_hpp__for_each_format loop
      perf list: Skip unsupported events

Oleg Nesterov (3):
      tracing/perf: Expand TRACE_EVENT(sched_stat_runtime)
      tracing/perf: Reimplement TP_perf_assign() logic
      tracing/perf: Avoid perf_trace_buf_*() in perf_trace_##call() when possible

Peter Zijlstra (3):
      perf: Update perf_event_type documentation
      sched: Micro-optimize the smart wake-affine logic
      perf: Do not compute time values unnecessarily

Robert Richter (2):
      perf tools: Use default include path notation for libtraceevent headers
      perf tools: Fix 'make tools/perf'

Runzhen Wang (1):
      perf tools: Make Power7 events available for perf

Stephane Eranian (2):
      perf tools: Improve robustness of topology parsing code
      perf: Add attr->mmap2 attribute to an event

Sukadev Bhattiprolu (2):
      perf tools: Rename cpu_map__all() to cpu_map__empty()
      perf top: Add --objdump option

Vince Weaver (1):
      perf: Export struct perf_branch_entry to userspace

Vladimir Davydov (1):
      sched: Move h_load calculation to task_h_load()

Yan, Zheng (5):
      perf/x86/intel/uncore: Add auxiliary pci device support
      perf/x86/intel/uncore: Add filter support for QPI boxes
      perf/x86/intel/uncore: Enable EV_SEL_EXT bit for PCU
      perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
      perf/x86: Add Silvermont (22nm Atom) support


 arch/powerpc/include/asm/perf_event_server.h       |   4 +-
 arch/powerpc/perf/power7-events-list.h             | 548 +++++++++++++++
 arch/powerpc/perf/power7-pmu.c                     | 148 +---
 arch/x86/Kconfig                                   |   5 -
 arch/x86/include/asm/alternative.h                 |  14 +-
 arch/x86/include/asm/tsc.h                         |   1 +
 arch/x86/kernel/alternative.c                      | 155 +++--
 arch/x86/kernel/cpu/perf_event.c                   |   6 +
 arch/x86/kernel/cpu/perf_event.h                   |   2 +
 arch/x86/kernel/cpu/perf_event_amd.c               |   3 +-
 arch/x86/kernel/cpu/perf_event_intel.c             | 180 ++++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c          |  32 +-
 arch/x86/kernel/cpu/perf_event_intel_uncore.c      | 254 +++++--
 arch/x86/kernel/cpu/perf_event_intel_uncore.h      |  10 +
 arch/x86/kernel/jump_label.c                       |  16 +-
 arch/x86/kernel/kprobes/common.h                   |   5 -
 arch/x86/kernel/kprobes/core.c                     |   2 +-
 arch/x86/kernel/kprobes/opt.c                      | 110 +--
 arch/x86/kernel/traps.c                            |   4 +
 arch/x86/kernel/tsc.c                              |   6 +
 include/linux/perf_event.h                         |  24 -
 include/linux/sched.h                              |   3 +
 include/trace/events/sched.h                       |  22 +-
 include/trace/ftrace.h                             |  33 +-
 include/uapi/linux/perf_event.h                    | 123 +++-
 kernel/events/callchain.c                          |   3 +
 kernel/events/core.c                               | 381 +++++++----
 kernel/sched/core.c                                |   7 +-
 kernel/sched/fair.c                                | 105 ++-
 kernel/sched/sched.h                               |   8 +-
 kernel/watchdog.c                                  |   8 -
 tools/lib/lk/Makefile                              |  15 -
 tools/lib/traceevent/Makefile                      |  20 +-
 tools/lib/traceevent/event-parse.c                 |   7 +-
 tools/lib/traceevent/event-parse.h                 |  15 +-
 tools/lib/traceevent/kbuffer-parse.c               | 732 ++++++++++++++++++++
 tools/lib/traceevent/kbuffer.h                     |  67 ++
 tools/lib/traceevent/trace-seq.c                   |  13 +
 tools/perf/Documentation/perf-diff.txt             |  79 ++-
 tools/perf/Documentation/perf-kvm.txt              |  46 +-
 tools/perf/Documentation/perf-list.txt             |   6 +-
 tools/perf/Documentation/perf-report.txt           |  13 +-
 tools/perf/Documentation/perf-stat.txt             |   5 +
 tools/perf/Documentation/perf-top.txt              |   5 +
 tools/perf/Documentation/perf-trace.txt            |  24 +
 tools/perf/Makefile                                |  21 +-
 tools/perf/arch/x86/Makefile                       |   2 +
 tools/perf/arch/x86/util/tsc.c                     |  59 ++
 tools/perf/arch/x86/util/tsc.h                     |  20 +
 tools/perf/bench/mem-memcpy.c                      |   2 +
 tools/perf/builtin-annotate.c                      |   5 +-
 tools/perf/builtin-diff.c                          | 664 ++++++++++++++----
 tools/perf/builtin-inject.c                        |  53 +-
 tools/perf/builtin-kmem.c                          |   5 +-
 tools/perf/builtin-kvm.c                           | 754 +++++++++++++++++++--
 tools/perf/builtin-list.c                          |   3 +
 tools/perf/builtin-lock.c                          |   3 +-
 tools/perf/builtin-mem.c                           |   6 +-
 tools/perf/builtin-record.c                        |  13 -
 tools/perf/builtin-report.c                        |  88 +--
 tools/perf/builtin-sched.c                         | 161 ++---
 tools/perf/builtin-script.c                        |  42 +-
 tools/perf/builtin-stat.c                          |  24 +-
 tools/perf/builtin-timechart.c                     | 176 ++---
 tools/perf/builtin-top.c                           |  36 +-
 tools/perf/builtin-trace.c                         | 721 ++++++++++++++++++--
 tools/perf/config/Makefile                         |   5 +-
 tools/perf/perf.h                                  |   3 +
 tools/perf/python/twatch.py                        |   2 +-
 tools/perf/tests/attr/test-record-group-sampling   |  36 +
 tools/perf/tests/builtin-test.c                    |  18 +
 tools/perf/tests/code-reading.c                    | 572 ++++++++++++++++
 tools/perf/tests/dso-data.c                        |   8 -
 tools/perf/tests/evsel-tp-sched.c                  |   4 +-
 tools/perf/tests/hists_link.c                      |  27 +-
 tools/perf/tests/keep-tracking.c                   | 154 +++++
 tools/perf/tests/make                              |  67 +-
 tools/perf/tests/mmap-basic.c                      |   2 +-
 tools/perf/tests/parse-events.c                    | 190 +++++-
 tools/perf/tests/perf-time-to-tsc.c                | 177 +++++
 tools/perf/tests/sample-parsing.c                  | 316 +++++++++
 tools/perf/tests/tests.h                           |  12 +
 tools/perf/tests/vmlinux-kallsyms.c                |  49 +-
 tools/perf/ui/browsers/annotate.c                  |  20 +-
 tools/perf/ui/browsers/hists.c                     |  18 +-
 tools/perf/ui/gtk/hists.c                          | 128 +++-
 tools/perf/ui/hist.c                               | 258 +------
 tools/perf/ui/setup.c                              |   1 -
 tools/perf/ui/stdio/hist.c                         |  45 +-
 tools/perf/util/annotate.c                         |  60 +-
 tools/perf/util/build-id.c                         |  11 +-
 tools/perf/util/callchain.c                        |  15 +-
 tools/perf/util/callchain.h                        |  11 +-
 tools/perf/util/cpumap.h                           |   2 +-
 tools/perf/util/dso.c                              |  10 +-
 tools/perf/util/dso.h                              |  17 +
 tools/perf/util/event.c                            |  31 +-
 tools/perf/util/event.h                            |  41 +-
 tools/perf/util/evlist.c                           | 294 ++++++--
 tools/perf/util/evlist.h                           |  21 +-
 tools/perf/util/evsel.c                            | 581 ++++++++++++++--
 tools/perf/util/evsel.h                            |  19 +-
 tools/perf/util/header.c                           | 161 +----
 tools/perf/util/header.h                           |  40 +-
 tools/perf/util/hist.c                             |   4 +-
 tools/perf/util/hist.h                             |  26 +-
 tools/perf/util/include/linux/string.h             |   1 +
 tools/perf/util/machine.c                          | 155 +++--
 tools/perf/util/machine.h                          |  14 +-
 tools/perf/util/map.c                              |  67 +-
 tools/perf/util/map.h                              |  13 +
 tools/perf/util/parse-events.c                     | 174 +++--
 tools/perf/util/parse-events.h                     |  11 +-
 tools/perf/util/parse-events.l                     |   4 +-
 tools/perf/util/parse-events.y                     |  62 +-
 tools/perf/util/pmu.c                              |  87 ++-
 tools/perf/util/pmu.h                              |   5 +-
 tools/perf/util/python.c                           |  21 +
 tools/perf/util/record.c                           | 108 +++
 .../perf/util/scripting-engines/trace-event-perl.c |  14 +-
 .../util/scripting-engines/trace-event-python.c    |   9 +-
 tools/perf/util/session.c                          | 241 +++++--
 tools/perf/util/session.h                          |  14 +-
 tools/perf/util/sort.c                             |  12 +-
 tools/perf/util/sort.h                             |  13 +-
 tools/perf/util/stat.c                             |   6 +
 tools/perf/util/stat.h                             |   9 +
 tools/perf/util/string.c                           |  24 +
 tools/perf/util/symbol-elf.c                       | 174 ++++-
 tools/perf/util/symbol-minimal.c                   |   7 +
 tools/perf/util/symbol.c                           | 285 +++++++-
 tools/perf/util/symbol.h                           |   5 +
 tools/perf/util/thread.c                           |  11 +-
 tools/perf/util/thread.h                           |  23 +-
 tools/perf/util/tool.h                             |  10 +-
 tools/perf/util/top.h                              |   2 +
 tools/perf/util/trace-event-info.c                 |  96 +--
 tools/perf/util/trace-event-parse.c                |   6 -
 tools/perf/util/trace-event-read.c                 |  52 +-
 tools/perf/util/trace-event-scripting.c            |   3 +-
 tools/perf/util/trace-event.h                      |  21 +-
 tools/perf/util/unwind.c                           |   2 +-
 tools/perf/util/util.c                             |  92 +++
 tools/perf/util/util.h                             |   5 +
 144 files changed, 9132 insertions(+), 2309 deletions(-)
 create mode 100644 arch/powerpc/perf/power7-events-list.h
 create mode 100644 tools/lib/traceevent/kbuffer-parse.c
 create mode 100644 tools/lib/traceevent/kbuffer.h
 create mode 100644 tools/perf/arch/x86/util/tsc.c
 create mode 100644 tools/perf/arch/x86/util/tsc.h
 create mode 100644 tools/perf/tests/attr/test-record-group-sampling
 create mode 100644 tools/perf/tests/code-reading.c
 create mode 100644 tools/perf/tests/keep-tracking.c
 create mode 100644 tools/perf/tests/perf-time-to-tsc.c
 create mode 100644 tools/perf/tests/sample-parsing.c
 create mode 100644 tools/perf/util/record.c

[ Full diff omitted for size reasons. ]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-03 13:29 [GIT PULL] perf changes for v3.12 Ingo Molnar
@ 2013-09-03 13:37 ` Arnaldo Carvalho de Melo
  2013-09-03 13:43   ` Ingo Molnar
  2013-09-03 17:02 ` Vince Weaver
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Arnaldo Carvalho de Melo @ 2013-09-03 13:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton, David Ahern, Adrian Hunter

Em Tue, Sep 03, 2013 at 03:29:33PM +0200, Ingo Molnar escreveu:
> Linus,
> 
> Please pull the latest perf-core-for-linus git tree from:

There were some misatributions I found, from memory, here, clarifying
FWIW:
 
>    * 'perf test' enhancements:
> 
>         . Add various improvements and fixes to the "vmlinux matches 
>           kallsyms" 'perf test' entry, related to the /proc/kcore 
>           annotation feature. By Arnaldo Carvalho de Melo.

by Adrian Hunter
     
>    * 'perf sched' enhancements:
>     
>         . Various improvements, including removing reliance on some 
>           scheduler tracepoints that provide the same information as the 
>           PERF_RECORD_{FORK,EXIT} events. By Arnaldo Carvalho de Melo.

By David Ahern.
 
:-)

- Arnaldo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-03 13:37 ` Arnaldo Carvalho de Melo
@ 2013-09-03 13:43   ` Ingo Molnar
  0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2013-09-03 13:43 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, linux-kernel, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton, David Ahern, Adrian Hunter


* Arnaldo Carvalho de Melo <acme@infradead.org> wrote:

> Em Tue, Sep 03, 2013 at 03:29:33PM +0200, Ingo Molnar escreveu:
> > Linus,
> > 
> > Please pull the latest perf-core-for-linus git tree from:
> 
> There were some misatributions I found, from memory, here, clarifying
> FWIW:
>  
> >    * 'perf test' enhancements:
> > 
> >         . Add various improvements and fixes to the "vmlinux matches 
> >           kallsyms" 'perf test' entry, related to the /proc/kcore 
> >           annotation feature. By Arnaldo Carvalho de Melo.
> 
> by Adrian Hunter
>      
> >    * 'perf sched' enhancements:
> >     
> >         . Various improvements, including removing reliance on some 
> >           scheduler tracepoints that provide the same information as the 
> >           PERF_RECORD_{FORK,EXIT} events. By Arnaldo Carvalho de Melo.
> 
> By David Ahern.
>  
> :-)

Thanks, I knew I'd get something wrong when reorganizing and summarizing 
all the changelogs topically! ;-)

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-03 13:29 [GIT PULL] perf changes for v3.12 Ingo Molnar
  2013-09-03 13:37 ` Arnaldo Carvalho de Melo
@ 2013-09-03 17:02 ` Vince Weaver
  2013-09-04 17:53 ` Linus Torvalds
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Vince Weaver @ 2013-09-03 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Thomas Gleixner, Andrew Morton

On Tue, 3 Sep 2013, Ingo Molnar wrote:
>    * New ABI details:
>         . Make Power7 events available via sysfs, by Runzhen Wang.

So we're really going to add 100+ Power7 events to the stable sysfs ABI?
Are all the new events listed under the sysfs ABI documentation?

Are we going to add all of the x86 events too, adding 100kB of unswappable
event listings to every user's kernel whether they use perf_event or not?

Yes, I know I am a broken record on this issue, but I don't feel like 
anyone has clarified the future plans here.

Vince

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-03 13:29 [GIT PULL] perf changes for v3.12 Ingo Molnar
  2013-09-03 13:37 ` Arnaldo Carvalho de Melo
  2013-09-03 17:02 ` Vince Weaver
@ 2013-09-04 17:53 ` Linus Torvalds
  2013-09-05 10:56   ` Ingo Molnar
  2013-09-05 13:38 ` Ingo Molnar
  2013-09-08  2:17 ` Linus Torvalds
  4 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2013-09-04 17:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Thomas Gleixner, Andrew Morton

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 1502 bytes --]

On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> Please pull the latest perf-core-for-linus git tree from:

I don't think this is new at all, but I just tried to do a perf
record/report of "make -j64 test" on git:

It's a big perf.data file (1.6G), but after it has done the
"processing time ordered events" thing it results in:

┌─Warning:───────────────────────────────────┐
│Processed 8672030 events and lost 71 chunks!│
│Check IO/CPU overload!                      │
│                                            │
│                                            │
│Press any key...                            │
└────────────────────────────────────────────┘

and then it just hangs using 100% CPU time. Pressing any key doesn't
do anything.

It may well still be *doing* something, and maybe it will come back
some day with results. But it sure doesn't show any indication that it
will.

Try this (in a current git source tree: note, by "git" I actually mean
git itself, not some random git repository)::

    perf record -g -e cycles:pp make -j64 test >& out
    perf report

maybe you can reproduce it.

                Linus
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-04 17:53 ` Linus Torvalds
@ 2013-09-05 10:56   ` Ingo Molnar
  2013-09-05 12:42     ` Frederic Weisbecker
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2013-09-05 10:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Thomas Gleixner, Andrew Morton,
	Frédéric Weisbecker


(Cc:-ed Frederic and Namhyung as well, it's about bad overhead in 
tools/perf/util/hist.c.)

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Please pull the latest perf-core-for-linus git tree from:
> 
> I don't think this is new at all, but I just tried to do a perf
> record/report of "make -j64 test" on git:
> 
> It's a big perf.data file (1.6G), but after it has done the
> "processing time ordered events" thing it results in:
> 
> ┌─Warning:───────────────────────────────────┐
> │Processed 8672030 events and lost 71 chunks!│
> │Check IO/CPU overload!                      │
> │                                            │
> │                                            │
> │Press any key...                            │
> └────────────────────────────────────────────┘
> 
> and then it just hangs using 100% CPU time. Pressing any key doesn't
> do anything.
> 
> It may well still be *doing* something, and maybe it will come back
> some day with results. But it sure doesn't show any indication that it
> will.
> 
> Try this (in a current git source tree: note, by "git" I actually mean
> git itself, not some random git repository)::
> 
>     perf record -g -e cycles:pp make -j64 test >& out
>     perf report
> 
> maybe you can reproduce it.

I managed to reproduce it on a 32-way box via:

     perf record -g make -j64 bzImage >/dev/null 2>&1

It's easier to debug it without the TUI:

     perf --no-pages report --stdio

It turns out that even with a 400 MB perf.data the 'perf report' call will 
eventually finish - here it ran for almost half an hour(!) on a fast box.

Arnaldo, the large overhead is in hists__collapse_resort(), in particular 
it's doing append_chain_children() 99% of the time:

-  99.74%  perf  perf               [.] append_chain_children                                                         ◆
   - append_chain_children                                                                                            ▒
      - 99.76% merge_chain_branch                                                                                     ▒
         - merge_chain_branch                                                                                         ▒
            + 98.04% hists__collapse_resort                                                                           ▒
            + 1.96% merge_chain_branch                                                                                ▒
+   0.05%  perf  perf               [.] merge_chain_branch                                                            ▒
+   0.03%  perf  libc-2.17.so       [.] _int_free                                                                     ▒
+   0.03%  perf  libc-2.17.so       [.] __libc_calloc                                                                 ▒
+   0.02%  perf  [kernel.kallsyms]  [k] account_user_time                                                             ▒
+   0.02%  perf  libc-2.17.so       [.] _int_malloc                                                                   ▒

It seems to be stuck in hists__collapse_resort().

In particular the overhead arises because the following loop in 
append_chain_children():

        /* lookup in childrens */
        chain_for_each_child(rnode, root) {
                unsigned int ret = append_chain(rnode, cursor, period);

Reaches very long counts and the algorithm gets quadratic (at least). The 
child count reaches over 100,000 entries in the end (!).

I don't think the high child count in itself is anomalous: a kernel build 
generates thousands of processes, tons of symbol ranges and tens of 
millions of call chain entries.

So I think what we need here is to speed up the lookup: put children into 
a secondary, ->pos,len indexed range-rbtree and do a binary search instead 
of a linear search over 100,000 child entries ... or something like that.

Btw., a side note, append_chain() is a rather confusing function in 
itself, with logic-inversion gems like:

                if (!found)
                        found = true;

All that should be cleaned up as well I guess.

The 'IO overload' message appears to be a separate, unrelated bug, it just 
annoyingly does not get refreshed away in the TUI before 
hists__collapse_resort() is called, and there's also no progress bar for 
the hists__collapse_resort() pass, so to the user it all looks like a 
deadlock.

So there's at least two bugs here:

  - the bad overhead in hists__collapse_resort()

  - bad usability if hists__collapse_resort() takes more than 1 second to finish

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-05 10:56   ` Ingo Molnar
@ 2013-09-05 12:42     ` Frederic Weisbecker
  2013-09-05 12:51       ` Ingo Molnar
  2013-09-10  8:06       ` Namhyung Kim
  0 siblings, 2 replies; 36+ messages in thread
From: Frederic Weisbecker @ 2013-09-05 12:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton

On Thu, Sep 05, 2013 at 12:56:39PM +0200, Ingo Molnar wrote:
> 
> (Cc:-ed Frederic and Namhyung as well, it's about bad overhead in 
> tools/perf/util/hist.c.)
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > Please pull the latest perf-core-for-linus git tree from:
> > 
> > I don't think this is new at all, but I just tried to do a perf
> > record/report of "make -j64 test" on git:
> > 
> > It's a big perf.data file (1.6G), but after it has done the
> > "processing time ordered events" thing it results in:
> > 
> > ┌─Warning:───────────────────────────────────┐
> > │Processed 8672030 events and lost 71 chunks!│
> > │Check IO/CPU overload!                      │
> > │                                            │
> > │                                            │
> > │Press any key...                            │
> > └────────────────────────────────────────────┘
> > 
> > and then it just hangs using 100% CPU time. Pressing any key doesn't
> > do anything.
> > 
> > It may well still be *doing* something, and maybe it will come back
> > some day with results. But it sure doesn't show any indication that it
> > will.
> > 
> > Try this (in a current git source tree: note, by "git" I actually mean
> > git itself, not some random git repository)::
> > 
> >     perf record -g -e cycles:pp make -j64 test >& out
> >     perf report
> > 
> > maybe you can reproduce it.
> 
> I managed to reproduce it on a 32-way box via:
> 
>      perf record -g make -j64 bzImage >/dev/null 2>&1
> 
> It's easier to debug it without the TUI:
> 
>      perf --no-pages report --stdio
> 
> It turns out that even with a 400 MB perf.data the 'perf report' call will 
> eventually finish - here it ran for almost half an hour(!) on a fast box.
> 
> Arnaldo, the large overhead is in hists__collapse_resort(), in particular 
> it's doing append_chain_children() 99% of the time:
> 
> -  99.74%  perf  perf               [.] append_chain_children                                                         ◆
>    - append_chain_children                                                                                            ▒
>       - 99.76% merge_chain_branch                                                                                     ▒
>          - merge_chain_branch                                                                                         ▒
>             + 98.04% hists__collapse_resort                                                                           ▒
>             + 1.96% merge_chain_branch                                                                                ▒
> +   0.05%  perf  perf               [.] merge_chain_branch                                                            ▒
> +   0.03%  perf  libc-2.17.so       [.] _int_free                                                                     ▒
> +   0.03%  perf  libc-2.17.so       [.] __libc_calloc                                                                 ▒
> +   0.02%  perf  [kernel.kallsyms]  [k] account_user_time                                                             ▒
> +   0.02%  perf  libc-2.17.so       [.] _int_malloc                                                                   ▒
> 
> It seems to be stuck in hists__collapse_resort().
> 
> In particular the overhead arises because the following loop in 
> append_chain_children():
> 
>         /* lookup in childrens */
>         chain_for_each_child(rnode, root) {
>                 unsigned int ret = append_chain(rnode, cursor, period);
> 
> Reaches very long counts and the algorithm gets quadratic (at least). The 
> child count reaches over 100,000 entries in the end (!).
> 
> I don't think the high child count in itself is anomalous: a kernel build 
> generates thousands of processes, tons of symbol ranges and tens of 
> millions of call chain entries.
> 
> So I think what we need here is to speed up the lookup: put children into 
> a secondary, ->pos,len indexed range-rbtree and do a binary search instead 
> of a linear search over 100,000 child entries ... or something like that.

You're right it's worth trying.

At least it might give better results for such high scale callchain trees.
I'll see what I can come up with.

> 
> Btw., a side note, append_chain() is a rather confusing function in 
> itself, with logic-inversion gems like:
> 
>                 if (!found)
>                         found = true;

The check is pointless yeah, I'll remove that.

> 
> All that should be cleaned up as well I guess.
> 
> The 'IO overload' message appears to be a separate, unrelated bug, it just 
> annoyingly does not get refreshed away in the TUI before 
> hists__collapse_resort() is called, and there's also no progress bar for 
> the hists__collapse_resort() pass, so to the user it all looks like a 
> deadlock.
> 
> So there's at least two bugs here:
> 
>   - the bad overhead in hists__collapse_resort()
> 
>   - bad usability if hists__collapse_resort() takes more than 1 second to finish

Also IIUC, collapsing/merging hists is only used for comm hists merging, due to
set_task_comm after exec.

Perhaps we can do better to anticipate the comm of a process based on tid/pid, fork
and comm events? This way we can avoid late collapses/merges. In the best case we
could get rid of collapses entirely and that would be some nice fresh air for util/hist.c

And ideally, the comm should be associated to a lifetime as a thread can change
its comm anytime.

At least we should be able to limit collapses significantly.
I can try to make callchains more scalable, but collapsing is a heavy weight operation
for them in any case.

> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-05 12:42     ` Frederic Weisbecker
@ 2013-09-05 12:51       ` Ingo Molnar
  2013-09-05 12:58         ` Frederic Weisbecker
  2013-09-10  8:06       ` Namhyung Kim
  1 sibling, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2013-09-05 12:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> > Btw., a side note, append_chain() is a rather confusing function in 
> > itself, with logic-inversion gems like:
> > 
> >                 if (!found)
> >                         found = true;
> 
> The check is pointless yeah, I'll remove that.

Are you sure it can be removed as-is? It inverts the 'found' condition, 
which then can be used later on AFAICS. (I haven't looked very deeply 
though.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-05 12:51       ` Ingo Molnar
@ 2013-09-05 12:58         ` Frederic Weisbecker
  0 siblings, 0 replies; 36+ messages in thread
From: Frederic Weisbecker @ 2013-09-05 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton

On Thu, Sep 05, 2013 at 02:51:01PM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > > Btw., a side note, append_chain() is a rather confusing function in 
> > > itself, with logic-inversion gems like:
> > > 
> > >                 if (!found)
> > >                         found = true;
> > 
> > The check is pointless yeah, I'll remove that.
> 
> Are you sure it can be removed as-is? It inverts the 'found' condition, 
> which then can be used later on AFAICS. (I haven't looked very deeply 
> though.)

Yeah in fact found is initialized to false. So the if (!found) is always true there.
May be it's a leftover from past version that relied on gotos or so.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-03 13:29 [GIT PULL] perf changes for v3.12 Ingo Molnar
                   ` (2 preceding siblings ...)
  2013-09-04 17:53 ` Linus Torvalds
@ 2013-09-05 13:38 ` Ingo Molnar
  2013-09-08  2:17 ` Linus Torvalds
  4 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2013-09-05 13:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Thomas Gleixner, Andrew Morton


* Ingo Molnar <mingo@kernel.org> wrote:

>    * 'perf report/top' enhancements:
> 
>         . Do annotation using /proc/kcore and /proc/kallsyms when 
>           available, removing the forced need for a vmlinux file kernel 
>           assembly annotation. This also improves this use case because 
>           vmlinux has just the initial kernel image, not what is actually 
>           in use after various code patchings by things like alternatives.
>           By Adrian Hunter.

Btw., this new feature is most noticeable if you run 'perf top' as root on 
any random box: you can annotate based on the live kernel image straight 
away, without having the vmlinux.

This works on various distro kernels as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-03 13:29 [GIT PULL] perf changes for v3.12 Ingo Molnar
                   ` (3 preceding siblings ...)
  2013-09-05 13:38 ` Ingo Molnar
@ 2013-09-08  2:17 ` Linus Torvalds
  2013-09-09 10:05   ` Peter Zijlstra
  4 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2013-09-08  2:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Thomas Gleixner

This is new for me, but I suspect it is more related to the new
Haswell CPU I have than necessarily the 3.12 perf pull request.

Regardless, nothing bad happened, but my dmesg has this in it:

   Unexpected number of pebs records 10

when I was profiling some git workloads. Full trace appended.

                Linus

---
 perf samples too long (2749 > 2500), lowering
kernel.perf_event_max_sample_rate to 50000
 perf samples too long (5200 > 5000), lowering
kernel.perf_event_max_sample_rate to 25000
 .. many seconds pass ..
 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 3270 at
arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
intel_pmu_drain_pebs_hsw+0x91/0xa0()
 Unexpected number of pebs records 10
 Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4
nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack
ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek
snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq
snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp
lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd
auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915
i2c_algo_bit drm_kms_helper drm i2c_core video
 CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4
 Hardware name:                  /DH87RL, BIOS
RLH8710H.86A.0317.2013.0426.1724 04/26/2013
  0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8
  ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780
  ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38
 Call Trace:
  <NMI>  [<ffffffff815fc637>] dump_stack+0x45/0x56
  [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
  [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
  [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
  [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
  [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
  [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
  [<ffffffff81604840>] do_nmi+0xd0/0x310
  [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
  <<EOE>>  [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
  [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
  [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
  [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
  [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
  [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
  [<ffffffff81600f48>] __schedule+0x368/0x7c0
  [<ffffffff816013c4>] schedule+0x24/0x70
  [<ffffffff816038a4>] retint_careful+0xd/0x21
 ---[ end trace 52dad7a9d8d96b5f ]---

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-08  2:17 ` Linus Torvalds
@ 2013-09-09 10:05   ` Peter Zijlstra
  2013-09-10 11:28     ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2013-09-09 10:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Arnaldo Carvalho de Melo,
	Thomas Gleixner, eranian, andi

On Sat, Sep 07, 2013 at 07:17:28PM -0700, Linus Torvalds wrote:
> This is new for me, but I suspect it is more related to the new
> Haswell CPU I have than necessarily the 3.12 perf pull request.
> 
> Regardless, nothing bad happened, but my dmesg has this in it:
> 
>    Unexpected number of pebs records 10
> 
> when I was profiling some git workloads. Full trace appended.

>  ------------[ cut here ]------------
>  WARNING: CPU: 1 PID: 3270 at
> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
> intel_pmu_drain_pebs_hsw+0x91/0xa0()
>  Unexpected number of pebs records 10
>  Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4
> nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack
> ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek
> snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq
> snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp
> lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd
> auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915
> i2c_algo_bit drm_kms_helper drm i2c_core video
>  CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4
>  Hardware name:                  /DH87RL, BIOS
> RLH8710H.86A.0317.2013.0426.1724 04/26/2013
>   0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8
>   ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780
>   ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38
>  Call Trace:
>   <NMI>  [<ffffffff815fc637>] dump_stack+0x45/0x56
>   [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
>   [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
>   [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
>   [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
>   [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
>   [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
>   [<ffffffff81604840>] do_nmi+0xd0/0x310
>   [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
>   <<EOE>>  [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
>   [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
>   [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
>   [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
>   [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
>   [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
>   [<ffffffff81600f48>] __schedule+0x368/0x7c0
>   [<ffffffff816013c4>] schedule+0x24/0x70
>   [<ffffffff816038a4>] retint_careful+0xd/0x21
>  ---[ end trace 52dad7a9d8d96b5f ]---

So I'm not entirely sure how this can happen. We do _not_ set
ds->pebs_event_reset[] so once a PEBS assist happens we do not
automatically re-engage. We wait for the PMI to happen, consume the data
and reprogram the hardware.

We allow for up to x86_pmu.max_pebs_events (4 on hsw afaik) of PEBS
entries to be in the buffer so each PEBS capable counter can trigger the
assist once while we 'wait' for the PMI to hit.

There's a little 'funny' with the hsw PEBS code in that it doesn't
unconditionally clear the buffer in intel_pmu_drain_pebs_hsw() but that
would only make a difference if n actually was < 0 -- again something
that shouldn't ever happen.

That said, there's two peculiar things about your report. First and
foremost HSW; I'm not at all sure how it deals with a transaction being
undone if a PEBS assist had happened during it. Does it reset the
hardware but not remove the data? If so, this could cause excess
entries. 

Stephane, Andi, it this at all possible?, section 18.10.5.1 (June 2013)
seems to suggest not; the way I read it PEBS assist will cause an abort
(and the PMI certainly would) and we'll process the entire thing outside
of transaction context.

That said, 18.10.5.1 mentions that entries from inside a transaction
will only have 0xb0 and 0xb8 aka perf_record::real_ip and
perf_record::tsx_tuning set, I don't think __intel_pmu_drain_pebs_nhm()
will do the right thing for them. In particular, when no
perf_record::status bit is set we'll drop the event on the ground.

Dropping them means we do not do intel_pmu_save_and_restart() to
reprogram the hardware and the counter would get 'stuck'. Then again,
without a status bit we have no idea what event to re-program so we're
pretty much up shit creek without no stinkin' paddle.

Another possibility is that the meaning of pebs_record::status has
changed -- it was 'broken' and Peggy agreed and said she would look into
it, not sure this has been 'fixed' but I couldn't find anything on it.
That said, I think the code always assumed it was 'sane' so it having
been fixed shouldn't matter.

Secondly you seem to have triggered the nested NMI replay path. I'm not
immediately sure how that could have messed up the PEBS state, but it
sure is a complication.

I think we can do something like the below as a pure code reduction --
completely untested:

---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 109 +++++++++---------------------
 1 file changed, 32 insertions(+), 77 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 63438aa..bfda50e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -178,20 +178,14 @@ struct pebs_record_nhm {
 	u64 status, dla, dse, lat;
 };
 
-/*
- * Same as pebs_record_nhm, with two additional fields.
- */
 struct pebs_record_hsw {
-	struct pebs_record_nhm nhm;
-	/*
-	 * Real IP of the event. In the Intel documentation this
-	 * is called eventingrip.
-	 */
-	u64 real_ip;
-	/*
-	 * TSX tuning information field: abort cycles and abort flags.
-	 */
-	u64 tsx_tuning;
+	u64 flags, ip;
+	u64 ax, bx, cx, dx;
+	u64 si, di, bp, sp;
+	u64 r8,  r9,  r10, r11;
+	u64 r12, r13, r14, r15;
+	u64 status, dla, dse, lat;
+	u64 real_ip, tsx_tuning;
 };
 
 void init_debug_store_on_cpu(int cpu)
@@ -789,12 +783,11 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 				   struct pt_regs *iregs, void *__pebs)
 {
 	/*
-	 * We cast to pebs_record_nhm to get the load latency data
-	 * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
+	 * We cast to the biggest pebs_record but are careful not to
+	 * unconditionally access the 'extra' entries.
 	 */
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct pebs_record_nhm *pebs = __pebs;
-	struct pebs_record_hsw *pebs_hsw = __pebs;
+	struct pebs_record_hsw *pebs = __pebs;
 	struct perf_sample_data data;
 	struct pt_regs regs;
 	u64 sample_type;
@@ -853,7 +846,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	regs.sp = pebs->sp;
 
 	if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
-		regs.ip = pebs_hsw->real_ip;
+		regs.ip = pebs->real_ip;
 		regs.flags |= PERF_EFLAGS_EXACT;
 	} else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
 		regs.flags |= PERF_EFLAGS_EXACT;
@@ -912,17 +905,34 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
 	__intel_pmu_pebs_event(event, iregs, at);
 }
 
-static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
-					void *top)
+static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
 	struct perf_event *event = NULL;
+	void *at, *top;
 	u64 status = 0;
-	int bit;
+	int bit, n;
+
+	if (!x86_pmu.pebs_active)
+		return;
+
+	at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
 
 	ds->pebs_index = ds->pebs_buffer_base;
 
+	n = (top - at) / x86_pmu.pebs_record_size;
+	if (n <= 0)
+		return;
+
+	/*
+	 * Should not happen, we program the threshold at 1 and do not
+	 * set a reset value.
+	 */
+	WARN_ONCE(n > x86_pmu.max_pebs_events,
+		  "Unexpected number of pebs records %d\n", n);
+
 	for (; at < top; at += x86_pmu.pebs_record_size) {
 		struct pebs_record_nhm *p = at;
 
@@ -950,61 +960,6 @@ static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
 	}
 }
 
-static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
-{
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct debug_store *ds = cpuc->ds;
-	struct pebs_record_nhm *at, *top;
-	int n;
-
-	if (!x86_pmu.pebs_active)
-		return;
-
-	at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
-	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
-
-	ds->pebs_index = ds->pebs_buffer_base;
-
-	n = top - at;
-	if (n <= 0)
-		return;
-
-	/*
-	 * Should not happen, we program the threshold at 1 and do not
-	 * set a reset value.
-	 */
-	WARN_ONCE(n > x86_pmu.max_pebs_events,
-		  "Unexpected number of pebs records %d\n", n);
-
-	return __intel_pmu_drain_pebs_nhm(iregs, at, top);
-}
-
-static void intel_pmu_drain_pebs_hsw(struct pt_regs *iregs)
-{
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct debug_store *ds = cpuc->ds;
-	struct pebs_record_hsw *at, *top;
-	int n;
-
-	if (!x86_pmu.pebs_active)
-		return;
-
-	at  = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
-	top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;
-
-	n = top - at;
-	if (n <= 0)
-		return;
-	/*
-	 * Should not happen, we program the threshold at 1 and do not
-	 * set a reset value.
-	 */
-	WARN_ONCE(n > x86_pmu.max_pebs_events,
-		  "Unexpected number of pebs records %d\n", n);
-
-	return __intel_pmu_drain_pebs_nhm(iregs, at, top);
-}
-
 /*
  * BTS, PEBS probe and setup
  */
@@ -1039,7 +994,7 @@ void intel_ds_init(void)
 		case 2:
 			pr_cont("PEBS fmt2%c, ", pebs_type);
 			x86_pmu.pebs_record_size = sizeof(struct pebs_record_hsw);
-			x86_pmu.drain_pebs = intel_pmu_drain_pebs_hsw;
+			x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
 			break;
 
 		default:

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-05 12:42     ` Frederic Weisbecker
  2013-09-05 12:51       ` Ingo Molnar
@ 2013-09-10  8:06       ` Namhyung Kim
  2013-09-10 11:18         ` Frederic Weisbecker
  1 sibling, 1 reply; 36+ messages in thread
From: Namhyung Kim @ 2013-09-10  8:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton

Hi,

On Thu, 5 Sep 2013 14:42:44 +0200, Frederic Weisbecker wrote:
> On Thu, Sep 05, 2013 at 12:56:39PM +0200, Ingo Molnar wrote:
>> 
>> (Cc:-ed Frederic and Namhyung as well, it's about bad overhead in 
>> tools/perf/util/hist.c.)
>> 
>> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> 
>> > On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> > >
>> > > Please pull the latest perf-core-for-linus git tree from:
>> > 
>> > I don't think this is new at all, but I just tried to do a perf
>> > record/report of "make -j64 test" on git:
>> > 
>> > It's a big perf.data file (1.6G), but after it has done the
>> > "processing time ordered events" thing it results in:
>> > 
>> > ┌─Warning:───────────────────────────────────┐
>> > │Processed 8672030 events and lost 71 chunks!│
>> > │Check IO/CPU overload!                      │
>> > │                                            │
>> > │                                            │
>> > │Press any key...                            │
>> > └────────────────────────────────────────────┘
>> > 
>> > and then it just hangs using 100% CPU time. Pressing any key doesn't
>> > do anything.
>> > 
>> > It may well still be *doing* something, and maybe it will come back
>> > some day with results. But it sure doesn't show any indication that it
>> > will.
>> > 
>> > Try this (in a current git source tree: note, by "git" I actually mean
>> > git itself, not some random git repository)::
>> > 
>> >     perf record -g -e cycles:pp make -j64 test >& out
>> >     perf report
>> > 
>> > maybe you can reproduce it.
>> 
>> I managed to reproduce it on a 32-way box via:
>> 
>>      perf record -g make -j64 bzImage >/dev/null 2>&1
>> 
>> It's easier to debug it without the TUI:
>> 
>>      perf --no-pages report --stdio
>> 
>> It turns out that even with a 400 MB perf.data the 'perf report' call will 
>> eventually finish - here it ran for almost half an hour(!) on a fast box.
>> 
>> Arnaldo, the large overhead is in hists__collapse_resort(), in particular 
>> it's doing append_chain_children() 99% of the time:
>> 
>> -  99.74%  perf  perf               [.] append_chain_children                                                         ◆
>>    - append_chain_children                                                                                            ▒
>>       - 99.76% merge_chain_branch                                                                                     ▒
>>          - merge_chain_branch                                                                                         ▒
>>             + 98.04% hists__collapse_resort                                                                           ▒
>>             + 1.96% merge_chain_branch                                                                                ▒
>> +   0.05%  perf  perf               [.] merge_chain_branch                                                            ▒
>> +   0.03%  perf  libc-2.17.so       [.] _int_free                                                                     ▒
>> +   0.03%  perf  libc-2.17.so       [.] __libc_calloc                                                                 ▒
>> +   0.02%  perf  [kernel.kallsyms]  [k] account_user_time                                                             ▒
>> +   0.02%  perf  libc-2.17.so       [.] _int_malloc                                                                   ▒
>> 
>> It seems to be stuck in hists__collapse_resort().
>> 
>> In particular the overhead arises because the following loop in 
>> append_chain_children():
>> 
>>         /* lookup in childrens */
>>         chain_for_each_child(rnode, root) {
>>                 unsigned int ret = append_chain(rnode, cursor, period);
>> 
>> Reaches very long counts and the algorithm gets quadratic (at least). The 
>> child count reaches over 100,000 entries in the end (!).
>> 
>> I don't think the high child count in itself is anomalous: a kernel build 
>> generates thousands of processes, tons of symbol ranges and tens of 
>> millions of call chain entries.
>> 
>> So I think what we need here is to speed up the lookup: put children into 
>> a secondary, ->pos,len indexed range-rbtree and do a binary search instead 
>> of a linear search over 100,000 child entries ... or something like that.
>
> You're right it's worth trying.
>
> At least it might give better results for such high scale callchain trees.
> I'll see what I can come up with.

I justed converted it to a normal rbtree and the total processing time went
down from 380s to 20s!  I couldn't understand how I can use the
range-rbtree in this case so probably there's a room for further
enhancement.  I'll send the patches soon.

>
>> 
>> Btw., a side note, append_chain() is a rather confusing function in 
>> itself, with logic-inversion gems like:
>> 
>>                 if (!found)
>>                         found = true;
>
> The check is pointless yeah, I'll remove that.
>
>> 
>> All that should be cleaned up as well I guess.
>> 
>> The 'IO overload' message appears to be a separate, unrelated bug, it just 
>> annoyingly does not get refreshed away in the TUI before 
>> hists__collapse_resort() is called, and there's also no progress bar for 
>> the hists__collapse_resort() pass, so to the user it all looks like a 
>> deadlock.
>> 
>> So there's at least two bugs here:
>> 
>>   - the bad overhead in hists__collapse_resort()
>> 
>>   - bad usability if hists__collapse_resort() takes more than 1 second to finish
>
> Also IIUC, collapsing/merging hists is only used for comm hists merging, due to
> set_task_comm after exec.
>
> Perhaps we can do better to anticipate the comm of a process based on tid/pid, fork
> and comm events? This way we can avoid late collapses/merges. In the best case we
> could get rid of collapses entirely and that would be some nice fresh air for util/hist.c
>
> And ideally, the comm should be associated to a lifetime as a thread can change
> its comm anytime.

Right.  I also thought about why the separate collapsing stage is
needed.  Maybe we can collect hist entries that have same comm at insert
time.  One problem I can imagine is that the target thread changes its
comm after collecting some hist entries.  In this case we should look up
another thread which has same old comm and pass the entries to it.  But
we don't have information that which entries are belong to a certain
thread so for now it'll require full traversal of hist entries.  If we
add the info to threads, I think we can get rid of collapses entirely.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-10  8:06       ` Namhyung Kim
@ 2013-09-10 11:18         ` Frederic Weisbecker
  0 siblings, 0 replies; 36+ messages in thread
From: Frederic Weisbecker @ 2013-09-10 11:18 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton

On Tue, Sep 10, 2013 at 05:06:06PM +0900, Namhyung Kim wrote:
> Hi,
> 
> On Thu, 5 Sep 2013 14:42:44 +0200, Frederic Weisbecker wrote:
> > On Thu, Sep 05, 2013 at 12:56:39PM +0200, Ingo Molnar wrote:
> >> 
> >> (Cc:-ed Frederic and Namhyung as well, it's about bad overhead in 
> >> tools/perf/util/hist.c.)
> >> 
> >> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >> 
> >> > On Tue, Sep 3, 2013 at 6:29 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> > >
> >> > > Please pull the latest perf-core-for-linus git tree from:
> >> > 
> >> > I don't think this is new at all, but I just tried to do a perf
> >> > record/report of "make -j64 test" on git:
> >> > 
> >> > It's a big perf.data file (1.6G), but after it has done the
> >> > "processing time ordered events" thing it results in:
> >> > 
> >> > ┌─Warning:───────────────────────────────────┐
> >> > │Processed 8672030 events and lost 71 chunks!│
> >> > │Check IO/CPU overload!                      │
> >> > │                                            │
> >> > │                                            │
> >> > │Press any key...                            │
> >> > └────────────────────────────────────────────┘
> >> > 
> >> > and then it just hangs using 100% CPU time. Pressing any key doesn't
> >> > do anything.
> >> > 
> >> > It may well still be *doing* something, and maybe it will come back
> >> > some day with results. But it sure doesn't show any indication that it
> >> > will.
> >> > 
> >> > Try this (in a current git source tree: note, by "git" I actually mean
> >> > git itself, not some random git repository)::
> >> > 
> >> >     perf record -g -e cycles:pp make -j64 test >& out
> >> >     perf report
> >> > 
> >> > maybe you can reproduce it.
> >> 
> >> I managed to reproduce it on a 32-way box via:
> >> 
> >>      perf record -g make -j64 bzImage >/dev/null 2>&1
> >> 
> >> It's easier to debug it without the TUI:
> >> 
> >>      perf --no-pages report --stdio
> >> 
> >> It turns out that even with a 400 MB perf.data the 'perf report' call will 
> >> eventually finish - here it ran for almost half an hour(!) on a fast box.
> >> 
> >> Arnaldo, the large overhead is in hists__collapse_resort(), in particular 
> >> it's doing append_chain_children() 99% of the time:
> >> 
> >> -  99.74%  perf  perf               [.] append_chain_children                                                         ◆
> >>    - append_chain_children                                                                                            ▒
> >>       - 99.76% merge_chain_branch                                                                                     ▒
> >>          - merge_chain_branch                                                                                         ▒
> >>             + 98.04% hists__collapse_resort                                                                           ▒
> >>             + 1.96% merge_chain_branch                                                                                ▒
> >> +   0.05%  perf  perf               [.] merge_chain_branch                                                            ▒
> >> +   0.03%  perf  libc-2.17.so       [.] _int_free                                                                     ▒
> >> +   0.03%  perf  libc-2.17.so       [.] __libc_calloc                                                                 ▒
> >> +   0.02%  perf  [kernel.kallsyms]  [k] account_user_time                                                             ▒
> >> +   0.02%  perf  libc-2.17.so       [.] _int_malloc                                                                   ▒
> >> 
> >> It seems to be stuck in hists__collapse_resort().
> >> 
> >> In particular the overhead arises because the following loop in 
> >> append_chain_children():
> >> 
> >>         /* lookup in childrens */
> >>         chain_for_each_child(rnode, root) {
> >>                 unsigned int ret = append_chain(rnode, cursor, period);
> >> 
> >> Reaches very long counts and the algorithm gets quadratic (at least). The 
> >> child count reaches over 100,000 entries in the end (!).
> >> 
> >> I don't think the high child count in itself is anomalous: a kernel build 
> >> generates thousands of processes, tons of symbol ranges and tens of 
> >> millions of call chain entries.
> >> 
> >> So I think what we need here is to speed up the lookup: put children into 
> >> a secondary, ->pos,len indexed range-rbtree and do a binary search instead 
> >> of a linear search over 100,000 child entries ... or something like that.
> >
> > You're right it's worth trying.
> >
> > At least it might give better results for such high scale callchain trees.
> > I'll see what I can come up with.
> 
> I justed converted it to a normal rbtree and the total processing time went
> down from 380s to 20s!  I couldn't understand how I can use the
> range-rbtree in this case so probably there's a room for further
> enhancement.  I'll send the patches soon.
> 
> >
> >> 
> >> Btw., a side note, append_chain() is a rather confusing function in 
> >> itself, with logic-inversion gems like:
> >> 
> >>                 if (!found)
> >>                         found = true;
> >
> > The check is pointless yeah, I'll remove that.
> >
> >> 
> >> All that should be cleaned up as well I guess.
> >> 
> >> The 'IO overload' message appears to be a separate, unrelated bug, it just 
> >> annoyingly does not get refreshed away in the TUI before 
> >> hists__collapse_resort() is called, and there's also no progress bar for 
> >> the hists__collapse_resort() pass, so to the user it all looks like a 
> >> deadlock.
> >> 
> >> So there's at least two bugs here:
> >> 
> >>   - the bad overhead in hists__collapse_resort()
> >> 
> >>   - bad usability if hists__collapse_resort() takes more than 1 second to finish
> >
> > Also IIUC, collapsing/merging hists is only used for comm hists merging, due to
> > set_task_comm after exec.
> >
> > Perhaps we can do better to anticipate the comm of a process based on tid/pid, fork
> > and comm events? This way we can avoid late collapses/merges. In the best case we
> > could get rid of collapses entirely and that would be some nice fresh air for util/hist.c
> >
> > And ideally, the comm should be associated to a lifetime as a thread can change
> > its comm anytime.
> 
> Right.  I also thought about why the separate collapsing stage is
> needed.  Maybe we can collect hist entries that have same comm at insert
> time.  One problem I can imagine is that the target thread changes its
> comm after collecting some hist entries.  In this case we should look up
> another thread which has same old comm and pass the entries to it.  But
> we don't have information that which entries are belong to a certain
> thread so for now it'll require full traversal of hist entries.  If we
> add the info to threads, I think we can get rid of collapses entirely.

Right, that's exactly what I'm working on. I should have something ready soon.

Thanks.
 
> Thanks,
> Namhyung

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [GIT PULL] perf changes for v3.12
  2013-09-09 10:05   ` Peter Zijlstra
@ 2013-09-10 11:28     ` Stephane Eranian
  2013-09-10 11:53       ` PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12) Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2013-09-10 11:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

Hi,


And what was the perf record command line for this crash?


On Mon, Sep 9, 2013 at 12:05 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Sep 07, 2013 at 07:17:28PM -0700, Linus Torvalds wrote:
>> This is new for me, but I suspect it is more related to the new
>> Haswell CPU I have than necessarily the 3.12 perf pull request.
>>
>> Regardless, nothing bad happened, but my dmesg has this in it:
>>
>>    Unexpected number of pebs records 10
>>
>> when I was profiling some git workloads. Full trace appended.
>
>>  ------------[ cut here ]------------
>>  WARNING: CPU: 1 PID: 3270 at
>> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
>> intel_pmu_drain_pebs_hsw+0x91/0xa0()
>>  Unexpected number of pebs records 10
>>  Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4
>> nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack
>> ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek
>> snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq
>> snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp
>> lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd
>> auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915
>> i2c_algo_bit drm_kms_helper drm i2c_core video
>>  CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4
>>  Hardware name:                  /DH87RL, BIOS
>> RLH8710H.86A.0317.2013.0426.1724 04/26/2013
>>   0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8
>>   ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780
>>   ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38
>>  Call Trace:
>>   <NMI>  [<ffffffff815fc637>] dump_stack+0x45/0x56
>>   [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
>>   [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
>>   [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
>>   [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
>>   [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
>>   [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
>>   [<ffffffff81604840>] do_nmi+0xd0/0x310
>>   [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
>>   <<EOE>>  [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
>>   [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
>>   [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
>>   [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
>>   [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
>>   [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
>>   [<ffffffff81600f48>] __schedule+0x368/0x7c0
>>   [<ffffffff816013c4>] schedule+0x24/0x70
>>   [<ffffffff816038a4>] retint_careful+0xd/0x21
>>  ---[ end trace 52dad7a9d8d96b5f ]---
>
> So I'm not entirely sure how this can happen. We do _not_ set
> ds->pebs_event_reset[] so once a PEBS assist happens we do not
> automatically re-engage. We wait for the PMI to happen, consume the data
> and reprogram the hardware.
>
> We allow for up to x86_pmu.max_pebs_events (4 on hsw afaik) of PEBS
> entries to be in the buffer so each PEBS capable counter can trigger the
> assist once while we 'wait' for the PMI to hit.
>
> There's a little 'funny' with the hsw PEBS code in that it doesn't
> unconditionally clear the buffer in intel_pmu_drain_pebs_hsw() but that
> would only make a difference if n actually was < 0 -- again something
> that shouldn't ever happen.
>
> That said, there's two peculiar things about your report. First and
> foremost HSW; I'm not at all sure how it deals with a transaction being
> undone if a PEBS assist had happened during it. Does it reset the
> hardware but not remove the data? If so, this could cause excess
> entries.
>
> Stephane, Andi, it this at all possible?, section 18.10.5.1 (June 2013)
> seems to suggest not; the way I read it PEBS assist will cause an abort
> (and the PMI certainly would) and we'll process the entire thing outside
> of transaction context.
>
> That said, 18.10.5.1 mentions that entries from inside a transaction
> will only have 0xb0 and 0xb8 aka perf_record::real_ip and
> perf_record::tsx_tuning set, I don't think __intel_pmu_drain_pebs_nhm()
> will do the right thing for them. In particular, when no
> perf_record::status bit is set we'll drop the event on the ground.
>
> Dropping them means we do not do intel_pmu_save_and_restart() to
> reprogram the hardware and the counter would get 'stuck'. Then again,
> without a status bit we have no idea what event to re-program so we're
> pretty much up shit creek without no stinkin' paddle.
>
> Another possibility is that the meaning of pebs_record::status has
> changed -- it was 'broken' and Peggy agreed and said she would look into
> it, not sure this has been 'fixed' but I couldn't find anything on it.
> That said, I think the code always assumed it was 'sane' so it having
> been fixed shouldn't matter.
>
> Secondly you seem to have triggered the nested NMI replay path. I'm not
> immediately sure how that could have messed up the PEBS state, but it
> sure is a complication.
>
> I think we can do something like the below as a pure code reduction --
> completely untested:
>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_ds.c | 109 +++++++++---------------------
>  1 file changed, 32 insertions(+), 77 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index 63438aa..bfda50e 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -178,20 +178,14 @@ struct pebs_record_nhm {
>         u64 status, dla, dse, lat;
>  };
>
> -/*
> - * Same as pebs_record_nhm, with two additional fields.
> - */
>  struct pebs_record_hsw {
> -       struct pebs_record_nhm nhm;
> -       /*
> -        * Real IP of the event. In the Intel documentation this
> -        * is called eventingrip.
> -        */
> -       u64 real_ip;
> -       /*
> -        * TSX tuning information field: abort cycles and abort flags.
> -        */
> -       u64 tsx_tuning;
> +       u64 flags, ip;
> +       u64 ax, bx, cx, dx;
> +       u64 si, di, bp, sp;
> +       u64 r8,  r9,  r10, r11;
> +       u64 r12, r13, r14, r15;
> +       u64 status, dla, dse, lat;
> +       u64 real_ip, tsx_tuning;
>  };
>
>  void init_debug_store_on_cpu(int cpu)
> @@ -789,12 +783,11 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>                                    struct pt_regs *iregs, void *__pebs)
>  {
>         /*
> -        * We cast to pebs_record_nhm to get the load latency data
> -        * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
> +        * We cast to the biggest pebs_record but are careful not to
> +        * unconditionally access the 'extra' entries.
>          */
>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -       struct pebs_record_nhm *pebs = __pebs;
> -       struct pebs_record_hsw *pebs_hsw = __pebs;
> +       struct pebs_record_hsw *pebs = __pebs;
>         struct perf_sample_data data;
>         struct pt_regs regs;
>         u64 sample_type;
> @@ -853,7 +846,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>         regs.sp = pebs->sp;
>
>         if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
> -               regs.ip = pebs_hsw->real_ip;
> +               regs.ip = pebs->real_ip;
>                 regs.flags |= PERF_EFLAGS_EXACT;
>         } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
>                 regs.flags |= PERF_EFLAGS_EXACT;
> @@ -912,17 +905,34 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
>         __intel_pmu_pebs_event(event, iregs, at);
>  }
>
> -static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
> -                                       void *top)
> +static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
>  {
>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>         struct debug_store *ds = cpuc->ds;
>         struct perf_event *event = NULL;
> +       void *at, *top;
>         u64 status = 0;
> -       int bit;
> +       int bit, n;
> +
> +       if (!x86_pmu.pebs_active)
> +               return;
> +
> +       at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
> +       top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
>
>         ds->pebs_index = ds->pebs_buffer_base;
>
> +       n = (top - at) / x86_pmu.pebs_record_size;
> +       if (n <= 0)
> +               return;
> +
> +       /*
> +        * Should not happen, we program the threshold at 1 and do not
> +        * set a reset value.
> +        */
> +       WARN_ONCE(n > x86_pmu.max_pebs_events,
> +                 "Unexpected number of pebs records %d\n", n);
> +
>         for (; at < top; at += x86_pmu.pebs_record_size) {
>                 struct pebs_record_nhm *p = at;
>
> @@ -950,61 +960,6 @@ static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
>         }
>  }
>
> -static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
> -{
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -       struct debug_store *ds = cpuc->ds;
> -       struct pebs_record_nhm *at, *top;
> -       int n;
> -
> -       if (!x86_pmu.pebs_active)
> -               return;
> -
> -       at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
> -       top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
> -
> -       ds->pebs_index = ds->pebs_buffer_base;
> -
> -       n = top - at;
> -       if (n <= 0)
> -               return;
> -
> -       /*
> -        * Should not happen, we program the threshold at 1 and do not
> -        * set a reset value.
> -        */
> -       WARN_ONCE(n > x86_pmu.max_pebs_events,
> -                 "Unexpected number of pebs records %d\n", n);
> -
> -       return __intel_pmu_drain_pebs_nhm(iregs, at, top);
> -}
> -
> -static void intel_pmu_drain_pebs_hsw(struct pt_regs *iregs)
> -{
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -       struct debug_store *ds = cpuc->ds;
> -       struct pebs_record_hsw *at, *top;
> -       int n;
> -
> -       if (!x86_pmu.pebs_active)
> -               return;
> -
> -       at  = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
> -       top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;
> -
> -       n = top - at;
> -       if (n <= 0)
> -               return;
> -       /*
> -        * Should not happen, we program the threshold at 1 and do not
> -        * set a reset value.
> -        */
> -       WARN_ONCE(n > x86_pmu.max_pebs_events,
> -                 "Unexpected number of pebs records %d\n", n);
> -
> -       return __intel_pmu_drain_pebs_nhm(iregs, at, top);
> -}
> -
>  /*
>   * BTS, PEBS probe and setup
>   */
> @@ -1039,7 +994,7 @@ void intel_ds_init(void)
>                 case 2:
>                         pr_cont("PEBS fmt2%c, ", pebs_type);
>                         x86_pmu.pebs_record_size = sizeof(struct pebs_record_hsw);
> -                       x86_pmu.drain_pebs = intel_pmu_drain_pebs_hsw;
> +                       x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
>                         break;
>
>                 default:

^ permalink raw reply	[flat|nested] 36+ messages in thread

* PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 11:28     ` Stephane Eranian
@ 2013-09-10 11:53       ` Ingo Molnar
  2013-09-10 12:32         ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2013-09-10 11:53 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen


* Stephane Eranian <eranian@googlemail.com> wrote:

> Hi,
> 
> 
> And what was the perf record command line for this crash?

AFAICS it wasn't a crash but the WARN_ON() in intel_pmu_drain_pebs_hsw(), 
at arch/x86/kernel/cpu/perf_event_intel_ds.c:1003.

        at  = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
        top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;

        n = top - at;
        if (n <= 0)
                return;

        /*
         * Should not happen, we program the threshold at 1 and do not
         * set a reset value.
         */
        WARN_ONCE(n > x86_pmu.max_pebs_events,
                  "Unexpected number of pebs records %d\n", n);

The command line Linus used was probably close to:

   perf record -e cycles:pp -g make -j64 bzImage

i.e. PEBS precise profiling, call chains, LBR is used to figure out the 
real instruction, but no '-a' per CPU profiling option, i.e. high 
frequency per task PMU context switching.

Note that AFAIK neither the kernel nor user-space used any TSX extensions, 
so this is the Haswell PMU in pure compatibility mode.

My (wild) guess is that unless all of us missed some subtle race in the 
PEBS code it's an (unknown?) erratum: the hardware got confused by the 
high frequency PMU switches, in this particular case where we got a new 
PMI right after a very short interval was programmed:

>>  Call Trace:
>>   <NMI>  [<ffffffff815fc637>] dump_stack+0x45/0x56
>>   [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
>>   [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
>>   [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
>>   [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
>>   [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
>>   [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
>>   [<ffffffff81604840>] do_nmi+0xd0/0x310
>>   [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
>>   <<EOE>>  [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
>>   [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
>>   [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
>>   [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
>>   [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
>>   [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
>>   [<ffffffff81600f48>] __schedule+0x368/0x7c0
>>   [<ffffffff816013c4>] schedule+0x24/0x70

Note that due to per task profiling the default (long, about 1 KHz) 
interval can get chopped up and can result in a very small period value 
being reprogrammed at PMU-sched-in time.

That kind of high-freq back-to-back activity could, in theory, confuse the 
PEBS hardware. Or the kernel :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 11:53       ` PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12) Ingo Molnar
@ 2013-09-10 12:32         ` Stephane Eranian
  2013-09-10 12:42           ` Ramkumar Ramachandra
                             ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Stephane Eranian @ 2013-09-10 12:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

Hi,

Ok, so I am able to reproduce the problem using a simpler
test case with a simple multithreaded program where
#threads >> #CPUs.

[ 2229.021934] WARNING: CPU: 6 PID: 17496 at
arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
intel_pmu_drain_pebs_hsw+0xa8/0xc0()
[ 2229.021936] Unexpected number of pebs records 21

[ 2229.021966] Call Trace:
[ 2229.021967]  <NMI>  [<ffffffff8159dcd6>] dump_stack+0x46/0x58
[ 2229.021976]  [<ffffffff8108dfdc>] warn_slowpath_common+0x8c/0xc0
[ 2229.021979]  [<ffffffff8108e0c6>] warn_slowpath_fmt+0x46/0x50
[ 2229.021982]  [<ffffffff810646c8>] intel_pmu_drain_pebs_hsw+0xa8/0xc0
[ 2229.021986]  [<ffffffff810668f0>] intel_pmu_handle_irq+0x220/0x380
[ 2229.021991]  [<ffffffff810c1d35>] ? sched_clock_cpu+0xc5/0x120
[ 2229.021995]  [<ffffffff815a5a84>] perf_event_nmi_handler+0x34/0x60
[ 2229.021998]  [<ffffffff815a52b8>] nmi_handle.isra.3+0x88/0x180
[ 2229.022001]  [<ffffffff815a5490>] do_nmi+0xe0/0x330
[ 2229.022004]  [<ffffffff815a48f7>] end_repeat_nmi+0x1e/0x2e
[ 2229.022008]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
[ 2229.022011]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
[ 2229.022015]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
[ 2229.022016]  <<EOE>>  [<ffffffff810659f3>] intel_pmu_enable_all+0x23/0xa0
[ 2229.022021]  [<ffffffff8105ff84>] x86_pmu_enable+0x274/0x310
[ 2229.022025]  [<ffffffff81141927>] perf_pmu_enable+0x27/0x30
[ 2229.022029]  [<ffffffff81143219>] perf_event_context_sched_in+0x79/0xc0

Could be a HW race whereby the PEBS of each HT threads get mixed up.
I will add a couple more checks to verify that. The intr_thres should not
have changed. Yet looks like we have a sitation where the index is way
past the threshold.



On Tue, Sep 10, 2013 at 4:53 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Stephane Eranian <eranian@googlemail.com> wrote:
>
>> Hi,
>>
>>
>> And what was the perf record command line for this crash?
>
> AFAICS it wasn't a crash but the WARN_ON() in intel_pmu_drain_pebs_hsw(),
> at arch/x86/kernel/cpu/perf_event_intel_ds.c:1003.
>
>         at  = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
>         top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;
>
>         n = top - at;
>         if (n <= 0)
>                 return;
>
>         /*
>          * Should not happen, we program the threshold at 1 and do not
>          * set a reset value.
>          */
>         WARN_ONCE(n > x86_pmu.max_pebs_events,
>                   "Unexpected number of pebs records %d\n", n);
>
> The command line Linus used was probably close to:
>
>    perf record -e cycles:pp -g make -j64 bzImage
>
> i.e. PEBS precise profiling, call chains, LBR is used to figure out the
> real instruction, but no '-a' per CPU profiling option, i.e. high
> frequency per task PMU context switching.
>
> Note that AFAIK neither the kernel nor user-space used any TSX extensions,
> so this is the Haswell PMU in pure compatibility mode.
>
> My (wild) guess is that unless all of us missed some subtle race in the
> PEBS code it's an (unknown?) erratum: the hardware got confused by the
> high frequency PMU switches, in this particular case where we got a new
> PMI right after a very short interval was programmed:
>
>>>  Call Trace:
>>>   <NMI>  [<ffffffff815fc637>] dump_stack+0x45/0x56
>>>   [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
>>>   [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
>>>   [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
>>>   [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
>>>   [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
>>>   [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
>>>   [<ffffffff81604840>] do_nmi+0xd0/0x310
>>>   [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
>>>   <<EOE>>  [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
>>>   [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
>>>   [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
>>>   [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
>>>   [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
>>>   [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
>>>   [<ffffffff81600f48>] __schedule+0x368/0x7c0
>>>   [<ffffffff816013c4>] schedule+0x24/0x70
>
> Note that due to per task profiling the default (long, about 1 KHz)
> interval can get chopped up and can result in a very small period value
> being reprogrammed at PMU-sched-in time.
>
> That kind of high-freq back-to-back activity could, in theory, confuse the
> PEBS hardware. Or the kernel :-)
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 12:32         ` Stephane Eranian
@ 2013-09-10 12:42           ` Ramkumar Ramachandra
  2013-09-10 12:51           ` Ramkumar Ramachandra
  2013-09-10 13:38           ` Ingo Molnar
  2 siblings, 0 replies; 36+ messages in thread
From: Ramkumar Ramachandra @ 2013-09-10 12:42 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Linux Kernel Mailing List, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Andi Kleen

Stephane Eranian wrote:
> [ 2229.021966] Call Trace:
> [ 2229.021967]  <NMI>  [<ffffffff8159dcd6>] dump_stack+0x46/0x58
> [ 2229.021976]  [<ffffffff8108dfdc>] warn_slowpath_common+0x8c/0xc0
> [ 2229.021979]  [<ffffffff8108e0c6>] warn_slowpath_fmt+0x46/0x50
> [ 2229.021982]  [<ffffffff810646c8>] intel_pmu_drain_pebs_hsw+0xa8/0xc0
> [ 2229.021986]  [<ffffffff810668f0>] intel_pmu_handle_irq+0x220/0x380
> [ 2229.021991]  [<ffffffff810c1d35>] ? sched_clock_cpu+0xc5/0x120
> [ 2229.021995]  [<ffffffff815a5a84>] perf_event_nmi_handler+0x34/0x60
> [ 2229.021998]  [<ffffffff815a52b8>] nmi_handle.isra.3+0x88/0x180
> [ 2229.022001]  [<ffffffff815a5490>] do_nmi+0xe0/0x330
> [ 2229.022004]  [<ffffffff815a48f7>] end_repeat_nmi+0x1e/0x2e
> [ 2229.022008]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022011]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022015]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022016]  <<EOE>>  [<ffffffff810659f3>] intel_pmu_enable_all+0x23/0xa0
> [ 2229.022021]  [<ffffffff8105ff84>] x86_pmu_enable+0x274/0x310
> [ 2229.022025]  [<ffffffff81141927>] perf_pmu_enable+0x27/0x30
> [ 2229.022029]  [<ffffffff81143219>] perf_event_context_sched_in+0x79/0xc0

Unnecessary repetition of the same task on multiple cores seems to be
the issue, although I'm not sure. Haven't gone through mingo's trace
yet.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 12:32         ` Stephane Eranian
  2013-09-10 12:42           ` Ramkumar Ramachandra
@ 2013-09-10 12:51           ` Ramkumar Ramachandra
  2013-09-10 12:55             ` Stephane Eranian
  2013-09-10 13:38           ` Ingo Molnar
  2 siblings, 1 reply; 36+ messages in thread
From: Ramkumar Ramachandra @ 2013-09-10 12:51 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Linux Kernel Mailing List, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Andi Kleen

Stephane Eranian wrote:
> a simple multithreaded program where
> #threads >> #CPUs

To put it another way, does Intel's HT work for CPU intensive and IO
minimal tasks? I think HT assumes some amount of inefficient IO
coupled with pure CPU usage.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 12:51           ` Ramkumar Ramachandra
@ 2013-09-10 12:55             ` Stephane Eranian
  2013-09-10 13:22               ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2013-09-10 12:55 UTC (permalink / raw)
  To: Ramkumar Ramachandra
  Cc: Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Linux Kernel Mailing List, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Andi Kleen

On Tue, Sep 10, 2013 at 5:51 AM, Ramkumar Ramachandra
<artagnon@gmail.com> wrote:
> Stephane Eranian wrote:
>> a simple multithreaded program where
>> #threads >> #CPUs
>
> To put it another way, does Intel's HT work for CPU intensive and IO
> minimal tasks? I think HT assumes some amount of inefficient IO
> coupled with pure CPU usage.

But your remark has nothing to do with the PMU issue.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 12:55             ` Stephane Eranian
@ 2013-09-10 13:22               ` Ingo Molnar
  0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2013-09-10 13:22 UTC (permalink / raw)
  To: eranian
  Cc: Ramkumar Ramachandra, Peter Zijlstra, Linus Torvalds,
	Linux Kernel Mailing List, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Andi Kleen


* Stephane Eranian <eranian@googlemail.com> wrote:

> On Tue, Sep 10, 2013 at 5:51 AM, Ramkumar Ramachandra
> <artagnon@gmail.com> wrote:
> > Stephane Eranian wrote:
> >> a simple multithreaded program where
> >> #threads >> #CPUs
> >
> > To put it another way, does Intel's HT work for CPU intensive and IO
> > minimal tasks? I think HT assumes some amount of inefficient IO
> > coupled with pure CPU usage.
> 
> But your remark has nothing to do with the PMU issue.

To quote Al Viro:

  "One name: Mark V. Shaney..."

Update your email filters if it gets annoying.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 12:32         ` Stephane Eranian
  2013-09-10 12:42           ` Ramkumar Ramachandra
  2013-09-10 12:51           ` Ramkumar Ramachandra
@ 2013-09-10 13:38           ` Ingo Molnar
  2013-09-10 14:15             ` Stephane Eranian
  2 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2013-09-10 13:38 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen


* Stephane Eranian <eranian@googlemail.com> wrote:

> Hi,
> 
> Ok, so I am able to reproduce the problem using a simpler
> test case with a simple multithreaded program where
> #threads >> #CPUs.

Does it go away if you use 'perf record --all-cpus'?

> [ 2229.021934] WARNING: CPU: 6 PID: 17496 at
> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
> intel_pmu_drain_pebs_hsw+0xa8/0xc0()
> [ 2229.021936] Unexpected number of pebs records 21
> 
> [ 2229.021966] Call Trace:
> [ 2229.021967]  <NMI>  [<ffffffff8159dcd6>] dump_stack+0x46/0x58
> [ 2229.021976]  [<ffffffff8108dfdc>] warn_slowpath_common+0x8c/0xc0
> [ 2229.021979]  [<ffffffff8108e0c6>] warn_slowpath_fmt+0x46/0x50
> [ 2229.021982]  [<ffffffff810646c8>] intel_pmu_drain_pebs_hsw+0xa8/0xc0
> [ 2229.021986]  [<ffffffff810668f0>] intel_pmu_handle_irq+0x220/0x380
> [ 2229.021991]  [<ffffffff810c1d35>] ? sched_clock_cpu+0xc5/0x120
> [ 2229.021995]  [<ffffffff815a5a84>] perf_event_nmi_handler+0x34/0x60
> [ 2229.021998]  [<ffffffff815a52b8>] nmi_handle.isra.3+0x88/0x180
> [ 2229.022001]  [<ffffffff815a5490>] do_nmi+0xe0/0x330
> [ 2229.022004]  [<ffffffff815a48f7>] end_repeat_nmi+0x1e/0x2e
> [ 2229.022008]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022011]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022015]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
> [ 2229.022016]  <<EOE>>  [<ffffffff810659f3>] intel_pmu_enable_all+0x23/0xa0
> [ 2229.022021]  [<ffffffff8105ff84>] x86_pmu_enable+0x274/0x310
> [ 2229.022025]  [<ffffffff81141927>] perf_pmu_enable+0x27/0x30
> [ 2229.022029]  [<ffffffff81143219>] perf_event_context_sched_in+0x79/0xc0
> 
> Could be a HW race whereby the PEBS of each HT threads get mixed up.

Yes, that seems plausible and would explain why the overrun is usually a 
small integer. We set up the DS with PEBS_BUFFER_SIZE == 4096, so with a 
record size of 192 bytes on HSW we should get index values of 0-21.

That fits within the indices range reported so far.

> [...] I will add a couple more checks to verify that. The intr_thres 
> should not have changed. Yet looks like we have a sitation where the 
> index is way past the threshold.

Btw., it would also be nice to add a check of ds->pebs_index against 
ds->pebs_absolute_maximum, to make sure the PEBS record index never goes 
outside the DS area. I.e. to protect against random corruption.

Right now we do only half a check:

        n = top - at;
        if (n <= 0)
                return;

this still allows an upwards overflow. We check x86_pmu.max_pebs_events 
but then let it continue:

        WARN_ONCE(n > x86_pmu.max_pebs_events,
                  "Unexpected number of pebs records %d\n", n);

        return __intel_pmu_drain_pebs_nhm(iregs, at, top);

Instead it should be something more robust, like:

	if (WARN_ONCE(n > max ...)) {
		/* Drain the PEBS buffer: */
		ds->pebs_index = ds->pebs_buffer_base;
		return;
	}

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 13:38           ` Ingo Molnar
@ 2013-09-10 14:15             ` Stephane Eranian
  2013-09-10 14:29               ` Ingo Molnar
  2013-09-10 15:28               ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: Stephane Eranian @ 2013-09-10 14:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Stephane Eranian <eranian@googlemail.com> wrote:
>
>> Hi,
>>
>> Ok, so I am able to reproduce the problem using a simpler
>> test case with a simple multithreaded program where
>> #threads >> #CPUs.
>
> Does it go away if you use 'perf record --all-cpus'?
>
Haven't tried that yet.

But I verified the DS pointers:
init:
CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
intr=ffff8808262de0c0 max=ffff8808262defc0
crash:
CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
intr=ffff8808262de0c0 max=ffff8808262defc0

Neither the base nor the max are modified.
The index simply goes beyond the threshold but that's not a bug.
It is 12 after the threshold of 1, so total 13 is my new crash report.

Two things to try:
- measure only one thread/core
- move the threshold a bit farther away (to get 2 or 3 entries)

The threshold is where to generate the interrupt. It does not mean where to stop
PEBS recording. So it is possible that in HSW, we may get into a situation where
it takes time to get to the handler to stop the PMU. I don't know how
given we use
NMI. Well, unless we were already servicing an NMI at the time. But
given that we
stop the PMU almost immediately in the handler, I don't see how that
would possible.
The other oddity in HSW is that we clear the NMI on entry to the
handler and not at
the end. I never gotten an good explanation as to why that was
necessary. So maybe
it is related...





>> [ 2229.021934] WARNING: CPU: 6 PID: 17496 at
>> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
>> intel_pmu_drain_pebs_hsw+0xa8/0xc0()
>> [ 2229.021936] Unexpected number of pebs records 21
>>
>> [ 2229.021966] Call Trace:
>> [ 2229.021967]  <NMI>  [<ffffffff8159dcd6>] dump_stack+0x46/0x58
>> [ 2229.021976]  [<ffffffff8108dfdc>] warn_slowpath_common+0x8c/0xc0
>> [ 2229.021979]  [<ffffffff8108e0c6>] warn_slowpath_fmt+0x46/0x50
>> [ 2229.021982]  [<ffffffff810646c8>] intel_pmu_drain_pebs_hsw+0xa8/0xc0
>> [ 2229.021986]  [<ffffffff810668f0>] intel_pmu_handle_irq+0x220/0x380
>> [ 2229.021991]  [<ffffffff810c1d35>] ? sched_clock_cpu+0xc5/0x120
>> [ 2229.021995]  [<ffffffff815a5a84>] perf_event_nmi_handler+0x34/0x60
>> [ 2229.021998]  [<ffffffff815a52b8>] nmi_handle.isra.3+0x88/0x180
>> [ 2229.022001]  [<ffffffff815a5490>] do_nmi+0xe0/0x330
>> [ 2229.022004]  [<ffffffff815a48f7>] end_repeat_nmi+0x1e/0x2e
>> [ 2229.022008]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
>> [ 2229.022011]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
>> [ 2229.022015]  [<ffffffff810652b3>] ? intel_pmu_pebs_enable_all+0x33/0x40
>> [ 2229.022016]  <<EOE>>  [<ffffffff810659f3>] intel_pmu_enable_all+0x23/0xa0
>> [ 2229.022021]  [<ffffffff8105ff84>] x86_pmu_enable+0x274/0x310
>> [ 2229.022025]  [<ffffffff81141927>] perf_pmu_enable+0x27/0x30
>> [ 2229.022029]  [<ffffffff81143219>] perf_event_context_sched_in+0x79/0xc0
>>
>> Could be a HW race whereby the PEBS of each HT threads get mixed up.
>
> Yes, that seems plausible and would explain why the overrun is usually a
> small integer. We set up the DS with PEBS_BUFFER_SIZE == 4096, so with a
> record size of 192 bytes on HSW we should get index values of 0-21.
>
> That fits within the indices range reported so far.
>
>> [...] I will add a couple more checks to verify that. The intr_thres
>> should not have changed. Yet looks like we have a sitation where the
>> index is way past the threshold.
>
> Btw., it would also be nice to add a check of ds->pebs_index against
> ds->pebs_absolute_maximum, to make sure the PEBS record index never goes
> outside the DS area. I.e. to protect against random corruption.
>
> Right now we do only half a check:
>
>         n = top - at;
>         if (n <= 0)
>                 return;
>
> this still allows an upwards overflow. We check x86_pmu.max_pebs_events
> but then let it continue:
>
>         WARN_ONCE(n > x86_pmu.max_pebs_events,
>                   "Unexpected number of pebs records %d\n", n);
>
>         return __intel_pmu_drain_pebs_nhm(iregs, at, top);
>
> Instead it should be something more robust, like:
>
>         if (WARN_ONCE(n > max ...)) {
>                 /* Drain the PEBS buffer: */
>                 ds->pebs_index = ds->pebs_buffer_base;
>                 return;
>         }
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 14:15             ` Stephane Eranian
@ 2013-09-10 14:29               ` Ingo Molnar
  2013-09-10 14:34                 ` Stephane Eranian
  2013-09-10 15:28               ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2013-09-10 14:29 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen


* Stephane Eranian <eranian@googlemail.com> wrote:

> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Stephane Eranian <eranian@googlemail.com> wrote:
> >
> >> Hi,
> >>
> >> Ok, so I am able to reproduce the problem using a simpler
> >> test case with a simple multithreaded program where
> >> #threads >> #CPUs.
> >
> > Does it go away if you use 'perf record --all-cpus'?
> >
> Haven't tried that yet.
> 
> But I verified the DS pointers:
> init:
> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
> intr=ffff8808262de0c0 max=ffff8808262defc0
> crash:
> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
> intr=ffff8808262de0c0 max=ffff8808262defc0
> 
> Neither the base nor the max are modified.
> The index simply goes beyond the threshold but that's not a bug.
> It is 12 after the threshold of 1, so total 13 is my new crash report.
> 
> Two things to try:
> - measure only one thread/core
> - move the threshold a bit farther away (to get 2 or 3 entries)
> 
> The threshold is where to generate the interrupt. It does not mean where 
> to stop PEBS recording. So it is possible that in HSW, we may get into a 
> situation where it takes time to get to the handler to stop the PMU. I 
> don't know how given we use NMI. Well, unless we were already servicing 
> an NMI at the time. But given that we stop the PMU almost immediately in 
> the handler, I don't see how that would possible. The other oddity in 
> HSW is that we clear the NMI on entry to the handler and not at the end. 
> I never gotten an good explanation as to why that was necessary. So 
> maybe it is related...

Do you mean:

        if (!x86_pmu.late_ack)
                apic_write(APIC_LVTPC, APIC_DM_NMI);

AFAICS that means the opposite: that we clear the NMI late, i.e. shortly 
before return, after we've processed the PMU.

Do the symptoms change if you remove the x86_pmu.late_ack setting line 
from:

        case 60: /* Haswell Client */
        case 70:
        case 71:
        case 63:
        case 69:
                x86_pmu.late_ack = true;

?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 14:29               ` Ingo Molnar
@ 2013-09-10 14:34                 ` Stephane Eranian
  2013-09-10 17:14                   ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2013-09-10 14:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Sep 10, 2013 at 7:29 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Stephane Eranian <eranian@googlemail.com> wrote:
>
>> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > * Stephane Eranian <eranian@googlemail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> Ok, so I am able to reproduce the problem using a simpler
>> >> test case with a simple multithreaded program where
>> >> #threads >> #CPUs.
>> >
>> > Does it go away if you use 'perf record --all-cpus'?
>> >
>> Haven't tried that yet.
>>
>> But I verified the DS pointers:
>> init:
>> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
>> intr=ffff8808262de0c0 max=ffff8808262defc0
>> crash:
>> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
>> intr=ffff8808262de0c0 max=ffff8808262defc0
>>
>> Neither the base nor the max are modified.
>> The index simply goes beyond the threshold but that's not a bug.
>> It is 12 after the threshold of 1, so total 13 is my new crash report.
>>
>> Two things to try:
>> - measure only one thread/core
>> - move the threshold a bit farther away (to get 2 or 3 entries)
>>
>> The threshold is where to generate the interrupt. It does not mean where
>> to stop PEBS recording. So it is possible that in HSW, we may get into a
>> situation where it takes time to get to the handler to stop the PMU. I
>> don't know how given we use NMI. Well, unless we were already servicing
>> an NMI at the time. But given that we stop the PMU almost immediately in
>> the handler, I don't see how that would possible. The other oddity in
>> HSW is that we clear the NMI on entry to the handler and not at the end.
>> I never gotten an good explanation as to why that was necessary. So
>> maybe it is related...
>
> Do you mean:
>
>         if (!x86_pmu.late_ack)
>                 apic_write(APIC_LVTPC, APIC_DM_NMI);
>
> AFAICS that means the opposite: that we clear the NMI late, i.e. shortly
> before return, after we've processed the PMU.
>
Yeah, the opposity, I got confused.

Let me try reverting that.
Also curious about the influence of the LBR here.

> Do the symptoms change if you remove the x86_pmu.late_ack setting line
> from:
>
>         case 60: /* Haswell Client */
>         case 70:
>         case 71:
>         case 63:
>         case 69:
>                 x86_pmu.late_ack = true;
>
> ?
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 14:15             ` Stephane Eranian
  2013-09-10 14:29               ` Ingo Molnar
@ 2013-09-10 15:28               ` Peter Zijlstra
  2013-09-10 16:14                 ` Stephane Eranian
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2013-09-10 15:28 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Sep 10, 2013 at 07:15:19AM -0700, Stephane Eranian wrote:
> The threshold is where to generate the interrupt. It does not mean
> where to stop PEBS recording.

It does, since we don't set a reset value. So once a PEBS assist
happens, that counter stops until we reprogram it in the PMI.

> So it is possible that in HSW, we may
> get into a situation where it takes time to get to the handler to stop
> the PMU. I don't know how given we use NMI. Well, unless we were
> already servicing an NMI at the time. But given that we stop the PMU
> almost immediately in the handler, I don't see how that would
> possible.  The other oddity in HSW is that we clear the NMI on entry
> to the handler and not at the end. I never gotten an good explanation
> as to why that was necessary. So maybe it is related...

Even if the PMI was delayed that counter would never do another entry.
So I really don't see how we can have more than nr_counters PEBS entries
in the buffer.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 15:28               ` Peter Zijlstra
@ 2013-09-10 16:14                 ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2013-09-10 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Sep 10, 2013 at 5:28 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 10, 2013 at 07:15:19AM -0700, Stephane Eranian wrote:
>> The threshold is where to generate the interrupt. It does not mean
>> where to stop PEBS recording.
>
> It does, since we don't set a reset value. So once a PEBS assist
> happens, that counter stops until we reprogram it in the PMI.
>
I agree with you.

I am curious to see what's in those entries. If they are all zeroes, then
we know only the pointer is bogus. Let me add some instrumentation
to check that.

>> So it is possible that in HSW, we may
>> get into a situation where it takes time to get to the handler to stop
>> the PMU. I don't know how given we use NMI. Well, unless we were
>> already servicing an NMI at the time. But given that we stop the PMU
>> almost immediately in the handler, I don't see how that would
>> possible.  The other oddity in HSW is that we clear the NMI on entry
>> to the handler and not at the end. I never gotten an good explanation
>> as to why that was necessary. So maybe it is related...
>
> Even if the PMI was delayed that counter would never do another entry.
> So I really don't see how we can have more than nr_counters PEBS entries
> in the buffer.

I agree.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 14:34                 ` Stephane Eranian
@ 2013-09-10 17:14                   ` Ingo Molnar
  2013-09-16 11:07                     ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2013-09-10 17:14 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen


* Stephane Eranian <eranian@googlemail.com> wrote:

> On Tue, Sep 10, 2013 at 7:29 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Stephane Eranian <eranian@googlemail.com> wrote:
> >
> >> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >> >
> >> > * Stephane Eranian <eranian@googlemail.com> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> Ok, so I am able to reproduce the problem using a simpler
> >> >> test case with a simple multithreaded program where
> >> >> #threads >> #CPUs.
> >> >
> >> > Does it go away if you use 'perf record --all-cpus'?
> >> >
> >> Haven't tried that yet.
> >>
> >> But I verified the DS pointers:
> >> init:
> >> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
> >> intr=ffff8808262de0c0 max=ffff8808262defc0
> >> crash:
> >> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
> >> intr=ffff8808262de0c0 max=ffff8808262defc0
> >>
> >> Neither the base nor the max are modified.
> >> The index simply goes beyond the threshold but that's not a bug.
> >> It is 12 after the threshold of 1, so total 13 is my new crash report.
> >>
> >> Two things to try:
> >> - measure only one thread/core
> >> - move the threshold a bit farther away (to get 2 or 3 entries)
> >>
> >> The threshold is where to generate the interrupt. It does not mean where
> >> to stop PEBS recording. So it is possible that in HSW, we may get into a
> >> situation where it takes time to get to the handler to stop the PMU. I
> >> don't know how given we use NMI. Well, unless we were already servicing
> >> an NMI at the time. But given that we stop the PMU almost immediately in
> >> the handler, I don't see how that would possible. The other oddity in
> >> HSW is that we clear the NMI on entry to the handler and not at the end.
> >> I never gotten an good explanation as to why that was necessary. So
> >> maybe it is related...
> >
> > Do you mean:
> >
> >         if (!x86_pmu.late_ack)
> >                 apic_write(APIC_LVTPC, APIC_DM_NMI);
> >
> > AFAICS that means the opposite: that we clear the NMI late, i.e. shortly
> > before return, after we've processed the PMU.
> >
> Yeah, the opposity, I got confused.
> 
> Let me try reverting that.
> Also curious about the influence of the LBR here.

You could exclude any LBR interaction by doing tests with "-e cycles:p".

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-10 17:14                   ` Ingo Molnar
@ 2013-09-16 11:07                     ` Stephane Eranian
  2013-09-16 15:41                       ` Ingo Molnar
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2013-09-16 11:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

Hi,

Some updates on this problem.
I have been running tests all week-end long on my HSW.
I can reproduce the problem. What I know:

- It is not linked with callchain
- The extra entries are valid
- The reset values are still zeroes
- The problem does not happen on SNB with the same test case
- The PMU state looks sane when that happens.
- The problem occurs even when restricting to one CPU/core (taskset -c 0-3)

So it seems like the threshold is ignored. But I don't understand where
there reset values are coming from. So it looks more like a bug in micro-code
where under certain circumstances multiple entries get written.

[132537.622177] Unexpected number of pebs records 5
[132537.622177] CPU0 base=ffff88023681a000 index=ffff88023681a000
intr=ffff88023681a0c0 max=ffff88023681afc0
[132537.622181] CPU0 0 p=ffff88023681a000 ovf=1 rip=c3830b
[132537.622182] CPU0 1 p=ffff88023681a0c0 ovf=1 rip=c3ab65
[132537.622183] CPU0 2 p=ffff88023681a180 ovf=1 rip=4f1b72
[132537.622184] CPU0 3 p=ffff88023681a240 ovf=1 rip=8dde78
[132537.622185] CPU0 4 p=ffff88023681a300 ovf=1 rip=4d24aa
[132537.622185] CPU0 0 reset=0
[132537.622186] CPU0 1 reset=0
[132537.622186] CPU0 2 reset=0
[132537.622187] CPU0 3 reset=0
[132537.622187] CPU0 4 reset=0
[132537.622188] CPU0 5 reset=0
[132537.622188] CPU0 6 reset=0
[132537.622189] CPU0 7 reset=0
[132537.622189]
[132537.622190] CPU#0: ctrl:       0000000000000000
[132537.622191] CPU#0: status:     0000000000000000
[132537.622191] CPU#0: overflow:   0000000000000000
[132537.622192] CPU#0: fixed:      00000000000000b0
[132537.622192] CPU#0: pebs:       0000000000000000
[132537.622193] CPU#0: active:     0000000200000001
[132537.622194] CPU#0:   gen-PMC0 ctrl:  0000000010c301c2
[132537.622194] CPU#0:   gen-PMC0 count: 0000fffffff214ea
[132537.622195] CPU#0:   gen-PMC0 left:  00000000000deb16
[132537.622196] CPU#0:   gen-PMC1 ctrl:  0000000000000000
[132537.622196] CPU#0:   gen-PMC1 count: 0000000000000000
[132537.622197] CPU#0:   gen-PMC1 left:  0000000000000000
[132537.622197] CPU#0:   gen-PMC2 ctrl:  0000000000000000
[132537.622198] CPU#0:   gen-PMC2 count: 0000000000000000
[132537.622198] CPU#0:   gen-PMC2 left:  0000000000000000
[132537.622199] CPU#0:   gen-PMC3 ctrl:  0000000000000000
[132537.622199] CPU#0:   gen-PMC3 count: 0000000000000000
[132537.622200] CPU#0:   gen-PMC3 left:  0000000000000000
[132537.622200] CPU#0: fixed-PMC0 count: 0000000000000000
[132537.622201] CPU#0: fixed-PMC1 count: 0000fff819c21c2c
[132537.622202] CPU#0: fixed-PMC2 count: 0000000000000000

Something must be happening with the interrupt or HT. I will disable
HT next and also disable the NMI watchdog.


On Tue, Sep 10, 2013 at 7:14 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Stephane Eranian <eranian@googlemail.com> wrote:
>
>> On Tue, Sep 10, 2013 at 7:29 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> >
>> > * Stephane Eranian <eranian@googlemail.com> wrote:
>> >
>> >> On Tue, Sep 10, 2013 at 6:38 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> >> >
>> >> > * Stephane Eranian <eranian@googlemail.com> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> Ok, so I am able to reproduce the problem using a simpler
>> >> >> test case with a simple multithreaded program where
>> >> >> #threads >> #CPUs.
>> >> >
>> >> > Does it go away if you use 'perf record --all-cpus'?
>> >> >
>> >> Haven't tried that yet.
>> >>
>> >> But I verified the DS pointers:
>> >> init:
>> >> CPU6 pebs base=ffff8808262de000 index=ffff8808262de000
>> >> intr=ffff8808262de0c0 max=ffff8808262defc0
>> >> crash:
>> >> CPU6 pebs base=ffff8808262de000 index=ffff8808262de9c0
>> >> intr=ffff8808262de0c0 max=ffff8808262defc0
>> >>
>> >> Neither the base nor the max are modified.
>> >> The index simply goes beyond the threshold but that's not a bug.
>> >> It is 12 after the threshold of 1, so total 13 is my new crash report.
>> >>
>> >> Two things to try:
>> >> - measure only one thread/core
>> >> - move the threshold a bit farther away (to get 2 or 3 entries)
>> >>
>> >> The threshold is where to generate the interrupt. It does not mean where
>> >> to stop PEBS recording. So it is possible that in HSW, we may get into a
>> >> situation where it takes time to get to the handler to stop the PMU. I
>> >> don't know how given we use NMI. Well, unless we were already servicing
>> >> an NMI at the time. But given that we stop the PMU almost immediately in
>> >> the handler, I don't see how that would possible. The other oddity in
>> >> HSW is that we clear the NMI on entry to the handler and not at the end.
>> >> I never gotten an good explanation as to why that was necessary. So
>> >> maybe it is related...
>> >
>> > Do you mean:
>> >
>> >         if (!x86_pmu.late_ack)
>> >                 apic_write(APIC_LVTPC, APIC_DM_NMI);
>> >
>> > AFAICS that means the opposite: that we clear the NMI late, i.e. shortly
>> > before return, after we've processed the PMU.
>> >
>> Yeah, the opposity, I got confused.
>>
>> Let me try reverting that.
>> Also curious about the influence of the LBR here.
>
> You could exclude any LBR interaction by doing tests with "-e cycles:p".
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-16 11:07                     ` Stephane Eranian
@ 2013-09-16 15:41                       ` Ingo Molnar
  2013-09-16 16:29                         ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2013-09-16 15:41 UTC (permalink / raw)
  To: eranian
  Cc: Peter Zijlstra, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen


* Stephane Eranian <eranian@googlemail.com> wrote:

> Hi,
> 
> Some updates on this problem.
> I have been running tests all week-end long on my HSW.
> I can reproduce the problem. What I know:
> 
> - It is not linked with callchain
> - The extra entries are valid
> - The reset values are still zeroes
> - The problem does not happen on SNB with the same test case
> - The PMU state looks sane when that happens.
> - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
> 
> So it seems like the threshold is ignored. But I don't understand where 
> there reset values are coming from. So it looks more like a bug in 
> micro-code where under certain circumstances multiple entries get 
> written.

Either multiple entries are written, or the PMI/NMI is not asserted as it 
should be?

> Something must be happening with the interrupt or HT. I will disable HT 
> next and also disable the NMI watchdog.

Yes, interaction with the NMI watchdog events might also be possible.

If it's truly just the threshold that is broken occasionally in a 
statistically insignificant manner then the bug is relatively benign and 
we could work it around in the kernel by ignoring excess entries.

In that case we should probably not annoy users with the scary kernel 
warning and instead increase a debug count somewhere so that it's still 
detectable.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-16 15:41                       ` Ingo Molnar
@ 2013-09-16 16:29                         ` Peter Zijlstra
  2013-09-17  7:00                           ` Ingo Molnar
  2013-09-23 15:25                           ` Stephane Eranian
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2013-09-16 16:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: eranian, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Mon, Sep 16, 2013 at 05:41:46PM +0200, Ingo Molnar wrote:
> 
> * Stephane Eranian <eranian@googlemail.com> wrote:
> 
> > Hi,
> > 
> > Some updates on this problem.
> > I have been running tests all week-end long on my HSW.
> > I can reproduce the problem. What I know:
> > 
> > - It is not linked with callchain
> > - The extra entries are valid
> > - The reset values are still zeroes
> > - The problem does not happen on SNB with the same test case
> > - The PMU state looks sane when that happens.
> > - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
> > 
> > So it seems like the threshold is ignored. But I don't understand where 
> > there reset values are coming from. So it looks more like a bug in 
> > micro-code where under certain circumstances multiple entries get 
> > written.
> 
> Either multiple entries are written, or the PMI/NMI is not asserted as it 
> should be?

No, both :-)

> > Something must be happening with the interrupt or HT. I will disable HT 
> > next and also disable the NMI watchdog.
> 
> Yes, interaction with the NMI watchdog events might also be possible.
> 
> If it's truly just the threshold that is broken occasionally in a 
> statistically insignificant manner then the bug is relatively benign and 
> we could work it around in the kernel by ignoring excess entries.
> 
> In that case we should probably not annoy users with the scary kernel 
> warning and instead increase a debug count somewhere so that it's still 
> detectable.

Its not just a broken threshold. When a PEBS event happens it can re-arm
itself but only if you program a RESET value !0. We don't do that, so
each counter should only ever fire once.

We must do this because PEBS is broken on NHM+ in that the
pebs_record::status is a direct copy of the overflow status field at
time of the assist and if you use the RESET thing nothing will clear the
status bits and you cannot demux the PEBS events back to the event that
generated them.

Worse, since its the overflow that arms the assist, and the assist
happens at some undefined amount of cycles after this event it is
possible for another assist to happen first.

That is, suppose both CNT0 and CNT1 have PEBS enabled and CNT0 overflows
first it is possible to find the CNT1 entry first in the buffer with
both of them having status := 0x03.

Complete and utter trainwreck.

This is why we have a threshold of 1 and use NMI for PMI even for pure
PEBS, it minimizes the complete clusterfuck described above.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-16 16:29                         ` Peter Zijlstra
@ 2013-09-17  7:00                           ` Ingo Molnar
  2013-09-23 15:25                           ` Stephane Eranian
  1 sibling, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2013-09-17  7:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: eranian, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 16, 2013 at 05:41:46PM +0200, Ingo Molnar wrote:
> > 
> > * Stephane Eranian <eranian@googlemail.com> wrote:
> > 
> > > Hi,
> > > 
> > > Some updates on this problem.
> > > I have been running tests all week-end long on my HSW.
> > > I can reproduce the problem. What I know:
> > > 
> > > - It is not linked with callchain
> > > - The extra entries are valid
> > > - The reset values are still zeroes
> > > - The problem does not happen on SNB with the same test case
> > > - The PMU state looks sane when that happens.
> > > - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
> > > 
> > > So it seems like the threshold is ignored. But I don't understand where 
> > > there reset values are coming from. So it looks more like a bug in 
> > > micro-code where under certain circumstances multiple entries get 
> > > written.
> > 
> > Either multiple entries are written, or the PMI/NMI is not asserted as it 
> > should be?
> 
> No, both :-)
> 
> > > Something must be happening with the interrupt or HT. I will disable HT 
> > > next and also disable the NMI watchdog.
> > 
> > Yes, interaction with the NMI watchdog events might also be possible.
> > 
> > If it's truly just the threshold that is broken occasionally in a 
> > statistically insignificant manner then the bug is relatively benign and 
> > we could work it around in the kernel by ignoring excess entries.
> > 
> > In that case we should probably not annoy users with the scary kernel 
> > warning and instead increase a debug count somewhere so that it's still 
> > detectable.
> 
> Its not just a broken threshold. When a PEBS event happens it can re-arm 
> itself but only if you program a RESET value !0. We don't do that, so 
> each counter should only ever fire once.
> 
> We must do this because PEBS is broken on NHM+ in that the 
> pebs_record::status is a direct copy of the overflow status field at 
> time of the assist and if you use the RESET thing nothing will clear the 
> status bits and you cannot demux the PEBS events back to the event that 
> generated them.
> 
> Worse, since its the overflow that arms the assist, and the assist 
> happens at some undefined amount of cycles after this event it is 
> possible for another assist to happen first.
> 
> That is, suppose both CNT0 and CNT1 have PEBS enabled and CNT0 overflows 
> first it is possible to find the CNT1 entry first in the buffer with 
> both of them having status := 0x03.
> 
> Complete and utter trainwreck.
> 
> This is why we have a threshold of 1 and use NMI for PMI even for pure 
> PEBS, it minimizes the complete clusterfuck described above.

What I mean that as per observations the problem seems to be statistical: 
it happens only once every couple of million records. So, as long as no 
memory is corrupted (the PEBS records don't go outside the DS area) it 
could be ignored when it happens, and still produce a valid, usable 
profile.

( Btw., we might want to introduce a 'error' event passed to tools, which 
  event they could process in a soft, statistical manner: only warn the 
  user if the erroneous events go beyond 1% or 5%, etc. Kernel warnings 
  are really not the best fit for such purposes. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-16 16:29                         ` Peter Zijlstra
  2013-09-17  7:00                           ` Ingo Molnar
@ 2013-09-23 15:25                           ` Stephane Eranian
  2013-09-23 15:33                             ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2013-09-23 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Mon, Sep 16, 2013 at 6:29 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 16, 2013 at 05:41:46PM +0200, Ingo Molnar wrote:
>>
>> * Stephane Eranian <eranian@googlemail.com> wrote:
>>
>> > Hi,
>> >
>> > Some updates on this problem.
>> > I have been running tests all week-end long on my HSW.
>> > I can reproduce the problem. What I know:
>> >
>> > - It is not linked with callchain
>> > - The extra entries are valid
>> > - The reset values are still zeroes
>> > - The problem does not happen on SNB with the same test case
>> > - The PMU state looks sane when that happens.
>> > - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
>> >
>> > So it seems like the threshold is ignored. But I don't understand where
>> > there reset values are coming from. So it looks more like a bug in
>> > micro-code where under certain circumstances multiple entries get
>> > written.
>>
>> Either multiple entries are written, or the PMI/NMI is not asserted as it
>> should be?
>
> No, both :-)
>
>> > Something must be happening with the interrupt or HT. I will disable HT
>> > next and also disable the NMI watchdog.
>>
>> Yes, interaction with the NMI watchdog events might also be possible.
>>
>> If it's truly just the threshold that is broken occasionally in a
>> statistically insignificant manner then the bug is relatively benign and
>> we could work it around in the kernel by ignoring excess entries.
>>
>> In that case we should probably not annoy users with the scary kernel
>> warning and instead increase a debug count somewhere so that it's still
>> detectable.
>
> Its not just a broken threshold. When a PEBS event happens it can re-arm
> itself but only if you program a RESET value !0. We don't do that, so
> each counter should only ever fire once.
>
> We must do this because PEBS is broken on NHM+ in that the
> pebs_record::status is a direct copy of the overflow status field at
> time of the assist and if you use the RESET thing nothing will clear the
> status bits and you cannot demux the PEBS events back to the event that
> generated them.
>
Trying to understand this problem better. You are saying that in case you
are sampling multiple PEBS events there is a problem if you allow more
than one record per PEBS buffer because the overflow status is not reset
properly.

For instance, if first record is caused by counter 0, ovfl_status=0x1,
then counter
is reset. Then, if counter 1 is the cause of the next record, then
that record has the
ovfl_status=0x3 instead of ovfl_status=0x2? Is that what you are saying?

If so then yes, I agree this is a serious bug and we need to have Intel fix it.

> Worse, since its the overflow that arms the assist, and the assist
> happens at some undefined amount of cycles after this event it is
> possible for another assist to happen first.
>
> That is, suppose both CNT0 and CNT1 have PEBS enabled and CNT0 overflows
> first it is possible to find the CNT1 entry first in the buffer with
> both of them having status := 0x03.
>
> Complete and utter trainwreck.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-23 15:25                           ` Stephane Eranian
@ 2013-09-23 15:33                             ` Peter Zijlstra
  2013-09-23 17:11                               ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2013-09-23 15:33 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Mon, Sep 23, 2013 at 05:25:19PM +0200, Stephane Eranian wrote:
> > Its not just a broken threshold. When a PEBS event happens it can re-arm
> > itself but only if you program a RESET value !0. We don't do that, so
> > each counter should only ever fire once.
> >
> > We must do this because PEBS is broken on NHM+ in that the
> > pebs_record::status is a direct copy of the overflow status field at
> > time of the assist and if you use the RESET thing nothing will clear the
> > status bits and you cannot demux the PEBS events back to the event that
> > generated them.
> >
> Trying to understand this problem better. You are saying that in case you
> are sampling multiple PEBS events there is a problem if you allow more
> than one record per PEBS buffer because the overflow status is not reset
> properly.

That is what I wrote; but I'm not entire sure that's correct. I think it
will reset the overflow bits once it does an actual reset after the PEBS
assist triggers, but see below.

> For instance, if first record is caused by counter 0, ovfl_status=0x1,
> then counter
> is reset. Then, if counter 1 is the cause of the next record, then
> that record has the
> ovfl_status=0x3 instead of ovfl_status=0x2? Is that what you are saying?
> 
> If so then yes, I agree this is a serious bug and we need to have Intel fix it.

But there's still the case where with 2 counters you can get:

cnt0 overflows; sets status |= 1 << 0, arms PEBS0 assist
cnt1 overflows; sets status |= 1 << 1, arms PEBS1 assist

PEBS0 ready to trigger
PEBS1 ready to trigger

Cnt1 event -> PEBS1 trigger, writes entry with status := 0x03
Cnt0 event -> PEBS0 trigger, writes entry with status := 0x03

At which point you'll have 2 events with the same status overflow bits
in 'reverse' order.

If we'd set RESET, the second entry would have status : 0x01, which
would be unambiguous again. But we'd still not know where to place the
0x03 entry.

With more PEBSn counters enabled and a threshold > 1 the chance of
having such scenarios is greatly increased.

The threshold := 1 case tries to avoid these cases by getting them out
as fast as possible and hopefully avoiding the second trigger.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-23 15:33                             ` Peter Zijlstra
@ 2013-09-23 17:11                               ` Stephane Eranian
  2013-09-23 17:24                                 ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2013-09-23 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Mon, Sep 23, 2013 at 5:33 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 23, 2013 at 05:25:19PM +0200, Stephane Eranian wrote:
>> > Its not just a broken threshold. When a PEBS event happens it can re-arm
>> > itself but only if you program a RESET value !0. We don't do that, so
>> > each counter should only ever fire once.
>> >
>> > We must do this because PEBS is broken on NHM+ in that the
>> > pebs_record::status is a direct copy of the overflow status field at
>> > time of the assist and if you use the RESET thing nothing will clear the
>> > status bits and you cannot demux the PEBS events back to the event that
>> > generated them.
>> >
>> Trying to understand this problem better. You are saying that in case you
>> are sampling multiple PEBS events there is a problem if you allow more
>> than one record per PEBS buffer because the overflow status is not reset
>> properly.
>
> That is what I wrote; but I'm not entire sure that's correct. I think it
> will reset the overflow bits once it does an actual reset after the PEBS
> assist triggers, but see below.
>
>> For instance, if first record is caused by counter 0, ovfl_status=0x1,
>> then counter
>> is reset. Then, if counter 1 is the cause of the next record, then
>> that record has the
>> ovfl_status=0x3 instead of ovfl_status=0x2? Is that what you are saying?
>>
>> If so then yes, I agree this is a serious bug and we need to have Intel fix it.
>
> But there's still the case where with 2 counters you can get:
>
> cnt0 overflows; sets status |= 1 << 0, arms PEBS0 assist
> cnt1 overflows; sets status |= 1 << 1, arms PEBS1 assist
>
> PEBS0 ready to trigger
> PEBS1 ready to trigger
>
> Cnt1 event -> PEBS1 trigger, writes entry with status := 0x03
> Cnt0 event -> PEBS0 trigger, writes entry with status := 0x03
>
> At which point you'll have 2 events with the same status overflow bits
> in 'reverse' order.
>
Ok so what you are saying is that the ovfl_status is not maintained private
to each counter but shared among all PEBS counters by ucode. That's
how you end up leaking between counters like that.

But the other thing I remember is that if two PEBS events overflow
at the same time, PEBS only write one record with 2 bits set  in the
ovfl_status field. No point in creating two because the machine state
will be the same for both. The kernel would just need to dispatch the
same PEBS record to all the events that overflowed.

Now, your case appears like that, except this is not what happened.
So you're misled to believe both counter overflowed at the same time
when they did not in reality.

I'd like to have a test case where I could reproduce this.

> If we'd set RESET, the second entry would have status : 0x01, which
> would be unambiguous again. But we'd still not know where to place the
> 0x03 entry.
>
> With more PEBSn counters enabled and a threshold > 1 the chance of
> having such scenarios is greatly increased.
>
> The threshold := 1 case tries to avoid these cases by getting them out
> as fast as possible and hopefully avoiding the second trigger.

I understand. We need to verify that the HSW problem is not related to
have 2 events running: cycles:pp and cycles for the NMI watchdog.
What if both interrupt at the same time. Normally, I'd expect the PEBS
code to drain the buffer for the PEBS event and then the regular handler
for the NMI watchdog event. But obviously, this does not happen that
way. Will continue to investigate.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12)
  2013-09-23 17:11                               ` Stephane Eranian
@ 2013-09-23 17:24                                 ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2013-09-23 17:24 UTC (permalink / raw)
  To: eranian
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Mon, Sep 23, 2013 at 07:11:21PM +0200, Stephane Eranian wrote:
> Ok so what you are saying is that the ovfl_status is not maintained private
> to each counter but shared among all PEBS counters by ucode. That's
> how you end up leaking between counters like that.

I only remember asking for clarification because the SDM isn't clear on
this subject; the answer was that it simply copies whatever is in
MSR_CORE_PERF_GLOBAL_STATUS.

I explained how this would be a problem and it was agreed this needed
fixing -- not sure if that ever happened.

> But the other thing I remember is that if two PEBS events overflow
> at the same time, PEBS only write one record with 2 bits set  in the
> ovfl_status field. No point in creating two because the machine state
> will be the same for both. The kernel would just need to dispatch the
> same PEBS record to all the events that overflowed.

Hurm.. that makes life more interesting still. The current code only
delivers the event to the first bit set. Changing this would be simple
though.

> Now, your case appears like that, except this is not what happened.
> So you're misled to believe both counter overflowed at the same time
> when they did not in reality.
> 
> I'd like to have a test case where I could reproduce this.

Agreed, I've never tried to actually reproduce this. I suppose it would
be easiest to trigger where the one event is very rare and controlled.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2013-09-23 17:25 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-03 13:29 [GIT PULL] perf changes for v3.12 Ingo Molnar
2013-09-03 13:37 ` Arnaldo Carvalho de Melo
2013-09-03 13:43   ` Ingo Molnar
2013-09-03 17:02 ` Vince Weaver
2013-09-04 17:53 ` Linus Torvalds
2013-09-05 10:56   ` Ingo Molnar
2013-09-05 12:42     ` Frederic Weisbecker
2013-09-05 12:51       ` Ingo Molnar
2013-09-05 12:58         ` Frederic Weisbecker
2013-09-10  8:06       ` Namhyung Kim
2013-09-10 11:18         ` Frederic Weisbecker
2013-09-05 13:38 ` Ingo Molnar
2013-09-08  2:17 ` Linus Torvalds
2013-09-09 10:05   ` Peter Zijlstra
2013-09-10 11:28     ` Stephane Eranian
2013-09-10 11:53       ` PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re: [GIT PULL] perf changes for v3.12) Ingo Molnar
2013-09-10 12:32         ` Stephane Eranian
2013-09-10 12:42           ` Ramkumar Ramachandra
2013-09-10 12:51           ` Ramkumar Ramachandra
2013-09-10 12:55             ` Stephane Eranian
2013-09-10 13:22               ` Ingo Molnar
2013-09-10 13:38           ` Ingo Molnar
2013-09-10 14:15             ` Stephane Eranian
2013-09-10 14:29               ` Ingo Molnar
2013-09-10 14:34                 ` Stephane Eranian
2013-09-10 17:14                   ` Ingo Molnar
2013-09-16 11:07                     ` Stephane Eranian
2013-09-16 15:41                       ` Ingo Molnar
2013-09-16 16:29                         ` Peter Zijlstra
2013-09-17  7:00                           ` Ingo Molnar
2013-09-23 15:25                           ` Stephane Eranian
2013-09-23 15:33                             ` Peter Zijlstra
2013-09-23 17:11                               ` Stephane Eranian
2013-09-23 17:24                                 ` Peter Zijlstra
2013-09-10 15:28               ` Peter Zijlstra
2013-09-10 16:14                 ` Stephane Eranian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.