This patch set adds support for precise event sampling with IBS. It also contains IBS fixes and updates not directly related to precise event sampling, but found during testing. There are no changes of perf tools required, thus this set only contains kernel patches. There will be also updated perf tools patches available that basically base on my previous postings to this list and additionally implement IBS pseudo events. With IBS there are two counting modes available to count either cycles or micro-ops. If the corresponding performance counter events (hw events) are setup with the precise flag set, the request is redirected to the ibs pmu: perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count perf record -a -e r076:p ... # same as -e cpu-cycles:p perf record -a -e r0C1:p ... # use ibs op counting micro-ops Each IBS sample contains a linear address that points to the instruction that was causing the sample to trigger. With ibs we have skid 0. Though the skid is 0, we map IBS sampling to following precise levels: 1: RIP taken from IBS sample or (if invalid) from stack. 2: RIP always taken from IBS sample, samples with an invalid rip are dropped. Thus samples of an event containing two precise modifiers (e.g. r076:pp) only contain (precise) addresses detected with IBS. Precise level 3 is reserved for other purposes in the future. The patches base on a trivial merge of tip/perf/core into tip/perf/x86-ibs. The merge and also the patches are available here: The following changes since commit 820b3e44dc22ac8072cd5ecf82d62193392fcca3: Merge remote-tracking branch 'tip/perf/core' into HEAD (2012-03-21 19:15:20 +0100) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git perf-ibs -Robert Robert Richter (12): perf/x86-ibs: Fix update of period perf: Pass last sampling period to perf_sample_data_init() perf/x86-ibs: Enable ibs op micro-ops counting mode perf/x86-ibs: Fix frequency profiling perf/x86-ibs: Take instruction pointer from ibs sample perf/x86-ibs: Precise event sampling with IBS for AMD CPUs perf/x86-ibs: Rename some variables perf/x86-ibs: Trigger overflow if remaining period is too small perf/x86-ibs: Extend hw period that triggers overflow perf/x86-ibs: Implement workaround for IBS erratum #420 perf/x86-ibs: Catch spurious interrupts after stopping ibs perf/x86-ibs: Fix usage of IBS op current count arch/alpha/kernel/perf_event.c | 3 +- arch/arm/kernel/perf_event_v6.c | 4 +- arch/arm/kernel/perf_event_v7.c | 4 +- arch/arm/kernel/perf_event_xscale.c | 8 +- arch/mips/kernel/perf_event_mipsxx.c | 2 +- arch/powerpc/kernel/perf_event.c | 3 +- arch/powerpc/kernel/perf_event_fsl_emb.c | 3 +- arch/sparc/kernel/perf_event.c | 4 +- arch/x86/include/asm/perf_event.h | 6 +- arch/x86/kernel/cpu/perf_event.c | 4 +- arch/x86/kernel/cpu/perf_event_amd.c | 7 +- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 274 +++++++++++++++++++++-------- arch/x86/kernel/cpu/perf_event_intel.c | 4 +- arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 +- arch/x86/kernel/cpu/perf_event_p4.c | 6 +- include/linux/perf_event.h | 5 +- kernel/events/core.c | 9 +- 17 files changed, 237 insertions(+), 115 deletions(-) -- 1.7.8.4
The last sw period was not correctly updated on overflow and thus led to wrong distribution of events. We always need to properly initialize data.period in struct perf_sample_data. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 27 ++++++++++++++------------- 1 files changed, 14 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 573d248..6eb6451 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -386,7 +386,21 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) if (!(*buf++ & perf_ibs->valid_mask)) return 0; + /* + * Emulate IbsOpCurCnt in MSRC001_1033 (IbsOpCtl), not + * supported in all cpus. As this triggered an interrupt, we + * set the current count to the max count. + */ + config = ibs_data.regs[0]; + if (perf_ibs == &perf_ibs_op && !(ibs_caps & IBS_CAPS_RDWROPCNT)) { + config &= ~IBS_OP_CUR_CNT; + config |= (config & IBS_OP_MAX_CNT) << 36; + } + + perf_ibs_event_update(perf_ibs, event, config); perf_sample_data_init(&data, 0); + data.period = event->hw.last_period; + if (event->attr.sample_type & PERF_SAMPLE_RAW) { ibs_data.caps = ibs_caps; size = 1; @@ -405,19 +419,6 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) regs = *iregs; /* XXX: update ip from ibs sample */ - /* - * Emulate IbsOpCurCnt in MSRC001_1033 (IbsOpCtl), not - * supported in all cpus. As this triggered an interrupt, we - * set the current count to the max count. - */ - config = ibs_data.regs[0]; - if (perf_ibs == &perf_ibs_op && !(ibs_caps & IBS_CAPS_RDWROPCNT)) { - config &= ~IBS_OP_CUR_CNT; - config |= (config & IBS_OP_MAX_CNT) << 36; - } - - perf_ibs_event_update(perf_ibs, event, config); - overflow = perf_ibs_set_period(perf_ibs, hwc, &config); reenable = !(overflow && perf_event_overflow(event, &data, ®s)); config = (config >> 4) | (reenable ? perf_ibs->enable_mask : 0); -- 1.7.8.4
We always need to pass the last sample period to perf_sample_data_init(), otherwise the event distribution will be wrong. Thus, modifiyng the function interface with the required period as argument. So basically a pattern like this: perf_sample_data_init(&data, ~0ULL); data.period = event->hw.last_period; will now be like that: perf_sample_data_init(&data, ~0ULL, event->hw.last_period); Avoids unininitialized data.period and simplifies code. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/alpha/kernel/perf_event.c | 3 +-- arch/arm/kernel/perf_event_v6.c | 4 +--- arch/arm/kernel/perf_event_v7.c | 4 +--- arch/arm/kernel/perf_event_xscale.c | 8 ++------ arch/mips/kernel/perf_event_mipsxx.c | 2 +- arch/powerpc/kernel/perf_event.c | 3 +-- arch/powerpc/kernel/perf_event_fsl_emb.c | 3 +-- arch/sparc/kernel/perf_event.c | 4 +--- arch/x86/kernel/cpu/perf_event.c | 4 +--- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 3 +-- arch/x86/kernel/cpu/perf_event_intel.c | 4 +--- arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++---- arch/x86/kernel/cpu/perf_event_p4.c | 6 +++--- include/linux/perf_event.h | 5 ++++- kernel/events/core.c | 9 ++++----- 15 files changed, 25 insertions(+), 43 deletions(-) diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c index 0dae252..d821b17 100644 --- a/arch/alpha/kernel/perf_event.c +++ b/arch/alpha/kernel/perf_event.c @@ -824,7 +824,6 @@ static void alpha_perf_event_irq_handler(unsigned long la_ptr, idx = la_ptr; - perf_sample_data_init(&data, 0); for (j = 0; j < cpuc->n_events; j++) { if (cpuc->current_idx[j] == idx) break; @@ -848,7 +847,7 @@ static void alpha_perf_event_irq_handler(unsigned long la_ptr, hwc = &event->hw; alpha_perf_event_update(event, hwc, idx, alpha_pmu->pmc_max_period[idx]+1); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (alpha_perf_event_set_period(event, hwc, idx)) { if (perf_event_overflow(event, &data, regs)) { diff --git a/arch/arm/kernel/perf_event_v6.c b/arch/arm/kernel/perf_event_v6.c index b78af0c..ab627a7 100644 --- a/arch/arm/kernel/perf_event_v6.c +++ b/arch/arm/kernel/perf_event_v6.c @@ -489,8 +489,6 @@ armv6pmu_handle_irq(int irq_num, */ armv6_pmcr_write(pmcr); - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < cpu_pmu->num_events; ++idx) { struct perf_event *event = cpuc->events[idx]; @@ -509,7 +507,7 @@ armv6pmu_handle_irq(int irq_num, hwc = &event->hw; armpmu_event_update(event, hwc, idx); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!armpmu_event_set_period(event, hwc, idx)) continue; diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c index 4d7095a..ec0c6cc 100644 --- a/arch/arm/kernel/perf_event_v7.c +++ b/arch/arm/kernel/perf_event_v7.c @@ -953,8 +953,6 @@ static irqreturn_t armv7pmu_handle_irq(int irq_num, void *dev) */ regs = get_irq_regs(); - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < cpu_pmu->num_events; ++idx) { struct perf_event *event = cpuc->events[idx]; @@ -973,7 +971,7 @@ static irqreturn_t armv7pmu_handle_irq(int irq_num, void *dev) hwc = &event->hw; armpmu_event_update(event, hwc, idx); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!armpmu_event_set_period(event, hwc, idx)) continue; diff --git a/arch/arm/kernel/perf_event_xscale.c b/arch/arm/kernel/perf_event_xscale.c index 71a21e6..e34e725 100644 --- a/arch/arm/kernel/perf_event_xscale.c +++ b/arch/arm/kernel/perf_event_xscale.c @@ -248,8 +248,6 @@ xscale1pmu_handle_irq(int irq_num, void *dev) regs = get_irq_regs(); - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < cpu_pmu->num_events; ++idx) { struct perf_event *event = cpuc->events[idx]; @@ -263,7 +261,7 @@ xscale1pmu_handle_irq(int irq_num, void *dev) hwc = &event->hw; armpmu_event_update(event, hwc, idx); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!armpmu_event_set_period(event, hwc, idx)) continue; @@ -588,8 +586,6 @@ xscale2pmu_handle_irq(int irq_num, void *dev) regs = get_irq_regs(); - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < cpu_pmu->num_events; ++idx) { struct perf_event *event = cpuc->events[idx]; @@ -603,7 +599,7 @@ xscale2pmu_handle_irq(int irq_num, void *dev) hwc = &event->hw; armpmu_event_update(event, hwc, idx); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!armpmu_event_set_period(event, hwc, idx)) continue; diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c index 811084f..ab73fa2 100644 --- a/arch/mips/kernel/perf_event_mipsxx.c +++ b/arch/mips/kernel/perf_event_mipsxx.c @@ -1325,7 +1325,7 @@ static int mipsxx_pmu_handle_shared_irq(void) regs = get_irq_regs(); - perf_sample_data_init(&data, 0); + perf_sample_data_init(&data, 0, 0); switch (counters) { #define HANDLE_COUNTER(n) \ diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c index c2e27ed..df2b284 100644 --- a/arch/powerpc/kernel/perf_event.c +++ b/arch/powerpc/kernel/perf_event.c @@ -1268,8 +1268,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val, if (record) { struct perf_sample_data data; - perf_sample_data_init(&data, ~0ULL); - data.period = event->hw.last_period; + perf_sample_data_init(&data, ~0ULL, event->hw.last_period); if (event->attr.sample_type & PERF_SAMPLE_ADDR) perf_get_data_addr(regs, &data.addr); diff --git a/arch/powerpc/kernel/perf_event_fsl_emb.c b/arch/powerpc/kernel/perf_event_fsl_emb.c index 0a6d2a9..106c533 100644 --- a/arch/powerpc/kernel/perf_event_fsl_emb.c +++ b/arch/powerpc/kernel/perf_event_fsl_emb.c @@ -613,8 +613,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val, if (record) { struct perf_sample_data data; - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); if (perf_event_overflow(event, &data, regs)) fsl_emb_pmu_stop(event, 0); diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c index 8e16a4a..333a14a 100644 --- a/arch/sparc/kernel/perf_event.c +++ b/arch/sparc/kernel/perf_event.c @@ -1294,8 +1294,6 @@ static int __kprobes perf_event_nmi_handler(struct notifier_block *self, regs = args->regs; - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); /* If the PMU has the TOE IRQ enable bits, we need to do a @@ -1319,7 +1317,7 @@ static int __kprobes perf_event_nmi_handler(struct notifier_block *self, if (val & (1ULL << 31)) continue; - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!sparc_perf_event_set_period(event, hwc, idx)) continue; diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 453ac94..56ae3af 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -1187,8 +1187,6 @@ int x86_pmu_handle_irq(struct pt_regs *regs) int idx, handled = 0; u64 val; - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); /* @@ -1223,7 +1221,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs) * event overflow */ handled++; - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); if (!x86_perf_event_set_period(event)) continue; diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 6eb6451..74b663c 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -398,8 +398,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) } perf_ibs_event_update(perf_ibs, event, config); - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (event->attr.sample_type & PERF_SAMPLE_RAW) { ibs_data.caps = ibs_caps; diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 26b3e2f..166546e 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1027,8 +1027,6 @@ static int intel_pmu_handle_irq(struct pt_regs *regs) u64 status; int handled; - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); /* @@ -1082,7 +1080,7 @@ again: if (!intel_pmu_save_and_restart(event)) continue; - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); if (has_branch_stack(event)) data.br_stack = &cpuc->lbr_stack; diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c index 7f64df1..5a3edc2 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c @@ -316,8 +316,7 @@ int intel_pmu_drain_bts_buffer(void) ds->bts_index = ds->bts_buffer_base; - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); regs.ip = 0; /* @@ -564,8 +563,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event, if (!intel_pmu_save_and_restart(event)) return; - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); /* * We use the interrupt regs as a base because the PEBS record diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c index ef484d9..ed301c7 100644 --- a/arch/x86/kernel/cpu/perf_event_p4.c +++ b/arch/x86/kernel/cpu/perf_event_p4.c @@ -1005,8 +1005,6 @@ static int p4_pmu_handle_irq(struct pt_regs *regs) int idx, handled = 0; u64 val; - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < x86_pmu.num_counters; idx++) { @@ -1034,10 +1032,12 @@ static int p4_pmu_handle_irq(struct pt_regs *regs) handled += overflow; /* event overflow for sure */ - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!x86_perf_event_set_period(event)) continue; + + if (perf_event_overflow(event, &data, regs)) x86_pmu_stop(event, 0); } diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 57ae485..12ac652 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1076,11 +1076,14 @@ struct perf_sample_data { struct perf_branch_stack *br_stack; }; -static inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr) +static inline void perf_sample_data_init(struct perf_sample_data *data, + u64 addr, u64 period) { + /* remaining struct members initialized in perf_prepare_sample() */ data->addr = addr; data->raw = NULL; data->br_stack = NULL; + data->period = period; } extern void perf_output_sample(struct perf_output_handle *handle, diff --git a/kernel/events/core.c b/kernel/events/core.c index c61234b..8833198 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4957,7 +4957,7 @@ void __perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) if (rctx < 0) return; - perf_sample_data_init(&data, addr); + perf_sample_data_init(&data, addr, 0); do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, &data, regs); @@ -5215,7 +5215,7 @@ void perf_tp_event(u64 addr, u64 count, void *record, int entry_size, .data = record, }; - perf_sample_data_init(&data, addr); + perf_sample_data_init(&data, addr, 0); data.raw = &raw; hlist_for_each_entry_rcu(event, node, head, hlist_entry) { @@ -5318,7 +5318,7 @@ void perf_bp_event(struct perf_event *bp, void *data) struct perf_sample_data sample; struct pt_regs *regs = data; - perf_sample_data_init(&sample, bp->attr.bp_addr); + perf_sample_data_init(&sample, bp->attr.bp_addr, 0); if (!bp->hw.state && !perf_exclude_event(bp, regs)) perf_swevent_event(bp, 1, &sample, regs); @@ -5344,8 +5344,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) event->pmu->read(event); - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); regs = get_irq_regs(); if (regs && !perf_exclude_event(event, regs)) { -- 1.7.8.4
Allow enabling ibs op micro-ops counting mode. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 74b663c..6f00ee3 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -468,6 +468,8 @@ static __init int perf_event_ibs_init(void) return -ENODEV; /* ibs not supported by the cpu */ perf_ibs_pmu_init(&perf_ibs_fetch, "ibs_fetch"); + if (ibs_caps & IBS_CAPS_OPCNT) + perf_ibs_op.config_mask |= IBS_OP_CNT_CTL; perf_ibs_pmu_init(&perf_ibs_op, "ibs_op"); register_nmi_handler(NMI_LOCAL, &perf_ibs_nmi_handler, 0, "perf_ibs"); printk(KERN_INFO "perf: AMD IBS detected (0x%08x)\n", ibs_caps); -- 1.7.8.4
Fixing profiling at a fix frequency, in this case the freq value and sample period was setup incorrectly. Since sampling periods are adjusted we also allow periods that have lower 4 bits set. Another fix is the setup of the hw counter: If we modify hwc->sample_period, we also need to update hwc->last_period and hwc->period_left. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 18 ++++++++++++++++-- 1 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 6f00ee3..eec3ea2 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -162,9 +162,16 @@ static int perf_ibs_init(struct perf_event *event) if (config & perf_ibs->cnt_mask) /* raw max_cnt may not be set */ return -EINVAL; - if (hwc->sample_period & 0x0f) - /* lower 4 bits can not be set in ibs max cnt */ + if (!event->attr.sample_freq && hwc->sample_period & 0x0f) + /* + * lower 4 bits can not be set in ibs max cnt, + * but allowing it in case we adjust the + * sample period to set a frequency. + */ return -EINVAL; + hwc->sample_period &= ~0x0FULL; + if (!hwc->sample_period) + hwc->sample_period = 0x10; } else { max_cnt = config & perf_ibs->cnt_mask; config &= ~perf_ibs->cnt_mask; @@ -175,6 +182,13 @@ static int perf_ibs_init(struct perf_event *event) if (!hwc->sample_period) return -EINVAL; + /* + * If we modify hwc->sample_period, we also need to update + * hwc->last_period and hwc->period_left. + */ + hwc->last_period = hwc->sample_period; + local64_set(&hwc->period_left, hwc->sample_period); + hwc->config_base = perf_ibs->msr; hwc->config = config; -- 1.7.8.4
Each IBS sample contains a linear address of the instruction that caused the sample to trigger. This address is more precise than the rip that was taken from the interrupt handler's stack. Update the rip with that address. We use this in the next patch to implement precise-event sampling on AMD systems using IBS. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/include/asm/perf_event.h | 6 ++- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 48 +++++++++++++++++++---------- 2 files changed, 35 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 9cf6696..651172d 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -157,6 +157,7 @@ struct x86_pmu_capability { #define IBS_CAPS_OPCNT (1U<<4) #define IBS_CAPS_BRNTRGT (1U<<5) #define IBS_CAPS_OPCNTEXT (1U<<6) +#define IBS_CAPS_RIPINVALIDCHK (1U<<7) #define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \ | IBS_CAPS_FETCHSAM \ @@ -169,14 +170,14 @@ struct x86_pmu_capability { #define IBSCTL_LVT_OFFSET_VALID (1ULL<<8) #define IBSCTL_LVT_OFFSET_MASK 0x0F -/* IbsFetchCtl bits/masks */ +/* ibs fetch bits/masks */ #define IBS_FETCH_RAND_EN (1ULL<<57) #define IBS_FETCH_VAL (1ULL<<49) #define IBS_FETCH_ENABLE (1ULL<<48) #define IBS_FETCH_CNT 0xFFFF0000ULL #define IBS_FETCH_MAX_CNT 0x0000FFFFULL -/* IbsOpCtl bits */ +/* ibs op bits/masks */ /* lower 4 bits of the current count are ignored: */ #define IBS_OP_CUR_CNT (0xFFFF0ULL<<32) #define IBS_OP_CNT_CTL (1ULL<<19) @@ -184,6 +185,7 @@ struct x86_pmu_capability { #define IBS_OP_ENABLE (1ULL<<17) #define IBS_OP_MAX_CNT 0x0000FFFFULL #define IBS_OP_MAX_CNT_EXT 0x007FFFFFULL /* not a register bit mask */ +#define IBS_RIP_INVALID (1ULL<<38) extern u32 get_ibs_caps(void); diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index eec3ea2..0321b64 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -9,6 +9,7 @@ #include <linux/perf_event.h> #include <linux/module.h> #include <linux/pci.h> +#include <linux/ptrace.h> #include <asm/apic.h> @@ -382,7 +383,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) struct perf_raw_record raw; struct pt_regs regs; struct perf_ibs_data ibs_data; - int offset, size, overflow, reenable; + int offset, size, check_rip, offset_max, throttle = 0; unsigned int msr; u64 *buf, config; @@ -413,28 +414,41 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) perf_ibs_event_update(perf_ibs, event, config); perf_sample_data_init(&data, 0, hwc->last_period); + if (!perf_ibs_set_period(perf_ibs, hwc, &config)) + goto out; /* no sw counter overflow */ + + ibs_data.caps = ibs_caps; + size = 1; + offset = 1; + check_rip = (perf_ibs == &perf_ibs_op && (ibs_caps & IBS_CAPS_RIPINVALIDCHK)); + if (event->attr.sample_type & PERF_SAMPLE_RAW) + offset_max = perf_ibs->offset_max; + else if (check_rip) + offset_max = 2; + else + offset_max = 1; + do { + rdmsrl(msr + offset, *buf++); + size++; + offset = find_next_bit(perf_ibs->offset_mask, + perf_ibs->offset_max, + offset + 1); + } while (offset < offset_max); + ibs_data.size = sizeof(u64) * size; + + regs = *iregs; + if (!check_rip || !(ibs_data.regs[2] & IBS_RIP_INVALID)) + instruction_pointer_set(®s, ibs_data.regs[1]); if (event->attr.sample_type & PERF_SAMPLE_RAW) { - ibs_data.caps = ibs_caps; - size = 1; - offset = 1; - do { - rdmsrl(msr + offset, *buf++); - size++; - offset = find_next_bit(perf_ibs->offset_mask, - perf_ibs->offset_max, - offset + 1); - } while (offset < perf_ibs->offset_max); - raw.size = sizeof(u32) + sizeof(u64) * size; + raw.size = sizeof(u32) + ibs_data.size; raw.data = ibs_data.data; data.raw = &raw; } - regs = *iregs; /* XXX: update ip from ibs sample */ - - overflow = perf_ibs_set_period(perf_ibs, hwc, &config); - reenable = !(overflow && perf_event_overflow(event, &data, ®s)); - config = (config >> 4) | (reenable ? perf_ibs->enable_mask : 0); + throttle = perf_event_overflow(event, &data, ®s); +out: + config = (config >> 4) | (throttle ? 0 : perf_ibs->enable_mask); perf_ibs_enable_event(hwc, config); perf_event_update_userpage(event); -- 1.7.8.4
This patch adds support for precise event sampling with IBS. There are two counting modes to count either cycles or micro-ops. If the corresponding performance counter events (hw events) are setup with the precise flag set, the request is redirected to the ibs pmu: perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count perf record -a -e r076:p ... # same as -e cpu-cycles:p perf record -a -e r0C1:p ... # use ibs op counting micro-ops Each IBS sample contains a linear address that points to the instruction that was causing the sample to trigger. With ibs we have skid 0. Though the skid is 0, we map IBS sampling to following precise levels: 1: RIP taken from IBS sample or (if invalid) from stack 2: RIP always taken from IBS sample, samples with an invalid rip are dropped. Thus samples of an event containing two precise modifiers (e.g. r076:pp) only contain (precise) addresses detected with IBS. Precise level 3 is reserved for other purposes in the future. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd.c | 7 +++- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 71 +++++++++++++++++++++++++++++- 2 files changed, 75 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 95e7fe1..4be3463 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -134,8 +134,13 @@ static u64 amd_pmu_event_map(int hw_event) static int amd_pmu_hw_config(struct perf_event *event) { - int ret = x86_pmu_hw_config(event); + int ret; + /* pass precise event sampling to ibs: */ + if (event->attr.precise_ip && get_ibs_caps()) + return -ENOENT; + + ret = x86_pmu_hw_config(event); if (ret) return ret; diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 0321b64..05a359f 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -145,17 +145,82 @@ static struct perf_ibs *get_ibs_pmu(int type) return NULL; } +/* + * Use IBS for precise event sampling: + * + * perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count + * perf record -a -e r076:p ... # same as -e cpu-cycles:p + * perf record -a -e r0C1:p ... # use ibs op counting micro-ops + * + * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl, + * MSRC001_1033) is used to select either cycle or micro-ops counting + * mode. + * + * We map IBS sampling to following precise levels: + * + * 1: RIP taken from IBS sample or (if invalid) from stack + * 2: RIP always taken from IBS sample, samples with an invalid rip + * are dropped. Thus samples of an event containing two precise + * modifiers (e.g. r076:pp) only contain (precise) addresses + * detected with IBS. + */ +static int perf_ibs_precise_event(struct perf_event *event, u64 *config) +{ + switch (event->attr.precise_ip) { + case 0: + return -ENOENT; + case 1: + case 2: + break; + default: + return -EOPNOTSUPP; + } + + switch (event->attr.type) { + case PERF_TYPE_HARDWARE: + switch (event->attr.config) { + case PERF_COUNT_HW_CPU_CYCLES: + *config = 0; + return 0; + } + break; + case PERF_TYPE_RAW: + switch (event->attr.config) { + case 0x0076: + *config = 0; + return 0; + case 0x00C1: + *config = IBS_OP_CNT_CTL; + return 0; + } + break; + default: + return -ENOENT; + } + + return -EOPNOTSUPP; +} + static int perf_ibs_init(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; struct perf_ibs *perf_ibs; u64 max_cnt, config; + int ret; perf_ibs = get_ibs_pmu(event->attr.type); - if (!perf_ibs) + if (perf_ibs) { + config = event->attr.config; + } else { + perf_ibs = &perf_ibs_op; + ret = perf_ibs_precise_event(event, &config); + if (ret) + return ret; + } + + if (event->pmu != &perf_ibs->pmu) return -ENOENT; - config = event->attr.config; if (config & ~perf_ibs->config_mask) return -EINVAL; @@ -439,6 +504,8 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) regs = *iregs; if (!check_rip || !(ibs_data.regs[2] & IBS_RIP_INVALID)) instruction_pointer_set(®s, ibs_data.regs[1]); + else if (event->attr.precise_ip > 1) + goto out; /* drop non-precise samples */ if (event->attr.sample_type & PERF_SAMPLE_RAW) { raw.size = sizeof(u32) + ibs_data.size; -- 1.7.8.4
Simple patch that just renames some variables for better understanding. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 05a359f..6591b77 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -62,7 +62,7 @@ struct perf_ibs_data { }; static int -perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *count) +perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *hw_period) { s64 left = local64_read(&hwc->period_left); s64 period = hwc->sample_period; @@ -91,7 +91,7 @@ perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *count) if (left > max) left = max; - *count = (u64)left; + *hw_period = (u64)left; return overflow; } @@ -264,13 +264,13 @@ static int perf_ibs_init(struct perf_event *event) static int perf_ibs_set_period(struct perf_ibs *perf_ibs, struct hw_perf_event *hwc, u64 *period) { - int ret; + int overflow; /* ignore lower 4 bits in min count: */ - ret = perf_event_set_period(hwc, 1<<4, perf_ibs->max_period, period); + overflow = perf_event_set_period(hwc, 1<<4, perf_ibs->max_period, period); local64_set(&hwc->prev_count, 0); - return ret; + return overflow; } static u64 get_ibs_fetch_count(u64 config) -- 1.7.8.4
There are cases where the remaining period is smaller than the minimal possible value. In this case the counter is restarted with the minimal period. This is of no use as the interrupt handler will trigger immediately again and most likely hits itself. This biases the results. So, if the remaining period is within the min range, we better do not restart the counter and instead trigger the overflow. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 6591b77..1f53f16 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -78,16 +78,13 @@ perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *hw_perio overflow = 1; } - if (unlikely(left <= 0)) { + if (unlikely(left < (s64)min)) { left += period; local64_set(&hwc->period_left, left); hwc->last_period = period; overflow = 1; } - if (unlikely(left < min)) - left = min; - if (left > max) left = max; -- 1.7.8.4
If the last hw period is too short we might hit the irq handler which biases the results. Thus try to have a max last period that triggers the sw overflow. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 1f53f16..f0271dd 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -85,8 +85,19 @@ perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *hw_perio overflow = 1; } - if (left > max) - left = max; + /* + * If the hw period that triggers the sw overflow is too short + * we might hit the irq handler. This biases the results. + * Thus we shorten the next-to-last period and set the last + * period to the max period. + */ + if (left > max) { + left -= max; + if (left > max) + left = max; + else if (left < min) + left = min; + } *hw_period = (u64)left; -- 1.7.8.4
When disabling ibs there might be the case where hardware continuously generates interrupts. This is described in erratum #420 (Instruction- Based Sampling Engine May Generate Interrupt that Cannot Be Cleared). To avoid this we must clear the counter mask first and then clear the enable bit. This patch implements this. See Revision Guide for AMD Family 10h Processors, Publication #41322. Note: We now keep track of the last read ibs config value which is then used to disable ibs. To update the config value we pass now a pointer to the functions reading it. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 62 +++++++++++++++++++----------- 1 files changed, 39 insertions(+), 23 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index f0271dd..35a35be 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -293,20 +293,36 @@ static u64 get_ibs_op_count(u64 config) static void perf_ibs_event_update(struct perf_ibs *perf_ibs, struct perf_event *event, - u64 config) + u64 *config) { - u64 count = perf_ibs->get_count(config); + u64 count = perf_ibs->get_count(*config); while (!perf_event_try_update(event, count, 20)) { - rdmsrl(event->hw.config_base, config); - count = perf_ibs->get_count(config); + rdmsrl(event->hw.config_base, *config); + count = perf_ibs->get_count(*config); } } -/* Note: The enable mask must be encoded in the config argument. */ -static inline void perf_ibs_enable_event(struct hw_perf_event *hwc, u64 config) +static inline void perf_ibs_enable_event(struct perf_ibs *perf_ibs, + struct hw_perf_event *hwc, u64 config) { - wrmsrl(hwc->config_base, hwc->config | config); + wrmsrl(hwc->config_base, hwc->config | config | perf_ibs->enable_mask); +} + +/* + * Erratum #420 Instruction-Based Sampling Engine May Generate + * Interrupt that Cannot Be Cleared: + * + * Must clear counter mask first, then clear the enable bit. See + * Revision Guide for AMD Family 10h Processors, Publication #41322. + */ +static inline void perf_ibs_disable_event(struct perf_ibs *perf_ibs, + struct hw_perf_event *hwc, u64 config) +{ + config &= ~perf_ibs->cnt_mask; + wrmsrl(hwc->config_base, config); + config &= ~perf_ibs->enable_mask; + wrmsrl(hwc->config_base, config); } /* @@ -320,7 +336,7 @@ static void perf_ibs_start(struct perf_event *event, int flags) struct hw_perf_event *hwc = &event->hw; struct perf_ibs *perf_ibs = container_of(event->pmu, struct perf_ibs, pmu); struct cpu_perf_ibs *pcpu = this_cpu_ptr(perf_ibs->pcpu); - u64 config; + u64 period; if (WARN_ON_ONCE(!(hwc->state & PERF_HES_STOPPED))) return; @@ -328,10 +344,9 @@ static void perf_ibs_start(struct perf_event *event, int flags) WARN_ON_ONCE(!(hwc->state & PERF_HES_UPTODATE)); hwc->state = 0; - perf_ibs_set_period(perf_ibs, hwc, &config); - config = (config >> 4) | perf_ibs->enable_mask; + perf_ibs_set_period(perf_ibs, hwc, &period); set_bit(IBS_STARTED, pcpu->state); - perf_ibs_enable_event(hwc, config); + perf_ibs_enable_event(perf_ibs, hwc, period >> 4); perf_event_update_userpage(event); } @@ -341,7 +356,7 @@ static void perf_ibs_stop(struct perf_event *event, int flags) struct hw_perf_event *hwc = &event->hw; struct perf_ibs *perf_ibs = container_of(event->pmu, struct perf_ibs, pmu); struct cpu_perf_ibs *pcpu = this_cpu_ptr(perf_ibs->pcpu); - u64 val; + u64 config; int stopping; stopping = test_and_clear_bit(IBS_STARTED, pcpu->state); @@ -349,12 +364,11 @@ static void perf_ibs_stop(struct perf_event *event, int flags) if (!stopping && (hwc->state & PERF_HES_UPTODATE)) return; - rdmsrl(hwc->config_base, val); + rdmsrl(hwc->config_base, config); if (stopping) { set_bit(IBS_STOPPING, pcpu->state); - val &= ~perf_ibs->enable_mask; - wrmsrl(hwc->config_base, val); + perf_ibs_disable_event(perf_ibs, hwc, config); WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED); hwc->state |= PERF_HES_STOPPED; } @@ -362,7 +376,7 @@ static void perf_ibs_stop(struct perf_event *event, int flags) if (hwc->state & PERF_HES_UPTODATE) return; - perf_ibs_event_update(perf_ibs, event, val); + perf_ibs_event_update(perf_ibs, event, &config); hwc->state |= PERF_HES_UPTODATE; } @@ -458,7 +472,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) struct perf_ibs_data ibs_data; int offset, size, check_rip, offset_max, throttle = 0; unsigned int msr; - u64 *buf, config; + u64 *buf, *config, period; if (!test_bit(IBS_STARTED, pcpu->state)) { /* Catch spurious interrupts after stopping IBS: */ @@ -479,15 +493,15 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) * supported in all cpus. As this triggered an interrupt, we * set the current count to the max count. */ - config = ibs_data.regs[0]; + config = &ibs_data.regs[0]; if (perf_ibs == &perf_ibs_op && !(ibs_caps & IBS_CAPS_RDWROPCNT)) { - config &= ~IBS_OP_CUR_CNT; - config |= (config & IBS_OP_MAX_CNT) << 36; + *config &= ~IBS_OP_CUR_CNT; + *config |= (*config & IBS_OP_MAX_CNT) << 36; } perf_ibs_event_update(perf_ibs, event, config); perf_sample_data_init(&data, 0, hwc->last_period); - if (!perf_ibs_set_period(perf_ibs, hwc, &config)) + if (!perf_ibs_set_period(perf_ibs, hwc, &period)) goto out; /* no sw counter overflow */ ibs_data.caps = ibs_caps; @@ -523,8 +537,10 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) throttle = perf_event_overflow(event, &data, ®s); out: - config = (config >> 4) | (throttle ? 0 : perf_ibs->enable_mask); - perf_ibs_enable_event(hwc, config); + if (throttle) + perf_ibs_disable_event(perf_ibs, hwc, *config); + else + perf_ibs_enable_event(perf_ibs, hwc, period >> 4); perf_event_update_userpage(event); -- 1.7.8.4
After disabling IBS there could be still incomming NMIs with samples that even have the valid bit cleared. Mark all this NMIs as handled to avoid spurious interrupt messages. Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 12 +++++++----- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 35a35be..b44aa63 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -475,11 +475,13 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) u64 *buf, *config, period; if (!test_bit(IBS_STARTED, pcpu->state)) { - /* Catch spurious interrupts after stopping IBS: */ - if (!test_and_clear_bit(IBS_STOPPING, pcpu->state)) - return 0; - rdmsrl(perf_ibs->msr, *ibs_data.regs); - return (*ibs_data.regs & perf_ibs->valid_mask) ? 1 : 0; + /* + * Catch spurious interrupts after stopping IBS: After + * disabling IBS there could be still incomming NMIs + * with samples that even have the valid bit cleared. + * Mark all this NMIs as handled. + */ + return test_and_clear_bit(IBS_STOPPING, pcpu->state) ? 1 : 0; } msr = hwc->config_base; -- 1.7.8.4
The value of IbsOpCurCnt rolls over when it reaches IbsOpMaxCnt. Thus, it is reset to zero by hardware. To get the correct count we need to add the max count to it in case we received an ibs sample (valid bit set). Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 33 +++++++++++++++++++---------- 1 files changed, 21 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index b44aa63..0dfe952 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -288,7 +288,15 @@ static u64 get_ibs_fetch_count(u64 config) static u64 get_ibs_op_count(u64 config) { - return (config & IBS_OP_CUR_CNT) >> 32; + u64 count = 0; + + if (config & IBS_OP_VAL) + count += (config & IBS_OP_MAX_CNT) << 4; /* cnt rolled over */ + + if (ibs_caps & IBS_CAPS_RDWROPCNT) + count += (config & IBS_OP_CUR_CNT) >> 32; + + return count; } static void @@ -297,7 +305,12 @@ perf_ibs_event_update(struct perf_ibs *perf_ibs, struct perf_event *event, { u64 count = perf_ibs->get_count(*config); - while (!perf_event_try_update(event, count, 20)) { + /* + * Set width to 64 since we do not overflow on max width but + * instead on max count. In perf_ibs_set_period() we clear + * prev count manually on overflow. + */ + while (!perf_event_try_update(event, count, 64)) { rdmsrl(event->hw.config_base, *config); count = perf_ibs->get_count(*config); } @@ -376,6 +389,12 @@ static void perf_ibs_stop(struct perf_event *event, int flags) if (hwc->state & PERF_HES_UPTODATE) return; + /* + * Clear valid bit to not count rollovers on update, rollovers + * are only updated in the irq handler. + */ + config &= ~perf_ibs->valid_mask; + perf_ibs_event_update(perf_ibs, event, &config); hwc->state |= PERF_HES_UPTODATE; } @@ -490,17 +509,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) if (!(*buf++ & perf_ibs->valid_mask)) return 0; - /* - * Emulate IbsOpCurCnt in MSRC001_1033 (IbsOpCtl), not - * supported in all cpus. As this triggered an interrupt, we - * set the current count to the max count. - */ config = &ibs_data.regs[0]; - if (perf_ibs == &perf_ibs_op && !(ibs_caps & IBS_CAPS_RDWROPCNT)) { - *config &= ~IBS_OP_CUR_CNT; - *config |= (*config & IBS_OP_MAX_CNT) << 36; - } - perf_ibs_event_update(perf_ibs, event, config); perf_sample_data_init(&data, 0, hwc->last_period); if (!perf_ibs_set_period(perf_ibs, hwc, &period)) -- 1.7.8.4
* Robert Richter <robert.richter@amd.com> wrote:
> perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count
Cool - this makes IBS really useful!
Mind posting some perf annotate output of any well-known kernel
function showing skiddy '-e cpu-cycles' output versus skid-less
'-e cpu-cycles:p' output?
I'm curious how well this works in practice.
Thanks,
Ingo
On 02.04.12 21:11:23, Ingo Molnar wrote:
> Mind posting some perf annotate output of any well-known kernel
> function showing skiddy '-e cpu-cycles' output versus skid-less
> '-e cpu-cycles:p' output?
This is what I got for _raw_spin_lock_irqsave (first perfctr, second
ibs).
-Robert
# perf annotate -k vmlinux -s _raw_spin_lock_irqsave -i perf-r076.data | cat
Percent | Source code & Disassembly of vmlinux
------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: ffffffff8145036a <_raw_spin_lock_irqsave>:
0.00 : ffffffff8145036a: push %rbp
0.00 : ffffffff8145036b: mov %rsp,%rbp
0.00 : ffffffff8145036e: callq ffffffff81456c40 <mcount>
0.00 : ffffffff81450373: pushfq
0.00 : ffffffff81450374: pop %rax
0.00 : ffffffff81450375: cli
0.00 : ffffffff81450376: mov $0x100,%edx
0.00 : ffffffff8145037b: lock xadd %dx,(%rdi)
0.00 : ffffffff81450380: mov %dl,%cl
0.00 : ffffffff81450382: shr $0x8,%dx
10.34 : ffffffff81450386: cmp %dl,%cl
0.00 : ffffffff81450388: je ffffffff81450390 <_raw_spin_lock_irqsave+0x26>
10.34 : ffffffff8145038a: pause
65.52 : ffffffff8145038c: mov (%rdi),%cl
13.79 : ffffffff8145038e: jmp ffffffff81450386 <_raw_spin_lock_irqsave+0x1c>
0.00 : ffffffff81450390: leaveq
0.00 : ffffffff81450391: retq
# perf annotate -k vmlinux -s _raw_spin_lock_irqsave -i perf-r076pp.data | cat
Percent | Source code & Disassembly of vmlinux
------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: ffffffff8145036a <_raw_spin_lock_irqsave>:
0.00 : ffffffff8145036a: push %rbp
0.00 : ffffffff8145036b: mov %rsp,%rbp
0.00 : ffffffff8145036e: callq ffffffff81456c40 <mcount>
0.00 : ffffffff81450373: pushfq
0.00 : ffffffff81450374: pop %rax
0.00 : ffffffff81450375: cli
0.00 : ffffffff81450376: mov $0x100,%edx
0.00 : ffffffff8145037b: lock xadd %dx,(%rdi)
2.78 : ffffffff81450380: mov %dl,%cl
0.00 : ffffffff81450382: shr $0x8,%dx
2.78 : ffffffff81450386: cmp %dl,%cl
11.11 : ffffffff81450388: je ffffffff81450390 <_raw_spin_lock_irqsave+0x26>
72.22 : ffffffff8145038a: pause
2.78 : ffffffff8145038c: mov (%rdi),%cl
8.33 : ffffffff8145038e: jmp ffffffff81450386 <_raw_spin_lock_irqsave+0x1c>
0.00 : ffffffff81450390: leaveq
0.00 : ffffffff81450391: retq
--
Advanced Micro Devices, Inc.
Operating System Research Center
On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote:
> + * We map IBS sampling to following precise levels:
> + *
> + * 1: RIP taken from IBS sample or (if invalid) from stack
> + * 2: RIP always taken from IBS sample, samples with an invalid rip
> + * are dropped. Thus samples of an event containing two precise
> + * modifiers (e.g. r076:pp) only contain (precise) addresses
> + * detected with IBS.
/*
* precise_ip:
*
* 0 - SAMPLE_IP can have arbitrary skid
* 1 - SAMPLE_IP must have constant skid
* 2 - SAMPLE_IP requested to have 0 skid
* 3 - SAMPLE_IP must have 0 skid
*
* See also PERF_RECORD_MISC_EXACT_IP
*/
your 1 doesn't have constant skid. I would suggest only supporting 2 and
letting userspace drop !PERF_RECORD_MISC_EXACT_IP records if so desired.
That said, mixing the IBS pmu into the regular core pmu isn't exactly
pretty..
On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote:
> + * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl,
> + * MSRC001_1033) is used to select either cycle or micro-ops counting
> + * mode.
Ah is that what it does.. the BKDG doesn't appear to say this.
On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote:
> + switch (event->attr.type) {
> + case PERF_TYPE_HARDWARE:
> + switch (event->attr.config) {
> + case PERF_COUNT_HW_CPU_CYCLES:
> + *config = 0;
> + return 0;
> + }
> + break;
> + case PERF_TYPE_RAW:
> + switch (event->attr.config) {
> + case 0x0076:
> + *config = 0;
> + return 0;
> + case 0x00C1:
> + *config = IBS_OP_CNT_CTL;
> + return 0;
> + }
> + break;
> + default:
> + return -ENOENT;
> + }
Another option would be to do this from amd_pmu_hw_config() after you've
already gotten rid of the whole attr.type thing.
On 14.04.12 12:22:10, Peter Zijlstra wrote: > On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote: > > + * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl, > > + * MSRC001_1033) is used to select either cycle or micro-ops counting > > + * mode. > > Ah is that what it does.. the BKDG doesn't appear to say this. "19 IbsOpCntCtl: periodic op counter count control. Revision B: Reserved. Revision C: Read-write. Reset 0b. 1=Count dispatched ops 0=Count clock cycles." It's here: MSRC001_1033 IBS Execution Control Register (IbsOpCtl) http://support.amd.com/us/Processor_TechDocs/31116.pdf Ok, it might not be quite clear that "dispatched ops" is related to EventSelect 0C1h Retired uops, but there is an exact mapping. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center
On 14.04.12 12:21:46, Peter Zijlstra wrote: > On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote: > > + * We map IBS sampling to following precise levels: > > + * > > + * 1: RIP taken from IBS sample or (if invalid) from stack > > + * 2: RIP always taken from IBS sample, samples with an invalid rip > > + * are dropped. Thus samples of an event containing two precise > > + * modifiers (e.g. r076:pp) only contain (precise) addresses > > + * detected with IBS. > > /* > * precise_ip: > * > * 0 - SAMPLE_IP can have arbitrary skid > * 1 - SAMPLE_IP must have constant skid > * 2 - SAMPLE_IP requested to have 0 skid > * 3 - SAMPLE_IP must have 0 skid > * > * See also PERF_RECORD_MISC_EXACT_IP > */ > > your 1 doesn't have constant skid. I would suggest only supporting 2 and > letting userspace drop !PERF_RECORD_MISC_EXACT_IP records if so desired. Ah, didn't notice the PERF_RECORD_MISC_EXACT_IP flag. Will set this flag for precise events. Problem is that this flag is not yet well supported, only perf-top uses it to count the total number of exact samples. Esp. perf-annotate and perf-report do not support it, and there are no modifiers to select precise-only sampling (or is this level 3?). Both might be useful: You might need only precise-rip samples (perf- annotate usage), on the other side you want samples with every clock/ops count overflow (e.g. to get a counting statistic). The p-modifier specification (see perf-list) is not sufficient to select both of it. Another question I have: Isn't precise level 2 a special case of level 1 where the skid is constant and 0? The problem I see is, if people want to measure precise rip, they simply use r076:p. Level 2 (r076:pp) is actually better than 1, but they might think not to be able to sample precise-rip if we throw an error for r076:p. Thus, I would prefer to also allow level 1. > That said, mixing the IBS pmu into the regular core pmu isn't exactly > pretty.. IBS is currently the only way to do precise-rip sampling on amd cpus. IBS events fit well with its corresponding perfctr events (0x76/ 0xc1). So what don't you like with this approach? I will also post IBS perf tool support where IBS can be directly used. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center
On 14.04.12 12:24:58, Peter Zijlstra wrote:
> On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote:
> > + switch (event->attr.type) {
> > + case PERF_TYPE_HARDWARE:
> > + switch (event->attr.config) {
> > + case PERF_COUNT_HW_CPU_CYCLES:
> > + *config = 0;
> > + return 0;
> > + }
> > + break;
> > + case PERF_TYPE_RAW:
> > + switch (event->attr.config) {
> > + case 0x0076:
> > + *config = 0;
> > + return 0;
> > + case 0x00C1:
> > + *config = IBS_OP_CNT_CTL;
> > + return 0;
> > + }
> > + break;
> > + default:
> > + return -ENOENT;
> > + }
>
> Another option would be to do this from amd_pmu_hw_config() after you've
> already gotten rid of the whole attr.type thing.
I didn't want to have IBS setup code in amd_pmu_hw_config(). The
approach to pass the configuration for precise sampling to the ibs pmu
was the simplest to me.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
On Mon, 2012-04-23 at 10:41 +0200, Robert Richter wrote:
> On 14.04.12 12:22:10, Peter Zijlstra wrote:
> > On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote:
> > > + * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl,
> > > + * MSRC001_1033) is used to select either cycle or micro-ops counting
> > > + * mode.
> >
> > Ah is that what it does.. the BKDG doesn't appear to say this.
>
> "19 IbsOpCntCtl: periodic op counter count control. Revision B:
> Reserved. Revision C: Read-write. Reset 0b. 1=Count dispatched ops
> 0=Count clock cycles."
>
> It's here:
>
> MSRC001_1033 IBS Execution Control Register (IbsOpCtl)
> http://support.amd.com/us/Processor_TechDocs/31116.pdf
>
> Ok, it might not be quite clear that "dispatched ops" is related to
> EventSelect 0C1h Retired uops, but there is an exact mapping.
Ah, looks like my docs are stale.. my fam10 doc didn't have it specified
at all and my fam12 doc just listed the bit 19 as "periodic op counter
count control. Read-write." Which isn't very helpful.
Thanks!
On 23.04.12 11:56:59, Robert Richter wrote: > On 14.04.12 12:21:46, Peter Zijlstra wrote: > > On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote: > > > + * We map IBS sampling to following precise levels: > > > + * > > > + * 1: RIP taken from IBS sample or (if invalid) from stack > > > + * 2: RIP always taken from IBS sample, samples with an invalid rip > > > + * are dropped. Thus samples of an event containing two precise > > > + * modifiers (e.g. r076:pp) only contain (precise) addresses > > > + * detected with IBS. > > > > /* > > * precise_ip: > > * > > * 0 - SAMPLE_IP can have arbitrary skid > > * 1 - SAMPLE_IP must have constant skid > > * 2 - SAMPLE_IP requested to have 0 skid > > * 3 - SAMPLE_IP must have 0 skid > > * > > * See also PERF_RECORD_MISC_EXACT_IP > > */ > > > > your 1 doesn't have constant skid. I would suggest only supporting 2 and > > letting userspace drop !PERF_RECORD_MISC_EXACT_IP records if so desired. > > Ah, didn't notice the PERF_RECORD_MISC_EXACT_IP flag. Will set this > flag for precise events. Peter, I have a patch on top that implements the support of the PERF_RECORD_MISC_EXACT_IP flag. But I am not quite sure about how to use the precise levels. What do you suggest? Thanks, -Robert > > Problem is that this flag is not yet well supported, only perf-top > uses it to count the total number of exact samples. Esp. perf-annotate > and perf-report do not support it, and there are no modifiers to > select precise-only sampling (or is this level 3?). > > Both might be useful: You might need only precise-rip samples (perf- > annotate usage), on the other side you want samples with every > clock/ops count overflow (e.g. to get a counting statistic). The > p-modifier specification (see perf-list) is not sufficient to select > both of it. > > Another question I have: Isn't precise level 2 a special case of level > 1 where the skid is constant and 0? The problem I see is, if people > want to measure precise rip, they simply use r076:p. Level 2 (r076:pp) > is actually better than 1, but they might think not to be able to > sample precise-rip if we throw an error for r076:p. Thus, I would > prefer to also allow level 1. > > > That said, mixing the IBS pmu into the regular core pmu isn't exactly > > pretty.. > > IBS is currently the only way to do precise-rip sampling on amd cpus. > IBS events fit well with its corresponding perfctr events (0x76/ > 0xc1). So what don't you like with this approach? I will also post IBS > perf tool support where IBS can be directly used. > > -Robert > > -- > Advanced Micro Devices, Inc. > Operating System Research Center -- Advanced Micro Devices, Inc. Operating System Research Center
On Fri, Apr 27, 2012 at 2:34 PM, Robert Richter <robert.richter@amd.com> wrote: > On 23.04.12 11:56:59, Robert Richter wrote: >> On 14.04.12 12:21:46, Peter Zijlstra wrote: >> > On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote: >> > > + * We map IBS sampling to following precise levels: >> > > + * >> > > + * 1: RIP taken from IBS sample or (if invalid) from stack >> > > + * 2: RIP always taken from IBS sample, samples with an invalid rip >> > > + * are dropped. Thus samples of an event containing two precise >> > > + * modifiers (e.g. r076:pp) only contain (precise) addresses >> > > + * detected with IBS. >> > >> > /* >> > * precise_ip: >> > * >> > * 0 - SAMPLE_IP can have arbitrary skid >> > * 1 - SAMPLE_IP must have constant skid >> > * 2 - SAMPLE_IP requested to have 0 skid >> > * 3 - SAMPLE_IP must have 0 skid >> > * >> > * See also PERF_RECORD_MISC_EXACT_IP >> > */ >> > >> > your 1 doesn't have constant skid. I would suggest only supporting 2 and >> > letting userspace drop !PERF_RECORD_MISC_EXACT_IP records if so desired. >> >> Ah, didn't notice the PERF_RECORD_MISC_EXACT_IP flag. Will set this >> flag for precise events. > Why not use 2? IBS has 0 skid, unless I am mistaken. > Peter, > > I have a patch on top that implements the support of the > PERF_RECORD_MISC_EXACT_IP flag. But I am not quite sure about how to > use the precise levels. What do you suggest? > > Thanks, > > -Robert > >> >> Problem is that this flag is not yet well supported, only perf-top >> uses it to count the total number of exact samples. Esp. perf-annotate >> and perf-report do not support it, and there are no modifiers to >> select precise-only sampling (or is this level 3?). >> >> Both might be useful: You might need only precise-rip samples (perf- >> annotate usage), on the other side you want samples with every >> clock/ops count overflow (e.g. to get a counting statistic). The >> p-modifier specification (see perf-list) is not sufficient to select >> both of it. >> >> Another question I have: Isn't precise level 2 a special case of level >> 1 where the skid is constant and 0? The problem I see is, if people >> want to measure precise rip, they simply use r076:p. Level 2 (r076:pp) >> is actually better than 1, but they might think not to be able to >> sample precise-rip if we throw an error for r076:p. Thus, I would >> prefer to also allow level 1. >> >> > That said, mixing the IBS pmu into the regular core pmu isn't exactly >> > pretty.. >> >> IBS is currently the only way to do precise-rip sampling on amd cpus. >> IBS events fit well with its corresponding perfctr events (0x76/ >> 0xc1). So what don't you like with this approach? I will also post IBS >> perf tool support where IBS can be directly used. >> >> -Robert >> >> -- >> Advanced Micro Devices, Inc. >> Operating System Research Center > > -- > Advanced Micro Devices, Inc. > Operating System Research Center >
On 27.04.12 14:39:21, Stephane Eranian wrote: > On Fri, Apr 27, 2012 at 2:34 PM, Robert Richter <robert.richter@amd.com> wrote: > > On 23.04.12 11:56:59, Robert Richter wrote: > >> On 14.04.12 12:21:46, Peter Zijlstra wrote: > >> > On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote: > >> > > + * We map IBS sampling to following precise levels: > >> > > + * > >> > > + * 1: RIP taken from IBS sample or (if invalid) from stack > >> > > + * 2: RIP always taken from IBS sample, samples with an invalid rip > >> > > + * are dropped. Thus samples of an event containing two precise > >> > > + * modifiers (e.g. r076:pp) only contain (precise) addresses > >> > > + * detected with IBS. > >> > > >> > /* > >> > * precise_ip: > >> > * > >> > * 0 - SAMPLE_IP can have arbitrary skid > >> > * 1 - SAMPLE_IP must have constant skid > >> > * 2 - SAMPLE_IP requested to have 0 skid > >> > * 3 - SAMPLE_IP must have 0 skid > >> > * > >> > * See also PERF_RECORD_MISC_EXACT_IP > >> > */ > >> > > >> > your 1 doesn't have constant skid. I would suggest only supporting 2 and > >> > letting userspace drop !PERF_RECORD_MISC_EXACT_IP records if so desired. > >> > >> Ah, didn't notice the PERF_RECORD_MISC_EXACT_IP flag. Will set this > >> flag for precise events. > > > Why not use 2? IBS has 0 skid, unless I am mistaken. Events with r076:p would fail then. But r076:pp is actually better and a subset of level 1. Thus both level should work. And there is still the question how samples with imprecise rip should be handled. Sometimes we want to get all samples and sometimes all samples should always contain a precise rip, other samples should be dropped then. But there is no option or modifier for this yet. My suggestions was to use level 1 for all samples and level 2 for samples that only contain a precise rip, saving level 3 for future use. -Robert > > > Peter, > > > > I have a patch on top that implements the support of the > > PERF_RECORD_MISC_EXACT_IP flag. But I am not quite sure about how to > > use the precise levels. What do you suggest? > > > > Thanks, > > > > -Robert > > > >> > >> Problem is that this flag is not yet well supported, only perf-top > >> uses it to count the total number of exact samples. Esp. perf-annotate > >> and perf-report do not support it, and there are no modifiers to > >> select precise-only sampling (or is this level 3?). > >> > >> Both might be useful: You might need only precise-rip samples (perf- > >> annotate usage), on the other side you want samples with every > >> clock/ops count overflow (e.g. to get a counting statistic). The > >> p-modifier specification (see perf-list) is not sufficient to select > >> both of it. > >> > >> Another question I have: Isn't precise level 2 a special case of level > >> 1 where the skid is constant and 0? The problem I see is, if people > >> want to measure precise rip, they simply use r076:p. Level 2 (r076:pp) > >> is actually better than 1, but they might think not to be able to > >> sample precise-rip if we throw an error for r076:p. Thus, I would > >> prefer to also allow level 1. > >> > >> > That said, mixing the IBS pmu into the regular core pmu isn't exactly > >> > pretty.. > >> > >> IBS is currently the only way to do precise-rip sampling on amd cpus. > >> IBS events fit well with its corresponding perfctr events (0x76/ > >> 0xc1). So what don't you like with this approach? I will also post IBS > >> perf tool support where IBS can be directly used. > >> > >> -Robert > >> > >> -- > >> Advanced Micro Devices, Inc. > >> Operating System Research Center > > > > -- > > Advanced Micro Devices, Inc. > > Operating System Research Center > > > -- Advanced Micro Devices, Inc. Operating System Research Center
Robert,
I did not follow the entire discussion, but based on your initial
post:
perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count
perf record -a -e r076:p ... # same as -e cpu-cycles:p
perf record -a -e r0C1:p ... # use ibs op counting micro-ops
Each IBS sample contains a linear address that points to the
instruction that was causing the sample to trigger. With ibs we have
skid 0.
Though the skid is 0, we map IBS sampling to following precise levels:
1: RIP taken from IBS sample or (if invalid) from stack.
I assume by stack you mean pt_regs, right?
2: RIP always taken from IBS sample, samples with an invalid rip
are dropped. Thus samples of an event containing two precise
modifiers (e.g. r076:pp) only contain (precise) addresses
detected with IBS.
I don't think you need the distinction between 1 and 2. You can
always use the pt_regs IP as a fallback. You can mark that the
IP is precise with the MISC_EXACT flag in the sample header.
This is how it's done with PEBS. What's wrong with that?
It may actually be better than dropping samples silently as it
may introduce some bias.
On Fri, Apr 27, 2012 at 2:54 PM, Robert Richter <robert.richter@amd.com> wrote:
> On 27.04.12 14:39:21, Stephane Eranian wrote:
>> On Fri, Apr 27, 2012 at 2:34 PM, Robert Richter <robert.richter@amd.com> wrote:
>> > On 23.04.12 11:56:59, Robert Richter wrote:
>> >> On 14.04.12 12:21:46, Peter Zijlstra wrote:
>> >> > On Mon, 2012-04-02 at 20:19 +0200, Robert Richter wrote:
>> >> > > + * We map IBS sampling to following precise levels:
>> >> > > + *
>> >> > > + * 1: RIP taken from IBS sample or (if invalid) from stack
>> >> > > + * 2: RIP always taken from IBS sample, samples with an invalid rip
>> >> > > + * are dropped. Thus samples of an event containing two precise
>> >> > > + * modifiers (e.g. r076:pp) only contain (precise) addresses
>> >> > > + * detected with IBS.
>> >> >
>> >> > /*
>> >> > * precise_ip:
>> >> > *
>> >> > * 0 - SAMPLE_IP can have arbitrary skid
>> >> > * 1 - SAMPLE_IP must have constant skid
>> >> > * 2 - SAMPLE_IP requested to have 0 skid
>> >> > * 3 - SAMPLE_IP must have 0 skid
>> >> > *
>> >> > * See also PERF_RECORD_MISC_EXACT_IP
>> >> > */
>> >> >
>> >> > your 1 doesn't have constant skid. I would suggest only supporting 2 and
>> >> > letting userspace drop !PERF_RECORD_MISC_EXACT_IP records if so desired.
>> >>
>> >> Ah, didn't notice the PERF_RECORD_MISC_EXACT_IP flag. Will set this
>> >> flag for precise events.
>> >
>> Why not use 2? IBS has 0 skid, unless I am mistaken.
>
> Events with r076:p would fail then. But r076:pp is actually better and
> a subset of level 1. Thus both level should work.
>
> And there is still the question how samples with imprecise rip should
> be handled. Sometimes we want to get all samples and sometimes all
> samples should always contain a precise rip, other samples should be
> dropped then. But there is no option or modifier for this yet.
>
> My suggestions was to use level 1 for all samples and level 2 for
> samples that only contain a precise rip, saving level 3 for future
> use.
>
> -Robert
>
>>
>> > Peter,
>> >
>> > I have a patch on top that implements the support of the
>> > PERF_RECORD_MISC_EXACT_IP flag. But I am not quite sure about how to
>> > use the precise levels. What do you suggest?
>> >
>> > Thanks,
>> >
>> > -Robert
>> >
>> >>
>> >> Problem is that this flag is not yet well supported, only perf-top
>> >> uses it to count the total number of exact samples. Esp. perf-annotate
>> >> and perf-report do not support it, and there are no modifiers to
>> >> select precise-only sampling (or is this level 3?).
>> >>
>> >> Both might be useful: You might need only precise-rip samples (perf-
>> >> annotate usage), on the other side you want samples with every
>> >> clock/ops count overflow (e.g. to get a counting statistic). The
>> >> p-modifier specification (see perf-list) is not sufficient to select
>> >> both of it.
>> >>
>> >> Another question I have: Isn't precise level 2 a special case of level
>> >> 1 where the skid is constant and 0? The problem I see is, if people
>> >> want to measure precise rip, they simply use r076:p. Level 2 (r076:pp)
>> >> is actually better than 1, but they might think not to be able to
>> >> sample precise-rip if we throw an error for r076:p. Thus, I would
>> >> prefer to also allow level 1.
>> >>
>> >> > That said, mixing the IBS pmu into the regular core pmu isn't exactly
>> >> > pretty..
>> >>
>> >> IBS is currently the only way to do precise-rip sampling on amd cpus.
>> >> IBS events fit well with its corresponding perfctr events (0x76/
>> >> 0xc1). So what don't you like with this approach? I will also post IBS
>> >> perf tool support where IBS can be directly used.
>> >>
>> >> -Robert
>> >>
>> >> --
>> >> Advanced Micro Devices, Inc.
>> >> Operating System Research Center
>> >
>> > --
>> > Advanced Micro Devices, Inc.
>> > Operating System Research Center
>> >
>>
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
On 27.04.12 15:10:22, Stephane Eranian wrote: > perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count > perf record -a -e r076:p ... # same as -e cpu-cycles:p > perf record -a -e r0C1:p ... # use ibs op counting micro-ops > > Each IBS sample contains a linear address that points to the > instruction that was causing the sample to trigger. With ibs we have > skid 0. > > Though the skid is 0, we map IBS sampling to following precise levels: > > 1: RIP taken from IBS sample or (if invalid) from stack. > > I assume by stack you mean pt_regs, right? Right. > > 2: RIP always taken from IBS sample, samples with an invalid rip > are dropped. Thus samples of an event containing two precise > modifiers (e.g. r076:pp) only contain (precise) addresses > detected with IBS. > > I don't think you need the distinction between 1 and 2. You can > always use the pt_regs IP as a fallback. You can mark that the > IP is precise with the MISC_EXACT flag in the sample header. > This is how it's done with PEBS. What's wrong with that? > It may actually be better than dropping samples silently as it > may introduce some bias. There is nothing wrong with it. I already implemented that the MISC_EXACT flag is supported. But, the flag is basically not used in the perf tool and there is no modifier or so to only get a precise rip. Supose you want to use perf-annotate you only want to get precise rips. With the levels suggested above you can do so with: perf record -a -e r076:pp ... | perf annotate ... (Note the double-p.) For non-biased sampling (e.g. counting or statistic numbers) you take level 1 and you get every sample: perf record -a -e r076:p ... There is the lack of a modifier to evaluate MISC_EXACT the same way. That's why my choice of the levels above. Didn't have a better idea. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center
On Fri, 2012-04-27 at 14:54 +0200, Robert Richter wrote:
> My suggestions was to use level 1 for all samples and level 2 for
> samples that only contain a precise rip, saving level 3 for future
> use.
No.
On Fri, 2012-04-27 at 17:18 +0200, Robert Richter wrote:
> There is nothing wrong with it. I already implemented that the
> MISC_EXACT flag is supported. But, the flag is basically not used in
> the perf tool and there is no modifier or so to only get a precise
> rip.
Just because userspace doesn't dtrt doesnt mean its a good idea to wreck
the kernel side of things.
Instead fix the userspace.
I'll simply not take patches that silently drops samples.
On Fri, Apr 27, 2012 at 5:30 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2012-04-27 at 17:18 +0200, Robert Richter wrote: >> There is nothing wrong with it. I already implemented that the >> MISC_EXACT flag is supported. But, the flag is basically not used in >> the perf tool and there is no modifier or so to only get a precise >> rip. > > Just because userspace doesn't dtrt doesnt mean its a good idea to wreck > the kernel side of things. > > Instead fix the userspace. > I was going to suggest you add an option to perf annotate/report to filter out non exact_ip samples. That can't be that hard to do. > I'll simply not take patches that silently drops samples. >
On 27.04.12 17:30:33, Peter Zijlstra wrote:
> On Fri, 2012-04-27 at 14:54 +0200, Robert Richter wrote:
> > My suggestions was to use level 1 for all samples and level 2 for
> > samples that only contain a precise rip, saving level 3 for future
> > use.
>
> No.
Ok, will look how to handle this in userspace.
But do you agree to have level 1 and 2 mapped to ibs, not just only
level 2 (since I don't want to fail with r076:p)?
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
On Fri, 2012-04-27 at 18:09 +0200, Robert Richter wrote:
>
> But do you agree to have level 1 and 2 mapped to ibs, not just only
> level 2 (since I don't want to fail with r076:p)?
Sure, you can have 1 and 2 mean the same. 2 wants 0 skid, 1 wants
constant skid, 0 is a constant, therefore its consistent.
On Fri, Apr 27, 2012 at 6:21 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2012-04-27 at 18:09 +0200, Robert Richter wrote:
>>
>> But do you agree to have level 1 and 2 mapped to ibs, not just only
>> level 2 (since I don't want to fail with r076:p)?
>
> Sure, you can have 1 and 2 mean the same. 2 wants 0 skid, 1 wants
> constant skid, 0 is a constant, therefore its consistent.
Yes. With IBS I would expect no difference in the samples between
level 1 and 2. But that's okay. As Peter says, it still fits within the
definitions of those levels.
Updated version. Level 1 and 2 are handled the same way now. Don't drop samples in precise level 2 if rip is invalid, instead support the PERF_EFLAGS_EXACT flag. No changes in other patches of [PATCH 00/12] perf/x86-ibs: Precise event sampling with IBS for AMD CPUs -Robert >From 6d646cefdea9958c3401110caecc958b41f6e84d Mon Sep 17 00:00:00 2001 From: Robert Richter <robert.richter@amd.com> Date: Mon, 12 Mar 2012 12:54:32 +0100 Subject: [PATCH] perf/x86-ibs: Precise event sampling with IBS for AMD CPUs This patch adds support for precise event sampling with IBS. There are two counting modes to count either cycles or micro-ops. If the corresponding performance counter events (hw events) are setup with the precise flag set, the request is redirected to the ibs pmu: perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count perf record -a -e r076:p ... # same as -e cpu-cycles:p perf record -a -e r0C1:p ... # use ibs op counting micro-ops Each ibs sample contains a linear address that points to the instruction that was causing the sample to trigger. With ibs we have skid 0. Thus, ibs supports precise levels 1 and 2. Samples are marked with the PERF_EFLAGS_EXACT flag set. In rare cases the rip is invalid when IBS was not able to record the rip correctly. Then the PERF_EFLAGS_EXACT flag is cleared and the rip is taken from pt_regs. V2: * don't drop samples in precise level 2 if rip is invalid, instead support the PERF_EFLAGS_EXACT flag Signed-off-by: Robert Richter <robert.richter@amd.com> --- arch/x86/kernel/cpu/perf_event_amd.c | 7 +++- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 73 ++++++++++++++++++++++++++++- 2 files changed, 76 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 95e7fe1..4be3463 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -134,8 +134,13 @@ static u64 amd_pmu_event_map(int hw_event) static int amd_pmu_hw_config(struct perf_event *event) { - int ret = x86_pmu_hw_config(event); + int ret; + /* pass precise event sampling to ibs: */ + if (event->attr.precise_ip && get_ibs_caps()) + return -ENOENT; + + ret = x86_pmu_hw_config(event); if (ret) return ret; diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 0321b64..117b0aa 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -145,17 +145,80 @@ static struct perf_ibs *get_ibs_pmu(int type) return NULL; } +/* + * Use IBS for precise event sampling: + * + * perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count + * perf record -a -e r076:p ... # same as -e cpu-cycles:p + * perf record -a -e r0C1:p ... # use ibs op counting micro-ops + * + * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl, + * MSRC001_1033) is used to select either cycle or micro-ops counting + * mode. + * + * The rip of IBS samples has skid 0. Thus, IBS supports precise + * levels 1 and 2 and the PERF_EFLAGS_EXACT is set. In rare cases the + * rip is invalid when IBS was not able to record the rip correctly. + * We clear PERF_EFLAGS_EXACT and take the rip from pt_regs then. + * + */ +static int perf_ibs_precise_event(struct perf_event *event, u64 *config) +{ + switch (event->attr.precise_ip) { + case 0: + return -ENOENT; + case 1: + case 2: + break; + default: + return -EOPNOTSUPP; + } + + switch (event->attr.type) { + case PERF_TYPE_HARDWARE: + switch (event->attr.config) { + case PERF_COUNT_HW_CPU_CYCLES: + *config = 0; + return 0; + } + break; + case PERF_TYPE_RAW: + switch (event->attr.config) { + case 0x0076: + *config = 0; + return 0; + case 0x00C1: + *config = IBS_OP_CNT_CTL; + return 0; + } + break; + default: + return -ENOENT; + } + + return -EOPNOTSUPP; +} + static int perf_ibs_init(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; struct perf_ibs *perf_ibs; u64 max_cnt, config; + int ret; perf_ibs = get_ibs_pmu(event->attr.type); - if (!perf_ibs) + if (perf_ibs) { + config = event->attr.config; + } else { + perf_ibs = &perf_ibs_op; + ret = perf_ibs_precise_event(event, &config); + if (ret) + return ret; + } + + if (event->pmu != &perf_ibs->pmu) return -ENOENT; - config = event->attr.config; if (config & ~perf_ibs->config_mask) return -EINVAL; @@ -437,8 +500,12 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) ibs_data.size = sizeof(u64) * size; regs = *iregs; - if (!check_rip || !(ibs_data.regs[2] & IBS_RIP_INVALID)) + if (check_rip && (ibs_data.regs[2] & IBS_RIP_INVALID)) { + regs.flags &= ~PERF_EFLAGS_EXACT; + } else { instruction_pointer_set(®s, ibs_data.regs[1]); + regs.flags |= PERF_EFLAGS_EXACT; + } if (event->attr.sample_type & PERF_SAMPLE_RAW) { raw.size = sizeof(u32) + ibs_data.size; -- 1.7.8.4 -- Advanced Micro Devices, Inc. Operating System Research Center
On Wed, 2012-05-02 at 12:33 +0200, Robert Richter wrote:
> Updated version. Level 1 and 2 are handled the same way now. Don't
> drop samples in precise level 2 if rip is invalid, instead support the
> PERF_EFLAGS_EXACT flag.
>
> No changes in other patches of
>
> [PATCH 00/12] perf/x86-ibs: Precise event sampling with IBS for AMD CPUs
Thanks!, I managed to stomp all patches on top of -tip and shall be
trying it out on my aging opteron-1216.
On Wed, 2012-05-02 at 13:14 +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-02 at 12:33 +0200, Robert Richter wrote:
> > Updated version. Level 1 and 2 are handled the same way now. Don't
> > drop samples in precise level 2 if rip is invalid, instead support the
> > PERF_EFLAGS_EXACT flag.
> >
> > No changes in other patches of
> >
> > [PATCH 00/12] perf/x86-ibs: Precise event sampling with IBS for AMD CPUs
>
> Thanks!, I managed to stomp all patches on top of -tip and shall be
> trying it out on my aging opteron-1216.
Hmm, that box isn't reporting X86_FEATURE_IBS, a quick trip to Wikipedia
tells me this is a K8 (Santa Ana), not Fam 10h. Means I don't actually
have any hardware to test this on :-(
I'll have to throw it to Ingo then, IIRC he's got an Istanbul part.
Commit-ID: c75841a398d667d9968245b9519d93cedbfb4780 Gitweb: http://git.kernel.org/tip/c75841a398d667d9968245b9519d93cedbfb4780 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:07 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:11 +0200 perf/x86-ibs: Fix update of period The last sw period was not correctly updated on overflow and thus led to wrong distribution of events. We always need to properly initialize data.period in struct perf_sample_data. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-2-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 27 ++++++++++++++------------- 1 files changed, 14 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 8ff74d4..c8f69be 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -386,7 +386,21 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) if (!(*buf++ & perf_ibs->valid_mask)) return 0; + /* + * Emulate IbsOpCurCnt in MSRC001_1033 (IbsOpCtl), not + * supported in all cpus. As this triggered an interrupt, we + * set the current count to the max count. + */ + config = ibs_data.regs[0]; + if (perf_ibs == &perf_ibs_op && !(ibs_caps & IBS_CAPS_RDWROPCNT)) { + config &= ~IBS_OP_CUR_CNT; + config |= (config & IBS_OP_MAX_CNT) << 36; + } + + perf_ibs_event_update(perf_ibs, event, config); perf_sample_data_init(&data, 0); + data.period = event->hw.last_period; + if (event->attr.sample_type & PERF_SAMPLE_RAW) { ibs_data.caps = ibs_caps; size = 1; @@ -405,19 +419,6 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) regs = *iregs; /* XXX: update ip from ibs sample */ - /* - * Emulate IbsOpCurCnt in MSRC001_1033 (IbsOpCtl), not - * supported in all cpus. As this triggered an interrupt, we - * set the current count to the max count. - */ - config = ibs_data.regs[0]; - if (perf_ibs == &perf_ibs_op && !(ibs_caps & IBS_CAPS_RDWROPCNT)) { - config &= ~IBS_OP_CUR_CNT; - config |= (config & IBS_OP_MAX_CNT) << 36; - } - - perf_ibs_event_update(perf_ibs, event, config); - overflow = perf_ibs_set_period(perf_ibs, hwc, &config); reenable = !(overflow && perf_event_overflow(event, &data, ®s)); config = (config >> 4) | (reenable ? perf_ibs->enable_mask : 0);
Commit-ID: fd0d000b2c34aa43d4e92dcf0dfaeda7e123008a Gitweb: http://git.kernel.org/tip/fd0d000b2c34aa43d4e92dcf0dfaeda7e123008a Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:08 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:12 +0200 perf: Pass last sampling period to perf_sample_data_init() We always need to pass the last sample period to perf_sample_data_init(), otherwise the event distribution will be wrong. Thus, modifiyng the function interface with the required period as argument. So basically a pattern like this: perf_sample_data_init(&data, ~0ULL); data.period = event->hw.last_period; will now be like that: perf_sample_data_init(&data, ~0ULL, event->hw.last_period); Avoids unininitialized data.period and simplifies code. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-3-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/alpha/kernel/perf_event.c | 3 +-- arch/arm/kernel/perf_event_v6.c | 4 +--- arch/arm/kernel/perf_event_v7.c | 4 +--- arch/arm/kernel/perf_event_xscale.c | 8 ++------ arch/mips/kernel/perf_event_mipsxx.c | 2 +- arch/powerpc/perf/core-book3s.c | 3 +-- arch/powerpc/perf/core-fsl-emb.c | 3 +-- arch/sparc/kernel/perf_event.c | 4 +--- arch/x86/kernel/cpu/perf_event.c | 4 +--- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 3 +-- arch/x86/kernel/cpu/perf_event_intel.c | 4 +--- arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 ++---- arch/x86/kernel/cpu/perf_event_p4.c | 6 +++--- include/linux/perf_event.h | 5 ++++- kernel/events/core.c | 9 ++++----- 15 files changed, 25 insertions(+), 43 deletions(-) diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c index 0dae252..d821b17 100644 --- a/arch/alpha/kernel/perf_event.c +++ b/arch/alpha/kernel/perf_event.c @@ -824,7 +824,6 @@ static void alpha_perf_event_irq_handler(unsigned long la_ptr, idx = la_ptr; - perf_sample_data_init(&data, 0); for (j = 0; j < cpuc->n_events; j++) { if (cpuc->current_idx[j] == idx) break; @@ -848,7 +847,7 @@ static void alpha_perf_event_irq_handler(unsigned long la_ptr, hwc = &event->hw; alpha_perf_event_update(event, hwc, idx, alpha_pmu->pmc_max_period[idx]+1); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (alpha_perf_event_set_period(event, hwc, idx)) { if (perf_event_overflow(event, &data, regs)) { diff --git a/arch/arm/kernel/perf_event_v6.c b/arch/arm/kernel/perf_event_v6.c index b78af0c..ab627a7 100644 --- a/arch/arm/kernel/perf_event_v6.c +++ b/arch/arm/kernel/perf_event_v6.c @@ -489,8 +489,6 @@ armv6pmu_handle_irq(int irq_num, */ armv6_pmcr_write(pmcr); - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < cpu_pmu->num_events; ++idx) { struct perf_event *event = cpuc->events[idx]; @@ -509,7 +507,7 @@ armv6pmu_handle_irq(int irq_num, hwc = &event->hw; armpmu_event_update(event, hwc, idx); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!armpmu_event_set_period(event, hwc, idx)) continue; diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c index 00755d8..d3c5360 100644 --- a/arch/arm/kernel/perf_event_v7.c +++ b/arch/arm/kernel/perf_event_v7.c @@ -1077,8 +1077,6 @@ static irqreturn_t armv7pmu_handle_irq(int irq_num, void *dev) */ regs = get_irq_regs(); - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < cpu_pmu->num_events; ++idx) { struct perf_event *event = cpuc->events[idx]; @@ -1097,7 +1095,7 @@ static irqreturn_t armv7pmu_handle_irq(int irq_num, void *dev) hwc = &event->hw; armpmu_event_update(event, hwc, idx); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!armpmu_event_set_period(event, hwc, idx)) continue; diff --git a/arch/arm/kernel/perf_event_xscale.c b/arch/arm/kernel/perf_event_xscale.c index 71a21e6..e34e725 100644 --- a/arch/arm/kernel/perf_event_xscale.c +++ b/arch/arm/kernel/perf_event_xscale.c @@ -248,8 +248,6 @@ xscale1pmu_handle_irq(int irq_num, void *dev) regs = get_irq_regs(); - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < cpu_pmu->num_events; ++idx) { struct perf_event *event = cpuc->events[idx]; @@ -263,7 +261,7 @@ xscale1pmu_handle_irq(int irq_num, void *dev) hwc = &event->hw; armpmu_event_update(event, hwc, idx); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!armpmu_event_set_period(event, hwc, idx)) continue; @@ -588,8 +586,6 @@ xscale2pmu_handle_irq(int irq_num, void *dev) regs = get_irq_regs(); - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < cpu_pmu->num_events; ++idx) { struct perf_event *event = cpuc->events[idx]; @@ -603,7 +599,7 @@ xscale2pmu_handle_irq(int irq_num, void *dev) hwc = &event->hw; armpmu_event_update(event, hwc, idx); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!armpmu_event_set_period(event, hwc, idx)) continue; diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c index 811084f..ab73fa2 100644 --- a/arch/mips/kernel/perf_event_mipsxx.c +++ b/arch/mips/kernel/perf_event_mipsxx.c @@ -1325,7 +1325,7 @@ static int mipsxx_pmu_handle_shared_irq(void) regs = get_irq_regs(); - perf_sample_data_init(&data, 0); + perf_sample_data_init(&data, 0, 0); switch (counters) { #define HANDLE_COUNTER(n) \ diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 02aee03..8f84bcb 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -1299,8 +1299,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val, if (record) { struct perf_sample_data data; - perf_sample_data_init(&data, ~0ULL); - data.period = event->hw.last_period; + perf_sample_data_init(&data, ~0ULL, event->hw.last_period); if (event->attr.sample_type & PERF_SAMPLE_ADDR) perf_get_data_addr(regs, &data.addr); diff --git a/arch/powerpc/perf/core-fsl-emb.c b/arch/powerpc/perf/core-fsl-emb.c index 0a6d2a9..106c533 100644 --- a/arch/powerpc/perf/core-fsl-emb.c +++ b/arch/powerpc/perf/core-fsl-emb.c @@ -613,8 +613,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val, if (record) { struct perf_sample_data data; - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); if (perf_event_overflow(event, &data, regs)) fsl_emb_pmu_stop(event, 0); diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c index 28559ce..5713957 100644 --- a/arch/sparc/kernel/perf_event.c +++ b/arch/sparc/kernel/perf_event.c @@ -1296,8 +1296,6 @@ static int __kprobes perf_event_nmi_handler(struct notifier_block *self, regs = args->regs; - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); /* If the PMU has the TOE IRQ enable bits, we need to do a @@ -1321,7 +1319,7 @@ static int __kprobes perf_event_nmi_handler(struct notifier_block *self, if (val & (1ULL << 31)) continue; - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!sparc_perf_event_set_period(event, hwc, idx)) continue; diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index e33e9cf..e049d6d 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -1183,8 +1183,6 @@ int x86_pmu_handle_irq(struct pt_regs *regs) int idx, handled = 0; u64 val; - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); /* @@ -1219,7 +1217,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs) * event overflow */ handled++; - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); if (!x86_perf_event_set_period(event)) continue; diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index c8f69be..2317228 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -398,8 +398,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) } perf_ibs_event_update(perf_ibs, event, config); - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (event->attr.sample_type & PERF_SAMPLE_RAW) { ibs_data.caps = ibs_caps; diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 26b3e2f..166546e 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1027,8 +1027,6 @@ static int intel_pmu_handle_irq(struct pt_regs *regs) u64 status; int handled; - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); /* @@ -1082,7 +1080,7 @@ again: if (!intel_pmu_save_and_restart(event)) continue; - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); if (has_branch_stack(event)) data.br_stack = &cpuc->lbr_stack; diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c index 7f64df1..5a3edc2 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c @@ -316,8 +316,7 @@ int intel_pmu_drain_bts_buffer(void) ds->bts_index = ds->bts_buffer_base; - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); regs.ip = 0; /* @@ -564,8 +563,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event, if (!intel_pmu_save_and_restart(event)) return; - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); /* * We use the interrupt regs as a base because the PEBS record diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c index a2dfacf..47124a7 100644 --- a/arch/x86/kernel/cpu/perf_event_p4.c +++ b/arch/x86/kernel/cpu/perf_event_p4.c @@ -1005,8 +1005,6 @@ static int p4_pmu_handle_irq(struct pt_regs *regs) int idx, handled = 0; u64 val; - perf_sample_data_init(&data, 0); - cpuc = &__get_cpu_var(cpu_hw_events); for (idx = 0; idx < x86_pmu.num_counters; idx++) { @@ -1034,10 +1032,12 @@ static int p4_pmu_handle_irq(struct pt_regs *regs) handled += overflow; /* event overflow for sure */ - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, hwc->last_period); if (!x86_perf_event_set_period(event)) continue; + + if (perf_event_overflow(event, &data, regs)) x86_pmu_stop(event, 0); } diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index ddbb6a9..f325786 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1132,11 +1132,14 @@ struct perf_sample_data { struct perf_branch_stack *br_stack; }; -static inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr) +static inline void perf_sample_data_init(struct perf_sample_data *data, + u64 addr, u64 period) { + /* remaining struct members initialized in perf_prepare_sample() */ data->addr = addr; data->raw = NULL; data->br_stack = NULL; + data->period = period; } extern void perf_output_sample(struct perf_output_handle *handle, diff --git a/kernel/events/core.c b/kernel/events/core.c index 9789a56..00c58df 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4957,7 +4957,7 @@ void __perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) if (rctx < 0) return; - perf_sample_data_init(&data, addr); + perf_sample_data_init(&data, addr, 0); do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, &data, regs); @@ -5215,7 +5215,7 @@ void perf_tp_event(u64 addr, u64 count, void *record, int entry_size, .data = record, }; - perf_sample_data_init(&data, addr); + perf_sample_data_init(&data, addr, 0); data.raw = &raw; hlist_for_each_entry_rcu(event, node, head, hlist_entry) { @@ -5318,7 +5318,7 @@ void perf_bp_event(struct perf_event *bp, void *data) struct perf_sample_data sample; struct pt_regs *regs = data; - perf_sample_data_init(&sample, bp->attr.bp_addr); + perf_sample_data_init(&sample, bp->attr.bp_addr, 0); if (!bp->hw.state && !perf_exclude_event(bp, regs)) perf_swevent_event(bp, 1, &sample, regs); @@ -5344,8 +5344,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer) event->pmu->read(event); - perf_sample_data_init(&data, 0); - data.period = event->hw.last_period; + perf_sample_data_init(&data, 0, event->hw.last_period); regs = get_irq_regs(); if (regs && !perf_exclude_event(event, regs)) {
Commit-ID: 7bf352384fda3f678a283928c6c5b2cd9da877e4 Gitweb: http://git.kernel.org/tip/7bf352384fda3f678a283928c6c5b2cd9da877e4 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:09 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:12 +0200 perf/x86-ibs: Enable ibs op micro-ops counting mode Allow enabling ibs op micro-ops counting mode. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-4-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 2317228..ebf169f 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -468,6 +468,8 @@ static __init int perf_event_ibs_init(void) return -ENODEV; /* ibs not supported by the cpu */ perf_ibs_pmu_init(&perf_ibs_fetch, "ibs_fetch"); + if (ibs_caps & IBS_CAPS_OPCNT) + perf_ibs_op.config_mask |= IBS_OP_CNT_CTL; perf_ibs_pmu_init(&perf_ibs_op, "ibs_op"); register_nmi_handler(NMI_LOCAL, perf_ibs_nmi_handler, 0, "perf_ibs"); printk(KERN_INFO "perf: AMD IBS detected (0x%08x)\n", ibs_caps);
Commit-ID: 6accb9cf76080422d400a641d9068b6b2a2c216f Gitweb: http://git.kernel.org/tip/6accb9cf76080422d400a641d9068b6b2a2c216f Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:10 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:13 +0200 perf/x86-ibs: Fix frequency profiling Fixing profiling at a fixed frequency, in this case the freq value and sample period was setup incorrectly. Since sampling periods are adjusted we also allow periods that have lower 4 bits set. Another fix is the setup of the hw counter: If we modify hwc->sample_period, we also need to update hwc->last_period and hwc->period_left. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-5-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 18 ++++++++++++++++-- 1 files changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index ebf169f..bc401bd 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -162,9 +162,16 @@ static int perf_ibs_init(struct perf_event *event) if (config & perf_ibs->cnt_mask) /* raw max_cnt may not be set */ return -EINVAL; - if (hwc->sample_period & 0x0f) - /* lower 4 bits can not be set in ibs max cnt */ + if (!event->attr.sample_freq && hwc->sample_period & 0x0f) + /* + * lower 4 bits can not be set in ibs max cnt, + * but allowing it in case we adjust the + * sample period to set a frequency. + */ return -EINVAL; + hwc->sample_period &= ~0x0FULL; + if (!hwc->sample_period) + hwc->sample_period = 0x10; } else { max_cnt = config & perf_ibs->cnt_mask; config &= ~perf_ibs->cnt_mask; @@ -175,6 +182,13 @@ static int perf_ibs_init(struct perf_event *event) if (!hwc->sample_period) return -EINVAL; + /* + * If we modify hwc->sample_period, we also need to update + * hwc->last_period and hwc->period_left. + */ + hwc->last_period = hwc->sample_period; + local64_set(&hwc->period_left, hwc->sample_period); + hwc->config_base = perf_ibs->msr; hwc->config = config;
Commit-ID: d47e8238cd76f1ffa7c8cd30e08b8e9074fd597e Gitweb: http://git.kernel.org/tip/d47e8238cd76f1ffa7c8cd30e08b8e9074fd597e Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:11 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:13 +0200 perf/x86-ibs: Take instruction pointer from ibs sample Each IBS sample contains a linear address of the instruction that caused the sample to trigger. This address is more precise than the rip that was taken from the interrupt handler's stack. Update the rip with that address. We use this in the next patch to implement precise-event sampling on AMD systems using IBS. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-6-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/include/asm/perf_event.h | 6 ++- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 48 +++++++++++++++++++---------- 2 files changed, 35 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 8a3c75d..4e40a64 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -158,6 +158,7 @@ struct x86_pmu_capability { #define IBS_CAPS_OPCNT (1U<<4) #define IBS_CAPS_BRNTRGT (1U<<5) #define IBS_CAPS_OPCNTEXT (1U<<6) +#define IBS_CAPS_RIPINVALIDCHK (1U<<7) #define IBS_CAPS_DEFAULT (IBS_CAPS_AVAIL \ | IBS_CAPS_FETCHSAM \ @@ -170,14 +171,14 @@ struct x86_pmu_capability { #define IBSCTL_LVT_OFFSET_VALID (1ULL<<8) #define IBSCTL_LVT_OFFSET_MASK 0x0F -/* IbsFetchCtl bits/masks */ +/* ibs fetch bits/masks */ #define IBS_FETCH_RAND_EN (1ULL<<57) #define IBS_FETCH_VAL (1ULL<<49) #define IBS_FETCH_ENABLE (1ULL<<48) #define IBS_FETCH_CNT 0xFFFF0000ULL #define IBS_FETCH_MAX_CNT 0x0000FFFFULL -/* IbsOpCtl bits */ +/* ibs op bits/masks */ /* lower 4 bits of the current count are ignored: */ #define IBS_OP_CUR_CNT (0xFFFF0ULL<<32) #define IBS_OP_CNT_CTL (1ULL<<19) @@ -185,6 +186,7 @@ struct x86_pmu_capability { #define IBS_OP_ENABLE (1ULL<<17) #define IBS_OP_MAX_CNT 0x0000FFFFULL #define IBS_OP_MAX_CNT_EXT 0x007FFFFFULL /* not a register bit mask */ +#define IBS_RIP_INVALID (1ULL<<38) extern u32 get_ibs_caps(void); diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index bc401bd..cc1f329 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -9,6 +9,7 @@ #include <linux/perf_event.h> #include <linux/module.h> #include <linux/pci.h> +#include <linux/ptrace.h> #include <asm/apic.h> @@ -382,7 +383,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) struct perf_raw_record raw; struct pt_regs regs; struct perf_ibs_data ibs_data; - int offset, size, overflow, reenable; + int offset, size, check_rip, offset_max, throttle = 0; unsigned int msr; u64 *buf, config; @@ -413,28 +414,41 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) perf_ibs_event_update(perf_ibs, event, config); perf_sample_data_init(&data, 0, hwc->last_period); + if (!perf_ibs_set_period(perf_ibs, hwc, &config)) + goto out; /* no sw counter overflow */ + + ibs_data.caps = ibs_caps; + size = 1; + offset = 1; + check_rip = (perf_ibs == &perf_ibs_op && (ibs_caps & IBS_CAPS_RIPINVALIDCHK)); + if (event->attr.sample_type & PERF_SAMPLE_RAW) + offset_max = perf_ibs->offset_max; + else if (check_rip) + offset_max = 2; + else + offset_max = 1; + do { + rdmsrl(msr + offset, *buf++); + size++; + offset = find_next_bit(perf_ibs->offset_mask, + perf_ibs->offset_max, + offset + 1); + } while (offset < offset_max); + ibs_data.size = sizeof(u64) * size; + + regs = *iregs; + if (!check_rip || !(ibs_data.regs[2] & IBS_RIP_INVALID)) + instruction_pointer_set(®s, ibs_data.regs[1]); if (event->attr.sample_type & PERF_SAMPLE_RAW) { - ibs_data.caps = ibs_caps; - size = 1; - offset = 1; - do { - rdmsrl(msr + offset, *buf++); - size++; - offset = find_next_bit(perf_ibs->offset_mask, - perf_ibs->offset_max, - offset + 1); - } while (offset < perf_ibs->offset_max); - raw.size = sizeof(u32) + sizeof(u64) * size; + raw.size = sizeof(u32) + ibs_data.size; raw.data = ibs_data.data; data.raw = &raw; } - regs = *iregs; /* XXX: update ip from ibs sample */ - - overflow = perf_ibs_set_period(perf_ibs, hwc, &config); - reenable = !(overflow && perf_event_overflow(event, &data, ®s)); - config = (config >> 4) | (reenable ? perf_ibs->enable_mask : 0); + throttle = perf_event_overflow(event, &data, ®s); +out: + config = (config >> 4) | (throttle ? 0 : perf_ibs->enable_mask); perf_ibs_enable_event(hwc, config); perf_event_update_userpage(event);
Commit-ID: 450bbd493d436f9eadd1b7828158f37559f26674 Gitweb: http://git.kernel.org/tip/450bbd493d436f9eadd1b7828158f37559f26674 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 12 Mar 2012 12:54:32 +0100 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:14 +0200 perf/x86-ibs: Precise event sampling with IBS for AMD CPUs This patch adds support for precise event sampling with IBS. There are two counting modes to count either cycles or micro-ops. If the corresponding performance counter events (hw events) are setup with the precise flag set, the request is redirected to the ibs pmu: perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count perf record -a -e r076:p ... # same as -e cpu-cycles:p perf record -a -e r0C1:p ... # use ibs op counting micro-ops Each ibs sample contains a linear address that points to the instruction that was causing the sample to trigger. With ibs we have skid 0. Thus, ibs supports precise levels 1 and 2. Samples are marked with the PERF_EFLAGS_EXACT flag set. In rare cases the rip is invalid when IBS was not able to record the rip correctly. Then the PERF_EFLAGS_EXACT flag is cleared and the rip is taken from pt_regs. V2: * don't drop samples in precise level 2 if rip is invalid, instead support the PERF_EFLAGS_EXACT flag Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120502103309.GP18810@erda.amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd.c | 7 +++- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 73 ++++++++++++++++++++++++++++- 2 files changed, 76 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 589286f..6565226 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -134,8 +134,13 @@ static u64 amd_pmu_event_map(int hw_event) static int amd_pmu_hw_config(struct perf_event *event) { - int ret = x86_pmu_hw_config(event); + int ret; + /* pass precise event sampling to ibs: */ + if (event->attr.precise_ip && get_ibs_caps()) + return -ENOENT; + + ret = x86_pmu_hw_config(event); if (ret) return ret; diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index cc1f329..34dfa85 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -145,17 +145,80 @@ static struct perf_ibs *get_ibs_pmu(int type) return NULL; } +/* + * Use IBS for precise event sampling: + * + * perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count + * perf record -a -e r076:p ... # same as -e cpu-cycles:p + * perf record -a -e r0C1:p ... # use ibs op counting micro-ops + * + * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl, + * MSRC001_1033) is used to select either cycle or micro-ops counting + * mode. + * + * The rip of IBS samples has skid 0. Thus, IBS supports precise + * levels 1 and 2 and the PERF_EFLAGS_EXACT is set. In rare cases the + * rip is invalid when IBS was not able to record the rip correctly. + * We clear PERF_EFLAGS_EXACT and take the rip from pt_regs then. + * + */ +static int perf_ibs_precise_event(struct perf_event *event, u64 *config) +{ + switch (event->attr.precise_ip) { + case 0: + return -ENOENT; + case 1: + case 2: + break; + default: + return -EOPNOTSUPP; + } + + switch (event->attr.type) { + case PERF_TYPE_HARDWARE: + switch (event->attr.config) { + case PERF_COUNT_HW_CPU_CYCLES: + *config = 0; + return 0; + } + break; + case PERF_TYPE_RAW: + switch (event->attr.config) { + case 0x0076: + *config = 0; + return 0; + case 0x00C1: + *config = IBS_OP_CNT_CTL; + return 0; + } + break; + default: + return -ENOENT; + } + + return -EOPNOTSUPP; +} + static int perf_ibs_init(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; struct perf_ibs *perf_ibs; u64 max_cnt, config; + int ret; perf_ibs = get_ibs_pmu(event->attr.type); - if (!perf_ibs) + if (perf_ibs) { + config = event->attr.config; + } else { + perf_ibs = &perf_ibs_op; + ret = perf_ibs_precise_event(event, &config); + if (ret) + return ret; + } + + if (event->pmu != &perf_ibs->pmu) return -ENOENT; - config = event->attr.config; if (config & ~perf_ibs->config_mask) return -EINVAL; @@ -437,8 +500,12 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) ibs_data.size = sizeof(u64) * size; regs = *iregs; - if (!check_rip || !(ibs_data.regs[2] & IBS_RIP_INVALID)) + if (check_rip && (ibs_data.regs[2] & IBS_RIP_INVALID)) { + regs.flags &= ~PERF_EFLAGS_EXACT; + } else { instruction_pointer_set(®s, ibs_data.regs[1]); + regs.flags |= PERF_EFLAGS_EXACT; + } if (event->attr.sample_type & PERF_SAMPLE_RAW) { raw.size = sizeof(u32) + ibs_data.size;
Commit-ID: 98112d2e957e0d348f06d8a40f2f720204a70b55 Gitweb: http://git.kernel.org/tip/98112d2e957e0d348f06d8a40f2f720204a70b55 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:13 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:14 +0200 perf/x86-ibs: Rename some variables Simple patch that just renames some variables for better understanding. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-8-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 34dfa85..29a1bff 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -62,7 +62,7 @@ struct perf_ibs_data { }; static int -perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *count) +perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *hw_period) { s64 left = local64_read(&hwc->period_left); s64 period = hwc->sample_period; @@ -91,7 +91,7 @@ perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *count) if (left > max) left = max; - *count = (u64)left; + *hw_period = (u64)left; return overflow; } @@ -262,13 +262,13 @@ static int perf_ibs_init(struct perf_event *event) static int perf_ibs_set_period(struct perf_ibs *perf_ibs, struct hw_perf_event *hwc, u64 *period) { - int ret; + int overflow; /* ignore lower 4 bits in min count: */ - ret = perf_event_set_period(hwc, 1<<4, perf_ibs->max_period, period); + overflow = perf_event_set_period(hwc, 1<<4, perf_ibs->max_period, period); local64_set(&hwc->prev_count, 0); - return ret; + return overflow; } static u64 get_ibs_fetch_count(u64 config)
Commit-ID: fc006cf7cc7471e1bdf34e40111971e03622af6c Gitweb: http://git.kernel.org/tip/fc006cf7cc7471e1bdf34e40111971e03622af6c Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:14 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:15 +0200 perf/x86-ibs: Trigger overflow if remaining period is too small There are cases where the remaining period is smaller than the minimal possible value. In this case the counter is restarted with the minimal period. This is of no use as the interrupt handler will trigger immediately again and most likely hits itself. This biases the results. So, if the remaining period is within the min range, we better do not restart the counter and instead trigger the overflow. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-9-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 29a1bff..3e32908 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -78,16 +78,13 @@ perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *hw_perio overflow = 1; } - if (unlikely(left <= 0)) { + if (unlikely(left < (s64)min)) { left += period; local64_set(&hwc->period_left, left); hwc->last_period = period; overflow = 1; } - if (unlikely(left < min)) - left = min; - if (left > max) left = max;
Commit-ID: 7caaf4d8241feecafb87919402b0a6dbb1b71d9e Gitweb: http://git.kernel.org/tip/7caaf4d8241feecafb87919402b0a6dbb1b71d9e Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:15 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:15 +0200 perf/x86-ibs: Extend hw period that triggers overflow If the last hw period is too short we might hit the irq handler which biases the results. Thus try to have a max last period that triggers the sw overflow. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-10-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 3e32908..cb51a3e 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -85,8 +85,19 @@ perf_event_set_period(struct hw_perf_event *hwc, u64 min, u64 max, u64 *hw_perio overflow = 1; } - if (left > max) - left = max; + /* + * If the hw period that triggers the sw overflow is too short + * we might hit the irq handler. This biases the results. + * Thus we shorten the next-to-last period and set the last + * period to the max period. + */ + if (left > max) { + left -= max; + if (left > max) + left = max; + else if (left < min) + left = min; + } *hw_period = (u64)left;
Commit-ID: c9574fe0bdb9ac9a2698e02a712088ce8431e9f8 Gitweb: http://git.kernel.org/tip/c9574fe0bdb9ac9a2698e02a712088ce8431e9f8 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:16 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:16 +0200 perf/x86-ibs: Implement workaround for IBS erratum #420 When disabling ibs there might be the case where hardware continuously generates interrupts. This is described in erratum #420 (Instruction- Based Sampling Engine May Generate Interrupt that Cannot Be Cleared). To avoid this we must clear the counter mask first and then clear the enable bit. This patch implements this. See Revision Guide for AMD Family 10h Processors, Publication #41322. Note: We now keep track of the last read ibs config value which is then used to disable ibs. To update the config value we pass now a pointer to the functions reading it. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-11-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 62 +++++++++++++++++++----------- 1 files changed, 39 insertions(+), 23 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index cb51a3e..b14e711 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -291,20 +291,36 @@ static u64 get_ibs_op_count(u64 config) static void perf_ibs_event_update(struct perf_ibs *perf_ibs, struct perf_event *event, - u64 config) + u64 *config) { - u64 count = perf_ibs->get_count(config); + u64 count = perf_ibs->get_count(*config); while (!perf_event_try_update(event, count, 20)) { - rdmsrl(event->hw.config_base, config); - count = perf_ibs->get_count(config); + rdmsrl(event->hw.config_base, *config); + count = perf_ibs->get_count(*config); } } -/* Note: The enable mask must be encoded in the config argument. */ -static inline void perf_ibs_enable_event(struct hw_perf_event *hwc, u64 config) +static inline void perf_ibs_enable_event(struct perf_ibs *perf_ibs, + struct hw_perf_event *hwc, u64 config) { - wrmsrl(hwc->config_base, hwc->config | config); + wrmsrl(hwc->config_base, hwc->config | config | perf_ibs->enable_mask); +} + +/* + * Erratum #420 Instruction-Based Sampling Engine May Generate + * Interrupt that Cannot Be Cleared: + * + * Must clear counter mask first, then clear the enable bit. See + * Revision Guide for AMD Family 10h Processors, Publication #41322. + */ +static inline void perf_ibs_disable_event(struct perf_ibs *perf_ibs, + struct hw_perf_event *hwc, u64 config) +{ + config &= ~perf_ibs->cnt_mask; + wrmsrl(hwc->config_base, config); + config &= ~perf_ibs->enable_mask; + wrmsrl(hwc->config_base, config); } /* @@ -318,7 +334,7 @@ static void perf_ibs_start(struct perf_event *event, int flags) struct hw_perf_event *hwc = &event->hw; struct perf_ibs *perf_ibs = container_of(event->pmu, struct perf_ibs, pmu); struct cpu_perf_ibs *pcpu = this_cpu_ptr(perf_ibs->pcpu); - u64 config; + u64 period; if (WARN_ON_ONCE(!(hwc->state & PERF_HES_STOPPED))) return; @@ -326,10 +342,9 @@ static void perf_ibs_start(struct perf_event *event, int flags) WARN_ON_ONCE(!(hwc->state & PERF_HES_UPTODATE)); hwc->state = 0; - perf_ibs_set_period(perf_ibs, hwc, &config); - config = (config >> 4) | perf_ibs->enable_mask; + perf_ibs_set_period(perf_ibs, hwc, &period); set_bit(IBS_STARTED, pcpu->state); - perf_ibs_enable_event(hwc, config); + perf_ibs_enable_event(perf_ibs, hwc, period >> 4); perf_event_update_userpage(event); } @@ -339,7 +354,7 @@ static void perf_ibs_stop(struct perf_event *event, int flags) struct hw_perf_event *hwc = &event->hw; struct perf_ibs *perf_ibs = container_of(event->pmu, struct perf_ibs, pmu); struct cpu_perf_ibs *pcpu = this_cpu_ptr(perf_ibs->pcpu); - u64 val; + u64 config; int stopping; stopping = test_and_clear_bit(IBS_STARTED, pcpu->state); @@ -347,12 +362,11 @@ static void perf_ibs_stop(struct perf_event *event, int flags) if (!stopping && (hwc->state & PERF_HES_UPTODATE)) return; - rdmsrl(hwc->config_base, val); + rdmsrl(hwc->config_base, config); if (stopping) { set_bit(IBS_STOPPING, pcpu->state); - val &= ~perf_ibs->enable_mask; - wrmsrl(hwc->config_base, val); + perf_ibs_disable_event(perf_ibs, hwc, config); WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED); hwc->state |= PERF_HES_STOPPED; } @@ -360,7 +374,7 @@ static void perf_ibs_stop(struct perf_event *event, int flags) if (hwc->state & PERF_HES_UPTODATE) return; - perf_ibs_event_update(perf_ibs, event, val); + perf_ibs_event_update(perf_ibs, event, &config); hwc->state |= PERF_HES_UPTODATE; } @@ -456,7 +470,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) struct perf_ibs_data ibs_data; int offset, size, check_rip, offset_max, throttle = 0; unsigned int msr; - u64 *buf, config; + u64 *buf, *config, period; if (!test_bit(IBS_STARTED, pcpu->state)) { /* Catch spurious interrupts after stopping IBS: */ @@ -477,15 +491,15 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) * supported in all cpus. As this triggered an interrupt, we * set the current count to the max count. */ - config = ibs_data.regs[0]; + config = &ibs_data.regs[0]; if (perf_ibs == &perf_ibs_op && !(ibs_caps & IBS_CAPS_RDWROPCNT)) { - config &= ~IBS_OP_CUR_CNT; - config |= (config & IBS_OP_MAX_CNT) << 36; + *config &= ~IBS_OP_CUR_CNT; + *config |= (*config & IBS_OP_MAX_CNT) << 36; } perf_ibs_event_update(perf_ibs, event, config); perf_sample_data_init(&data, 0, hwc->last_period); - if (!perf_ibs_set_period(perf_ibs, hwc, &config)) + if (!perf_ibs_set_period(perf_ibs, hwc, &period)) goto out; /* no sw counter overflow */ ibs_data.caps = ibs_caps; @@ -523,8 +537,10 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) throttle = perf_event_overflow(event, &data, ®s); out: - config = (config >> 4) | (throttle ? 0 : perf_ibs->enable_mask); - perf_ibs_enable_event(hwc, config); + if (throttle) + perf_ibs_disable_event(perf_ibs, hwc, *config); + else + perf_ibs_enable_event(perf_ibs, hwc, period >> 4); perf_event_update_userpage(event);
Commit-ID: fc5fb2b5e1874e5894e2ac503bfb744220db89a1 Gitweb: http://git.kernel.org/tip/fc5fb2b5e1874e5894e2ac503bfb744220db89a1 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:17 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:16 +0200 perf/x86-ibs: Catch spurious interrupts after stopping IBS After disabling IBS there could be still incomming NMIs with samples that even have the valid bit cleared. Mark all this NMIs as handled to avoid spurious interrupt messages. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-12-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 12 +++++++----- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index b14e711..5a9f95b 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -473,11 +473,13 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) u64 *buf, *config, period; if (!test_bit(IBS_STARTED, pcpu->state)) { - /* Catch spurious interrupts after stopping IBS: */ - if (!test_and_clear_bit(IBS_STOPPING, pcpu->state)) - return 0; - rdmsrl(perf_ibs->msr, *ibs_data.regs); - return (*ibs_data.regs & perf_ibs->valid_mask) ? 1 : 0; + /* + * Catch spurious interrupts after stopping IBS: After + * disabling IBS there could be still incomming NMIs + * with samples that even have the valid bit cleared. + * Mark all this NMIs as handled. + */ + return test_and_clear_bit(IBS_STOPPING, pcpu->state) ? 1 : 0; } msr = hwc->config_base;
Commit-ID: 8b1e13638d465863572c8207a5cfceeef0cf0441 Gitweb: http://git.kernel.org/tip/8b1e13638d465863572c8207a5cfceeef0cf0441 Author: Robert Richter <robert.richter@amd.com> AuthorDate: Mon, 2 Apr 2012 20:19:18 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 9 May 2012 15:23:17 +0200 perf/x86-ibs: Fix usage of IBS op current count The value of IbsOpCurCnt rolls over when it reaches IbsOpMaxCnt. Thus, it is reset to zero by hardware. To get the correct count we need to add the max count to it in case we received an ibs sample (valid bit set). Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1333390758-10893-13-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/kernel/cpu/perf_event_amd_ibs.c | 33 +++++++++++++++++++---------- 1 files changed, 21 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_amd_ibs.c b/arch/x86/kernel/cpu/perf_event_amd_ibs.c index 5a9f95b..da9bcdc 100644 --- a/arch/x86/kernel/cpu/perf_event_amd_ibs.c +++ b/arch/x86/kernel/cpu/perf_event_amd_ibs.c @@ -286,7 +286,15 @@ static u64 get_ibs_fetch_count(u64 config) static u64 get_ibs_op_count(u64 config) { - return (config & IBS_OP_CUR_CNT) >> 32; + u64 count = 0; + + if (config & IBS_OP_VAL) + count += (config & IBS_OP_MAX_CNT) << 4; /* cnt rolled over */ + + if (ibs_caps & IBS_CAPS_RDWROPCNT) + count += (config & IBS_OP_CUR_CNT) >> 32; + + return count; } static void @@ -295,7 +303,12 @@ perf_ibs_event_update(struct perf_ibs *perf_ibs, struct perf_event *event, { u64 count = perf_ibs->get_count(*config); - while (!perf_event_try_update(event, count, 20)) { + /* + * Set width to 64 since we do not overflow on max width but + * instead on max count. In perf_ibs_set_period() we clear + * prev count manually on overflow. + */ + while (!perf_event_try_update(event, count, 64)) { rdmsrl(event->hw.config_base, *config); count = perf_ibs->get_count(*config); } @@ -374,6 +387,12 @@ static void perf_ibs_stop(struct perf_event *event, int flags) if (hwc->state & PERF_HES_UPTODATE) return; + /* + * Clear valid bit to not count rollovers on update, rollovers + * are only updated in the irq handler. + */ + config &= ~perf_ibs->valid_mask; + perf_ibs_event_update(perf_ibs, event, &config); hwc->state |= PERF_HES_UPTODATE; } @@ -488,17 +507,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs) if (!(*buf++ & perf_ibs->valid_mask)) return 0; - /* - * Emulate IbsOpCurCnt in MSRC001_1033 (IbsOpCtl), not - * supported in all cpus. As this triggered an interrupt, we - * set the current count to the max count. - */ config = &ibs_data.regs[0]; - if (perf_ibs == &perf_ibs_op && !(ibs_caps & IBS_CAPS_RDWROPCNT)) { - *config &= ~IBS_OP_CUR_CNT; - *config |= (*config & IBS_OP_MAX_CNT) << 36; - } - perf_ibs_event_update(perf_ibs, event, config); perf_sample_data_init(&data, 0, hwc->last_period); if (!perf_ibs_set_period(perf_ibs, hwc, &period))