All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5 0/6] large PEBS interrupt threshold
@ 2015-02-23 14:25 Kan Liang
  2015-02-23 14:25 ` [PATCH V5 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Kan Liang @ 2015-02-23 14:25 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

This patch series implements large PEBS interrupt threshold.
Currently, the PEBS threshold is forced to set to one. A larger PEBS
interrupt threshold can significantly reduce the sampling overhead
especially for frequently occurring events
(like cycles or branches or load/stores) with small sampling period.
For example, perf record cycles event when running kernbench
with 10003 sampling period. The Elapsed Time reduced from 32.7 seconds
to 16.5 seconds, which is 2X faster.
For more details, please refer to patch 3's description.

Limitations:
 - It can not supply a callgraph.
 - It requires setting a fixed period.
 - It cannot supply a time stamp.
 - To supply a TID it requires flushing on context switch.
If the above requirement doesn't apply, the threshold will set to one.

Collisions:
When PEBS events happen near to each other, the records for the events
can be collapsed into a single one, and it's not possible to
reconstruct. When collision happens, we drop the PEBS record.
Actually, collisions are extremely rare as long as different events
are used. We once tested the worst case with four frequently occurring
events (cycles:p,instructions:p,branches:p,mem-stores:p).
The collisions rate is only 0.34%.
The only way you can get a lot of collision is when you count the same
thing multiple times. But it is not a useful configuration.
For details about collisions, please refer to patch 4's description.

changes since v1:
  - drop patch 'perf, core: Add all PMUs to pmu_idr'
  - add comments for case that multiple counters overflow simultaneously
changes since v2:
  - rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
  - use flag to indicate auto reload mechanism
  - move codes that setup PEBS sample data to separate function
  - output the PEBS records in batch
  - enable this for All (PEBS capable) hardware
  - more description for the multiplex
changes since v3:
  - ignore conflicting PEBS record
changes since v4:
  - Do more tests for collision and update comments

Yan, Zheng (6):
  perf, x86: use the PEBS auto reload mechanism when possible
  perf, x86: introduce setup_pebs_sample_data()
  perf, x86: large PEBS interrupt threshold
  perf, x86: handle multiple records in PEBS buffer
  perf, x86: drain PEBS buffer during context switch
  perf, x86: enlarge PEBS buffer

 arch/x86/kernel/cpu/perf_event.c           |  15 +-
 arch/x86/kernel/cpu/perf_event.h           |   5 +-
 arch/x86/kernel/cpu/perf_event_intel.c     |  11 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c  | 283 ++++++++++++++++++++---------
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   3 -
 5 files changed, 224 insertions(+), 93 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V5 1/6] perf, x86: use the PEBS auto reload mechanism when possible
  2015-02-23 14:25 [PATCH V5 0/6] large PEBS interrupt threshold Kan Liang
@ 2015-02-23 14:25 ` Kan Liang
  2015-03-30 12:06   ` Peter Zijlstra
  2015-02-23 14:25 ` [PATCH V5 2/6] perf, x86: introduce setup_pebs_sample_data() Kan Liang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Kan Liang @ 2015-02-23 14:25 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c          | 15 +++++++++------
 arch/x86/kernel/cpu/perf_event.h          |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  9 +++++++++
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index e0dab5c..8a77a94 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -990,13 +990,16 @@ int x86_perf_event_set_period(struct perf_event *event)
 
 	per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
 
-	/*
-	 * The hw event starts counting from this event offset,
-	 * mark it to be able to extra future deltas:
-	 */
-	local64_set(&hwc->prev_count, (u64)-left);
+	if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+	    local64_read(&hwc->prev_count) != (u64)-left) {
+		/*
+		 * The hw event starts counting from this event offset,
+		 * mark it to be able to extra future deltas:
+		 */
+		local64_set(&hwc->prev_count, (u64)-left);
 
-	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+		wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+	}
 
 	/*
 	 * Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index a371d27..bc4ae3b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -72,7 +72,7 @@ struct event_constraint {
 #define PERF_X86_EVENT_PEBS_LD_HSW	0x10 /* haswell style datala, load */
 #define PERF_X86_EVENT_PEBS_NA_HSW	0x20 /* haswell style datala, unknown */
 #define PERF_X86_EVENT_RDPMC_ALLOWED	0x40 /* grant rdpmc permission */
-
+#define PERF_X86_EVENT_AUTO_RELOAD	0x80 /* use PEBS auto-reload */
 
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 0739833..728fb88 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -673,6 +673,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	struct hw_perf_event *hwc = &event->hw;
 
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
+	if (!event->attr.freq)
+		hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;
 
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 
@@ -680,6 +682,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
 	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled |= 1ULL << 63;
+
+	/* Use auto-reload if possible to save a MSR write in the PMI */
+	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+		ds->pebs_event_reset[hwc->idx] =
+			(u64)-hwc->sample_period & x86_pmu.cntval_mask;
+	}
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
@@ -698,6 +706,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
 
 	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+	hwc->flags &= ~PERF_X86_EVENT_AUTO_RELOAD;
 }
 
 void intel_pmu_pebs_enable_all(void)
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 2/6] perf, x86: introduce setup_pebs_sample_data()
  2015-02-23 14:25 [PATCH V5 0/6] large PEBS interrupt threshold Kan Liang
  2015-02-23 14:25 ` [PATCH V5 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
@ 2015-02-23 14:25 ` Kan Liang
  2015-02-23 14:25 ` [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Kan Liang @ 2015-02-23 14:25 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

move codes that setup PEBS sample data to separate function.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 95 +++++++++++++++++--------------
 1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 728fb88..1c16700 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -838,8 +838,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
 	return txn;
 }
 
-static void __intel_pmu_pebs_event(struct perf_event *event,
-				   struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+				   struct pt_regs *iregs, void *__pebs,
+				   struct perf_sample_data *data,
+				   struct pt_regs *regs)
 {
 #define PERF_X86_EVENT_PEBS_HSW_PREC \
 		(PERF_X86_EVENT_PEBS_ST_HSW | \
@@ -851,30 +853,25 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 */
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct pebs_record_hsw *pebs = __pebs;
-	struct perf_sample_data data;
-	struct pt_regs regs;
 	u64 sample_type;
 	int fll, fst, dsrc;
 	int fl = event->hw.flags;
 
-	if (!intel_pmu_save_and_restart(event))
-		return;
-
 	sample_type = event->attr.sample_type;
 	dsrc = sample_type & PERF_SAMPLE_DATA_SRC;
 
 	fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
 	fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);
 
-	perf_sample_data_init(&data, 0, event->hw.last_period);
+	perf_sample_data_init(data, 0, event->hw.last_period);
 
-	data.period = event->hw.last_period;
+	data->period = event->hw.last_period;
 
 	/*
 	 * Use latency for weight (only avail with PEBS-LL)
 	 */
 	if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
-		data.weight = pebs->lat;
+		data->weight = pebs->lat;
 
 	/*
 	 * data.data_src encodes the data source
@@ -887,7 +884,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 			val = precise_datala_hsw(event, pebs->dse);
 		else if (fst)
 			val = precise_store_data(pebs->dse);
-		data.data_src.val = val;
+		data->data_src.val = val;
 	}
 
 	/*
@@ -900,58 +897,70 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
 	 * A possible PERF_SAMPLE_REGS will have to transfer all regs.
 	 */
-	regs = *iregs;
-	regs.flags = pebs->flags;
-	set_linear_ip(&regs, pebs->ip);
-	regs.bp = pebs->bp;
-	regs.sp = pebs->sp;
+	*regs = *iregs;
+	regs->flags = pebs->flags;
+	set_linear_ip(regs, pebs->ip);
+	regs->bp = pebs->bp;
+	regs->sp = pebs->sp;
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR) {
-		regs.ax = pebs->ax;
-		regs.bx = pebs->bx;
-		regs.cx = pebs->cx;
-		regs.dx = pebs->dx;
-		regs.si = pebs->si;
-		regs.di = pebs->di;
-		regs.bp = pebs->bp;
-		regs.sp = pebs->sp;
-
-		regs.flags = pebs->flags;
+		regs->ax = pebs->ax;
+		regs->bx = pebs->bx;
+		regs->cx = pebs->cx;
+		regs->dx = pebs->dx;
+		regs->si = pebs->si;
+		regs->di = pebs->di;
+		regs->bp = pebs->bp;
+		regs->sp = pebs->sp;
+
+		regs->flags = pebs->flags;
 #ifndef CONFIG_X86_32
-		regs.r8 = pebs->r8;
-		regs.r9 = pebs->r9;
-		regs.r10 = pebs->r10;
-		regs.r11 = pebs->r11;
-		regs.r12 = pebs->r12;
-		regs.r13 = pebs->r13;
-		regs.r14 = pebs->r14;
-		regs.r15 = pebs->r15;
+		regs->r8 = pebs->r8;
+		regs->r9 = pebs->r9;
+		regs->r10 = pebs->r10;
+		regs->r11 = pebs->r11;
+		regs->r12 = pebs->r12;
+		regs->r13 = pebs->r13;
+		regs->r14 = pebs->r14;
+		regs->r15 = pebs->r15;
 #endif
 	}
 
 	if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
-		regs.ip = pebs->real_ip;
-		regs.flags |= PERF_EFLAGS_EXACT;
-	} else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
-		regs.flags |= PERF_EFLAGS_EXACT;
+		regs->ip = pebs->real_ip;
+		regs->flags |= PERF_EFLAGS_EXACT;
+	} else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+		regs->flags |= PERF_EFLAGS_EXACT;
 	else
-		regs.flags &= ~PERF_EFLAGS_EXACT;
+		regs->flags &= ~PERF_EFLAGS_EXACT;
 
 	if ((sample_type & PERF_SAMPLE_ADDR) &&
 	    x86_pmu.intel_cap.pebs_format >= 1)
-		data.addr = pebs->dla;
+		data->addr = pebs->dla;
 
 	if (x86_pmu.intel_cap.pebs_format >= 2) {
 		/* Only set the TSX weight when no memory weight. */
 		if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
-			data.weight = intel_hsw_weight(pebs);
+			data->weight = intel_hsw_weight(pebs);
 
 		if (sample_type & PERF_SAMPLE_TRANSACTION)
-			data.txn = intel_hsw_transaction(pebs);
+			data->txn = intel_hsw_transaction(pebs);
 	}
 
 	if (has_branch_stack(event))
-		data.br_stack = &cpuc->lbr_stack;
+		data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+				   struct pt_regs *iregs, void *__pebs)
+{
+	struct perf_sample_data data;
+	struct pt_regs regs;
+
+	if (!intel_pmu_save_and_restart(event))
+		return;
+
+	setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);
 
 	if (perf_event_overflow(event, &data, &regs))
 		x86_pmu_stop(event, 0);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold
  2015-02-23 14:25 [PATCH V5 0/6] large PEBS interrupt threshold Kan Liang
  2015-02-23 14:25 ` [PATCH V5 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
  2015-02-23 14:25 ` [PATCH V5 2/6] perf, x86: introduce setup_pebs_sample_data() Kan Liang
@ 2015-02-23 14:25 ` Kan Liang
  2015-03-02 17:08   ` Stephane Eranian
  2015-03-30 13:54   ` Peter Zijlstra
  2015-02-23 14:25 ` [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 23+ messages in thread
From: Kan Liang @ 2015-02-23 14:25 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
  "perf record -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
  "perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
(The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test.)

period    plain(Sec)  multi(Sec)  Delta
10003     32.7        16.5        16.2
20003     30.2        16.2        14.0
40003     18.6        14.1        4.5
80003     16.8        14.6        2.2
100003    16.9        14.1        2.8
800003    15.4        15.7        -0.3
1000003   15.3        15.2        0.2
2000003   15.3        15.1        0.1

With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 40 +++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 1c16700..16fdb18 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
 	int node = cpu_to_node(cpu);
-	int max, thresh = 1; /* always use a single PEBS record */
+	int max;
 	void *buffer, *ibuffer;
 
 	if (!x86_pmu.pebs)
@@ -280,9 +280,6 @@ static int alloc_pebs_buffer(int cpu)
 	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
 		max * x86_pmu.pebs_record_size;
 
-	ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
-		thresh * x86_pmu.pebs_record_size;
-
 	return 0;
 }
 
@@ -667,15 +664,35 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
 	return &emptyconstraint;
 }
 
+/*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS	\
+	(PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+	PERF_SAMPLE_TRANSACTION)
+
+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+	return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
 void intel_pmu_pebs_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
+	struct debug_store *ds = cpuc->ds;
+	bool first_pebs;
+	u64 threshold;
 
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
 	if (!event->attr.freq)
 		hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;
 
+	first_pebs = !pebs_is_enabled(cpuc);
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 
 	if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -683,6 +700,21 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled |= 1ULL << 63;
 
+	/*
+	 * When the event is constrained enough we can use a larger
+	 * threshold and run the event with less frequent PMI.
+	 */
+	if (0 && /* disable this temporarily */
+	    (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
+	    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
+		threshold = ds->pebs_absolute_maximum -
+			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+	} else {
+		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+	}
+	if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+		ds->pebs_interrupt_threshold = threshold;
+
 	/* Use auto-reload if possible to save a MSR write in the PMI */
 	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
 		ds->pebs_event_reset[hwc->idx] =
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-02-23 14:25 [PATCH V5 0/6] large PEBS interrupt threshold Kan Liang
                   ` (2 preceding siblings ...)
  2015-02-23 14:25 ` [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
@ 2015-02-23 14:25 ` Kan Liang
  2015-03-30 13:45   ` Peter Zijlstra
  2015-02-23 14:25 ` [PATCH V5 5/6] perf, x86: drain PEBS buffer during context switch Kan Liang
  2015-02-23 14:25 ` [PATCH V5 6/6] perf, x86: enlarge PEBS buffer Kan Liang
  5 siblings, 1 reply; 23+ messages in thread
From: Kan Liang @ 2015-02-23 14:25 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

When PEBS interrupt threshold is larger than one, the PEBS buffer
may include multiple records for each PEBS event. This patch makes
the code first count how many records each PEBS event has, then
output the samples in batch.

One corner case needs to mention is that the PEBS hardware doesn't
deal well with collisions, when PEBS events happen near to each
other. The records for the events can be collapsed into a single
one, and it's not possible to reconstruct all events that caused
the PEBS record, However in practice collisions are extremely rare,
as long as different events are used. The periods are typically very
large, so any collision is unlikely. When collision happens, we drop
the PEBS record.

Here are some numbers about collisions.
Four frequently occurring events
(cycles:p,instructions:p,branches:p,mem-stores:p) are tested

Test events which are sampled together                   collision rate
cycles:p,instructions:p                                  0.25%
cycles:p,instructions:p,branches:p                       0.30%
cycles:p,instructions:p,branches:p,mem-stores:p          0.35%

cycles:p,cycles:p                                        98.52%

collisions are extremely rare as long as different events are used. The
only way you can get a lot of collision is when you count the same thing
multiple times. But it is not a useful configuration.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 114 ++++++++++++++++++++----------
 1 file changed, 77 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 16fdb18..cf74a52 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -984,18 +984,52 @@ static void setup_pebs_sample_data(struct perf_event *event,
 }
 
 static void __intel_pmu_pebs_event(struct perf_event *event,
-				   struct pt_regs *iregs, void *__pebs)
+				   struct pt_regs *iregs,
+				   void *at, void *top, int count)
 {
+	struct perf_output_handle handle;
+	struct perf_event_header header;
 	struct perf_sample_data data;
 	struct pt_regs regs;
 
-	if (!intel_pmu_save_and_restart(event))
+	if (!intel_pmu_save_and_restart(event) &&
+	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
 		return;
 
-	setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);
+	setup_pebs_sample_data(event, iregs, at, &data, &regs);
 
-	if (perf_event_overflow(event, &data, &regs))
+	if (perf_event_overflow(event, &data, &regs)) {
 		x86_pmu_stop(event, 0);
+		return;
+	}
+
+	if (count <= 1)
+		return;
+
+	at += x86_pmu.pebs_record_size;
+	count--;
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+	perf_prepare_sample(&header, &data, event, &regs);
+
+	if (perf_output_begin(&handle, event, header.size * count))
+		return;
+
+	for (; at < top; at += x86_pmu.pebs_record_size) {
+		struct pebs_record_nhm *p = at;
+
+		if (p->status != (1 << event->hw.idx))
+			continue;
+
+		setup_pebs_sample_data(event, iregs, at, &data, &regs);
+		perf_output_sample(&handle, &header, &data, event);
+
+		count--;
+		if (count == 0)
+			break;
+	}
+
+	perf_output_end(&handle);
 }
 
 static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
@@ -1036,61 +1070,67 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
 	WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
 	at += n - 1;
 
-	__intel_pmu_pebs_event(event, iregs, at);
+	__intel_pmu_pebs_event(event, iregs, at, top, 1);
 }
 
 static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
-	struct perf_event *event = NULL;
-	void *at, *top;
-	u64 status = 0;
+	struct perf_event *event;
+	void *base, *at, *top;
 	int bit;
+	int counts[MAX_PEBS_EVENTS] = {};
 
 	if (!x86_pmu.pebs_active)
 		return;
 
-	at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
 
 	ds->pebs_index = ds->pebs_buffer_base;
 
-	if (unlikely(at > top))
+	if (unlikely(base >= top))
 		return;
 
-	/*
-	 * Should not happen, we program the threshold at 1 and do not
-	 * set a reset value.
-	 */
-	WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
-		  "Unexpected number of pebs records %ld\n",
-		  (long)(top - at) / x86_pmu.pebs_record_size);
-
-	for (; at < top; at += x86_pmu.pebs_record_size) {
+	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
 		struct pebs_record_nhm *p = at;
 
-		for_each_set_bit(bit, (unsigned long *)&p->status,
-				 x86_pmu.max_pebs_events) {
-			event = cpuc->events[bit];
-			if (!test_bit(bit, cpuc->active_mask))
-				continue;
-
-			WARN_ON_ONCE(!event);
-
-			if (!event->attr.precise_ip)
-				continue;
-
-			if (__test_and_set_bit(bit, (unsigned long *)&status))
-				continue;
-
-			break;
-		}
+		bit = find_first_bit((unsigned long *)&p->status,
+					x86_pmu.max_pebs_events);
+		if (bit >= x86_pmu.max_pebs_events)
+			continue;
+		/*
+		 * The PEBS hardware does not deal well with collisions,
+		 * when the same event happens near to each other. The
+		 * records for the events can be collapsed into a single
+		 * one, and it's not possible to reconstruct all events
+		 * that caused the PEBS record. However in practice, the
+		 * collisions are extremely rare. If collision happened,
+		 * we drop the record. its the safest choice.
+		 */
+		if (p->status != (1 << bit))
+			continue;
+		if (!test_bit(bit, cpuc->active_mask))
+			continue;
+		event = cpuc->events[bit];
+		WARN_ON_ONCE(!event);
+		if (!event->attr.precise_ip)
+			continue;
+		counts[bit]++;
+	}
 
-		if (!event || bit >= x86_pmu.max_pebs_events)
+	for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+		if (counts[bit] == 0)
 			continue;
+		event = cpuc->events[bit];
+		for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+			struct pebs_record_nhm *p = at;
 
-		__intel_pmu_pebs_event(event, iregs, at);
+			if (p->status == (1 << bit))
+				break;
+		}
+		__intel_pmu_pebs_event(event, iregs, at, top, counts[bit]);
 	}
 }
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 5/6] perf, x86: drain PEBS buffer during context switch
  2015-02-23 14:25 [PATCH V5 0/6] large PEBS interrupt threshold Kan Liang
                   ` (3 preceding siblings ...)
  2015-02-23 14:25 ` [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
@ 2015-02-23 14:25 ` Kan Liang
  2015-03-30 13:50   ` Peter Zijlstra
  2015-02-23 14:25 ` [PATCH V5 6/6] perf, x86: enlarge PEBS buffer Kan Liang
  5 siblings, 1 reply; 23+ messages in thread
From: Kan Liang @ 2015-02-23 14:25 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  3 +++
 arch/x86/kernel/cpu/perf_event_intel.c     | 11 +++++++++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c  | 33 ++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  3 ---
 4 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index bc4ae3b..b4f6431 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -151,6 +151,7 @@ struct cpu_hw_events {
 	 */
 	struct debug_store	*ds;
 	u64			pebs_enabled;
+	bool			pebs_sched_cb_enabled;
 
 	/*
 	 * Intel LBR bits
@@ -739,6 +740,8 @@ void intel_pmu_pebs_enable_all(void);
 
 void intel_pmu_pebs_disable_all(void);
 
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
 void intel_ds_init(void);
 
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 9f1dd18..bdb2ffd 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2030,6 +2030,15 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+				 bool sched_in)
+{
+	if (x86_pmu.pebs_active)
+		intel_pmu_pebs_sched_task(ctx, sched_in);
+	if (x86_pmu.lbr_nr)
+		intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
 
 PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2081,7 +2090,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
-	.sched_task		= intel_pmu_lbr_sched_task,
+	.sched_task		= intel_pmu_sched_task,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index cf74a52..b8d0451 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -664,6 +664,19 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
 	return &emptyconstraint;
 }
 
+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+	struct pt_regs regs;
+
+	x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+	if (!sched_in)
+		intel_pmu_drain_pebs_buffer();
+}
+
 /*
  * Flags PEBS can handle without an PMI.
  *
@@ -704,13 +717,20 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	 * When the event is constrained enough we can use a larger
 	 * threshold and run the event with less frequent PMI.
 	 */
-	if (0 && /* disable this temporarily */
-	    (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
+	if ((hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
 	    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
 		threshold = ds->pebs_absolute_maximum -
 			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+		if (first_pebs) {
+			perf_sched_cb_inc(event->ctx->pmu);
+			cpuc->pebs_sched_cb_enabled = true;
+		}
 	} else {
 		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+		if (cpuc->pebs_sched_cb_enabled) {
+			perf_sched_cb_dec(event->ctx->pmu);
+			cpuc->pebs_sched_cb_enabled = false;
+		}
 	}
 	if (first_pebs || ds->pebs_interrupt_threshold > threshold)
 		ds->pebs_interrupt_threshold = threshold;
@@ -726,8 +746,17 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
+	struct debug_store *ds = cpuc->ds;
+
+	if (ds->pebs_interrupt_threshold >
+	    ds->pebs_buffer_base + x86_pmu.pebs_record_size)
+		intel_pmu_drain_pebs_buffer();
 
 	cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+	if (cpuc->pebs_sched_cb_enabled && !pebs_is_enabled(cpuc)) {
+		perf_sched_cb_dec(event->ctx->pmu);
+		cpuc->pebs_sched_cb_enabled = false;
+	}
 
 	if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
 		cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 0473874..1edd3b9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -256,9 +256,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct x86_perf_task_context *task_ctx;
 
-	if (!x86_pmu.lbr_nr)
-		return;
-
 	/*
 	 * If LBR callstack feature is enabled and the stack was saved when
 	 * the task was scheduled out, restore the stack. Otherwise flush
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 6/6] perf, x86: enlarge PEBS buffer
  2015-02-23 14:25 [PATCH V5 0/6] large PEBS interrupt threshold Kan Liang
                   ` (4 preceding siblings ...)
  2015-02-23 14:25 ` [PATCH V5 5/6] perf, x86: drain PEBS buffer during context switch Kan Liang
@ 2015-02-23 14:25 ` Kan Liang
  5 siblings, 0 replies; 23+ messages in thread
From: Kan Liang @ 2015-02-23 14:25 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

Currently the PEBS buffer size is 4k, it only can hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as BTS buffer), 64k memory can hold about 330 PEBS
records. This will significantly the reduce number of PMI when
large PEBS interrupt threshold is used.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index b8d0451..eba42a1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
 #define BTS_RECORD_SIZE		24
 
 #define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE	PAGE_SIZE
+#define PEBS_BUFFER_SIZE	(PAGE_SIZE << 4)
 #define PEBS_FIXUP_SIZE		PAGE_SIZE
 
 /*
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold
  2015-02-23 14:25 ` [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
@ 2015-03-02 17:08   ` Stephane Eranian
  2015-03-02 17:59     ` Andi Kleen
  2015-03-30 13:54   ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Stephane Eranian @ 2015-03-02 17:08 UTC (permalink / raw)
  To: Kan Liang
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

Hi,

I spent some time looking at this patch series and testing some scenarios.

On Mon, Feb 23, 2015 at 9:25 AM, Kan Liang <kan.liang@intel.com> wrote:
>
> From: Yan, Zheng <zheng.z.yan@intel.com>
>
> PEBS always had the capability to log samples to its buffers without
> an interrupt. Traditionally perf has not used this but always set the
> PEBS threshold to one.
>
> For frequently occurring events (like cycles or branches or load/store)
> this in term requires using a relatively high sampling period to avoid
> overloading the system, by only processing PMIs. This in term increases
> sampling error.
>
> For the common cases we still need to use the PMI because the PEBS
> hardware has various limitations. The biggest one is that it can not
> supply a callgraph. It also requires setting a fixed period, as the
> hardware does not support adaptive period. Another issue is that it
> cannot supply a time stamp and some other options. To supply a TID it
> requires flushing on context switch. It can however supply the IP, the
> load/store address, TSX information, registers, and some other things.
>
> So we can make PEBS work for some specific cases, basically as long as
> you can do without a callgraph and can set the period you can use this
> new PEBS mode.
>
> The main benefit is the ability to support much lower sampling period
> (down to -c 1000) without extensive overhead.
>
> One use cases is for example to increase the resolution of the c2c tool.
> Another is double checking when you suspect the standard sampling has
> too much sampling error.
>
> Some numbers on the overhead, using cycle soak, comparing the elapsed
> time from "kernbench -M -H" between plain (threshold set to one) and
> multi (large threshold).
> The test command for plain:
>   "perf record -e cycles:p -c $period -- kernbench -M -H"
> The test command for multi:
>   "perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
> (The only difference of test command between multi and plain is time
> stamp options. Since time stamp is not supported by large PEBS
> threshold, it can be used as a flag to indicate if large threshold is
> enabled during the test.)
>
> period    plain(Sec)  multi(Sec)  Delta
> 10003     32.7        16.5        16.2
> 20003     30.2        16.2        14.0
> 40003     18.6        14.1        4.5
> 80003     16.8        14.6        2.2
> 100003    16.9        14.1        2.8
> 800003    15.4        15.7        -0.3

Who collects with such small periods? I believe there would be other
sides effects.

>
> 1000003   15.3        15.2        0.2
> 2000003   15.3        15.1        0.1


At more reasonable periods, the benefit seems to vanish. I don't quite know
of a scenario where one would absolutely need very small period. This is
about statistical sampling and not tracing.

I believe that you need to know exactly what you are doing, i.e., what you
are measuring for this improvement to make a difference and yet provide
useful data. I believe in system-wide mode, the benefit vanishes quickly
because you may not know in advance what you are measuring. The
key problem is the lack of timestamp which help order the MMAP records
vs. sample records.  I tried with my recently released Jitted code patch
series and sure enough with the --no-time, you cannot correlate samples
to symbols in the jitted code anymore. This is a case where you need
the timestamps to synchronize user level sampling data (obtained from
the jit compiler) with perf samples. Unless I know I am measuring jitted
code, I would not be able to use the PEBS buffer improvements without
losing symbolization capabilities.

I believe the patch may only be useful for per-process monitoring,
where the user knows exactly what is measured. But then maybe a
much lower overhead is not the key factor in the collection. However,
if you were to measure in a production environment, you'd need to
minimize overhead, yet most commonly this is done in system-wide
mode.

Overall, I think the patch is only useful to a small set of monitoring scenarios
and for people using "extreme" sampling periods. It requires the user to be
aware of the behavior of the monitoring app (because of the no-timestamp
requirement).


>
> With periods below 100003, plain (threshold one) cause much more
> overhead. With 10003 sampling period, the Elapsed Time for multi is
> even 2X faster than plain.
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_ds.c | 40 +++++++++++++++++++++++++++----
>  1 file changed, 36 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index 1c16700..16fdb18 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
>  {
>         struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
>         int node = cpu_to_node(cpu);
> -       int max, thresh = 1; /* always use a single PEBS record */
> +       int max;
>         void *buffer, *ibuffer;
>
>         if (!x86_pmu.pebs)
> @@ -280,9 +280,6 @@ static int alloc_pebs_buffer(int cpu)
>         ds->pebs_absolute_maximum = ds->pebs_buffer_base +
>                 max * x86_pmu.pebs_record_size;
>
> -       ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> -               thresh * x86_pmu.pebs_record_size;
> -
>         return 0;
>  }
>
> @@ -667,15 +664,35 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
>         return &emptyconstraint;
>  }
>
> +/*
> + * Flags PEBS can handle without an PMI.
> + *
> + * TID can only be handled by flushing at context switch.
> + */
> +#define PEBS_FREERUNNING_FLAGS \
> +       (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
> +       PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
> +       PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> +       PERF_SAMPLE_TRANSACTION)
> +
> +static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
> +{
> +       return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
> +}
> +
>  void intel_pmu_pebs_enable(struct perf_event *event)
>  {
>         struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>         struct hw_perf_event *hwc = &event->hw;
> +       struct debug_store *ds = cpuc->ds;
> +       bool first_pebs;
> +       u64 threshold;
>
>         hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
>         if (!event->attr.freq)
>                 hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;
>
> +       first_pebs = !pebs_is_enabled(cpuc);
>         cpuc->pebs_enabled |= 1ULL << hwc->idx;
>
>         if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
> @@ -683,6 +700,21 @@ void intel_pmu_pebs_enable(struct perf_event *event)
>         else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
>                 cpuc->pebs_enabled |= 1ULL << 63;
>
> +       /*
> +        * When the event is constrained enough we can use a larger
> +        * threshold and run the event with less frequent PMI.
> +        */
> +       if (0 && /* disable this temporarily */
> +           (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
> +           !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
> +               threshold = ds->pebs_absolute_maximum -
> +                       x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> +       } else {
> +               threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
> +       }
> +       if (first_pebs || ds->pebs_interrupt_threshold > threshold)
> +               ds->pebs_interrupt_threshold = threshold;
> +
>         /* Use auto-reload if possible to save a MSR write in the PMI */
>         if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
>                 ds->pebs_event_reset[hwc->idx] =
> --
> 1.8.3.2
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold
  2015-03-02 17:08   ` Stephane Eranian
@ 2015-03-02 17:59     ` Andi Kleen
  2015-03-02 18:07       ` Stephane Eranian
  0 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2015-03-02 17:59 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Kan Liang, Peter Zijlstra, LKML, Ingo Molnar,
	Arnaldo Carvalho de Melo, Andi Kleen

> At more reasonable periods, the benefit seems to vanish. I don't quite know
> of a scenario where one would absolutely need very small period. This is
> about statistical sampling and not tracing.

In workloads with small irregular events (e.g. irregularly
space event handlers that run for less than a milli-second) a small
period is the only reliable way to catch them.

You're right the mode is not very useful for anything
that runs the same code continuously.

However irregular scenarios are becoming more and more
important. For example this is useful for characterizing
complex GUI applications.

> I believe that you need to know exactly what you are doing, i.e., what you
> are measuring for this improvement to make a difference and yet provide
> useful data. I believe in system-wide mode, the benefit vanishes quickly
> because you may not know in advance what you are measuring. The
> key problem is the lack of timestamp which help order the MMAP records
> vs. sample records.  I tried with my recently released Jitted code patch
> series and sure enough with the --no-time, you cannot correlate samples
> to symbols in the jitted code anymore. This is a case where you need
> the timestamps to synchronize user level sampling data (obtained from
> the jit compiler) with perf samples. Unless I know I am measuring jitted
> code, I would not be able to use the PEBS buffer improvements without
> losing symbolization capabilities.

This is only when the JIT regularly reuses virtual address space.
If it mostly uses new addresses there is no issue, as the
mappings are unique. I believe JITs rarely reuse.

> Overall, I think the patch is only useful to a small set of monitoring scenarios
> and for people using "extreme" sampling periods. It requires the user to be
> aware of the behavior of the monitoring app (because of the no-timestamp
> requirement).

Yes right now with all the limitations you need to know
what you're doing. However it's still useful for experts.

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold
  2015-03-02 17:59     ` Andi Kleen
@ 2015-03-02 18:07       ` Stephane Eranian
  0 siblings, 0 replies; 23+ messages in thread
From: Stephane Eranian @ 2015-03-02 18:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kan Liang, Peter Zijlstra, LKML, Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Mar 2, 2015 at 12:59 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> At more reasonable periods, the benefit seems to vanish. I don't quite know
>> of a scenario where one would absolutely need very small period. This is
>> about statistical sampling and not tracing.
>
> In workloads with small irregular events (e.g. irregularly
> space event handlers that run for less than a milli-second) a small
> period is the only reliable way to catch them.
>
> You're right the mode is not very useful for anything
> that runs the same code continuously.
>
> However irregular scenarios are becoming more and more
> important. For example this is useful for characterizing
> complex GUI applications.
>
>> I believe that you need to know exactly what you are doing, i.e., what you
>> are measuring for this improvement to make a difference and yet provide
>> useful data. I believe in system-wide mode, the benefit vanishes quickly
>> because you may not know in advance what you are measuring. The
>> key problem is the lack of timestamp which help order the MMAP records
>> vs. sample records.  I tried with my recently released Jitted code patch
>> series and sure enough with the --no-time, you cannot correlate samples
>> to symbols in the jitted code anymore. This is a case where you need
>> the timestamps to synchronize user level sampling data (obtained from
>> the jit compiler) with perf samples. Unless I know I am measuring jitted
>> code, I would not be able to use the PEBS buffer improvements without
>> losing symbolization capabilities.
>
> This is only when the JIT regularly reuses virtual address space.
> If it mostly uses new addresses there is no issue, as the
> mappings are unique. I believe JITs rarely reuse.
>
Not just in that case. In my test case, there is no re-jitting.
It is just that you need to reconcile the timestamps in the jit
meta-data file (jitdump)
and the perf.data so that you can insert the MMAP covering the jitted
code in the
right place. Otherwise the big anon mmap() remains in place and masks the MMAPs
generated from the jitdump. This is based on how my JIT patch series works.

>> Overall, I think the patch is only useful to a small set of monitoring scenarios
>> and for people using "extreme" sampling periods. It requires the user to be
>> aware of the behavior of the monitoring app (because of the no-timestamp
>> requirement).
>
> Yes right now with all the limitations you need to know
> what you're doing. However it's still useful for experts.
>
You need to know that no parts of the address space is reused
overtime, i.e., no overlapping or recycled mmap regions.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 1/6] perf, x86: use the PEBS auto reload mechanism when possible
  2015-02-23 14:25 ` [PATCH V5 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
@ 2015-03-30 12:06   ` Peter Zijlstra
  2015-03-30 14:02     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-03-30 12:06 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Mon, Feb 23, 2015 at 09:25:51AM -0500, Kan Liang wrote:
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -673,6 +673,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
>  	struct hw_perf_event *hwc = &event->hw;
>  
>  	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
> +	if (!event->attr.freq)
> +		hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;
>  
>  	cpuc->pebs_enabled |= 1ULL << hwc->idx;

Why not in x86_pmu_event_init()? This is not something that will change,
ever, right?

> @@ -680,6 +682,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
>  		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
>  	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
>  		cpuc->pebs_enabled |= 1ULL << 63;
> +
> +	/* Use auto-reload if possible to save a MSR write in the PMI */
> +	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
> +		ds->pebs_event_reset[hwc->idx] =
> +			(u64)-hwc->sample_period & x86_pmu.cntval_mask;
> +	}
>  }
>  
>  void intel_pmu_pebs_disable(struct perf_event *event)
> @@ -698,6 +706,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
>  		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>  
>  	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
> +	hwc->flags &= ~PERF_X86_EVENT_AUTO_RELOAD;

Equally, we should not ever have to clear this. You cannot change
attributes after the event is created, if that flag ever gets set, it
stays valid.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-02-23 14:25 ` [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
@ 2015-03-30 13:45   ` Peter Zijlstra
  2015-03-30 17:19     ` Liang, Kan
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-03-30 13:45 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Mon, Feb 23, 2015 at 09:25:54AM -0500, Kan Liang wrote:
> From: Yan, Zheng <zheng.z.yan@intel.com>
> 
> When PEBS interrupt threshold is larger than one, the PEBS buffer
> may include multiple records for each PEBS event. This patch makes
> the code first count how many records each PEBS event has, then
> output the samples in batch.
> 
> One corner case needs to mention is that the PEBS hardware doesn't
> deal well with collisions, when PEBS events happen near to each
> other. The records for the events can be collapsed into a single
> one, and it's not possible to reconstruct all events that caused
> the PEBS record, However in practice collisions are extremely rare,
> as long as different events are used. The periods are typically very
> large, so any collision is unlikely. When collision happens, we drop
> the PEBS record.
> 
> Here are some numbers about collisions.
> Four frequently occurring events
> (cycles:p,instructions:p,branches:p,mem-stores:p) are tested
> 
> Test events which are sampled together                   collision rate
> cycles:p,instructions:p                                  0.25%
> cycles:p,instructions:p,branches:p                       0.30%
> cycles:p,instructions:p,branches:p,mem-stores:p          0.35%
> 
> cycles:p,cycles:p                                        98.52%
> 
> collisions are extremely rare as long as different events are used. The
> only way you can get a lot of collision is when you count the same thing
> multiple times. But it is not a useful configuration.

This fails to mention the other problem the status field has.  You also
did not specify what exact condition you counted as a collision.

The PEBS status field is a copy of the GLOBAL_STATUS MSR at assist time,
this means that:

 - its possible (and harmless) for the status field to contain set bits
   for !PEBS events -- the proposed code is buggy here.
 - its possible to have multiple PEBS bits set even though the event
   really only was for a single event -- if you count everything with
   multiple PEBS bits set as a collision you're counting wrong.

So once again, a coherent story here please.

>  static void __intel_pmu_pebs_event(struct perf_event *event,
> +				   struct pt_regs *iregs,
> +				   void *at, void *top, int count)
>  {
> +	struct perf_output_handle handle;
> +	struct perf_event_header header;
>  	struct perf_sample_data data;
>  	struct pt_regs regs;
>  
> +	if (!intel_pmu_save_and_restart(event) &&
> +	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
>  		return;
>  
> +	setup_pebs_sample_data(event, iregs, at, &data, &regs);
>  
> +	if (perf_event_overflow(event, &data, &regs)) {
>  		x86_pmu_stop(event, 0);
> +		return;
> +	}
> +
> +	if (count <= 1)
> +		return;
> +
> +	at += x86_pmu.pebs_record_size;
> +	count--;
> +
> +	perf_sample_data_init(&data, 0, event->hw.last_period);
> +	perf_prepare_sample(&header, &data, event, &regs);
> +
> +	if (perf_output_begin(&handle, event, header.size * count))
> +		return;
> +
> +	for (; at < top; at += x86_pmu.pebs_record_size) {
> +		struct pebs_record_nhm *p = at;
> +
> +		if (p->status != (1 << event->hw.idx))
> +			continue;
> +
> +		setup_pebs_sample_data(event, iregs, at, &data, &regs);
> +		perf_output_sample(&handle, &header, &data, event);
> +
> +		count--;
> +		if (count == 0)
> +			break;
> +	}
> +
> +	perf_output_end(&handle);
>  }

This can use a comment on why this is funny like this. I have vague
memories, but a comment helps everybody who doesn't have those memories
-- which will include me in a year or so.

What I cannot remember is why we call overflow on the first, not the
last event.

>  static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
>  {
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>  	struct debug_store *ds = cpuc->ds;
> +	struct perf_event *event;
> +	void *base, *at, *top;
>  	int bit;
> +	int counts[MAX_PEBS_EVENTS] = {};
>  
>  	if (!x86_pmu.pebs_active)
>  		return;
>  
> +	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
>  	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
>  
>  	ds->pebs_index = ds->pebs_buffer_base;
>  
> +	if (unlikely(base >= top))
>  		return;
>  
> +	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
>  		struct pebs_record_nhm *p = at;
>  
> +		bit = find_first_bit((unsigned long *)&p->status,
> +					x86_pmu.max_pebs_events);
> +		if (bit >= x86_pmu.max_pebs_events)
> +			continue;
> +		/*
> +		 * The PEBS hardware does not deal well with collisions,
> +		 * when the same event happens near to each other. The
> +		 * records for the events can be collapsed into a single
> +		 * one, and it's not possible to reconstruct all events
> +		 * that caused the PEBS record. However in practice, the
> +		 * collisions are extremely rare. If collision happened,
> +		 * we drop the record. its the safest choice.
> +		 */
> +		if (p->status != (1 << bit))
> +			continue;

As per the above, this is buggy. You should start by masking p->status
with x86_pmu.pebs_active to clear all !PEBS counter bits.

> +		if (!test_bit(bit, cpuc->active_mask))
> +			continue;
> +		event = cpuc->events[bit];
> +		WARN_ON_ONCE(!event);
> +		if (!event->attr.precise_ip)
> +			continue;
> +		counts[bit]++;
> +	}
>  
> +	for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
> +		if (counts[bit] == 0)
>  			continue;
> +		event = cpuc->events[bit];
> +		for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> +			struct pebs_record_nhm *p = at;
>  
> +			if (p->status == (1 << bit))
> +				break;
> +		}
> +		__intel_pmu_pebs_event(event, iregs, at, top, counts[bit]);
>  	}
>  }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 5/6] perf, x86: drain PEBS buffer during context switch
  2015-02-23 14:25 ` [PATCH V5 5/6] perf, x86: drain PEBS buffer during context switch Kan Liang
@ 2015-03-30 13:50   ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2015-03-30 13:50 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Mon, Feb 23, 2015 at 09:25:55AM -0500, Kan Liang wrote:
> From: Yan, Zheng <zheng.z.yan@intel.com>
> 
> Flush the PEBS buffer during context switch if PEBS interrupt threshold
> is larger than one. This allows perf to supply TID for sample outputs.
> 
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> Signed-off-by: Kan Liang <kan.liang@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event.h           |  3 +++
>  arch/x86/kernel/cpu/perf_event_intel.c     | 11 +++++++++-
>  arch/x86/kernel/cpu/perf_event_intel_ds.c  | 33 ++++++++++++++++++++++++++++--
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  3 ---
>  4 files changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index bc4ae3b..b4f6431 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -151,6 +151,7 @@ struct cpu_hw_events {
>  	 */
>  	struct debug_store	*ds;
>  	u64			pebs_enabled;
> +	bool			pebs_sched_cb_enabled;
>  
>  	/*
>  	 * Intel LBR bits

Why do we need that extra state? I would've expected to see a inc/dec
for every AUTO_RELOAD that gets added/removed.

> @@ -704,13 +717,20 @@ void intel_pmu_pebs_enable(struct perf_event *event)
>  	 * When the event is constrained enough we can use a larger
>  	 * threshold and run the event with less frequent PMI.
>  	 */
> -	if (0 && /* disable this temporarily */
> -	    (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
> +	if ((hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
>  	    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
>  		threshold = ds->pebs_absolute_maximum -
>  			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> +		if (first_pebs) {
> +			perf_sched_cb_inc(event->ctx->pmu);
> +			cpuc->pebs_sched_cb_enabled = true;
> +		}
>  	} else {
>  		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
> +		if (cpuc->pebs_sched_cb_enabled) {
> +			perf_sched_cb_dec(event->ctx->pmu);
> +			cpuc->pebs_sched_cb_enabled = false;
> +		}
>  	}
>  	if (first_pebs || ds->pebs_interrupt_threshold > threshold)
>  		ds->pebs_interrupt_threshold = threshold;

I'm confused, why do you do sched_cb_dec for every event that wasn't
AUTO_RELOAD?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold
  2015-02-23 14:25 ` [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
  2015-03-02 17:08   ` Stephane Eranian
@ 2015-03-30 13:54   ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2015-03-30 13:54 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Mon, Feb 23, 2015 at 09:25:53AM -0500, Kan Liang wrote:

> +/*
> + * Flags PEBS can handle without an PMI.
> + *
> + * TID can only be handled by flushing at context switch.
> + */
> +#define PEBS_FREERUNNING_FLAGS	\
> +	(PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
> +	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
> +	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> +	PERF_SAMPLE_TRANSACTION)
> +

> @@ -683,6 +700,21 @@ void intel_pmu_pebs_enable(struct perf_event *event)
>  	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
>  		cpuc->pebs_enabled |= 1ULL << 63;
>  
> +	/*
> +	 * When the event is constrained enough we can use a larger
> +	 * threshold and run the event with less frequent PMI.
> +	 */
> +	if (0 && /* disable this temporarily */
> +	    (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
> +	    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {

Why not do the FREERUNNING flags thing in intel_pmu_hw_config() when you
set AUTO_RELOAD and make AUTO_RELOAD conditional on it.

That way you only have a single test here. AUTO_RELOAD implies the
sample_type conforms.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 1/6] perf, x86: use the PEBS auto reload mechanism when possible
  2015-03-30 12:06   ` Peter Zijlstra
@ 2015-03-30 14:02     ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2015-03-30 14:02 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Mon, Mar 30, 2015 at 02:06:07PM +0200, Peter Zijlstra wrote:
> On Mon, Feb 23, 2015 at 09:25:51AM -0500, Kan Liang wrote:
> > +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> > @@ -673,6 +673,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
> >  	struct hw_perf_event *hwc = &event->hw;
> >  
> >  	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
> > +	if (!event->attr.freq)
> > +		hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;
> >  
> >  	cpuc->pebs_enabled |= 1ULL << hwc->idx;
> 
> Why not in x86_pmu_event_init()? This is not something that will change,
> ever, right?

intel_pmu_hw_config() would be the better place, as said in a later
email.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-03-30 13:45   ` Peter Zijlstra
@ 2015-03-30 17:19     ` Liang, Kan
  2015-03-30 17:25       ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Liang, Kan @ 2015-03-30 17:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, acme, eranian, andi



> > One corner case needs to mention is that the PEBS hardware doesn't
> > deal well with collisions, when PEBS events happen near to each other.
> > The records for the events can be collapsed into a single one, and
> > it's not possible to reconstruct all events that caused the PEBS
> > record, However in practice collisions are extremely rare, as long as
> > different events are used. The periods are typically very large, so
> > any collision is unlikely. When collision happens, we drop the PEBS
> > record.
> >
> > Here are some numbers about collisions.
> > Four frequently occurring events
> > (cycles:p,instructions:p,branches:p,mem-stores:p) are tested
> >
> > Test events which are sampled together                   collision rate
> > cycles:p,instructions:p                                  0.25%
> > cycles:p,instructions:p,branches:p                       0.30%
> > cycles:p,instructions:p,branches:p,mem-stores:p          0.35%
> >
> > cycles:p,cycles:p                                        98.52%
> >
> > collisions are extremely rare as long as different events are used.
> > The only way you can get a lot of collision is when you count the same
> > thing multiple times. But it is not a useful configuration.
> 
> This fails to mention the other problem the status field has.  You also did
> not specify what exact condition you counted as a collision.
> 
> The PEBS status field is a copy of the GLOBAL_STATUS MSR at assist time,
> this means that:
> 
>  - its possible (and harmless) for the status field to contain set bits
>    for !PEBS events -- the proposed code is buggy here.
I will fix it.
>  - its possible to have multiple PEBS bits set even though the event
>    really only was for a single event -- if you count everything with
>    multiple PEBS bits set as a collision you're counting wrong.
> 

In what situation multiple PEBS bits was set for a single event?
Could you please give me an example?

Thanks,
Kan


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-03-30 17:19     ` Liang, Kan
@ 2015-03-30 17:25       ` Andi Kleen
  2015-03-30 17:43         ` Liang, Kan
  0 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2015-03-30 17:25 UTC (permalink / raw)
  To: Liang, Kan; +Cc: Peter Zijlstra, linux-kernel, mingo, acme, eranian, andi

> >  - its possible (and harmless) for the status field to contain set bits
> >    for !PEBS events -- the proposed code is buggy here.
> I will fix it.
> >  - its possible to have multiple PEBS bits set even though the event
> >    really only was for a single event -- if you count everything with
> >    multiple PEBS bits set as a collision you're counting wrong.
> > 
> 
> In what situation multiple PEBS bits was set for a single event?
> Could you please give me an example?

The field in the PEBS record is just a copy of GLOBAL_STATUS (minus
some extra bits), so if there were already bits in GLOBAL_STATUS 
that haven't been cleared yet you may see multiple bits

In a proper configuration this should be rare, as we expect the
events to be different and run on different effective frequencies,
so GLOBAL_STATUS should be already cleared.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-03-30 17:25       ` Andi Kleen
@ 2015-03-30 17:43         ` Liang, Kan
  2015-03-30 17:45           ` Andi Kleen
  2015-03-30 20:07           ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Liang, Kan @ 2015-03-30 17:43 UTC (permalink / raw)
  To: Andi Kleen, Peter Zijlstra; +Cc: linux-kernel, mingo, acme, eranian



> -----Original Message-----
> From: Andi Kleen [mailto:andi@firstfloor.org]
> Sent: Monday, March 30, 2015 1:26 PM
> To: Liang, Kan
> Cc: Peter Zijlstra; linux-kernel@vger.kernel.org; mingo@kernel.org;
> acme@infradead.org; eranian@google.com; andi@firstfloor.org
> Subject: Re: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS
> buffer
> 
> > >  - its possible (and harmless) for the status field to contain set bits
> > >    for !PEBS events -- the proposed code is buggy here.
> > I will fix it.
> > >  - its possible to have multiple PEBS bits set even though the event
> > >    really only was for a single event -- if you count everything with
> > >    multiple PEBS bits set as a collision you're counting wrong.
> > >
> >
> > In what situation multiple PEBS bits was set for a single event?
> > Could you please give me an example?
> 
> The field in the PEBS record is just a copy of GLOBAL_STATUS (minus some
> extra bits), so if there were already bits in GLOBAL_STATUS that haven't
> been cleared yet you may see multiple bits
> 
> In a proper configuration this should be rare, as we expect the events to
> be different and run on different effective frequencies, so
> GLOBAL_STATUS should be already cleared.

The case you mentioned has already covered in the description. 
I tried both multiple different events and multiple same events.
As you said, the collision for multiple different events is rare.
The collision for multiple same events is very high, but it's not a useful
configuration

As my understanding, Peter means the multiple PEBS bits could be set
when sampling only one single event.  

Peter, could you please clarify?

Thanks,
Kan

> 
> -Andi
> 
> --
> ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-03-30 17:43         ` Liang, Kan
@ 2015-03-30 17:45           ` Andi Kleen
  2015-03-30 20:07           ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2015-03-30 17:45 UTC (permalink / raw)
  To: Liang, Kan; +Cc: Andi Kleen, Peter Zijlstra, linux-kernel, mingo, acme, eranian

> 
> The case you mentioned has already covered in the description. 

AFAIK the description was for the "two PEBS events are happening nearby,
so the hardware merges them" case. That is different than what I
described.

Both can happen (very rarely)

> I tried both multiple different events and multiple same events.
> As you said, the collision for multiple different events is rare.
> The collision for multiple same events is very high, but it's not a useful
> configuration
> 
> As my understanding, Peter means the multiple PEBS bits could be set
> when sampling only one single event.  

-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-03-30 17:43         ` Liang, Kan
  2015-03-30 17:45           ` Andi Kleen
@ 2015-03-30 20:07           ` Peter Zijlstra
  2015-03-30 20:11             ` Andi Kleen
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-03-30 20:07 UTC (permalink / raw)
  To: Liang, Kan; +Cc: Andi Kleen, linux-kernel, mingo, acme, eranian

On Mon, Mar 30, 2015 at 05:43:45PM +0000, Liang, Kan wrote:
> As my understanding, Peter means the multiple PEBS bits could be set
> when sampling only one single event.  
> 
> Peter, could you please clarify?


So the chain of events of a PEBS counter is as follows:

 A) the CTRn value reaches 0:
    - the corresponding bit in GLOBAL_STATUS gets set
    - we start arming the hardware assist

<unspecified amount of time later --
	this could cover multiple events of interest>

 B) the hardware assist is armed, any next event will trigger it

 C) a matching event happens:
    - the hardware assist triggers and generates a PEBS record
      this includes a copy of GLOBAL_STATUS at this moment
    - if we auto-reload we (re)set CTRn
    - we clear the relevant bit in GLOBAL_STATUS


CASE 1:

Now if we have a chain of events like:

 A0, B0, A1, C0

The event generated for counter0 will include a status with counter1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

CASE 2:

If otoh we modify things like:

 A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists (the
instruction matches both criteria), we will generate but a single
record, but again with both counters listed in the status field.

This time the record pertains to both.


Note that these two cases are different but undistinguishable with the
data as generated -- although one can try and infer given future events.
The first case must continue to generate an event for counter1
(hopefully with counter0 clear, otherwise continue with the induction
until you get a record with but one set).

Also note that its not entirely clear in case two if the event needs to
be the very same or just 'close', this is not (well) specified, but
conceptually the case presented is the easier.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-03-30 20:07           ` Peter Zijlstra
@ 2015-03-30 20:11             ` Andi Kleen
  2015-03-30 21:24               ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2015-03-30 20:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Liang, Kan, Andi Kleen, linux-kernel, mingo, acme, eranian


Fair enough.

For now we can simply only allow multi pebs if only a single
PEBS event is active (non PEBS events don't matter, as we
can just ignore those)

This would be always on Atom, which only has a single PEBS
counter.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-03-30 20:11             ` Andi Kleen
@ 2015-03-30 21:24               ` Peter Zijlstra
  2015-03-30 21:53                 ` Andi Kleen
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2015-03-30 21:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Liang, Kan, linux-kernel, mingo, acme, eranian

On Mon, Mar 30, 2015 at 10:11:16PM +0200, Andi Kleen wrote:
> 
> Fair enough.
> 
> For now we can simply only allow multi pebs if only a single
> PEBS event is active (non PEBS events don't matter, as we
> can just ignore those)
> 
> This would be always on Atom, which only has a single PEBS
> counter.

I didn't say we can't do this for the big cores; I just said I want a
coherent story about what and why we're doing this.

The explanation was missing half the story, that's not good.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-03-30 21:24               ` Peter Zijlstra
@ 2015-03-30 21:53                 ` Andi Kleen
  0 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2015-03-30 21:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, Liang, Kan, linux-kernel, mingo, acme, eranian

> I just said I want a
> coherent story about what and why we're doing this.

What:
is clear, isn't it? Use this capability which has been long designed
into PEBS, but not used by perf.

Why:
The goal is to go to higher sampling rates with comparable or less overhead.
Higher sampling rates improve analysis of anything that is not
mostly doing the same thing over and over.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-03-30 21:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-23 14:25 [PATCH V5 0/6] large PEBS interrupt threshold Kan Liang
2015-02-23 14:25 ` [PATCH V5 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
2015-03-30 12:06   ` Peter Zijlstra
2015-03-30 14:02     ` Peter Zijlstra
2015-02-23 14:25 ` [PATCH V5 2/6] perf, x86: introduce setup_pebs_sample_data() Kan Liang
2015-02-23 14:25 ` [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
2015-03-02 17:08   ` Stephane Eranian
2015-03-02 17:59     ` Andi Kleen
2015-03-02 18:07       ` Stephane Eranian
2015-03-30 13:54   ` Peter Zijlstra
2015-02-23 14:25 ` [PATCH V5 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
2015-03-30 13:45   ` Peter Zijlstra
2015-03-30 17:19     ` Liang, Kan
2015-03-30 17:25       ` Andi Kleen
2015-03-30 17:43         ` Liang, Kan
2015-03-30 17:45           ` Andi Kleen
2015-03-30 20:07           ` Peter Zijlstra
2015-03-30 20:11             ` Andi Kleen
2015-03-30 21:24               ` Peter Zijlstra
2015-03-30 21:53                 ` Andi Kleen
2015-02-23 14:25 ` [PATCH V5 5/6] perf, x86: drain PEBS buffer during context switch Kan Liang
2015-03-30 13:50   ` Peter Zijlstra
2015-02-23 14:25 ` [PATCH V5 6/6] perf, x86: enlarge PEBS buffer Kan Liang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.