All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 0/6] large PEBS interrupt threshold
@ 2015-04-09 16:37 Kan Liang
  2015-04-09 16:37 ` [PATCH V6 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Kan Liang @ 2015-04-09 16:37 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

This patch series implements large PEBS interrupt threshold.
Currently, the PEBS threshold is forced to set to one. A larger PEBS
interrupt threshold can significantly reduce the sampling overhead
especially for frequently occurring events
(like cycles or branches or load/stores) with small sampling period.
For example, perf record cycles event when running kernbench
with 10003 sampling period. The Elapsed Time reduced from 32.7 seconds
to 16.5 seconds, which is 2X faster.
For more details, please refer to patch 3's description.

Limitations:
 - It can not supply a callgraph.
 - It requires setting a fixed period.
 - It cannot supply a time stamp.
 - To supply a TID it requires flushing on context switch.
If the above requirement doesn't apply, the threshold will set to one.

Collisions:
When PEBS events happen near to each other, the records for the events
can be collapsed into a single one, and it's not possible to
reconstruct. When collision happens, we drop the PEBS record.
Actually, collisions are extremely rare as long as different events
are used. We once tested the worst case with four frequently occurring
events (cycles:p,instructions:p,branches:p,mem-stores:p).
The collisions rate is only 0.34%.
The only way you can get a lot of collision is when you count the same
thing multiple times. But it is not a useful configuration.
For details about collisions, please refer to patch 4's description.

changes since v1:
  - drop patch 'perf, core: Add all PMUs to pmu_idr'
  - add comments for case that multiple counters overflow simultaneously
changes since v2:
  - rename perf_sched_cb_{enable,disable} to perf_sched_cb_user_{inc,dec}
  - use flag to indicate auto reload mechanism
  - move codes that setup PEBS sample data to separate function
  - output the PEBS records in batch
  - enable this for All (PEBS capable) hardware
  - more description for the multiplex
changes since v3:
  - ignore conflicting PEBS record
changes since v4:
  - Do more tests for collision and update comments
changes since v5:
  - Move autoreload and large PEBS available check to intel_pmu_hw_config
  - make AUTO_RELOAD conditional on large PEBS
  - !PEBS bug fix
  - coherent story about what is collision and how we handle it.
  - Remove extra state pebs_sched_cb_enabled

Yan, Zheng (6):
  perf, x86: use the PEBS auto reload mechanism when possible
  perf, x86: introduce setup_pebs_sample_data()
  perf, x86: large PEBS interrupt threshold
  perf, x86: handle multiple records in PEBS buffer
  perf, x86: drain PEBS buffer during context switch
  perf, x86: enlarge PEBS buffer

 arch/x86/kernel/cpu/perf_event.c           |  15 +-
 arch/x86/kernel/cpu/perf_event.h           |  15 +-
 arch/x86/kernel/cpu/perf_event_intel.c     |  21 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c  | 316 ++++++++++++++++++++++-------
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   3 -
 include/linux/perf_event.h                 |   4 +
 kernel/events/core.c                       |   6 +-
 7 files changed, 293 insertions(+), 87 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH V6 1/6] perf, x86: use the PEBS auto reload mechanism when possible
  2015-04-09 16:37 [PATCH V6 0/6] large PEBS interrupt threshold Kan Liang
@ 2015-04-09 16:37 ` Kan Liang
  2015-04-09 16:37 ` [PATCH V6 2/6] perf, x86: introduce setup_pebs_sample_data() Kan Liang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Kan Liang @ 2015-04-09 16:37 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c          | 15 +++++++++------
 arch/x86/kernel/cpu/perf_event.h          |  2 +-
 arch/x86/kernel/cpu/perf_event_intel.c    |  8 ++++++--
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  7 +++++++
 4 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 87848eb..8cc1153 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1058,13 +1058,16 @@ int x86_perf_event_set_period(struct perf_event *event)
 
 	per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
 
-	/*
-	 * The hw event starts counting from this event offset,
-	 * mark it to be able to extra future deltas:
-	 */
-	local64_set(&hwc->prev_count, (u64)-left);
+	if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+	    local64_read(&hwc->prev_count) != (u64)-left) {
+		/*
+		 * The hw event starts counting from this event offset,
+		 * mark it to be able to extra future deltas:
+		 */
+		local64_set(&hwc->prev_count, (u64)-left);
 
-	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+		wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+	}
 
 	/*
 	 * Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 329f035..1290bb6 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -74,7 +74,7 @@ struct event_constraint {
 #define PERF_X86_EVENT_EXCL		0x40 /* HT exclusivity on counter */
 #define PERF_X86_EVENT_DYNAMIC		0x80 /* dynamic alloc'd constraint */
 #define PERF_X86_EVENT_RDPMC_ALLOWED	0x40 /* grant rdpmc permission */
-
+#define PERF_X86_EVENT_AUTO_RELOAD	0x80 /* use PEBS auto-reload */
 
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 9da2400..0a7b5ca 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2305,8 +2305,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	if (ret)
 		return ret;
 
-	if (event->attr.precise_ip && x86_pmu.pebs_aliases)
-		x86_pmu.pebs_aliases(event);
+	if (event->attr.precise_ip) {
+		if (!event->attr.freq)
+			event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+		if (x86_pmu.pebs_aliases)
+			x86_pmu.pebs_aliases(event);
+	}
 
 	if (needs_branch_stack(event)) {
 		ret = intel_pmu_setup_lbr_filter(event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ca69ea5..7c6dd8e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -680,6 +680,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
+	struct debug_store *ds = cpuc->ds;
 
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
 
@@ -689,6 +690,12 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
 	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled |= 1ULL << 63;
+
+	/* Use auto-reload if possible to save a MSR write in the PMI */
+	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+		ds->pebs_event_reset[hwc->idx] =
+			(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
+	}
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V6 2/6] perf, x86: introduce setup_pebs_sample_data()
  2015-04-09 16:37 [PATCH V6 0/6] large PEBS interrupt threshold Kan Liang
  2015-04-09 16:37 ` [PATCH V6 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
@ 2015-04-09 16:37 ` Kan Liang
  2015-04-09 16:37 ` [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Kan Liang @ 2015-04-09 16:37 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

move codes that setup PEBS sample data to separate function.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 95 +++++++++++++++++--------------
 1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 7c6dd8e..e3916d5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -845,8 +845,10 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
 	return txn;
 }
 
-static void __intel_pmu_pebs_event(struct perf_event *event,
-				   struct pt_regs *iregs, void *__pebs)
+static void setup_pebs_sample_data(struct perf_event *event,
+				   struct pt_regs *iregs, void *__pebs,
+				   struct perf_sample_data *data,
+				   struct pt_regs *regs)
 {
 #define PERF_X86_EVENT_PEBS_HSW_PREC \
 		(PERF_X86_EVENT_PEBS_ST_HSW | \
@@ -858,30 +860,25 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 */
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct pebs_record_hsw *pebs = __pebs;
-	struct perf_sample_data data;
-	struct pt_regs regs;
 	u64 sample_type;
 	int fll, fst, dsrc;
 	int fl = event->hw.flags;
 
-	if (!intel_pmu_save_and_restart(event))
-		return;
-
 	sample_type = event->attr.sample_type;
 	dsrc = sample_type & PERF_SAMPLE_DATA_SRC;
 
 	fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
 	fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);
 
-	perf_sample_data_init(&data, 0, event->hw.last_period);
+	perf_sample_data_init(data, 0, event->hw.last_period);
 
-	data.period = event->hw.last_period;
+	data->period = event->hw.last_period;
 
 	/*
 	 * Use latency for weight (only avail with PEBS-LL)
 	 */
 	if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
-		data.weight = pebs->lat;
+		data->weight = pebs->lat;
 
 	/*
 	 * data.data_src encodes the data source
@@ -894,7 +891,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 			val = precise_datala_hsw(event, pebs->dse);
 		else if (fst)
 			val = precise_store_data(pebs->dse);
-		data.data_src.val = val;
+		data->data_src.val = val;
 	}
 
 	/*
@@ -907,58 +904,70 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * PERF_SAMPLE_IP and PERF_SAMPLE_CALLCHAIN to function properly.
 	 * A possible PERF_SAMPLE_REGS will have to transfer all regs.
 	 */
-	regs = *iregs;
-	regs.flags = pebs->flags;
-	set_linear_ip(&regs, pebs->ip);
-	regs.bp = pebs->bp;
-	regs.sp = pebs->sp;
+	*regs = *iregs;
+	regs->flags = pebs->flags;
+	set_linear_ip(regs, pebs->ip);
+	regs->bp = pebs->bp;
+	regs->sp = pebs->sp;
 
 	if (sample_type & PERF_SAMPLE_REGS_INTR) {
-		regs.ax = pebs->ax;
-		regs.bx = pebs->bx;
-		regs.cx = pebs->cx;
-		regs.dx = pebs->dx;
-		regs.si = pebs->si;
-		regs.di = pebs->di;
-		regs.bp = pebs->bp;
-		regs.sp = pebs->sp;
-
-		regs.flags = pebs->flags;
+		regs->ax = pebs->ax;
+		regs->bx = pebs->bx;
+		regs->cx = pebs->cx;
+		regs->dx = pebs->dx;
+		regs->si = pebs->si;
+		regs->di = pebs->di;
+		regs->bp = pebs->bp;
+		regs->sp = pebs->sp;
+
+		regs->flags = pebs->flags;
 #ifndef CONFIG_X86_32
-		regs.r8 = pebs->r8;
-		regs.r9 = pebs->r9;
-		regs.r10 = pebs->r10;
-		regs.r11 = pebs->r11;
-		regs.r12 = pebs->r12;
-		regs.r13 = pebs->r13;
-		regs.r14 = pebs->r14;
-		regs.r15 = pebs->r15;
+		regs->r8 = pebs->r8;
+		regs->r9 = pebs->r9;
+		regs->r10 = pebs->r10;
+		regs->r11 = pebs->r11;
+		regs->r12 = pebs->r12;
+		regs->r13 = pebs->r13;
+		regs->r14 = pebs->r14;
+		regs->r15 = pebs->r15;
 #endif
 	}
 
 	if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
-		regs.ip = pebs->real_ip;
-		regs.flags |= PERF_EFLAGS_EXACT;
-	} else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
-		regs.flags |= PERF_EFLAGS_EXACT;
+		regs->ip = pebs->real_ip;
+		regs->flags |= PERF_EFLAGS_EXACT;
+	} else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(regs))
+		regs->flags |= PERF_EFLAGS_EXACT;
 	else
-		regs.flags &= ~PERF_EFLAGS_EXACT;
+		regs->flags &= ~PERF_EFLAGS_EXACT;
 
 	if ((sample_type & PERF_SAMPLE_ADDR) &&
 	    x86_pmu.intel_cap.pebs_format >= 1)
-		data.addr = pebs->dla;
+		data->addr = pebs->dla;
 
 	if (x86_pmu.intel_cap.pebs_format >= 2) {
 		/* Only set the TSX weight when no memory weight. */
 		if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
-			data.weight = intel_hsw_weight(pebs);
+			data->weight = intel_hsw_weight(pebs);
 
 		if (sample_type & PERF_SAMPLE_TRANSACTION)
-			data.txn = intel_hsw_transaction(pebs);
+			data->txn = intel_hsw_transaction(pebs);
 	}
 
 	if (has_branch_stack(event))
-		data.br_stack = &cpuc->lbr_stack;
+		data->br_stack = &cpuc->lbr_stack;
+}
+
+static void __intel_pmu_pebs_event(struct perf_event *event,
+				   struct pt_regs *iregs, void *__pebs)
+{
+	struct perf_sample_data data;
+	struct pt_regs regs;
+
+	if (!intel_pmu_save_and_restart(event))
+		return;
+
+	setup_pebs_sample_data(event, iregs, __pebs, &data, &regs);
 
 	if (perf_event_overflow(event, &data, &regs))
 		x86_pmu_stop(event, 0);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
  2015-04-09 16:37 [PATCH V6 0/6] large PEBS interrupt threshold Kan Liang
  2015-04-09 16:37 ` [PATCH V6 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
  2015-04-09 16:37 ` [PATCH V6 2/6] perf, x86: introduce setup_pebs_sample_data() Kan Liang
@ 2015-04-09 16:37 ` Kan Liang
  2015-04-15 17:14   ` Peter Zijlstra
                     ` (2 more replies)
  2015-04-09 16:37 ` [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 25+ messages in thread
From: Kan Liang @ 2015-04-09 16:37 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
  "perf record --time -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
  "perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
(The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test.)

period    plain(Sec)  multi(Sec)  Delta
10003     32.7        16.5        16.2
20003     30.2        16.2        14.0
40003     18.6        14.1        4.5
80003     16.8        14.6        2.2
100003    16.9        14.1        2.8
800003    15.4        15.7        -0.3
1000003   15.3        15.2        0.2
2000003   15.3        15.1        0.1

With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.

This patch also make AUTO_RELOAD conditional on large PEBS. Auto reload
only be enabled when fix period and large PEBS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h          | 11 +++++++++++
 arch/x86/kernel/cpu/perf_event_intel.c    |  4 +++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 30 +++++++++++++++++++++++++++---
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 1290bb6..5b677a9 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -87,6 +87,17 @@ struct amd_nb {
 #define MAX_PEBS_EVENTS		8
 
 /*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS \
+	(PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+	PERF_SAMPLE_TRANSACTION)
+
+/*
  * A debug store configuration.
  *
  * We only support architectures that use 64bit fields.
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 0a7b5ca..6c8579a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2306,7 +2306,9 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		return ret;
 
 	if (event->attr.precise_ip) {
-		if (!event->attr.freq)
+		/* only enable auto reload when fix period and large PEBS */
+		if (!event->attr.freq &&
+		    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
 			event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
 		if (x86_pmu.pebs_aliases)
 			x86_pmu.pebs_aliases(event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index e3916d5..3ce7f59 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
 	int node = cpu_to_node(cpu);
-	int max, thresh = 1; /* always use a single PEBS record */
+	int max;
 	void *buffer, *ibuffer;
 
 	if (!x86_pmu.pebs)
@@ -280,8 +280,9 @@ static int alloc_pebs_buffer(int cpu)
 	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
 		max * x86_pmu.pebs_record_size;
 
-	ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
-		thresh * x86_pmu.pebs_record_size;
+	if (x86_pmu.intel_cap.pebs_format < 1)
+		ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
+			x86_pmu.pebs_record_size;
 
 	return 0;
 }
@@ -676,14 +677,22 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
 	return &emptyconstraint;
 }
 
+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+	return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
 void intel_pmu_pebs_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
 	struct debug_store *ds = cpuc->ds;
+	bool first_pebs;
+	u64 threshold;
 
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
 
+	first_pebs = !pebs_is_enabled(cpuc);
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 
 	if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -691,11 +700,26 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled |= 1ULL << 63;
 
+	/*
+	 * When the event is constrained enough we can use a larger
+	 * threshold and run the event with less frequent PMI.
+	 */
+	if (0 && /* disable this temporarily */
+	    (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD)) {
+		threshold = ds->pebs_absolute_maximum -
+			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+	} else {
+		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+	}
+
 	/* Use auto-reload if possible to save a MSR write in the PMI */
 	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
 		ds->pebs_event_reset[hwc->idx] =
 			(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
 	}
+
+	if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+		ds->pebs_interrupt_threshold = threshold;
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-09 16:37 [PATCH V6 0/6] large PEBS interrupt threshold Kan Liang
                   ` (2 preceding siblings ...)
  2015-04-09 16:37 ` [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
@ 2015-04-09 16:37 ` Kan Liang
  2015-04-09 21:01   ` Andi Kleen
                     ` (3 more replies)
  2015-04-09 16:37 ` [PATCH V6 5/6] perf, x86: drain PEBS buffer during context switch Kan Liang
  2015-04-09 16:37 ` [PATCH V6 6/6] perf, x86: enlarge PEBS buffer Kan Liang
  5 siblings, 4 replies; 25+ messages in thread
From: Kan Liang @ 2015-04-09 16:37 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

When PEBS interrupt threshold is larger than one, the PEBS buffer
may include multiple records for each PEBS event. This patch makes
the code first count how many records each PEBS event has, then
output the samples in batch.

One corner case needs to mention is that the PEBS hardware doesn't
deal well with collisions. The records for the events can be collapsed
into a single one, and it's not possible to reconstruct all events that
caused the PEBS record.
Here are some cases which can be called collisions.
 - PEBS events happen near to each other, so the hardware merges them.
 - PEBS events happen near to each other, but they are not merged.
   The GLOBAL_STATUS for first counter is clear before generating event
   for next counter. Only the first record can be treated as collisions.
 - Same as case2, but the first counter isn't clear before generating
   event for next counter. All the records are treated as collision
   until a record with only one bit set for PEBS event.

GLOBAL_STATUS could be set by both PEBS and non-PEBS events. Multiple
non-PEBS bit set doesn't count as collisions.

In practice collisions are extremely rare, as long as different PEBS
events are used. The periods are typically very large, so any collision
is unlikely. When collision happens, we drop the PEBS record.
The only way you can get a lot of collision is when you count the same
thing multiple times. But it is not a useful configuration.

Here are some numbers about collisions.
Four frequently occurring events
(cycles:p,instructions:p,branches:p,mem-stores:p) are tested

Test events which are sampled together                   collision rate
cycles:p,instructions:p                                  0.25%
cycles:p,instructions:p,branches:p                       0.30%
cycles:p,instructions:p,branches:p,mem-stores:p          0.35%

cycles:p,cycles:p                                        98.52%

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 157 +++++++++++++++++++++++++-----
 include/linux/perf_event.h                |   4 +
 kernel/events/core.c                      |   6 +-
 3 files changed, 137 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 3ce7f59..fafbf97 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -888,6 +888,9 @@ static void setup_pebs_sample_data(struct perf_event *event,
 	int fll, fst, dsrc;
 	int fl = event->hw.flags;
 
+	if (pebs == NULL)
+		return;
+
 	sample_type = event->attr.sample_type;
 	dsrc = sample_type & PERF_SAMPLE_DATA_SRC;
 
@@ -982,7 +985,7 @@ static void setup_pebs_sample_data(struct perf_event *event,
 		data->br_stack = &cpuc->lbr_stack;
 }
 
-static void __intel_pmu_pebs_event(struct perf_event *event,
+static void __intel_pmu_pebs_event_core(struct perf_event *event,
 				   struct pt_regs *iregs, void *__pebs)
 {
 	struct perf_sample_data data;
@@ -997,6 +1000,89 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 		x86_pmu_stop(event, 0);
 }
 
+/* Clear all non-PEBS bits */
+static u64
+nonpebs_bit_clear(u64 pebs_status)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_event *event;
+	int bit;
+
+	for_each_set_bit(bit, (unsigned long *)&pebs_status, 64) {
+
+		if (bit >= x86_pmu.max_pebs_events)
+			clear_bit(bit, (unsigned long *)&pebs_status);
+		else {
+			event = cpuc->events[bit];
+			WARN_ON_ONCE(!event);
+
+			if (!event->attr.precise_ip)
+				clear_bit(bit, (unsigned long *)&pebs_status);
+		}
+	}
+
+	return pebs_status;
+}
+
+static inline void *
+get_next_pebs_record_by_bit(void *base, void *top, int bit)
+{
+	void *at;
+	u64 pebs_status;
+
+	if (base == NULL)
+		return NULL;
+
+	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
+		struct pebs_record_nhm *p = at;
+
+		if (p->status & (1 << bit)) {
+
+			if (p->status == (1 << bit))
+				return at;
+
+			/* clear non-PEBS bit and re-check */
+			pebs_status = nonpebs_bit_clear(p->status);
+			if (pebs_status == (1 << bit))
+				return at;
+		}
+	}
+	return NULL;
+}
+
+static void __intel_pmu_pebs_event_nhm(struct perf_event *event,
+				   struct pt_regs *iregs,
+				   void *base, void *top,
+				   int bit, int count)
+{
+	struct perf_sample_data data;
+	struct pt_regs regs;
+	int i;
+	void *at = get_next_pebs_record_by_bit(base, top, bit);
+
+	if (!intel_pmu_save_and_restart(event) &&
+	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+		return;
+
+	if (count > 1) {
+		for (i = 0; i < count - 1; i++) {
+			setup_pebs_sample_data(event, iregs, at, &data, &regs);
+			perf_event_output(event, &data, &regs);
+			at += x86_pmu.pebs_record_size;
+			at = get_next_pebs_record_by_bit(at, top, bit);
+		}
+	}
+
+	setup_pebs_sample_data(event, iregs, at, &data, &regs);
+
+	/* all records are processed, handle event overflow now */
+	if (perf_event_overflow(event, &data, &regs)) {
+		x86_pmu_stop(event, 0);
+		return;
+	}
+
+}
+
 static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1035,61 +1121,78 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
 	WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
 	at += n - 1;
 
-	__intel_pmu_pebs_event(event, iregs, at);
+	__intel_pmu_pebs_event_core(event, iregs, at);
 }
 
 static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
-	struct perf_event *event = NULL;
-	void *at, *top;
-	u64 status = 0;
+	struct perf_event *event;
+	void *base, *at, *top;
 	int bit;
+	int counts[MAX_PEBS_EVENTS] = {};
 
 	if (!x86_pmu.pebs_active)
 		return;
 
-	at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+	base = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
 
 	ds->pebs_index = ds->pebs_buffer_base;
 
-	if (unlikely(at > top))
+	if (unlikely(base >= top))
 		return;
 
-	/*
-	 * Should not happen, we program the threshold at 1 and do not
-	 * set a reset value.
-	 */
-	WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
-		  "Unexpected number of pebs records %ld\n",
-		  (long)(top - at) / x86_pmu.pebs_record_size);
-
-	for (; at < top; at += x86_pmu.pebs_record_size) {
+	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
 		struct pebs_record_nhm *p = at;
 
 		for_each_set_bit(bit, (unsigned long *)&p->status,
 				 x86_pmu.max_pebs_events) {
 			event = cpuc->events[bit];
-			if (!test_bit(bit, cpuc->active_mask))
-				continue;
-
 			WARN_ON_ONCE(!event);
 
-			if (!event->attr.precise_ip)
-				continue;
+			if (event->attr.precise_ip)
+				break;
+		}
 
-			if (__test_and_set_bit(bit, (unsigned long *)&status))
-				continue;
+		if (bit >= x86_pmu.max_pebs_events)
+			continue;
+		if (!test_bit(bit, cpuc->active_mask))
+			continue;
+		/*
+		 * The PEBS hardware does not deal well with the situation
+		 * when events happen near to each other and multiple bits
+		 * are set. But it should happen rarely.
+		 *
+		 * If these events include one PEBS and multiple non-PEBS
+		 * events, it doesn't impact PEBS record. The record will
+		 * be handled normally. (slow path)
+		 *
+		 * If these events include two or more PEBS events, the
+		 * records for the events can be collapsed into a single
+		 * one, and it's not possible to reconstruct all events
+		 * that caused the PEBS record. It's called collision.
+		 * If collision happened, the record will be dropped.
+		 *
+		 */
+		if (p->status != (1 << bit)) {
+			u64 pebs_status;
 
-			break;
+			/* slow path */
+			pebs_status = nonpebs_bit_clear(p->status);
+			if (pebs_status != (1 << bit))
+				continue;
 		}
+		counts[bit]++;
+	}
 
-		if (!event || bit >= x86_pmu.max_pebs_events)
+	for (bit = 0; bit < x86_pmu.max_pebs_events; bit++) {
+		if (counts[bit] == 0)
 			continue;
-
-		__intel_pmu_pebs_event(event, iregs, at);
+		event = cpuc->events[bit];
+		__intel_pmu_pebs_event_nhm(event, iregs, base,
+					   top, bit, counts[bit]);
 	}
 }
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61992cf..750007e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -734,6 +734,10 @@ extern int perf_event_overflow(struct perf_event *event,
 				 struct perf_sample_data *data,
 				 struct pt_regs *regs);
 
+extern void perf_event_output(struct perf_event *event,
+				struct perf_sample_data *data,
+				struct pt_regs *regs);
+
 static inline bool is_sampling_event(struct perf_event *event)
 {
 	return event->attr.sample_period != 0;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 06917d5..a8d0e92 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5360,9 +5360,9 @@ void perf_prepare_sample(struct perf_event_header *header,
 	}
 }
 
-static void perf_event_output(struct perf_event *event,
-				struct perf_sample_data *data,
-				struct pt_regs *regs)
+void perf_event_output(struct perf_event *event,
+			struct perf_sample_data *data,
+			struct pt_regs *regs)
 {
 	struct perf_output_handle handle;
 	struct perf_event_header header;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V6 5/6] perf, x86: drain PEBS buffer during context switch
  2015-04-09 16:37 [PATCH V6 0/6] large PEBS interrupt threshold Kan Liang
                   ` (3 preceding siblings ...)
  2015-04-09 16:37 ` [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
@ 2015-04-09 16:37 ` Kan Liang
  2015-04-09 16:37 ` [PATCH V6 6/6] perf, x86: enlarge PEBS buffer Kan Liang
  5 siblings, 0 replies; 25+ messages in thread
From: Kan Liang @ 2015-04-09 16:37 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  2 ++
 arch/x86/kernel/cpu/perf_event_intel.c     | 11 +++++++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c  | 45 +++++++++++++++++++++++++-----
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  3 --
 4 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 5b677a9..446f21b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -871,6 +871,8 @@ void intel_pmu_pebs_enable_all(void);
 
 void intel_pmu_pebs_disable_all(void);
 
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
 void intel_ds_init(void);
 
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 6c8579a..d647d7e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2729,6 +2729,15 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+				 bool sched_in)
+{
+	if (x86_pmu.pebs_active)
+		intel_pmu_pebs_sched_task(ctx, sched_in);
+	if (x86_pmu.lbr_nr)
+		intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
 
 PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2780,7 +2789,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
-	.sched_task		= intel_pmu_lbr_sched_task,
+	.sched_task		= intel_pmu_sched_task,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index fafbf97..3d8950a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -550,6 +550,19 @@ int intel_pmu_drain_bts_buffer(void)
 	return 1;
 }
 
+static inline void intel_pmu_drain_pebs_buffer(void)
+{
+	struct pt_regs regs;
+
+	x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+	if (!sched_in)
+		intel_pmu_drain_pebs_buffer();
+}
+
 /*
  * PEBS
  */
@@ -704,18 +717,28 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	 * When the event is constrained enough we can use a larger
 	 * threshold and run the event with less frequent PMI.
 	 */
-	if (0 && /* disable this temporarily */
-	    (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD)) {
+	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
 		threshold = ds->pebs_absolute_maximum -
 			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
-	} else {
-		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
-	}
 
-	/* Use auto-reload if possible to save a MSR write in the PMI */
-	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+		/* Use auto-reload if possible to save a MSR write in the PMI */
 		ds->pebs_event_reset[hwc->idx] =
 			(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
+
+		if (first_pebs)
+			perf_sched_cb_inc(event->ctx->pmu);
+	} else {
+		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+
+		ds->pebs_event_reset[hwc->idx] = 0;
+
+		/*
+		 * If not all events can use larger buffer,
+		 * roll back to threshold = 1
+		 */
+		if (!first_pebs &&
+		    (ds->pebs_interrupt_threshold > threshold))
+			perf_sched_cb_dec(event->ctx->pmu);
 	}
 
 	if (first_pebs || ds->pebs_interrupt_threshold > threshold)
@@ -726,6 +749,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
+	struct debug_store *ds = cpuc->ds;
 
 	cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
 
@@ -734,6 +758,13 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 	else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled &= ~(1ULL << 63);
 
+	if (ds->pebs_interrupt_threshold >
+	    ds->pebs_buffer_base + x86_pmu.pebs_record_size) {
+		intel_pmu_drain_pebs_buffer();
+		if (!pebs_is_enabled(cpuc))
+			perf_sched_cb_dec(event->ctx->pmu);
+	}
+
 	if (cpuc->enabled)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 94e5b50..c8a72cc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -262,9 +262,6 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct x86_perf_task_context *task_ctx;
 
-	if (!x86_pmu.lbr_nr)
-		return;
-
 	/*
 	 * If LBR callstack feature is enabled and the stack was saved when
 	 * the task was scheduled out, restore the stack. Otherwise flush
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V6 6/6] perf, x86: enlarge PEBS buffer
  2015-04-09 16:37 [PATCH V6 0/6] large PEBS interrupt threshold Kan Liang
                   ` (4 preceding siblings ...)
  2015-04-09 16:37 ` [PATCH V6 5/6] perf, x86: drain PEBS buffer during context switch Kan Liang
@ 2015-04-09 16:37 ` Kan Liang
  5 siblings, 0 replies; 25+ messages in thread
From: Kan Liang @ 2015-04-09 16:37 UTC (permalink / raw)
  To: a.p.zijlstra, linux-kernel; +Cc: mingo, acme, eranian, andi, Kan Liang

From: Yan, Zheng <zheng.z.yan@intel.com>

Currently the PEBS buffer size is 4k, it only can hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as BTS buffer), 64k memory can hold about 330 PEBS
records. This will significantly the reduce number of PMI when
large PEBS interrupt threshold is used.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 3d8950a..dfbcaad 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
 #define BTS_RECORD_SIZE		24
 
 #define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE	PAGE_SIZE
+#define PEBS_BUFFER_SIZE	(PAGE_SIZE << 4)
 #define PEBS_FIXUP_SIZE		PAGE_SIZE
 
 /*
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-09 16:37 ` [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
@ 2015-04-09 21:01   ` Andi Kleen
  2015-04-15 18:28   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2015-04-09 21:01 UTC (permalink / raw)
  To: Kan Liang; +Cc: a.p.zijlstra, linux-kernel, mingo, acme, eranian, andi

> +	for_each_set_bit(bit, (unsigned long *)&pebs_status, 64) {
> +
> +		if (bit >= x86_pmu.max_pebs_events)
> +			clear_bit(bit, (unsigned long *)&pebs_status);
> +		else {
> +			event = cpuc->events[bit];
> +			WARN_ON_ONCE(!event);
> +
> +			if (!event->attr.precise_ip)
> +				clear_bit(bit, (unsigned long *)&pebs_status);

Precompute the mask of non pebs events first in the caller.
Then this function would be just a & ~mask

BTW clear_bit is atomic, if you're local you should always use
__clear_bit.


> +}
> +
> +static inline void *
> +get_next_pebs_record_by_bit(void *base, void *top, int bit)
> +{
> +	void *at;
> +	u64 pebs_status;
> +
> +	if (base == NULL)
> +		return NULL;
> +
> +	for (at = base; at < top; at += x86_pmu.pebs_record_size) {
> +		struct pebs_record_nhm *p = at;
> +
> +		if (p->status & (1 << bit)) {

Use test_bit.

> +
> +			if (p->status == (1 << bit))
> +				return at;
> +
> +			/* clear non-PEBS bit and re-check */
> +			pebs_status = nonpebs_bit_clear(p->status);
> +			if (pebs_status == (1 << bit))
> +				return at;
> +		}
> +	}
> +	return NULL;

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
  2015-04-09 16:37 ` [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
@ 2015-04-15 17:14   ` Peter Zijlstra
  2015-04-15 17:48     ` Liang, Kan
  2015-04-15 18:05   ` Peter Zijlstra
  2015-04-16 18:45   ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-15 17:14 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> This patch also make AUTO_RELOAD conditional on large PEBS. Auto reload
> only be enabled when fix period and large PEBS.

What's a large PEBS?

> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -87,6 +87,17 @@ struct amd_nb {
>  #define MAX_PEBS_EVENTS		8
>  
>  /*
> + * Flags PEBS can handle without an PMI.
> + *
> + * TID can only be handled by flushing at context switch.
> + */
> +#define PEBS_FREERUNNING_FLAGS \
> +	(PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
> +	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
> +	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> +	PERF_SAMPLE_TRANSACTION)
> +
> +/*
>   * A debug store configuration.
>   *
>   * We only support architectures that use 64bit fields.
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 0a7b5ca..6c8579a 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -2306,7 +2306,9 @@ static int intel_pmu_hw_config(struct perf_event *event)
>  		return ret;
>  
>  	if (event->attr.precise_ip) {
> -		if (!event->attr.freq)
> +		/* only enable auto reload when fix period and large PEBS */
> +		if (!event->attr.freq &&
> +		    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
>  			event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
>  		if (x86_pmu.pebs_aliases)
>  			x86_pmu.pebs_aliases(event);

I suspect you meant the above change right?

But this negates part of the benefit of the auto reload; where
previously it saved an MSR write for pretty much all PEBS usage it now
becomes a burden for pretty much everyone.

Why cannot we retain the win for all PEBS users?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
  2015-04-15 17:14   ` Peter Zijlstra
@ 2015-04-15 17:48     ` Liang, Kan
  2015-04-15 18:10       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Liang, Kan @ 2015-04-15 17:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, acme, eranian, andi



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Wednesday, April 15, 2015 1:15 PM
> To: Liang, Kan
> Cc: linux-kernel@vger.kernel.org; mingo@kernel.org;
> acme@infradead.org; eranian@google.com; andi@firstfloor.org
> Subject: Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
> 
> On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> > This patch also make AUTO_RELOAD conditional on large PEBS. Auto
> > reload only be enabled when fix period and large PEBS.
> 
> What's a large PEBS?
>
Should be large PEBS available.
!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)

> > +++ b/arch/x86/kernel/cpu/perf_event.h
> > @@ -87,6 +87,17 @@ struct amd_nb {
> >  #define MAX_PEBS_EVENTS		8
> >
> >  /*
> > + * Flags PEBS can handle without an PMI.
> > + *
> > + * TID can only be handled by flushing at context switch.
> > + */
> > +#define PEBS_FREERUNNING_FLAGS \
> > +	(PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
> > +	PERF_SAMPLE_ID | PERF_SAMPLE_CPU |
> PERF_SAMPLE_STREAM_ID | \
> > +	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> > +	PERF_SAMPLE_TRANSACTION)
> > +
> > +/*
> >   * A debug store configuration.
> >   *
> >   * We only support architectures that use 64bit fields.
> > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c
> > b/arch/x86/kernel/cpu/perf_event_intel.c
> > index 0a7b5ca..6c8579a 100644
> > --- a/arch/x86/kernel/cpu/perf_event_intel.c
> > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > @@ -2306,7 +2306,9 @@ static int intel_pmu_hw_config(struct
> perf_event *event)
> >  		return ret;
> >
> >  	if (event->attr.precise_ip) {
> > -		if (!event->attr.freq)
> > +		/* only enable auto reload when fix period and large PEBS
> */
> > +		if (!event->attr.freq &&
> > +		    !(event->attr.sample_type &
> ~PEBS_FREERUNNING_FLAGS))
> >  			event->hw.flags |=
> PERF_X86_EVENT_AUTO_RELOAD;
> >  		if (x86_pmu.pebs_aliases)
> >  			x86_pmu.pebs_aliases(event);
> 
> I suspect you meant the above change right?

Yes.

> 
> But this negates part of the benefit of the auto reload; where previously it
> saved an MSR write for pretty much all PEBS usage it now becomes a
> burden for pretty much everyone.
> 
> Why cannot we retain the win for all PEBS users?

The change tries to address your comments.
https://lkml.org/lkml/2015/3/30/294

Yes, we can retain the win. If so, I think we need to introduce another
flag like PERF_X86_EVENT_LARGE_PEBS and check it in pebs_is_enabled.
Or just keep the previous V5 patch unchanged.  

Thanks,
Kan



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
  2015-04-09 16:37 ` [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
  2015-04-15 17:14   ` Peter Zijlstra
@ 2015-04-15 18:05   ` Peter Zijlstra
  2015-04-15 18:35     ` Liang, Kan
  2015-04-16 18:45   ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-15 18:05 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> @@ -280,8 +280,9 @@ static int alloc_pebs_buffer(int cpu)
>  	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
>  		max * x86_pmu.pebs_record_size;
>  
> -	ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> -		thresh * x86_pmu.pebs_record_size;
> +	if (x86_pmu.intel_cap.pebs_format < 1)
> +		ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> +			x86_pmu.pebs_record_size;
>  
>  	return 0;
>  }

I can't seem to figure out what this is about.. help?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
  2015-04-15 17:48     ` Liang, Kan
@ 2015-04-15 18:10       ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-15 18:10 UTC (permalink / raw)
  To: Liang, Kan; +Cc: linux-kernel, mingo, acme, eranian, andi

On Wed, Apr 15, 2015 at 05:48:39PM +0000, Liang, Kan wrote:
> > > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > > @@ -2306,7 +2306,9 @@ static int intel_pmu_hw_config(struct perf_event *event)
> > >  		return ret;
> > >
> > >  	if (event->attr.precise_ip) {
> > > -		if (!event->attr.freq)
> > > +		/* only enable auto reload when fix period and large PEBS */
> > > +		if (!event->attr.freq &&
> > > +		    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
> > >  			event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
> > >  		if (x86_pmu.pebs_aliases)
> > >  			x86_pmu.pebs_aliases(event);
> > 
> > I suspect you meant the above change right?
> 
> Yes.
> 
> > 
> > But this negates part of the benefit of the auto reload; where previously it
> > saved an MSR write for pretty much all PEBS usage it now becomes a
> > burden for pretty much everyone.
> > 
> > Why cannot we retain the win for all PEBS users?
> 
> The change tries to address your comments.
> https://lkml.org/lkml/2015/3/30/294
> 
> Yes, we can retain the win. If so, I think we need to introduce another
> flag like PERF_X86_EVENT_LARGE_PEBS and check it in pebs_is_enabled.
> Or just keep the previous V5 patch unchanged.  

Right, something like so. 

--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -65,16 +65,17 @@ struct event_constraint {
 /*
  * struct hw_perf_event.flags flags
  */
-#define PERF_X86_EVENT_PEBS_LDLAT	0x1 /* ld+ldlat data address sampling */
-#define PERF_X86_EVENT_PEBS_ST		0x2 /* st data address sampling */
-#define PERF_X86_EVENT_PEBS_ST_HSW	0x4 /* haswell style datala, store */
-#define PERF_X86_EVENT_COMMITTED	0x8 /* event passed commit_txn */
-#define PERF_X86_EVENT_PEBS_LD_HSW	0x10 /* haswell style datala, load */
-#define PERF_X86_EVENT_PEBS_NA_HSW	0x20 /* haswell style datala, unknown */
-#define PERF_X86_EVENT_EXCL		0x40 /* HT exclusivity on counter */
-#define PERF_X86_EVENT_DYNAMIC		0x80 /* dynamic alloc'd constraint */
-#define PERF_X86_EVENT_RDPMC_ALLOWED	0x40 /* grant rdpmc permission */
-#define PERF_X86_EVENT_AUTO_RELOAD	0x80 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_PEBS_LDLAT	0x0001 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST		0x0002 /* st data address sampling */
+#define PERF_X86_EVENT_PEBS_ST_HSW	0x0004 /* haswell style datala, store */
+#define PERF_X86_EVENT_COMMITTED	0x0008 /* event passed commit_txn */
+#define PERF_X86_EVENT_PEBS_LD_HSW	0x0010 /* haswell style datala, load */
+#define PERF_X86_EVENT_PEBS_NA_HSW	0x0020 /* haswell style datala, unknown */
+#define PERF_X86_EVENT_EXCL		0x0040 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC		0x0080 /* dynamic alloc'd constraint */
+#define PERF_X86_EVENT_RDPMC_ALLOWED	0x0100 /* grant rdpmc permission */
+#define PERF_X86_EVENT_AUTO_RELOAD	0x0200 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_FREERUNNING	0x0400 /* use freerunning PEBS */
 
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
@@ -87,6 +88,17 @@ struct amd_nb {
 #define MAX_PEBS_EVENTS		8
 
 /*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS \
+	(PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+	PERF_SAMPLE_TRANSACTION)
+
+/*
  * A debug store configuration.
  *
  * We only support architectures that use 64bit fields.
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2306,8 +2306,12 @@ static int intel_pmu_hw_config(struct pe
 		return ret;
 
 	if (event->attr.precise_ip) {
-		if (!event->attr.freq)
+		if (!event->attr.freq) {
 			event->hw.flags |= PERF_X86_EVENT_AUTO_RELOAD;
+
+			if (!(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS))
+				event->hw.flags |= PERF_X86_EVENT_FREERUNNING;
+		}
 		if (x86_pmu.pebs_aliases)
 			x86_pmu.pebs_aliases(event);
 	}
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
 	int node = cpu_to_node(cpu);
-	int max, thresh = 1; /* always use a single PEBS record */
+	int max;
 	void *buffer, *ibuffer;
 
 	if (!x86_pmu.pebs)
@@ -277,11 +277,13 @@ static int alloc_pebs_buffer(int cpu)
 
 	ds->pebs_buffer_base = (u64)(unsigned long)buffer;
 	ds->pebs_index = ds->pebs_buffer_base;
-	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
-		max * x86_pmu.pebs_record_size;
+	ds->pebs_absolute_maximum =
+		ds->pebs_buffer_base + max * x86_pmu.pebs_record_size;
 
-	ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
-		thresh * x86_pmu.pebs_record_size;
+	if (x86_pmu.intel_cap.pebs_format < 1) {
+		ds->pebs_interrupt_threshold =
+			ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+	}
 
 	return 0;
 }
@@ -684,14 +686,22 @@ struct event_constraint *intel_pebs_cons
 	return &emptyconstraint;
 }
 
+static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
+{
+	return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
+}
+
 void intel_pmu_pebs_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
 	struct debug_store *ds = cpuc->ds;
+	bool first_pebs;
+	u64 threshold;
 
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
 
+	first_pebs = !pebs_is_enabled(cpuc);
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 
 	if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -699,11 +709,26 @@ void intel_pmu_pebs_enable(struct perf_e
 	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled |= 1ULL << 63;
 
+	/*
+	 * When the event is constrained enough we can use a larger
+	 * threshold and run the event with less frequent PMI.
+	 */
+	if (0 && /* disable this temporarily */
+	    (hwc->flags & PERF_X86_EVENT_FREERUNNING)) {
+		threshold = ds->pebs_absolute_maximum -
+			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+	} else {
+		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+	}
+
 	/* Use auto-reload if possible to save a MSR write in the PMI */
 	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
 		ds->pebs_event_reset[hwc->idx] =
 			(u64)(-hwc->sample_period) & x86_pmu.cntval_mask;
 	}
+
+	if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+		ds->pebs_interrupt_threshold = threshold;
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-09 16:37 ` [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
  2015-04-09 21:01   ` Andi Kleen
@ 2015-04-15 18:28   ` Peter Zijlstra
  2015-04-15 18:36   ` Peter Zijlstra
  2015-04-16 12:53   ` Peter Zijlstra
  3 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-15 18:28 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:

> The only way you can get a lot of collision is when you count the same
> thing multiple times. But it is not a useful configuration.

Not entirely true; I _think_ you can be unfortunate if you measure with
a userspace only PEBS event along with either a kernel or unrestricted
PEBS event. Imagine the event triggering and setting the overflow flag
right before entering the kernel. Then all kernel side events will end
up with multiple bits set.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
  2015-04-15 18:05   ` Peter Zijlstra
@ 2015-04-15 18:35     ` Liang, Kan
  2015-04-15 18:41       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Liang, Kan @ 2015-04-15 18:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, acme, eranian, andi


> 
> On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> > @@ -280,8 +280,9 @@ static int alloc_pebs_buffer(int cpu)
> >  	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
> >  		max * x86_pmu.pebs_record_size;
> >
> > -	ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> > -		thresh * x86_pmu.pebs_record_size;
> > +	if (x86_pmu.intel_cap.pebs_format < 1)
> > +		ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> > +			x86_pmu.pebs_record_size;
> >
> >  	return 0;
> >  }
> 
> I can't seem to figure out what this is about.. help?

We move AUTO_RELOAD and large PEBS check to intel_pmu_hw_config.
But for earlier platform, it calls x86_pmu_hw_config.
So we force single PEBS record for old platform.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-09 16:37 ` [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
  2015-04-09 21:01   ` Andi Kleen
  2015-04-15 18:28   ` Peter Zijlstra
@ 2015-04-15 18:36   ` Peter Zijlstra
  2015-04-16 12:53   ` Peter Zijlstra
  3 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-15 18:36 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:
> +/* Clear all non-PEBS bits */
> +static u64
> +nonpebs_bit_clear(u64 pebs_status)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	struct perf_event *event;
> +	int bit;
> +
> +	for_each_set_bit(bit, (unsigned long *)&pebs_status, 64) {
> +
> +		if (bit >= x86_pmu.max_pebs_events)
> +			clear_bit(bit, (unsigned long *)&pebs_status);
> +		else {
> +			event = cpuc->events[bit];
> +			WARN_ON_ONCE(!event);
> +
> +			if (!event->attr.precise_ip)
> +				clear_bit(bit, (unsigned long *)&pebs_status);
> +		}
> +	}
> +
> +	return pebs_status;
> +}

What was wrong with:

	status = p->status & cpuc->pebs_enabled;

?

We use the same index bits in the PEBS_ENABLE MSR as in the status reg,
right? If you're really paranoid you can mask out the high (>31) bits
too I suppose.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
  2015-04-15 18:35     ` Liang, Kan
@ 2015-04-15 18:41       ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-15 18:41 UTC (permalink / raw)
  To: Liang, Kan; +Cc: linux-kernel, mingo, acme, eranian, andi

On Wed, Apr 15, 2015 at 06:35:48PM +0000, Liang, Kan wrote:
> 
> > 
> > On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> > > @@ -280,8 +280,9 @@ static int alloc_pebs_buffer(int cpu)
> > >  	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
> > >  		max * x86_pmu.pebs_record_size;
> > >
> > > -	ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> > > -		thresh * x86_pmu.pebs_record_size;
> > > +	if (x86_pmu.intel_cap.pebs_format < 1)
> > > +		ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> > > +			x86_pmu.pebs_record_size;
> > >
> > >  	return 0;
> > >  }
> > 
> > I can't seem to figure out what this is about.. help?
> 
> We move AUTO_RELOAD and large PEBS check to intel_pmu_hw_config.
> But for earlier platform, it calls x86_pmu_hw_config.
> So we force single PEBS record for old platform.

We're talking about intel_pmu vs core_pmu right? I don't think core_pmu
supports PEBS at all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-09 16:37 ` [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
                     ` (2 preceding siblings ...)
  2015-04-15 18:36   ` Peter Zijlstra
@ 2015-04-16 12:53   ` Peter Zijlstra
  2015-04-17  8:11     ` Peter Zijlstra
  3 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-16 12:53 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Thu, Apr 09, 2015 at 12:37:44PM -0400, Kan Liang wrote:
> From: Yan, Zheng <zheng.z.yan@intel.com>
> 
> When PEBS interrupt threshold is larger than one, the PEBS buffer
> may include multiple records for each PEBS event. This patch makes
> the code first count how many records each PEBS event has, then
> output the samples in batch.
> 
> One corner case needs to mention is that the PEBS hardware doesn't
> deal well with collisions. The records for the events can be collapsed
> into a single one, and it's not possible to reconstruct all events that
> caused the PEBS record.
> Here are some cases which can be called collisions.
>  - PEBS events happen near to each other, so the hardware merges them.
>  - PEBS events happen near to each other, but they are not merged.
>    The GLOBAL_STATUS for first counter is clear before generating event
>    for next counter. Only the first record can be treated as collisions.
>  - Same as case2, but the first counter isn't clear before generating
>    event for next counter. All the records are treated as collision
>    until a record with only one bit set for PEBS event.
> 
> GLOBAL_STATUS could be set by both PEBS and non-PEBS events. Multiple
> non-PEBS bit set doesn't count as collisions.
> 
> In practice collisions are extremely rare, as long as different PEBS
> events are used. The periods are typically very large, so any collision
> is unlikely. When collision happens, we drop the PEBS record.
> The only way you can get a lot of collision is when you count the same
> thing multiple times. But it is not a useful configuration.
> 
> Here are some numbers about collisions.
> Four frequently occurring events
> (cycles:p,instructions:p,branches:p,mem-stores:p) are tested
> 
> Test events which are sampled together                   collision rate
> cycles:p,instructions:p                                  0.25%
> cycles:p,instructions:p,branches:p                       0.30%
> cycles:p,instructions:p,branches:p,mem-stores:p          0.35%
> 
> cycles:p,cycles:p                                        98.52%

*sigh* you're really going to make me write this :-(

The sad part is that I had already written a large part of it for you in
a previous email (
lkml.kernel.org/r/20150330200710.GO27490@worktop.programming.kicks-ass.net
).

And yes, writing a good Changelog takes a lot of time and effort,
sometimes more than the actual patch, and that is OK.

There's a *PLEASE CLARIFY* in the below, please do that.

Also the below talks about a PERF_RECORD_SAMPLES_LOST, please also
implement that.

---

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible scenarios
to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

 A) the CTRn value reaches 0:
    - the corresponding bit in GLOBAL_STATUS gets set
    - we start arming the hardware assist

    < some unspecified amount of time later --
        this could cover multiple events of interest >

 B) the hardware assist is armed, any next event will trigger it

 C) a matching event happens:
    - the hardware assist triggers and generates a PEBS record
      this includes a copy of GLOBAL_STATUS at this moment
    - if we auto-reload we (re)set CTRn
    - we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

 A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close' --
*PLEASE CLARIFY* either the very same instruction or retired in the same
cycle?

For instance, consider this chain of events:

 A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists (the
instruction matches both criteria), we will generate but a single
record, but again with both counters listed in the status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare, and to make sure
the user is not left in the dark about this we'll emit a
PERF_RECORD_SAMPLES_LOST record with the number of possible discards.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold
  2015-04-09 16:37 ` [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
  2015-04-15 17:14   ` Peter Zijlstra
  2015-04-15 18:05   ` Peter Zijlstra
@ 2015-04-16 18:45   ` Peter Zijlstra
  2 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-16 18:45 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Thu, Apr 09, 2015 at 12:37:43PM -0400, Kan Liang wrote:
> +	/*
> +	 * When the event is constrained enough we can use a larger
> +	 * threshold and run the event with less frequent PMI.
> +	 */
> +	if (0 && /* disable this temporarily */

This is a good hint your patches are ordered wrong.

> +	    (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD)) {
> +		threshold = ds->pebs_absolute_maximum -
> +			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> +	} else {
> +		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
> +	}
> +

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-16 12:53   ` Peter Zijlstra
@ 2015-04-17  8:11     ` Peter Zijlstra
  2015-04-17 12:50       ` Liang, Kan
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-17  8:11 UTC (permalink / raw)
  To: Kan Liang; +Cc: linux-kernel, mingo, acme, eranian, andi

On Thu, Apr 16, 2015 at 02:53:42PM +0200, Peter Zijlstra wrote:
> When the PEBS interrupt threshold is larger than one record and the
> machine supports multiple PEBS events, the records of these events are
> mixed up and we need to demultiplex them.
> 
> Demuxing the records is hard because the hardware is deficient. The
> hardware has two issues that, when combined, create impossible scenarios
> to demux.
> 
> The first issue is that the 'status' field of the PEBS record is a copy
> of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
> problem let us first describe the regular PEBS cycle:
> 
>  A) the CTRn value reaches 0:
>     - the corresponding bit in GLOBAL_STATUS gets set
>     - we start arming the hardware assist
> 
>     < some unspecified amount of time later --
>         this could cover multiple events of interest >
> 
>  B) the hardware assist is armed, any next event will trigger it
> 
>  C) a matching event happens:
>     - the hardware assist triggers and generates a PEBS record
>       this includes a copy of GLOBAL_STATUS at this moment
>     - if we auto-reload we (re)set CTRn

Is this actually true? Do we reload here or on A ?

>     - we clear the relevant bit in GLOBAL_STATUS

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-17  8:11     ` Peter Zijlstra
@ 2015-04-17 12:50       ` Liang, Kan
  2015-04-17 13:12         ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Liang, Kan @ 2015-04-17 12:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, acme, eranian, andi



> >
> >  A) the CTRn value reaches 0:
> >     - the corresponding bit in GLOBAL_STATUS gets set
> >     - we start arming the hardware assist
> >
> >     < some unspecified amount of time later --
> >         this could cover multiple events of interest >
> >
> >  B) the hardware assist is armed, any next event will trigger it
> >
> >  C) a matching event happens:
> >     - the hardware assist triggers and generates a PEBS record
> >       this includes a copy of GLOBAL_STATUS at this moment
> >     - if we auto-reload we (re)set CTRn
> 
> Is this actually true? Do we reload here or on A ?
> 

Yes, on C.
According to SDM Volume 3, 18.7.1.1, the reset value will be
loaded after each PEBS record is written, which is done
by hw assist.


> >     - we clear the relevant bit in GLOBAL_STATUS

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-17 12:50       ` Liang, Kan
@ 2015-04-17 13:12         ` Peter Zijlstra
  2015-04-17 14:19           ` Liang, Kan
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-17 13:12 UTC (permalink / raw)
  To: Liang, Kan; +Cc: linux-kernel, mingo, acme, eranian, andi

On Fri, Apr 17, 2015 at 12:50:33PM +0000, Liang, Kan wrote:
> 
> 
> > >
> > >  A) the CTRn value reaches 0:
> > >     - the corresponding bit in GLOBAL_STATUS gets set
> > >     - we start arming the hardware assist
> > >
> > >     < some unspecified amount of time later --
> > >         this could cover multiple events of interest >
> > >
> > >  B) the hardware assist is armed, any next event will trigger it
> > >
> > >  C) a matching event happens:
> > >     - the hardware assist triggers and generates a PEBS record
> > >       this includes a copy of GLOBAL_STATUS at this moment
> > >     - if we auto-reload we (re)set CTRn
> > 
> > Is this actually true? Do we reload here or on A ?
> > 
> 
> Yes, on C.
> According to SDM Volume 3, 18.7.1.1, the reset value will be
> loaded after each PEBS record is written, which is done
> by hw assist.

OK, then I did indeed remember that right.

But that brings us to patch 1 of this series, how is that correct in the
face of this? There is an arbitrary delay (A->B) added to the period.
And the Changelog of course never did bother to make that clear.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-17 13:12         ` Peter Zijlstra
@ 2015-04-17 14:19           ` Liang, Kan
  2015-04-17 14:44             ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Liang, Kan @ 2015-04-17 14:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, acme, eranian, andi



> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Friday, April 17, 2015 9:13 AM
> To: Liang, Kan
> Cc: linux-kernel@vger.kernel.org; mingo@kernel.org;
> acme@infradead.org; eranian@google.com; andi@firstfloor.org
> Subject: Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS
> buffer
> 
> On Fri, Apr 17, 2015 at 12:50:33PM +0000, Liang, Kan wrote:
> >
> >
> > > >
> > > >  A) the CTRn value reaches 0:
> > > >     - the corresponding bit in GLOBAL_STATUS gets set
> > > >     - we start arming the hardware assist
> > > >
> > > >     < some unspecified amount of time later --
> > > >         this could cover multiple events of interest >
> > > >
> > > >  B) the hardware assist is armed, any next event will trigger it
> > > >
> > > >  C) a matching event happens:
> > > >     - the hardware assist triggers and generates a PEBS record
> > > >       this includes a copy of GLOBAL_STATUS at this moment
> > > >     - if we auto-reload we (re)set CTRn
> > >
> > > Is this actually true? Do we reload here or on A ?
> > >
> >
> > Yes, on C.
> > According to SDM Volume 3, 18.7.1.1, the reset value will be loaded
> > after each PEBS record is written, which is done by hw assist.
> 
> OK, then I did indeed remember that right.
> 
> But that brings us to patch 1 of this series, how is that correct in the face of
> this? There is an arbitrary delay (A->B) added to the period.
> And the Changelog of course never did bother to make that clear.

OK. I will update the changelog for patch 1 as below.
---
When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
However, the reset value will be loaded by hardware assist. There is 
a little bit delay compared to previous non-auto-reload mechanism.
The delay is arbitrary but very small.

Thanks,
Kan



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-17 14:19           ` Liang, Kan
@ 2015-04-17 14:44             ` Peter Zijlstra
  2015-04-17 18:20               ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-17 14:44 UTC (permalink / raw)
  To: Liang, Kan; +Cc: linux-kernel, mingo, acme, eranian, andi

On Fri, Apr 17, 2015 at 02:19:58PM +0000, Liang, Kan wrote:

> > But that brings us to patch 1 of this series, how is that correct in the face of
> > this? There is an arbitrary delay (A->B) added to the period.
> > And the Changelog of course never did bother to make that clear.
> 
> OK. I will update the changelog for patch 1 as below.
> ---
> When a fixed period is specified, this patch make perf use the PEBS
> auto reload mechanism. This makes normal profiling faster, because
> it avoids one costly MSR write in the PMI handler.

> However, the reset value will be loaded by hardware assist. There is 
> a little bit delay compared to previous non-auto-reload mechanism.
> The delay is arbitrary but very small.

What is very small? And doesn't that mean its significant at exactly the
point this patch series is aimed at, namely very short period.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-17 14:44             ` Peter Zijlstra
@ 2015-04-17 18:20               ` Andi Kleen
  2015-04-17 18:25                 ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2015-04-17 18:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Liang, Kan, linux-kernel, mingo, acme, eranian, andi

On Fri, Apr 17, 2015 at 04:44:07PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 17, 2015 at 02:19:58PM +0000, Liang, Kan wrote:
> 
> > > But that brings us to patch 1 of this series, how is that correct in the face of
> > > this? There is an arbitrary delay (A->B) added to the period.
> > > And the Changelog of course never did bother to make that clear.

That's how perf and other profilers always behaved. The PMI
is not part of the period. The automatic PEBS reload is not in any way
different. It's much faster than a PMI, but it's also not zero cost.

This is not a gap in measurement though -- there is no other code
running during that time on that CPU. It's simply overhead from the
measurement mechanism.

> > 
> > OK. I will update the changelog for patch 1 as below.
> > ---
> > When a fixed period is specified, this patch make perf use the PEBS
> > auto reload mechanism. This makes normal profiling faster, because
> > it avoids one costly MSR write in the PMI handler.
> 
> > However, the reset value will be loaded by hardware assist. There is 
> > a little bit delay compared to previous non-auto-reload mechanism.
> > The delay is arbitrary but very small.
> 
> What is very small? And doesn't that mean its significant at exactly the
> point this patch series is aimed at, namely very short period.

The assist cost is 400-800 cycles, assuming common cases with everything
cached. The minimum period the patch currently uses is 10000. In that
extreme case it can be ~10% if cycles are used.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer
  2015-04-17 18:20               ` Andi Kleen
@ 2015-04-17 18:25                 ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2015-04-17 18:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Liang, Kan, linux-kernel, mingo, acme, eranian

On Fri, Apr 17, 2015 at 08:20:37PM +0200, Andi Kleen wrote:
> On Fri, Apr 17, 2015 at 04:44:07PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 17, 2015 at 02:19:58PM +0000, Liang, Kan wrote:
> > 
> > > > But that brings us to patch 1 of this series, how is that correct in the face of
> > > > this? There is an arbitrary delay (A->B) added to the period.
> > > > And the Changelog of course never did bother to make that clear.
> 
> That's how perf and other profilers always behaved. The PMI
> is not part of the period. The automatic PEBS reload is not in any way
> different. It's much faster than a PMI, but it's also not zero cost.
> 
> This is not a gap in measurement though -- there is no other code
> running during that time on that CPU. It's simply overhead from the
> measurement mechanism.
> 
> > > 
> > > OK. I will update the changelog for patch 1 as below.
> > > ---
> > > When a fixed period is specified, this patch make perf use the PEBS
> > > auto reload mechanism. This makes normal profiling faster, because
> > > it avoids one costly MSR write in the PMI handler.
> > 
> > > However, the reset value will be loaded by hardware assist. There is 
> > > a little bit delay compared to previous non-auto-reload mechanism.
> > > The delay is arbitrary but very small.
> > 
> > What is very small? And doesn't that mean its significant at exactly the
> > point this patch series is aimed at, namely very short period.
> 
> The assist cost is 400-800 cycles, assuming common cases with everything
> cached. The minimum period the patch currently uses is 10000. In that
> extreme case it can be ~10% if cycles are used.

Thanks, please include all this information.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-04-17 18:25 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-09 16:37 [PATCH V6 0/6] large PEBS interrupt threshold Kan Liang
2015-04-09 16:37 ` [PATCH V6 1/6] perf, x86: use the PEBS auto reload mechanism when possible Kan Liang
2015-04-09 16:37 ` [PATCH V6 2/6] perf, x86: introduce setup_pebs_sample_data() Kan Liang
2015-04-09 16:37 ` [PATCH V6 3/6] perf, x86: large PEBS interrupt threshold Kan Liang
2015-04-15 17:14   ` Peter Zijlstra
2015-04-15 17:48     ` Liang, Kan
2015-04-15 18:10       ` Peter Zijlstra
2015-04-15 18:05   ` Peter Zijlstra
2015-04-15 18:35     ` Liang, Kan
2015-04-15 18:41       ` Peter Zijlstra
2015-04-16 18:45   ` Peter Zijlstra
2015-04-09 16:37 ` [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer Kan Liang
2015-04-09 21:01   ` Andi Kleen
2015-04-15 18:28   ` Peter Zijlstra
2015-04-15 18:36   ` Peter Zijlstra
2015-04-16 12:53   ` Peter Zijlstra
2015-04-17  8:11     ` Peter Zijlstra
2015-04-17 12:50       ` Liang, Kan
2015-04-17 13:12         ` Peter Zijlstra
2015-04-17 14:19           ` Liang, Kan
2015-04-17 14:44             ` Peter Zijlstra
2015-04-17 18:20               ` Andi Kleen
2015-04-17 18:25                 ` Peter Zijlstra
2015-04-09 16:37 ` [PATCH V6 5/6] perf, x86: drain PEBS buffer during context switch Kan Liang
2015-04-09 16:37 ` [PATCH V6 6/6] perf, x86: enlarge PEBS buffer Kan Liang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.