[PATCH V2 0/4] bug fix mmap read and rdpmc read in large PEBS

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2 0/4] bug fix mmap read and rdpmc read in large PEBS
@ 2017-12-20 19:42 kan.liang
  2017-12-20 19:42 ` [PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload kan.liang
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: kan.liang @ 2017-12-20 19:42 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel; +Cc: tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

------
Changes since V1:
 - Check PERF_X86_EVENT_AUTO_RELOAD before call
   intel_pmu_save_and_restore()
 - Introduce a special purpose intel_pmu_save_and_restart()
   just for AUTO_RELOAD.
 - New patch to disable userspace RDPMC usage if large PEBS is enabled.

------

There is bug when mmap read event->count with large PEBS enabled.
Here is an example.
 #./read_count
 0x71f0
 0x122c0
 0x1000000001c54
 0x100000001257d
 0x200000000bdc5

The bug is caused by two issues.
- In x86_perf_event_update, the calculation of event->count does not
  take the auto-reload values into account.
- In x86_pmu_read, it doesn't count the undrained values in large PEBS
  buffers.

The issue is introduced with the auto-reload mechanism enabled by
commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
mechanism when possible")

Also, the userspace RDPMC usage is broken for large PEBS.

The issue is introduced with the large PEBS enabled by
commit b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")

The source code of read_count is as below.

struct cpu {
        int fd;
        struct perf_event_mmap_page *buf;
};

int perf_open(struct cpu *ctx, int cpu)
{
        struct perf_event_attr attr = {
                .type = PERF_TYPE_HARDWARE,
                .size = sizeof(struct perf_event_attr),
                .sample_period = 100000,
                .config = 0,
                .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
                                PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
                .precise_ip = 3,
                .mmap = 1,
                .comm = 1,
                .task = 1,
                .mmap2 = 1,
                .sample_id_all = 1,
                .comm_exec = 1,
        };
        ctx->buf = NULL;
        ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
        if (ctx->fd < 0) {
                perror("perf_event_open");
                return -1;
        }
        return 0;
}

void perf_close(struct cpu *ctx)
{
        close(ctx->fd);
        if (ctx->buf)
                munmap(ctx->buf, pagesize);
}

int main(int ac, char **av)
{
        struct cpu ctx;
        u64 count;

        perf_open(&ctx, 0);

        while (1) {
                sleep(5);

                if (read(ctx.fd, &count, 8) != 8) {
                        perror("counter read");
                        break;
                }
                printf("0x%llx\n", count);

        }
        perf_close(&ctx);
}

Kan Liang (4):
  perf/x86/intel: fix event update for auto-reload
  perf/x86: introduce read function for x86_pmu
  perf/x86/intel: drain PEBS buffer in event read
  perf/x86: fix: disable userspace RDPMC usage for large PEBS

 arch/x86/events/core.c       |  5 ++-
 arch/x86/events/intel/core.c |  9 +++++
 arch/x86/events/intel/ds.c   | 79 ++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/events/perf_event.h |  3 ++
 4 files changed, 93 insertions(+), 3 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload
  2017-12-20 19:42 [PATCH V2 0/4] bug fix mmap read and rdpmc read in large PEBS kan.liang
@ 2017-12-20 19:42 ` kan.liang
  2017-12-20 19:42 ` [PATCH V2 2/4] perf/x86: introduce read function for x86_pmu kan.liang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: kan.liang @ 2017-12-20 19:42 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel; +Cc: tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

There is bug when mmap read event->count with large PEBS enabled.
Here is an example.
 #./read_count
 0x71f0
 0x122c0
 0x1000000001c54
 0x100000001257d
 0x200000000bdc5

There is auto-reload mechanism enabled for PEBS events in fixed period
mode. But the calculation of event->count does not take the auto-reload
values into account. Anyone who read the event->count will get wrong
result, e.g x86_pmu_read. Also, the calculation of hwc->period_left is
wrong either. It impacts the accuracy of period for the first record in
PEBS multiple records.

The issue is introduced with the auto-reload mechanism enabled by
commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")

Introduce intel_pmu_save_and_restart_reload only for auto-reload.

The formula to calculate the event->count is as below.
event->count = period left from last time +
               (reload_times - 1) * reload_val +
               latency of PMI handler

prev_count is the last observed hardware counter value. Just the same as
non-auto-reload, its absolute value is the period of the first record.
It should not update with each reload. Because it doesn't 'observe' the
hardware counter for each auto-reload.

For the second and later records, the period is exactly the reload
value. Just need to simply add (reload_times - 1) * reload_val to
event->count.

The calculation of the latency of PMI handler is a little bit different
as non-auto-reload. Because the start point is -reload_value. It needs
to be adjusted by adding reload_value.
The period_left needs to do the same adjustment.

There is nothing need to do in x86_perf_event_set_period(). Because it
is fixed period. The period_left is already adjusted.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3674a4b..cc1f373 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
 	return NULL;
 }
 
+/*
+ * Specific intel_pmu_save_and_restart() for auto-reload.
+ */
+static int intel_pmu_save_and_restart_reload(struct perf_event *event,
+					     u64 reload_val,
+					     int reload_times)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	int shift = 64 - x86_pmu.cntval_bits;
+	u64 prev_raw_count, new_raw_count;
+	u64 delta;
+
+	if ((reload_times == 0) || (reload_val == 0))
+		return intel_pmu_save_and_restart(event);
+
+	/*
+	 * Careful: an NMI might modify the previous event value.
+	 *
+	 * Our tactic to handle this is to first atomically read and
+	 * exchange a new raw count - then add that new-prev delta
+	 * count to the generic event atomically:
+	 */
+again:
+	prev_raw_count = local64_read(&hwc->prev_count);
+	rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+
+	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+					new_raw_count) != prev_raw_count)
+		goto again;
+
+	/*
+	 * Now we have the new raw value and have updated the prev
+	 * timestamp already. We can now calculate the elapsed delta
+	 * (event-)time and add that to the generic event.
+	 *
+	 * Careful, not all hw sign-extends above the physical width
+	 * of the count.
+	 *
+	 * event->count = period left from last time +
+	 *                (reload_times - 1) * reload_val +
+	 *                latency of PMI handler
+	 * The period left from last time can be got from -prev_count.
+	 * The start points of counting is always -reload_val.
+	 * So the real latency of PMI handler is reload_val + new_raw_count.
+	 */
+	delta = (reload_val << shift) + (new_raw_count << shift) -
+		(prev_raw_count << shift);
+	delta >>= shift;
+
+	local64_add(reload_val * (reload_times - 1), &event->count);
+	local64_add(delta, &event->count);
+	local64_sub(delta, &hwc->period_left);
+
+	return x86_perf_event_set_period(event);
+}
+
 static void __intel_pmu_pebs_event(struct perf_event *event,
 				   struct pt_regs *iregs,
 				   void *base, void *top,
 				   int bit, int count)
 {
+	struct hw_perf_event *hwc = &event->hw;
 	struct perf_sample_data data;
 	struct pt_regs regs;
 	void *at = get_next_pebs_record_by_bit(base, top, bit);
 
-	if (!intel_pmu_save_and_restart(event) &&
-	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+		/*
+		 * Now, auto-reload is only enabled in fixed period mode.
+		 * The reload value is always hwc->sample_period.
+		 * May need to change it, if auto-reload is enabled in
+		 * freq mode later.
+		 */
+		intel_pmu_save_and_restart_reload(event, hwc->sample_period,
+						  count);
+	} else if (!intel_pmu_save_and_restart(event))
 		return;
 
 	while (count > 1) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH V2 2/4] perf/x86: introduce read function for x86_pmu
  2017-12-20 19:42 [PATCH V2 0/4] bug fix mmap read and rdpmc read in large PEBS kan.liang
  2017-12-20 19:42 ` [PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload kan.liang
@ 2017-12-20 19:42 ` kan.liang
  2017-12-20 19:42 ` [PATCH V2 3/4] perf/x86/intel: drain PEBS buffer in event read kan.liang
  2017-12-20 19:42 ` [PATCH V2 4/4] perf/x86: fix: disable userspace RDPMC usage for large PEBS kan.liang
  3 siblings, 0 replies; 7+ messages in thread
From: kan.liang @ 2017-12-20 19:42 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel; +Cc: tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Large PEBS need to be specially handled in event count read.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       | 2 ++
 arch/x86/events/perf_event.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 140d332..acd7ffc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1884,6 +1884,8 @@ early_initcall(init_hw_perf_events);
 
 static inline void x86_pmu_read(struct perf_event *event)
 {
+	if (x86_pmu.read)
+		return x86_pmu.read(event);
 	x86_perf_event_update(event);
 }
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f7aaadf..67426e51 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -534,6 +534,7 @@ struct x86_pmu {
 	void		(*disable)(struct perf_event *);
 	void		(*add)(struct perf_event *);
 	void		(*del)(struct perf_event *);
+	void		(*read)(struct perf_event *event);
 	int		(*hw_config)(struct perf_event *event);
 	int		(*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign);
 	unsigned	eventsel;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH V2 3/4] perf/x86/intel: drain PEBS buffer in event read
  2017-12-20 19:42 [PATCH V2 0/4] bug fix mmap read and rdpmc read in large PEBS kan.liang
  2017-12-20 19:42 ` [PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload kan.liang
  2017-12-20 19:42 ` [PATCH V2 2/4] perf/x86: introduce read function for x86_pmu kan.liang
@ 2017-12-20 19:42 ` kan.liang
  2017-12-20 19:42 ` [PATCH V2 4/4] perf/x86: fix: disable userspace RDPMC usage for large PEBS kan.liang
  3 siblings, 0 replies; 7+ messages in thread
From: kan.liang @ 2017-12-20 19:42 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel; +Cc: tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

When the PEBS interrupt threshold is larger than one, there is no way to
get exact auto-reload times and value needed for event update unless
flush the PEBS buffer.

Drain the PEBS buffer in event read when large PEBS is enabled.

For the threshold is one, even auto-reload is enabled, it doesn't need
to be specially handled. Because auto-reload is only effect when event
overflow. There is no overflow in event read.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c |  9 +++++++++
 arch/x86/events/intel/ds.c   | 10 ++++++++++
 arch/x86/events/perf_event.h |  2 ++
 3 files changed, 21 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 09c26a4..bdc35f8 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event *event)
 		intel_pmu_pebs_del(event);
 }
 
+static void intel_pmu_read_event(struct perf_event *event)
+{
+	if (event->attr.precise_ip)
+		return intel_pmu_pebs_read(event);
+
+	x86_perf_event_update(event);
+}
+
 static void intel_pmu_enable_fixed(struct hw_perf_event *hwc)
 {
 	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
@@ -3495,6 +3503,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.disable		= intel_pmu_disable_event,
 	.add			= intel_pmu_add_event,
 	.del			= intel_pmu_del_event,
+	.read			= intel_pmu_read_event,
 	.hw_config		= intel_pmu_hw_config,
 	.schedule_events	= x86_schedule_events,
 	.eventsel		= MSR_ARCH_PERFMON_EVENTSEL0,
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index cc1f373..2027560 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -926,6 +926,16 @@ void intel_pmu_pebs_del(struct perf_event *event)
 	pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
 }
 
+void intel_pmu_pebs_read(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	if (pebs_needs_sched_cb(cpuc))
+		return intel_pmu_drain_pebs_buffer();
+
+	x86_perf_event_update(event);
+}
+
 void intel_pmu_pebs_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 67426e51..93ec3b4 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -928,6 +928,8 @@ void intel_pmu_pebs_add(struct perf_event *event);
 
 void intel_pmu_pebs_del(struct perf_event *event);
 
+void intel_pmu_pebs_read(struct perf_event *event);
+
 void intel_pmu_pebs_enable(struct perf_event *event);
 
 void intel_pmu_pebs_disable(struct perf_event *event);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH V2 4/4] perf/x86: fix: disable userspace RDPMC usage for large PEBS
  2017-12-20 19:42 [PATCH V2 0/4] bug fix mmap read and rdpmc read in large PEBS kan.liang
                   ` (2 preceding siblings ...)
  2017-12-20 19:42 ` [PATCH V2 3/4] perf/x86/intel: drain PEBS buffer in event read kan.liang
@ 2017-12-20 19:42 ` kan.liang
  2017-12-20 21:41   ` Andi Kleen
  3 siblings, 1 reply; 7+ messages in thread
From: kan.liang @ 2017-12-20 19:42 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel; +Cc: tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The userspace RDPMC usage never works for large PEBS since the large
PEBS is introduced by
commit b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")

When the PEBS interrupt threshold is larger than one, there is no way to
get exact auto-reload times and value for userspace RDPMC.

Disable the userspace RDPMC usage when large PEBS is enabled.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index acd7ffc..703bd81 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2120,7 +2120,8 @@ static int x86_pmu_event_init(struct perf_event *event)
 			event->destroy(event);
 	}
 
-	if (READ_ONCE(x86_pmu.attr_rdpmc))
+	if (READ_ONCE(x86_pmu.attr_rdpmc) &&
+	    !(event->hw.flags & PERF_X86_EVENT_FREERUNNING))
 		event->hw.flags |= PERF_X86_EVENT_RDPMC_ALLOWED;
 
 	return err;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH V2 4/4] perf/x86: fix: disable userspace RDPMC usage for large PEBS
  2017-12-20 19:42 ` [PATCH V2 4/4] perf/x86: fix: disable userspace RDPMC usage for large PEBS kan.liang
@ 2017-12-20 21:41   ` Andi Kleen
  2017-12-20 22:05     ` Liang, Kan
  0 siblings, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2017-12-20 21:41 UTC (permalink / raw)
  To: kan.liang; +Cc: peterz, mingo, acme, linux-kernel, tglx, jolsa, eranian

On Wed, Dec 20, 2017 at 11:42:51AM -0800, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The userspace RDPMC usage never works for large PEBS since the large
> PEBS is introduced by
> commit b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
> handling (large PEBS interrupt threshold)")
> 
> When the PEBS interrupt threshold is larger than one, there is no way to
> get exact auto-reload times and value for userspace RDPMC.
> 
> Disable the userspace RDPMC usage when large PEBS is enabled.

The only drawback is that the event wraps quickly. I suspect in many
cases it's still usable for measuring short regions.

I wouldn't force disable it, just let the users deal with it.

-Andi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V2 4/4] perf/x86: fix: disable userspace RDPMC usage for large PEBS
  2017-12-20 21:41   ` Andi Kleen
@ 2017-12-20 22:05     ` Liang, Kan
  0 siblings, 0 replies; 7+ messages in thread
From: Liang, Kan @ 2017-12-20 22:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: peterz, mingo, acme, linux-kernel, tglx, jolsa, eranian



On 12/20/2017 4:41 PM, Andi Kleen wrote:
> On Wed, Dec 20, 2017 at 11:42:51AM -0800, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The userspace RDPMC usage never works for large PEBS since the large
>> PEBS is introduced by
>> commit b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
>> handling (large PEBS interrupt threshold)")
>>
>> When the PEBS interrupt threshold is larger than one, there is no way to
>> get exact auto-reload times and value for userspace RDPMC.
>>
>> Disable the userspace RDPMC usage when large PEBS is enabled.
> 
> The only drawback is that the event wraps quickly. 

It's broken. The value read from RDPMC command will always between 
-reload_val and 0. Without reload times, I think it's impossible to 
calculate the total count number.

> I suspect in many
> cases it's still usable for measuring short regions.
>

Unless the short region is less than reload_val.
Otherwise, reload times is needed.

Thanks,
Kan

> I wouldn't force disable it, just let the users deal with it.
> 
> -Andi
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-12-20 22:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-20 19:42 [PATCH V2 0/4] bug fix mmap read and rdpmc read in large PEBS kan.liang
2017-12-20 19:42 ` [PATCH V2 1/4] perf/x86/intel: fix event update for auto-reload kan.liang
2017-12-20 19:42 ` [PATCH V2 2/4] perf/x86: introduce read function for x86_pmu kan.liang
2017-12-20 19:42 ` [PATCH V2 3/4] perf/x86/intel: drain PEBS buffer in event read kan.liang
2017-12-20 19:42 ` [PATCH V2 4/4] perf/x86: fix: disable userspace RDPMC usage for large PEBS kan.liang
2017-12-20 21:41   ` Andi Kleen
2017-12-20 22:05     ` Liang, Kan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.