linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
@ 2018-01-29 16:29 kan.liang
  2018-01-29 16:29 ` [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload kan.liang
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: kan.liang @ 2018-01-29 16:29 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: acme, tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

------

Changes since V2:
 - Refined the changelog
 - Introduced specific read function for large PEBS.
   The previous generic PEBS read function is confusing.
   Disabled PMU in pmu::read() path for large PEBS.
   Handled the corner case when reload_times == 0.
 - Modified the parameter of intel_pmu_save_and_restart_reload()
   Discarded local64_cmpxchg
 - Added fixes tag
 - Added WARN to handle reload_times == 0 || reload_val == 0

Changes since V1:
 - Check PERF_X86_EVENT_AUTO_RELOAD before call
   intel_pmu_save_and_restore()
 - Introduce a special purpose intel_pmu_save_and_restart()
   just for AUTO_RELOAD.
 - New patch to disable userspace RDPMC usage if large PEBS is enabled.

------

There is a bug when mmap read event->count with large PEBS enabled.
Here is an example.
 #./read_count
 0x71f0
 0x122c0
 0x1000000001c54
 0x100000001257d
 0x200000000bdc5

The bug is caused by two issues.
- In x86_perf_event_update, the calculation of event->count does not
  take the auto-reload values into account.
- In x86_pmu_read, it doesn't count the undrained values in large PEBS
  buffers.

The first issue was introduced with the auto-reload mechanism enabled
since commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
mechanism when possible")

Patch 1 fixed the issue in x86_perf_event_update.

The second issue was introduced since commit b8241d20699e
("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
interrupt threshold)")

Patch 2-4 fixed the issue in x86_pmu_read.

Besides the two issues, the userspace RDPMC usage is broken for large
PEBS as well.
The RDPMC issue was also introduced since commit b8241d20699e
("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
interrupt threshold)")

Patch 5 fixed the RDPMC issue.

The source code of read_count is as below.

struct cpu {
        int fd;
        struct perf_event_mmap_page *buf;
};

int perf_open(struct cpu *ctx, int cpu)
{
        struct perf_event_attr attr = {
                .type = PERF_TYPE_HARDWARE,
                .size = sizeof(struct perf_event_attr),
                .sample_period = 100000,
                .config = 0,
                .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
                                PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
                .precise_ip = 3,
                .mmap = 1,
                .comm = 1,
                .task = 1,
                .mmap2 = 1,
                .sample_id_all = 1,
                .comm_exec = 1,
        };
        ctx->buf = NULL;
        ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
        if (ctx->fd < 0) {
                perror("perf_event_open");
                return -1;
        }
        return 0;
}

void perf_close(struct cpu *ctx)
{
        close(ctx->fd);
        if (ctx->buf)
                munmap(ctx->buf, pagesize);
}

int main(int ac, char **av)
{
        struct cpu ctx;
        u64 count;

        perf_open(&ctx, 0);

        while (1) {
                sleep(5);

                if (read(ctx.fd, &count, 8) != 8) {
                        perror("counter read");
                        break;
                }
                printf("0x%llx\n", count);

        }
        perf_close(&ctx);
}

Kan Liang (5):
  perf/x86/intel: fix event update for auto-reload
  perf/x86: introduce read function for x86_pmu
  perf/x86/intel/ds: introduce read function for large pebs
  perf/x86/intel: introduce read function for intel_pmu
  perf/x86: fix: disable userspace RDPMC usage for large PEBS

 arch/x86/events/core.c       |  5 ++-
 arch/x86/events/intel/core.c |  9 +++++
 arch/x86/events/intel/ds.c   | 85 ++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/events/perf_event.h |  3 ++
 4 files changed, 99 insertions(+), 3 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload
  2018-01-29 16:29 [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read kan.liang
@ 2018-01-29 16:29 ` kan.liang
  2018-02-06 15:06   ` Peter Zijlstra
  2018-01-29 16:29 ` [PATCH V3 2/5] perf/x86: introduce read function for x86_pmu kan.liang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: kan.liang @ 2018-01-29 16:29 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: acme, tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

There is a bug when mmap read event->count with large PEBS enabled.
Here is an example.
 #./read_count
 0x71f0
 0x122c0
 0x1000000001c54
 0x100000001257d
 0x200000000bdc5

In fixed period mode, the auto-reload mechanism could be enabled for
PEBS events. But the calculation of event->count does not take the
auto-reload values into account. Anyone who read the event->count will
get wrong result, e.g x86_pmu_read. The calculation of hwc->period_left
is wrong either, which will impact the calculation of the period for
the first record in PEBS multiple records.

The issue was introduced with the auto-reload mechanism enabled since
commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")

Introduce intel_pmu_save_and_restart_reload to calculate the event->count
only for auto-reload.

There is a small gap between 'PEBS hardware is armed' and 'the NMI is
handled'. Because of the gap, the first record also needs to be specially
handled.
The formula to calculate the increments of event->count is as below.
The increments = the period for first record +
                 (reload_times - 1) * reload_val +
                 the gap
- The 'the period for first record' is the period left from last PMI,
  which can be got from the previous event value.
- For the second and later records, the period is exactly the reload
  value. Just need to simply add (reload_times - 1) * reload_val
- Because of the auto-reload, the start point of counting is alwyas
  (-reload_val). So the calculation of 'the gap' needs to be corrected by
  adding reload_val.
  The period_left needs to do the same adjustment as well.

There is nothing need to do in x86_perf_event_set_period(). Because it
is fixed period. The period_left is already adjusted.

Fixes: 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8156e47..6533426 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1303,17 +1303,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
 	return NULL;
 }
 
+/*
+ * Specific intel_pmu_save_and_restart() for auto-reload.
+ * It only be called from drain_pebs().
+ */
+static int intel_pmu_save_and_restart_reload(struct perf_event *event,
+					     int reload_times)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	int shift = 64 - x86_pmu.cntval_bits;
+	u64 reload_val = hwc->sample_period;
+	u64 prev_raw_count, new_raw_count;
+	u64 delta;
+
+	WARN_ON((reload_times == 0) || (reload_val == 0));
+
+	/*
+	 * drain_pebs() only happens when the PMU is disabled.
+	 * It doesnot need to specially handle the previous event value.
+	 * The hwc->prev_count will be updated in x86_perf_event_set_period().
+	 */
+	prev_raw_count = local64_read(&hwc->prev_count);
+	rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+
+	/*
+	 * Now we have the new raw value and have updated the prev
+	 * timestamp already. We can now calculate the elapsed delta
+	 * (event-)time and add that to the generic event.
+	 *
+	 * Careful, not all hw sign-extends above the physical width
+	 * of the count.
+	 *
+	 * There is a small gap between 'PEBS hardware is armed' and 'the NMI
+	 * is handled'. Because of the gap, the first record also needs to be
+	 * specially handled.
+	 * The formula to calculate the increments of event->count is as below.
+	 * The increments = the period for first record +
+	 *                  (reload_times - 1) * reload_val +
+	 *                  the gap
+	 * 'The period for first record' can be got from -prev_raw_count.
+	 *
+	 * 'The gap' = new_raw_count + reload_val. Because the start point of
+	 * counting is always -reload_val for auto-reload.
+	 *
+	 * The period_left needs to do the same adjustment by adding
+	 * reload_val.
+	 */
+	delta = (reload_val << shift) + (new_raw_count << shift) -
+		(prev_raw_count << shift);
+	delta >>= shift;
+
+	local64_add(reload_val * (reload_times - 1), &event->count);
+	local64_add(delta, &event->count);
+	local64_sub(delta, &hwc->period_left);
+
+	return x86_perf_event_set_period(event);
+}
+
 static void __intel_pmu_pebs_event(struct perf_event *event,
 				   struct pt_regs *iregs,
 				   void *base, void *top,
 				   int bit, int count)
 {
+	struct hw_perf_event *hwc = &event->hw;
 	struct perf_sample_data data;
 	struct pt_regs regs;
 	void *at = get_next_pebs_record_by_bit(base, top, bit);
 
-	if (!intel_pmu_save_and_restart(event) &&
-	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+		/*
+		 * Now, auto-reload is only enabled in fixed period mode.
+		 * The reload value is always hwc->sample_period.
+		 * May need to change it, if auto-reload is enabled in
+		 * freq mode later.
+		 */
+		intel_pmu_save_and_restart_reload(event, count);
+	} else if (!intel_pmu_save_and_restart(event))
 		return;
 
 	while (count > 1) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V3 2/5] perf/x86: introduce read function for x86_pmu
  2018-01-29 16:29 [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read kan.liang
  2018-01-29 16:29 ` [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload kan.liang
@ 2018-01-29 16:29 ` kan.liang
  2018-01-29 16:29 ` [PATCH V3 3/5] perf/x86/intel/ds: introduce read function for large pebs kan.liang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2018-01-29 16:29 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: acme, tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Large PEBS needs to be specially handled in event count read.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       | 2 ++
 arch/x86/events/perf_event.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 140d332..acd7ffc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1884,6 +1884,8 @@ early_initcall(init_hw_perf_events);
 
 static inline void x86_pmu_read(struct perf_event *event)
 {
+	if (x86_pmu.read)
+		return x86_pmu.read(event);
 	x86_perf_event_update(event);
 }
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8e4ea143..805400b 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -519,6 +519,7 @@ struct x86_pmu {
 	void		(*disable)(struct perf_event *);
 	void		(*add)(struct perf_event *);
 	void		(*del)(struct perf_event *);
+	void		(*read)(struct perf_event *event);
 	int		(*hw_config)(struct perf_event *event);
 	int		(*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign);
 	unsigned	eventsel;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V3 3/5] perf/x86/intel/ds: introduce read function for large pebs
  2018-01-29 16:29 [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read kan.liang
  2018-01-29 16:29 ` [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload kan.liang
  2018-01-29 16:29 ` [PATCH V3 2/5] perf/x86: introduce read function for x86_pmu kan.liang
@ 2018-01-29 16:29 ` kan.liang
  2018-01-29 16:29 ` [PATCH V3 4/5] perf/x86/intel: fix pmu read for large PEBS kan.liang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2018-01-29 16:29 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: acme, tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

When the PEBS interrupt threshold is larger than one, there is no way to
get exact auto-reload times and value, which needed for event update
unless flush the PEBS buffer.

Introduce intel_pmu_large_pebs_read() to drain the PEBS buffer in event
read when large PEBS is enabled.
To prevent the race, the drain_pebs() only be called when the PMU is
disabled.

Unconditionally call x86_perf_event_update() for large pebs.
- It is easily to call pmu::read() twice in a short period. There could
  be no samples in the PEBS buffer. x86_perf_event_update() is needed
  to update the count.
- There is no harmful to call x86_perf_event_update() for other cases.
- It's safe. Don't need to worry about the auto-reload. Because the PMU
  is disabled.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/ds.c   | 16 ++++++++++++++++
 arch/x86/events/perf_event.h |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 6533426..1c11fa2 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1303,6 +1303,22 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
 	return NULL;
 }
 
+int intel_pmu_large_pebs_read(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	/* Check if the event has large pebs */
+	if (!pebs_needs_sched_cb(cpuc))
+		return 0;
+
+	perf_pmu_disable(event->pmu);
+	intel_pmu_drain_pebs_buffer();
+	x86_perf_event_update(event);
+	perf_pmu_enable(event->pmu);
+
+	return 1;
+}
+
 /*
  * Specific intel_pmu_save_and_restart() for auto-reload.
  * It only be called from drain_pebs().
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 805400b..7d3cd32 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -923,6 +923,8 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
 
+int intel_pmu_large_pebs_read(struct perf_event *event);
+
 void intel_ds_init(void);
 
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V3 4/5] perf/x86/intel: fix pmu read for large PEBS
  2018-01-29 16:29 [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read kan.liang
                   ` (2 preceding siblings ...)
  2018-01-29 16:29 ` [PATCH V3 3/5] perf/x86/intel/ds: introduce read function for large pebs kan.liang
@ 2018-01-29 16:29 ` kan.liang
  2018-01-29 16:29 ` [PATCH V3 5/5] perf/x86: fix: disable userspace RDPMC usage " kan.liang
  2018-01-30  9:16 ` [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read Stephane Eranian
  5 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2018-01-29 16:29 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: acme, tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Large PEBS needs to be specially handled in event count read.

It is only available for intel_pmu.

Only need to specially handle the large PEBS. For the threshold is one,
even auto-reload is enabled, it doesn't need to be specially handled.
Because,
- auto-reload is only effect when event overflow. Once there is an
  overflow, the NMI handler will automatically drain_pebs().
- For other cases, x86_perf_event_update() is good enough.

Fixes: b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 731153a..1610a9d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event *event)
 		intel_pmu_pebs_del(event);
 }
 
+static void intel_pmu_read_event(struct perf_event *event)
+{
+	if (intel_pmu_large_pebs_read(event))
+		return;
+
+	x86_perf_event_update(event);
+}
+
 static void intel_pmu_enable_fixed(struct hw_perf_event *hwc)
 {
 	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
@@ -3495,6 +3503,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.disable		= intel_pmu_disable_event,
 	.add			= intel_pmu_add_event,
 	.del			= intel_pmu_del_event,
+	.read			= intel_pmu_read_event,
 	.hw_config		= intel_pmu_hw_config,
 	.schedule_events	= x86_schedule_events,
 	.eventsel		= MSR_ARCH_PERFMON_EVENTSEL0,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH V3 5/5] perf/x86: fix: disable userspace RDPMC usage for large PEBS
  2018-01-29 16:29 [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read kan.liang
                   ` (3 preceding siblings ...)
  2018-01-29 16:29 ` [PATCH V3 4/5] perf/x86/intel: fix pmu read for large PEBS kan.liang
@ 2018-01-29 16:29 ` kan.liang
  2018-01-30  9:16 ` [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read Stephane Eranian
  5 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2018-01-29 16:29 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: acme, tglx, jolsa, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The userspace RDPMC usage never works for large PEBS since the large
PEBS is introduced by
commit b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")

When the PEBS interrupt threshold is larger than one, there is no way to
get exact auto-reload times and value for userspace RDPMC.

Disable the userspace RDPMC usage when large PEBS is enabled.

Fixes: b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index acd7ffc..703bd81 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2120,7 +2120,8 @@ static int x86_pmu_event_init(struct perf_event *event)
 			event->destroy(event);
 	}
 
-	if (READ_ONCE(x86_pmu.attr_rdpmc))
+	if (READ_ONCE(x86_pmu.attr_rdpmc) &&
+	    !(event->hw.flags & PERF_X86_EVENT_FREERUNNING))
 		event->hw.flags |= PERF_X86_EVENT_RDPMC_ALLOWED;
 
 	return err;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-29 16:29 [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read kan.liang
                   ` (4 preceding siblings ...)
  2018-01-29 16:29 ` [PATCH V3 5/5] perf/x86: fix: disable userspace RDPMC usage " kan.liang
@ 2018-01-30  9:16 ` Stephane Eranian
  2018-01-30 13:39   ` Jiri Olsa
  2018-01-30 14:41   ` Liang, Kan
  5 siblings, 2 replies; 26+ messages in thread
From: Stephane Eranian @ 2018-01-30  9:16 UTC (permalink / raw)
  To: kan.liang
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Jiri Olsa, Andi Kleen

Hi,

On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> ------
>
> Changes since V2:
>  - Refined the changelog
>  - Introduced specific read function for large PEBS.
>    The previous generic PEBS read function is confusing.
>    Disabled PMU in pmu::read() path for large PEBS.
>    Handled the corner case when reload_times == 0.
>  - Modified the parameter of intel_pmu_save_and_restart_reload()
>    Discarded local64_cmpxchg
>  - Added fixes tag
>  - Added WARN to handle reload_times == 0 || reload_val == 0
>
> Changes since V1:
>  - Check PERF_X86_EVENT_AUTO_RELOAD before call
>    intel_pmu_save_and_restore()

It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
with large PEBS. Large PEBS requires fixed period. So the kernel could
make up the period from the event and store it in the sampling buffer.

I tried using large PEBS recently, and despite trying different option
combination of perf record, I was not able to get it to work.

$ perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

But I was able to make this work with a much older kernel.

Another annoyance I ran into is with perf record requiring -c period
in order not to set
PERF_SAMPLE_PERIOD in the event.

If I do:
perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

I get

perf_event_attr:
  type                             4
  size                             112
  config                           0x10d1
  { sample_period, sample_freq }   199936
  sample_type                      IP|TID|CPU

But if I do:
perf record  -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

I get

perf_event_attr:
  type                             4
  size                             112
  config                           0x10d1
  { sample_period, sample_freq }   199936
  sample_type                      IP|TID|CPU|PERIOD

Perf should check if all events have a period=, then it should not
pass PERF_SAMPLE_PERIOD, even
more so when only one event is defined.

Also it does not seem to honor --no-period.

All of this makes it hard to use large PEBS in general.

So I think your series should also address this part.

>  - Introduce a special purpose intel_pmu_save_and_restart()
>    just for AUTO_RELOAD.
>  - New patch to disable userspace RDPMC usage if large PEBS is enabled.
>
> ------
>
> There is a bug when mmap read event->count with large PEBS enabled.
> Here is an example.
>  #./read_count
>  0x71f0
>  0x122c0
>  0x1000000001c54
>  0x100000001257d
>  0x200000000bdc5
>
> The bug is caused by two issues.
> - In x86_perf_event_update, the calculation of event->count does not
>   take the auto-reload values into account.
> - In x86_pmu_read, it doesn't count the undrained values in large PEBS
>   buffers.
>
> The first issue was introduced with the auto-reload mechanism enabled
> since commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
> mechanism when possible")
>
> Patch 1 fixed the issue in x86_perf_event_update.
>
> The second issue was introduced since commit b8241d20699e
> ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
> interrupt threshold)")
>
> Patch 2-4 fixed the issue in x86_pmu_read.
>
> Besides the two issues, the userspace RDPMC usage is broken for large
> PEBS as well.
> The RDPMC issue was also introduced since commit b8241d20699e
> ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
> interrupt threshold)")
>
> Patch 5 fixed the RDPMC issue.
>
> The source code of read_count is as below.
>
> struct cpu {
>         int fd;
>         struct perf_event_mmap_page *buf;
> };
>
> int perf_open(struct cpu *ctx, int cpu)
> {
>         struct perf_event_attr attr = {
>                 .type = PERF_TYPE_HARDWARE,
>                 .size = sizeof(struct perf_event_attr),
>                 .sample_period = 100000,
>                 .config = 0,
>                 .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
>                                 PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
>                 .precise_ip = 3,
>                 .mmap = 1,
>                 .comm = 1,
>                 .task = 1,
>                 .mmap2 = 1,
>                 .sample_id_all = 1,
>                 .comm_exec = 1,
>         };
>         ctx->buf = NULL;
>         ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
>         if (ctx->fd < 0) {
>                 perror("perf_event_open");
>                 return -1;
>         }
>         return 0;
> }
>
> void perf_close(struct cpu *ctx)
> {
>         close(ctx->fd);
>         if (ctx->buf)
>                 munmap(ctx->buf, pagesize);
> }
>
> int main(int ac, char **av)
> {
>         struct cpu ctx;
>         u64 count;
>
>         perf_open(&ctx, 0);
>
>         while (1) {
>                 sleep(5);
>
>                 if (read(ctx.fd, &count, 8) != 8) {
>                         perror("counter read");
>                         break;
>                 }
>                 printf("0x%llx\n", count);
>
>         }
>         perf_close(&ctx);
> }
>
> Kan Liang (5):
>   perf/x86/intel: fix event update for auto-reload
>   perf/x86: introduce read function for x86_pmu
>   perf/x86/intel/ds: introduce read function for large pebs
>   perf/x86/intel: introduce read function for intel_pmu
>   perf/x86: fix: disable userspace RDPMC usage for large PEBS
>
>  arch/x86/events/core.c       |  5 ++-
>  arch/x86/events/intel/core.c |  9 +++++
>  arch/x86/events/intel/ds.c   | 85 ++++++++++++++++++++++++++++++++++++++++++--
>  arch/x86/events/perf_event.h |  3 ++
>  4 files changed, 99 insertions(+), 3 deletions(-)
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30  9:16 ` [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read Stephane Eranian
@ 2018-01-30 13:39   ` Jiri Olsa
  2018-01-30 14:59     ` Liang, Kan
  2018-01-30 14:41   ` Liang, Kan
  1 sibling, 1 reply; 26+ messages in thread
From: Jiri Olsa @ 2018-01-30 13:39 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: kan.liang, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
> Hi,
> 
> On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@linux.intel.com> wrote:
> >
> > From: Kan Liang <kan.liang@linux.intel.com>
> >
> > ------
> >
> > Changes since V2:
> >  - Refined the changelog
> >  - Introduced specific read function for large PEBS.
> >    The previous generic PEBS read function is confusing.
> >    Disabled PMU in pmu::read() path for large PEBS.
> >    Handled the corner case when reload_times == 0.
> >  - Modified the parameter of intel_pmu_save_and_restart_reload()
> >    Discarded local64_cmpxchg
> >  - Added fixes tag
> >  - Added WARN to handle reload_times == 0 || reload_val == 0
> >
> > Changes since V1:
> >  - Check PERF_X86_EVENT_AUTO_RELOAD before call
> >    intel_pmu_save_and_restore()
> 
> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
> with large PEBS. Large PEBS requires fixed period. So the kernel could
> make up the period from the event and store it in the sampling buffer.
> 
> I tried using large PEBS recently, and despite trying different option
> combination of perf record, I was not able to get it to work.
> 
> $ perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
> 
> But I was able to make this work with a much older kernel.
> 
> Another annoyance I ran into is with perf record requiring -c period
> in order not to set
> PERF_SAMPLE_PERIOD in the event.
> 
> If I do:
> perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
> 
> I get
> 
> perf_event_attr:
>   type                             4
>   size                             112
>   config                           0x10d1
>   { sample_period, sample_freq }   199936
>   sample_type                      IP|TID|CPU
> 
> But if I do:
> perf record  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
> 
> I get
> 
> perf_event_attr:
>   type                             4
>   size                             112
>   config                           0x10d1
>   { sample_period, sample_freq }   199936
>   sample_type                      IP|TID|CPU|PERIOD
> 
> Perf should check if all events have a period=, then it should not
> pass PERF_SAMPLE_PERIOD, even
> more so when only one event is defined.
> 
> Also it does not seem to honor --no-period.

yep, there's a bug in period=x term handling
we did not re/set the sample_type based on that

attached patch fixes that for me, also takes into account
the --no/-period options

jirka


---
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f251e824edac..907267206973 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
 	OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
 			&record.opts.sample_time_set,
 			"Record the sample timestamps"),
-	OPT_BOOLEAN('P', "period", &record.opts.period, "Record the sample period"),
+	OPT_BOOLEAN_SET('P', "period", &record.opts.period, &record.opts.period_set,
+			"Record the sample period"),
 	OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
 		    "don't sample"),
 	OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 2357f4ccc9c7..cfe46236a5e5 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -50,6 +50,7 @@ struct record_opts {
 	bool	     sample_time_set;
 	bool	     sample_cpu;
 	bool	     period;
+	bool	     period_set;
 	bool	     running_time;
 	bool	     full_auxtrace;
 	bool	     auxtrace_snapshot_mode;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66fa45198a11..ff359c9ece2e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel *evsel,
 			if (!(term->weak && opts->user_interval != ULLONG_MAX)) {
 				attr->sample_period = term->val.period;
 				attr->freq = 0;
+				perf_evsel__reset_sample_bit(evsel, PERIOD);
 			}
 			break;
 		case PERF_EVSEL__CONFIG_TERM_FREQ:
 			if (!(term->weak && opts->user_freq != UINT_MAX)) {
 				attr->sample_freq = term->val.freq;
 				attr->freq = 1;
+				perf_evsel__set_sample_bit(evsel, PERIOD);
 			}
 			break;
 		case PERF_EVSEL__CONFIG_TERM_TIME:
@@ -969,9 +971,6 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
 	if (target__has_cpu(&opts->target) || opts->sample_cpu)
 		perf_evsel__set_sample_bit(evsel, CPU);
 
-	if (opts->period)
-		perf_evsel__set_sample_bit(evsel, PERIOD);
-
 	/*
 	 * When the user explicitly disabled time don't force it here.
 	 */
@@ -1073,6 +1072,14 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
 	apply_config_terms(evsel, opts, track);
 
 	evsel->ignore_missing_thread = opts->ignore_missing_thread;
+
+	/* The --period option takes the precedence. */
+	if (opts->period_set) {
+		if (opts->period)
+			perf_evsel__set_sample_bit(evsel, PERIOD);
+		else
+			perf_evsel__reset_sample_bit(evsel, PERIOD);
+	}
 }
 
 static int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30  9:16 ` [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read Stephane Eranian
  2018-01-30 13:39   ` Jiri Olsa
@ 2018-01-30 14:41   ` Liang, Kan
  1 sibling, 0 replies; 26+ messages in thread
From: Liang, Kan @ 2018-01-30 14:41 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Jiri Olsa, Andi Kleen



On 1/30/2018 4:16 AM, Stephane Eranian wrote:
> Hi,
> 
> On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> ------
>>
>> Changes since V2:
>>   - Refined the changelog
>>   - Introduced specific read function for large PEBS.
>>     The previous generic PEBS read function is confusing.
>>     Disabled PMU in pmu::read() path for large PEBS.
>>     Handled the corner case when reload_times == 0.
>>   - Modified the parameter of intel_pmu_save_and_restart_reload()
>>     Discarded local64_cmpxchg
>>   - Added fixes tag
>>   - Added WARN to handle reload_times == 0 || reload_val == 0
>>
>> Changes since V1:
>>   - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>     intel_pmu_save_and_restore()
> 
> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
> with large PEBS. Large PEBS requires fixed period. So the kernel could
> make up the period from the event and store it in the sampling buffer.

The PERF_SAMPLE_PERIOD will be implicitly set in freq mode.
By now, large PEBS doesn't support freq mode yet.

> 
> I tried using large PEBS recently, and despite trying different option
> combination of perf record, I was not able to get it to work.
> 
> $ perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
> 
> But I was able to make this work with a much older kernel.

Is there any error message?

> 
> Another annoyance I ran into is with perf record requiring -c period
> in order not to set
> PERF_SAMPLE_PERIOD in the event.
> 
> If I do:
> perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
> 
> I get
> 
> perf_event_attr:
>    type                             4
>    size                             112
>    config                           0x10d1
>    { sample_period, sample_freq }   199936
>    sample_type                      IP|TID|CPU
> 
> But if I do:
> perf record  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
> 
> I get
> 
> perf_event_attr:
>    type                             4
>    size                             112
>    config                           0x10d1
>    { sample_period, sample_freq }   199936
>    sample_type                      IP|TID|CPU|PERIOD
> 
> Perf should check if all events have a period=, then it should not
> pass PERF_SAMPLE_PERIOD, even
> more so when only one event is defined.

As my understanding, the "period=" is per-event term. It should not 
change the global setting.
I think it's better still use "--no-period" to control the PERIOD bit.

> 
> Also it does not seem to honor --no-period.
>

Right, perf tool doesn't clear the PERIOD for --no-period.
	if (opts->period)
		perf_evsel__set_sample_bit(evsel, PERIOD);

I will submit a patch to fix it.

> All of this makes it hard to use large PEBS in general.
> 
> So I think your series should also address this part.

The issue looks like perf tool problem. I will take a look at it, but 
probably address it in a separate patch set.

Thanks,
Kan
> 
>>   - Introduce a special purpose intel_pmu_save_and_restart()
>>     just for AUTO_RELOAD.
>>   - New patch to disable userspace RDPMC usage if large PEBS is enabled.
>>
>> ------
>>
>> There is a bug when mmap read event->count with large PEBS enabled.
>> Here is an example.
>>   #./read_count
>>   0x71f0
>>   0x122c0
>>   0x1000000001c54
>>   0x100000001257d
>>   0x200000000bdc5
>>
>> The bug is caused by two issues.
>> - In x86_perf_event_update, the calculation of event->count does not
>>    take the auto-reload values into account.
>> - In x86_pmu_read, it doesn't count the undrained values in large PEBS
>>    buffers.
>>
>> The first issue was introduced with the auto-reload mechanism enabled
>> since commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
>> mechanism when possible")
>>
>> Patch 1 fixed the issue in x86_perf_event_update.
>>
>> The second issue was introduced since commit b8241d20699e
>> ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
>> interrupt threshold)")
>>
>> Patch 2-4 fixed the issue in x86_pmu_read.
>>
>> Besides the two issues, the userspace RDPMC usage is broken for large
>> PEBS as well.
>> The RDPMC issue was also introduced since commit b8241d20699e
>> ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
>> interrupt threshold)")
>>
>> Patch 5 fixed the RDPMC issue.
>>
>> The source code of read_count is as below.
>>
>> struct cpu {
>>          int fd;
>>          struct perf_event_mmap_page *buf;
>> };
>>
>> int perf_open(struct cpu *ctx, int cpu)
>> {
>>          struct perf_event_attr attr = {
>>                  .type = PERF_TYPE_HARDWARE,
>>                  .size = sizeof(struct perf_event_attr),
>>                  .sample_period = 100000,
>>                  .config = 0,
>>                  .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
>>                                  PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
>>                  .precise_ip = 3,
>>                  .mmap = 1,
>>                  .comm = 1,
>>                  .task = 1,
>>                  .mmap2 = 1,
>>                  .sample_id_all = 1,
>>                  .comm_exec = 1,
>>          };
>>          ctx->buf = NULL;
>>          ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
>>          if (ctx->fd < 0) {
>>                  perror("perf_event_open");
>>                  return -1;
>>          }
>>          return 0;
>> }
>>
>> void perf_close(struct cpu *ctx)
>> {
>>          close(ctx->fd);
>>          if (ctx->buf)
>>                  munmap(ctx->buf, pagesize);
>> }
>>
>> int main(int ac, char **av)
>> {
>>          struct cpu ctx;
>>          u64 count;
>>
>>          perf_open(&ctx, 0);
>>
>>          while (1) {
>>                  sleep(5);
>>
>>                  if (read(ctx.fd, &count, 8) != 8) {
>>                          perror("counter read");
>>                          break;
>>                  }
>>                  printf("0x%llx\n", count);
>>
>>          }
>>          perf_close(&ctx);
>> }
>>
>> Kan Liang (5):
>>    perf/x86/intel: fix event update for auto-reload
>>    perf/x86: introduce read function for x86_pmu
>>    perf/x86/intel/ds: introduce read function for large pebs
>>    perf/x86/intel: introduce read function for intel_pmu
>>    perf/x86: fix: disable userspace RDPMC usage for large PEBS
>>
>>   arch/x86/events/core.c       |  5 ++-
>>   arch/x86/events/intel/core.c |  9 +++++
>>   arch/x86/events/intel/ds.c   | 85 ++++++++++++++++++++++++++++++++++++++++++--
>>   arch/x86/events/perf_event.h |  3 ++
>>   4 files changed, 99 insertions(+), 3 deletions(-)
>>
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30 13:39   ` Jiri Olsa
@ 2018-01-30 14:59     ` Liang, Kan
  2018-01-30 15:04       ` Jiri Olsa
  0 siblings, 1 reply; 26+ messages in thread
From: Liang, Kan @ 2018-01-30 14:59 UTC (permalink / raw)
  To: Jiri Olsa, Stephane Eranian
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Andi Kleen



On 1/30/2018 8:39 AM, Jiri Olsa wrote:
> On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
>> Hi,
>>
>> On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@linux.intel.com> wrote:
>>>
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> ------
>>>
>>> Changes since V2:
>>>   - Refined the changelog
>>>   - Introduced specific read function for large PEBS.
>>>     The previous generic PEBS read function is confusing.
>>>     Disabled PMU in pmu::read() path for large PEBS.
>>>     Handled the corner case when reload_times == 0.
>>>   - Modified the parameter of intel_pmu_save_and_restart_reload()
>>>     Discarded local64_cmpxchg
>>>   - Added fixes tag
>>>   - Added WARN to handle reload_times == 0 || reload_val == 0
>>>
>>> Changes since V1:
>>>   - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>>     intel_pmu_save_and_restore()
>>
>> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
>> with large PEBS. Large PEBS requires fixed period. So the kernel could
>> make up the period from the event and store it in the sampling buffer.
>>
>> I tried using large PEBS recently, and despite trying different option
>> combination of perf record, I was not able to get it to work.
>>
>> $ perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>> --no-timestamp --no-period -a -C 0
>>
>> But I was able to make this work with a much older kernel.
>>
>> Another annoyance I ran into is with perf record requiring -c period
>> in order not to set
>> PERF_SAMPLE_PERIOD in the event.
>>
>> If I do:
>> perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>> --no-timestamp --no-period -a -C 0
>>
>> I get
>>
>> perf_event_attr:
>>    type                             4
>>    size                             112
>>    config                           0x10d1
>>    { sample_period, sample_freq }   199936
>>    sample_type                      IP|TID|CPU
>>
>> But if I do:
>> perf record  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>> --no-timestamp --no-period -a -C 0
>>
>> I get
>>
>> perf_event_attr:
>>    type                             4
>>    size                             112
>>    config                           0x10d1
>>    { sample_period, sample_freq }   199936
>>    sample_type                      IP|TID|CPU|PERIOD
>>
>> Perf should check if all events have a period=, then it should not
>> pass PERF_SAMPLE_PERIOD, even
>> more so when only one event is defined.
>>
>> Also it does not seem to honor --no-period.
> 
> yep, there's a bug in period=x term handling
> we did not re/set the sample_type based on that
> 
> attached patch fixes that for me, also takes into account
> the --no/-period options
> 
> jirka
> 
> 
> ---
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index f251e824edac..907267206973 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
>   	OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
>   			&record.opts.sample_time_set,
>   			"Record the sample timestamps"),
> -	OPT_BOOLEAN('P', "period", &record.opts.period, "Record the sample period"),
> +	OPT_BOOLEAN_SET('P', "period", &record.opts.period, &record.opts.period_set,
> +			"Record the sample period"),
>   	OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
>   		    "don't sample"),
>   	OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
> index 2357f4ccc9c7..cfe46236a5e5 100644
> --- a/tools/perf/perf.h
> +++ b/tools/perf/perf.h
> @@ -50,6 +50,7 @@ struct record_opts {
>   	bool	     sample_time_set;
>   	bool	     sample_cpu;
>   	bool	     period;
> +	bool	     period_set;
>   	bool	     running_time;
>   	bool	     full_auxtrace;
>   	bool	     auxtrace_snapshot_mode;
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 66fa45198a11..ff359c9ece2e 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel *evsel,
>   			if (!(term->weak && opts->user_interval != ULLONG_MAX)) {
>   				attr->sample_period = term->val.period;
>   				attr->freq = 0;
> +				perf_evsel__reset_sample_bit(evsel, PERIOD);
>   			}
>   			break;
>   		case PERF_EVSEL__CONFIG_TERM_FREQ:
>   			if (!(term->weak && opts->user_freq != UINT_MAX)) {
>   				attr->sample_freq = term->val.freq;
>   				attr->freq = 1;
> +				perf_evsel__set_sample_bit(evsel, PERIOD);
>   			}

If we do so, some events could be in fixed mode without PERIOD set. 
Other events could be in freq mode with PERIOD set.
That could be a problem for large PEBS. The PEBS buffer is shared among 
events. It doesn't support freq mode yet.

Thanks,
Kan

>   			break;
>   		case PERF_EVSEL__CONFIG_TERM_TIME:
> @@ -969,9 +971,6 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
>   	if (target__has_cpu(&opts->target) || opts->sample_cpu)
>   		perf_evsel__set_sample_bit(evsel, CPU);
>   
> -	if (opts->period)
> -		perf_evsel__set_sample_bit(evsel, PERIOD);
> -
>   	/*
>   	 * When the user explicitly disabled time don't force it here.
>   	 */
> @@ -1073,6 +1072,14 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
>   	apply_config_terms(evsel, opts, track);
>   
>   	evsel->ignore_missing_thread = opts->ignore_missing_thread;
> +
> +	/* The --period option takes the precedence. */
> +	if (opts->period_set) {
> +		if (opts->period)
> +			perf_evsel__set_sample_bit(evsel, PERIOD);
> +		else
> +			perf_evsel__reset_sample_bit(evsel, PERIOD);
> +	}
>   }
>   
>   static int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30 14:59     ` Liang, Kan
@ 2018-01-30 15:04       ` Jiri Olsa
  2018-01-30 15:25         ` Liang, Kan
  0 siblings, 1 reply; 26+ messages in thread
From: Jiri Olsa @ 2018-01-30 15:04 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Stephane Eranian, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Jan 30, 2018 at 09:59:15AM -0500, Liang, Kan wrote:
> 
> 
> On 1/30/2018 8:39 AM, Jiri Olsa wrote:
> > On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
> > > Hi,
> > > 
> > > On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@linux.intel.com> wrote:
> > > > 
> > > > From: Kan Liang <kan.liang@linux.intel.com>
> > > > 
> > > > ------
> > > > 
> > > > Changes since V2:
> > > >   - Refined the changelog
> > > >   - Introduced specific read function for large PEBS.
> > > >     The previous generic PEBS read function is confusing.
> > > >     Disabled PMU in pmu::read() path for large PEBS.
> > > >     Handled the corner case when reload_times == 0.
> > > >   - Modified the parameter of intel_pmu_save_and_restart_reload()
> > > >     Discarded local64_cmpxchg
> > > >   - Added fixes tag
> > > >   - Added WARN to handle reload_times == 0 || reload_val == 0
> > > > 
> > > > Changes since V1:
> > > >   - Check PERF_X86_EVENT_AUTO_RELOAD before call
> > > >     intel_pmu_save_and_restore()
> > > 
> > > It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
> > > with large PEBS. Large PEBS requires fixed period. So the kernel could
> > > make up the period from the event and store it in the sampling buffer.
> > > 
> > > I tried using large PEBS recently, and despite trying different option
> > > combination of perf record, I was not able to get it to work.
> > > 
> > > $ perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> > > --no-timestamp --no-period -a -C 0
> > > 
> > > But I was able to make this work with a much older kernel.
> > > 
> > > Another annoyance I ran into is with perf record requiring -c period
> > > in order not to set
> > > PERF_SAMPLE_PERIOD in the event.
> > > 
> > > If I do:
> > > perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> > > --no-timestamp --no-period -a -C 0
> > > 
> > > I get
> > > 
> > > perf_event_attr:
> > >    type                             4
> > >    size                             112
> > >    config                           0x10d1
> > >    { sample_period, sample_freq }   199936
> > >    sample_type                      IP|TID|CPU
> > > 
> > > But if I do:
> > > perf record  -e cpu/event=0xd1,umask=0x10,period=19936/pp
> > > --no-timestamp --no-period -a -C 0
> > > 
> > > I get
> > > 
> > > perf_event_attr:
> > >    type                             4
> > >    size                             112
> > >    config                           0x10d1
> > >    { sample_period, sample_freq }   199936
> > >    sample_type                      IP|TID|CPU|PERIOD
> > > 
> > > Perf should check if all events have a period=, then it should not
> > > pass PERF_SAMPLE_PERIOD, even
> > > more so when only one event is defined.
> > > 
> > > Also it does not seem to honor --no-period.
> > 
> > yep, there's a bug in period=x term handling
> > we did not re/set the sample_type based on that
> > 
> > attached patch fixes that for me, also takes into account
> > the --no/-period options
> > 
> > jirka
> > 
> > 
> > ---
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index f251e824edac..907267206973 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
> >   	OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
> >   			&record.opts.sample_time_set,
> >   			"Record the sample timestamps"),
> > -	OPT_BOOLEAN('P', "period", &record.opts.period, "Record the sample period"),
> > +	OPT_BOOLEAN_SET('P', "period", &record.opts.period, &record.opts.period_set,
> > +			"Record the sample period"),
> >   	OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
> >   		    "don't sample"),
> >   	OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
> > diff --git a/tools/perf/perf.h b/tools/perf/perf.h
> > index 2357f4ccc9c7..cfe46236a5e5 100644
> > --- a/tools/perf/perf.h
> > +++ b/tools/perf/perf.h
> > @@ -50,6 +50,7 @@ struct record_opts {
> >   	bool	     sample_time_set;
> >   	bool	     sample_cpu;
> >   	bool	     period;
> > +	bool	     period_set;
> >   	bool	     running_time;
> >   	bool	     full_auxtrace;
> >   	bool	     auxtrace_snapshot_mode;
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index 66fa45198a11..ff359c9ece2e 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel *evsel,
> >   			if (!(term->weak && opts->user_interval != ULLONG_MAX)) {
> >   				attr->sample_period = term->val.period;
> >   				attr->freq = 0;
> > +				perf_evsel__reset_sample_bit(evsel, PERIOD);
> >   			}
> >   			break;
> >   		case PERF_EVSEL__CONFIG_TERM_FREQ:
> >   			if (!(term->weak && opts->user_freq != UINT_MAX)) {
> >   				attr->sample_freq = term->val.freq;
> >   				attr->freq = 1;
> > +				perf_evsel__set_sample_bit(evsel, PERIOD);
> >   			}
> 
> If we do so, some events could be in fixed mode without PERIOD set. Other
> events could be in freq mode with PERIOD set.

it also sets the attr->freq, so it's not in fixed mode

jirka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30 15:04       ` Jiri Olsa
@ 2018-01-30 15:25         ` Liang, Kan
  2018-01-30 16:36           ` Stephane Eranian
  0 siblings, 1 reply; 26+ messages in thread
From: Liang, Kan @ 2018-01-30 15:25 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Stephane Eranian, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen



On 1/30/2018 10:04 AM, Jiri Olsa wrote:
> On Tue, Jan 30, 2018 at 09:59:15AM -0500, Liang, Kan wrote:
>>
>>
>> On 1/30/2018 8:39 AM, Jiri Olsa wrote:
>>> On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
>>>> Hi,
>>>>
>>>> On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@linux.intel.com> wrote:
>>>>>
>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>
>>>>> ------
>>>>>
>>>>> Changes since V2:
>>>>>    - Refined the changelog
>>>>>    - Introduced specific read function for large PEBS.
>>>>>      The previous generic PEBS read function is confusing.
>>>>>      Disabled PMU in pmu::read() path for large PEBS.
>>>>>      Handled the corner case when reload_times == 0.
>>>>>    - Modified the parameter of intel_pmu_save_and_restart_reload()
>>>>>      Discarded local64_cmpxchg
>>>>>    - Added fixes tag
>>>>>    - Added WARN to handle reload_times == 0 || reload_val == 0
>>>>>
>>>>> Changes since V1:
>>>>>    - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>>>>      intel_pmu_save_and_restore()
>>>>
>>>> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
>>>> with large PEBS. Large PEBS requires fixed period. So the kernel could
>>>> make up the period from the event and store it in the sampling buffer.
>>>>
>>>> I tried using large PEBS recently, and despite trying different option
>>>> combination of perf record, I was not able to get it to work.
>>>>
>>>> $ perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>> --no-timestamp --no-period -a -C 0
>>>>
>>>> But I was able to make this work with a much older kernel.
>>>>
>>>> Another annoyance I ran into is with perf record requiring -c period
>>>> in order not to set
>>>> PERF_SAMPLE_PERIOD in the event.
>>>>
>>>> If I do:
>>>> perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>> --no-timestamp --no-period -a -C 0
>>>>
>>>> I get
>>>>
>>>> perf_event_attr:
>>>>     type                             4
>>>>     size                             112
>>>>     config                           0x10d1
>>>>     { sample_period, sample_freq }   199936
>>>>     sample_type                      IP|TID|CPU
>>>>
>>>> But if I do:
>>>> perf record  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>> --no-timestamp --no-period -a -C 0
>>>>
>>>> I get
>>>>
>>>> perf_event_attr:
>>>>     type                             4
>>>>     size                             112
>>>>     config                           0x10d1
>>>>     { sample_period, sample_freq }   199936
>>>>     sample_type                      IP|TID|CPU|PERIOD
>>>>
>>>> Perf should check if all events have a period=, then it should not
>>>> pass PERF_SAMPLE_PERIOD, even
>>>> more so when only one event is defined.
>>>>
>>>> Also it does not seem to honor --no-period.
>>>
>>> yep, there's a bug in period=x term handling
>>> we did not re/set the sample_type based on that
>>>
>>> attached patch fixes that for me, also takes into account
>>> the --no/-period options
>>>
>>> jirka
>>>
>>>
>>> ---
>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>> index f251e824edac..907267206973 100644
>>> --- a/tools/perf/builtin-record.c
>>> +++ b/tools/perf/builtin-record.c
>>> @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
>>>    	OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
>>>    			&record.opts.sample_time_set,
>>>    			"Record the sample timestamps"),
>>> -	OPT_BOOLEAN('P', "period", &record.opts.period, "Record the sample period"),
>>> +	OPT_BOOLEAN_SET('P', "period", &record.opts.period, &record.opts.period_set,
>>> +			"Record the sample period"),
>>>    	OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
>>>    		    "don't sample"),
>>>    	OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
>>> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
>>> index 2357f4ccc9c7..cfe46236a5e5 100644
>>> --- a/tools/perf/perf.h
>>> +++ b/tools/perf/perf.h
>>> @@ -50,6 +50,7 @@ struct record_opts {
>>>    	bool	     sample_time_set;
>>>    	bool	     sample_cpu;
>>>    	bool	     period;
>>> +	bool	     period_set;
>>>    	bool	     running_time;
>>>    	bool	     full_auxtrace;
>>>    	bool	     auxtrace_snapshot_mode;
>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>> index 66fa45198a11..ff359c9ece2e 100644
>>> --- a/tools/perf/util/evsel.c
>>> +++ b/tools/perf/util/evsel.c
>>> @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel *evsel,
>>>    			if (!(term->weak && opts->user_interval != ULLONG_MAX)) {
>>>    				attr->sample_period = term->val.period;
>>>    				attr->freq = 0;
>>> +				perf_evsel__reset_sample_bit(evsel, PERIOD);
>>>    			}
>>>    			break;
>>>    		case PERF_EVSEL__CONFIG_TERM_FREQ:
>>>    			if (!(term->weak && opts->user_freq != UINT_MAX)) {
>>>    				attr->sample_freq = term->val.freq;
>>>    				attr->freq = 1;
>>> +				perf_evsel__set_sample_bit(evsel, PERIOD);
>>>    			}
>>
>> If we do so, some events could be in fixed mode without PERIOD set. Other
>> events could be in freq mode with PERIOD set.
> 
> it also sets the attr->freq, so it's not in fixed mode
> 

I mean the TERM_PERIOD. It probably enable the large PEBS.
 >>>    				attr->freq = 0;
 >>> +				perf_evsel__reset_sample_bit(evsel, PERIOD);

The events in fixed mode could enable large PEBS. Events in freq mode 
should not enable large PEBS.
I think that could be a problem if some events try to enable large PEBS, 
while others not.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30 15:25         ` Liang, Kan
@ 2018-01-30 16:36           ` Stephane Eranian
  2018-01-30 16:48             ` Liang, Kan
  0 siblings, 1 reply; 26+ messages in thread
From: Stephane Eranian @ 2018-01-30 16:36 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Jiri Olsa, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Jan 30, 2018 at 7:25 AM, Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
> On 1/30/2018 10:04 AM, Jiri Olsa wrote:
>>
>> On Tue, Jan 30, 2018 at 09:59:15AM -0500, Liang, Kan wrote:
>>>
>>>
>>>
>>> On 1/30/2018 8:39 AM, Jiri Olsa wrote:
>>>>
>>>> On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@linux.intel.com> wrote:
>>>>>>
>>>>>>
>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>
>>>>>> ------
>>>>>>
>>>>>> Changes since V2:
>>>>>>    - Refined the changelog
>>>>>>    - Introduced specific read function for large PEBS.
>>>>>>      The previous generic PEBS read function is confusing.
>>>>>>      Disabled PMU in pmu::read() path for large PEBS.
>>>>>>      Handled the corner case when reload_times == 0.
>>>>>>    - Modified the parameter of intel_pmu_save_and_restart_reload()
>>>>>>      Discarded local64_cmpxchg
>>>>>>    - Added fixes tag
>>>>>>    - Added WARN to handle reload_times == 0 || reload_val == 0
>>>>>>
>>>>>> Changes since V1:
>>>>>>    - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>>>>>      intel_pmu_save_and_restore()
>>>>>
>>>>>
>>>>> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
>>>>> with large PEBS. Large PEBS requires fixed period. So the kernel could
>>>>> make up the period from the event and store it in the sampling buffer.
>>>>>
>>>>> I tried using large PEBS recently, and despite trying different option
>>>>> combination of perf record, I was not able to get it to work.
>>>>>
>>>>> $ perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>> --no-timestamp --no-period -a -C 0
>>>>>
>>>>> But I was able to make this work with a much older kernel.
>>>>>
>>>>> Another annoyance I ran into is with perf record requiring -c period
>>>>> in order not to set
>>>>> PERF_SAMPLE_PERIOD in the event.
>>>>>
>>>>> If I do:
>>>>> perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>> --no-timestamp --no-period -a -C 0
>>>>>
>>>>> I get
>>>>>
>>>>> perf_event_attr:
>>>>>     type                             4
>>>>>     size                             112
>>>>>     config                           0x10d1
>>>>>     { sample_period, sample_freq }   199936
>>>>>     sample_type                      IP|TID|CPU
>>>>>
>>>>> But if I do:
>>>>> perf record  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>> --no-timestamp --no-period -a -C 0
>>>>>
>>>>> I get
>>>>>
>>>>> perf_event_attr:
>>>>>     type                             4
>>>>>     size                             112
>>>>>     config                           0x10d1
>>>>>     { sample_period, sample_freq }   199936
>>>>>     sample_type                      IP|TID|CPU|PERIOD
>>>>>
>>>>> Perf should check if all events have a period=, then it should not
>>>>> pass PERF_SAMPLE_PERIOD, even
>>>>> more so when only one event is defined.
>>>>>
>>>>> Also it does not seem to honor --no-period.
>>>>
>>>>
>>>> yep, there's a bug in period=x term handling
>>>> we did not re/set the sample_type based on that
>>>>
>>>> attached patch fixes that for me, also takes into account
>>>> the --no/-period options
>>>>
>>>> jirka
>>>>
>>>>
>>>> ---
>>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>>> index f251e824edac..907267206973 100644
>>>> --- a/tools/perf/builtin-record.c
>>>> +++ b/tools/perf/builtin-record.c
>>>> @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
>>>>         OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
>>>>                         &record.opts.sample_time_set,
>>>>                         "Record the sample timestamps"),
>>>> -       OPT_BOOLEAN('P', "period", &record.opts.period, "Record the
>>>> sample period"),
>>>> +       OPT_BOOLEAN_SET('P', "period", &record.opts.period,
>>>> &record.opts.period_set,
>>>> +                       "Record the sample period"),
>>>>         OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
>>>>                     "don't sample"),
>>>>         OPT_BOOLEAN_SET('N', "no-buildid-cache",
>>>> &record.no_buildid_cache,
>>>> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
>>>> index 2357f4ccc9c7..cfe46236a5e5 100644
>>>> --- a/tools/perf/perf.h
>>>> +++ b/tools/perf/perf.h
>>>> @@ -50,6 +50,7 @@ struct record_opts {
>>>>         bool         sample_time_set;
>>>>         bool         sample_cpu;
>>>>         bool         period;
>>>> +       bool         period_set;
>>>>         bool         running_time;
>>>>         bool         full_auxtrace;
>>>>         bool         auxtrace_snapshot_mode;
>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>> index 66fa45198a11..ff359c9ece2e 100644
>>>> --- a/tools/perf/util/evsel.c
>>>> +++ b/tools/perf/util/evsel.c
>>>> @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel
>>>> *evsel,
>>>>                         if (!(term->weak && opts->user_interval !=
>>>> ULLONG_MAX)) {
>>>>                                 attr->sample_period = term->val.period;
>>>>                                 attr->freq = 0;
>>>> +                               perf_evsel__reset_sample_bit(evsel,
>>>> PERIOD);
>>>>                         }
>>>>                         break;
>>>>                 case PERF_EVSEL__CONFIG_TERM_FREQ:
>>>>                         if (!(term->weak && opts->user_freq !=
>>>> UINT_MAX)) {
>>>>                                 attr->sample_freq = term->val.freq;
>>>>                                 attr->freq = 1;
>>>> +                               perf_evsel__set_sample_bit(evsel,
>>>> PERIOD);
>>>>                         }
>>>
>>>
>>> If we do so, some events could be in fixed mode without PERIOD set. Other
>>> events could be in freq mode with PERIOD set.
>>
>>
>> it also sets the attr->freq, so it's not in fixed mode
>>
>
> I mean the TERM_PERIOD. It probably enable the large PEBS.
>>>>                             attr->freq = 0;
>>>> +                           perf_evsel__reset_sample_bit(evsel, PERIOD);
>
> The events in fixed mode could enable large PEBS. Events in freq mode should
> not enable large PEBS.
> I think that could be a problem if some events try to enable large PEBS,
> while others not.
>
You only enable large PEBS if 100% of the events use fixed periods,
either via -c period
or because they all use individual period=p. The --no-period could
also be used to remove
the period for measurements where the period is not needed.

> Thanks,
> Kan
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30 16:36           ` Stephane Eranian
@ 2018-01-30 16:48             ` Liang, Kan
  2018-01-30 18:52               ` Jiri Olsa
  0 siblings, 1 reply; 26+ messages in thread
From: Liang, Kan @ 2018-01-30 16:48 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Jiri Olsa, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen



On 1/30/2018 11:36 AM, Stephane Eranian wrote:
> On Tue, Jan 30, 2018 at 7:25 AM, Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>> On 1/30/2018 10:04 AM, Jiri Olsa wrote:
>>>
>>> On Tue, Jan 30, 2018 at 09:59:15AM -0500, Liang, Kan wrote:
>>>>
>>>>
>>>>
>>>> On 1/30/2018 8:39 AM, Jiri Olsa wrote:
>>>>>
>>>>> On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@linux.intel.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>>
>>>>>>> ------
>>>>>>>
>>>>>>> Changes since V2:
>>>>>>>     - Refined the changelog
>>>>>>>     - Introduced specific read function for large PEBS.
>>>>>>>       The previous generic PEBS read function is confusing.
>>>>>>>       Disabled PMU in pmu::read() path for large PEBS.
>>>>>>>       Handled the corner case when reload_times == 0.
>>>>>>>     - Modified the parameter of intel_pmu_save_and_restart_reload()
>>>>>>>       Discarded local64_cmpxchg
>>>>>>>     - Added fixes tag
>>>>>>>     - Added WARN to handle reload_times == 0 || reload_val == 0
>>>>>>>
>>>>>>> Changes since V1:
>>>>>>>     - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>>>>>>       intel_pmu_save_and_restore()
>>>>>>
>>>>>>
>>>>>> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
>>>>>> with large PEBS. Large PEBS requires fixed period. So the kernel could
>>>>>> make up the period from the event and store it in the sampling buffer.
>>>>>>
>>>>>> I tried using large PEBS recently, and despite trying different option
>>>>>> combination of perf record, I was not able to get it to work.
>>>>>>
>>>>>> $ perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>>> --no-timestamp --no-period -a -C 0
>>>>>>
>>>>>> But I was able to make this work with a much older kernel.
>>>>>>
>>>>>> Another annoyance I ran into is with perf record requiring -c period
>>>>>> in order not to set
>>>>>> PERF_SAMPLE_PERIOD in the event.
>>>>>>
>>>>>> If I do:
>>>>>> perf record  -c 1  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>>> --no-timestamp --no-period -a -C 0
>>>>>>
>>>>>> I get
>>>>>>
>>>>>> perf_event_attr:
>>>>>>      type                             4
>>>>>>      size                             112
>>>>>>      config                           0x10d1
>>>>>>      { sample_period, sample_freq }   199936
>>>>>>      sample_type                      IP|TID|CPU
>>>>>>
>>>>>> But if I do:
>>>>>> perf record  -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>>> --no-timestamp --no-period -a -C 0
>>>>>>
>>>>>> I get
>>>>>>
>>>>>> perf_event_attr:
>>>>>>      type                             4
>>>>>>      size                             112
>>>>>>      config                           0x10d1
>>>>>>      { sample_period, sample_freq }   199936
>>>>>>      sample_type                      IP|TID|CPU|PERIOD
>>>>>>
>>>>>> Perf should check if all events have a period=, then it should not
>>>>>> pass PERF_SAMPLE_PERIOD, even
>>>>>> more so when only one event is defined.
>>>>>>
>>>>>> Also it does not seem to honor --no-period.
>>>>>
>>>>>
>>>>> yep, there's a bug in period=x term handling
>>>>> we did not re/set the sample_type based on that
>>>>>
>>>>> attached patch fixes that for me, also takes into account
>>>>> the --no/-period options
>>>>>
>>>>> jirka
>>>>>
>>>>>
>>>>> ---
>>>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>>>> index f251e824edac..907267206973 100644
>>>>> --- a/tools/perf/builtin-record.c
>>>>> +++ b/tools/perf/builtin-record.c
>>>>> @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
>>>>>          OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
>>>>>                          &record.opts.sample_time_set,
>>>>>                          "Record the sample timestamps"),
>>>>> -       OPT_BOOLEAN('P', "period", &record.opts.period, "Record the
>>>>> sample period"),
>>>>> +       OPT_BOOLEAN_SET('P', "period", &record.opts.period,
>>>>> &record.opts.period_set,
>>>>> +                       "Record the sample period"),
>>>>>          OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
>>>>>                      "don't sample"),
>>>>>          OPT_BOOLEAN_SET('N', "no-buildid-cache",
>>>>> &record.no_buildid_cache,
>>>>> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
>>>>> index 2357f4ccc9c7..cfe46236a5e5 100644
>>>>> --- a/tools/perf/perf.h
>>>>> +++ b/tools/perf/perf.h
>>>>> @@ -50,6 +50,7 @@ struct record_opts {
>>>>>          bool         sample_time_set;
>>>>>          bool         sample_cpu;
>>>>>          bool         period;
>>>>> +       bool         period_set;
>>>>>          bool         running_time;
>>>>>          bool         full_auxtrace;
>>>>>          bool         auxtrace_snapshot_mode;
>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>> index 66fa45198a11..ff359c9ece2e 100644
>>>>> --- a/tools/perf/util/evsel.c
>>>>> +++ b/tools/perf/util/evsel.c
>>>>> @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel
>>>>> *evsel,
>>>>>                          if (!(term->weak && opts->user_interval !=
>>>>> ULLONG_MAX)) {
>>>>>                                  attr->sample_period = term->val.period;
>>>>>                                  attr->freq = 0;
>>>>> +                               perf_evsel__reset_sample_bit(evsel,
>>>>> PERIOD);
>>>>>                          }
>>>>>                          break;
>>>>>                  case PERF_EVSEL__CONFIG_TERM_FREQ:
>>>>>                          if (!(term->weak && opts->user_freq !=
>>>>> UINT_MAX)) {
>>>>>                                  attr->sample_freq = term->val.freq;
>>>>>                                  attr->freq = 1;
>>>>> +                               perf_evsel__set_sample_bit(evsel,
>>>>> PERIOD);
>>>>>                          }
>>>>
>>>>
>>>> If we do so, some events could be in fixed mode without PERIOD set. Other
>>>> events could be in freq mode with PERIOD set.
>>>
>>>
>>> it also sets the attr->freq, so it's not in fixed mode
>>>
>>
>> I mean the TERM_PERIOD. It probably enable the large PEBS.
>>>>>                              attr->freq = 0;
>>>>> +                           perf_evsel__reset_sample_bit(evsel, PERIOD);
>>
>> The events in fixed mode could enable large PEBS. Events in freq mode should
>> not enable large PEBS.
>> I think that could be a problem if some events try to enable large PEBS,
>> while others not.
>>
> You only enable large PEBS if 100% of the events use fixed periods,
> either via -c period
> or because they all use individual period=p. The --no-period could
> also be used to remove
> the period for measurements where the period is not needed.


Oh, right, the kernel has already guaranteed that.
	if (cpuc->n_pebs == cpuc->n_large_pebs) {
		threshold = ds->pebs_absolute_maximum -
			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
	} else {

Sorry for the noise.

jirka's patch looks good to me.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30 16:48             ` Liang, Kan
@ 2018-01-30 18:52               ` Jiri Olsa
  2018-01-30 19:56                 ` Stephane Eranian
  0 siblings, 1 reply; 26+ messages in thread
From: Jiri Olsa @ 2018-01-30 18:52 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Stephane Eranian, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Jan 30, 2018 at 11:48:18AM -0500, Liang, Kan wrote:

SNIP

> > > 
> > > The events in fixed mode could enable large PEBS. Events in freq mode should
> > > not enable large PEBS.
> > > I think that could be a problem if some events try to enable large PEBS,
> > > while others not.
> > > 
> > You only enable large PEBS if 100% of the events use fixed periods,
> > either via -c period
> > or because they all use individual period=p. The --no-period could
> > also be used to remove
> > the period for measurements where the period is not needed.
> 
> 
> Oh, right, the kernel has already guaranteed that.
> 	if (cpuc->n_pebs == cpuc->n_large_pebs) {
> 		threshold = ds->pebs_absolute_maximum -
> 			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> 	} else {
> 
> Sorry for the noise.
> 
> jirka's patch looks good to me.

cool, I'll post it later this week

thanks,
jirka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30 18:52               ` Jiri Olsa
@ 2018-01-30 19:56                 ` Stephane Eranian
  2018-01-31  3:59                   ` Andi Kleen
  0 siblings, 1 reply; 26+ messages in thread
From: Stephane Eranian @ 2018-01-30 19:56 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Liang, Kan, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner, Andi Kleen

On Tue, Jan 30, 2018 at 10:52 AM, Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Tue, Jan 30, 2018 at 11:48:18AM -0500, Liang, Kan wrote:
>
> SNIP
>
> > > >
> > > > The events in fixed mode could enable large PEBS. Events in freq mode should
> > > > not enable large PEBS.
> > > > I think that could be a problem if some events try to enable large PEBS,
> > > > while others not.
> > > >
> > > You only enable large PEBS if 100% of the events use fixed periods,
> > > either via -c period
> > > or because they all use individual period=p. The --no-period could
> > > also be used to remove
> > > the period for measurements where the period is not needed.
> >
> >
> > Oh, right, the kernel has already guaranteed that.
> >       if (cpuc->n_pebs == cpuc->n_large_pebs) {
> >               threshold = ds->pebs_absolute_maximum -
> >                       x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> >       } else {
> >
> > Sorry for the noise.
> >
> > jirka's patch looks good to me.
>
> cool, I'll post it later this week
>
Still, the part I am missing here, is why asking for
PERF_SAMPLE_PERIOD voids large PEBS.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-30 19:56                 ` Stephane Eranian
@ 2018-01-31  3:59                   ` Andi Kleen
  2018-01-31  9:15                     ` Jiri Olsa
  0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2018-01-31  3:59 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Jiri Olsa, Liang, Kan, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner

> Still, the part I am missing here, is why asking for
> PERF_SAMPLE_PERIOD voids large PEBS.

I think it was disabled together with frequency mode
(which we could support too, but it's a bit more work)

But yes PERIOD could probably be allowed.


-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-31  3:59                   ` Andi Kleen
@ 2018-01-31  9:15                     ` Jiri Olsa
  2018-01-31 13:15                       ` Jiri Olsa
  0 siblings, 1 reply; 26+ messages in thread
From: Jiri Olsa @ 2018-01-31  9:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, Liang, Kan, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner

On Tue, Jan 30, 2018 at 07:59:41PM -0800, Andi Kleen wrote:
> > Still, the part I am missing here, is why asking for
> > PERF_SAMPLE_PERIOD voids large PEBS.
> 
> I think it was disabled together with frequency mode
> (which we could support too, but it's a bit more work)
> 
> But yes PERIOD could probably be allowed.

looks like it's just a matter of adding PERF_SAMPLE_PERIOD
into PEBS_FREERUNNING_FLAGS, we already populate period
in setup_pebs_sample_data

jirka


---
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8e4ea143ed96..78f91ec1056e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -93,7 +93,8 @@ struct amd_nb {
 	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
 	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
 	PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
-	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
+	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
+	PERF_SAMPLE_PERIOD)
 
 #define PEBS_REGS \
 	(PERF_REG_X86_AX | \

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-31  9:15                     ` Jiri Olsa
@ 2018-01-31 13:15                       ` Jiri Olsa
  2018-01-31 15:15                         ` Liang, Kan
  0 siblings, 1 reply; 26+ messages in thread
From: Jiri Olsa @ 2018-01-31 13:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, Liang, Kan, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner

On Wed, Jan 31, 2018 at 10:15:39AM +0100, Jiri Olsa wrote:
> On Tue, Jan 30, 2018 at 07:59:41PM -0800, Andi Kleen wrote:
> > > Still, the part I am missing here, is why asking for
> > > PERF_SAMPLE_PERIOD voids large PEBS.
> > 
> > I think it was disabled together with frequency mode
> > (which we could support too, but it's a bit more work)
> > 
> > But yes PERIOD could probably be allowed.
> 
> looks like it's just a matter of adding PERF_SAMPLE_PERIOD
> into PEBS_FREERUNNING_FLAGS, we already populate period
> in setup_pebs_sample_data
> 
> jirka
> 
> 
> ---
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 8e4ea143ed96..78f91ec1056e 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -93,7 +93,8 @@ struct amd_nb {
>  	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
>  	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
>  	PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
> -	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
> +	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
> +	PERF_SAMPLE_PERIOD)

seems to work, getting large PEBS event for following command line:

  perf record -e cycles:P -c 100 --no-timestamp -C 0 --period


jirka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-31 13:15                       ` Jiri Olsa
@ 2018-01-31 15:15                         ` Liang, Kan
  2018-01-31 15:41                           ` Jiri Olsa
  0 siblings, 1 reply; 26+ messages in thread
From: Liang, Kan @ 2018-01-31 15:15 UTC (permalink / raw)
  To: Jiri Olsa, Andi Kleen
  Cc: Stephane Eranian, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner



On 1/31/2018 8:15 AM, Jiri Olsa wrote:
> On Wed, Jan 31, 2018 at 10:15:39AM +0100, Jiri Olsa wrote:
>> On Tue, Jan 30, 2018 at 07:59:41PM -0800, Andi Kleen wrote:
>>>> Still, the part I am missing here, is why asking for
>>>> PERF_SAMPLE_PERIOD voids large PEBS.
>>>
>>> I think it was disabled together with frequency mode
>>> (which we could support too, but it's a bit more work)
>>>
>>> But yes PERIOD could probably be allowed.
>>
>> looks like it's just a matter of adding PERF_SAMPLE_PERIOD
>> into PEBS_FREERUNNING_FLAGS, we already populate period
>> in setup_pebs_sample_data
>>
>> jirka
>>
>>
>> ---
>> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
>> index 8e4ea143ed96..78f91ec1056e 100644
>> --- a/arch/x86/events/perf_event.h
>> +++ b/arch/x86/events/perf_event.h
>> @@ -93,7 +93,8 @@ struct amd_nb {
>>   	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
>>   	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
>>   	PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
>> -	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
>> +	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
>> +	PERF_SAMPLE_PERIOD)
> 
> seems to work, getting large PEBS event for following command line:
> 
>    perf record -e cycles:P -c 100 --no-timestamp -C 0 --period
> 
> 

Yes, I tried the patch. The large PEBS can be enabled with the PERIOD 
flag. Everything looks good.

Jirka, could you please post the patch as well?

Thanks,
Kan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read
  2018-01-31 15:15                         ` Liang, Kan
@ 2018-01-31 15:41                           ` Jiri Olsa
  0 siblings, 0 replies; 26+ messages in thread
From: Jiri Olsa @ 2018-01-31 15:41 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Andi Kleen, Stephane Eranian, Peter Zijlstra, Ingo Molnar, LKML,
	Arnaldo Carvalho de Melo, Thomas Gleixner

On Wed, Jan 31, 2018 at 10:15:33AM -0500, Liang, Kan wrote:
> 
> 
> On 1/31/2018 8:15 AM, Jiri Olsa wrote:
> > On Wed, Jan 31, 2018 at 10:15:39AM +0100, Jiri Olsa wrote:
> > > On Tue, Jan 30, 2018 at 07:59:41PM -0800, Andi Kleen wrote:
> > > > > Still, the part I am missing here, is why asking for
> > > > > PERF_SAMPLE_PERIOD voids large PEBS.
> > > > 
> > > > I think it was disabled together with frequency mode
> > > > (which we could support too, but it's a bit more work)
> > > > 
> > > > But yes PERIOD could probably be allowed.
> > > 
> > > looks like it's just a matter of adding PERF_SAMPLE_PERIOD
> > > into PEBS_FREERUNNING_FLAGS, we already populate period
> > > in setup_pebs_sample_data
> > > 
> > > jirka
> > > 
> > > 
> > > ---
> > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> > > index 8e4ea143ed96..78f91ec1056e 100644
> > > --- a/arch/x86/events/perf_event.h
> > > +++ b/arch/x86/events/perf_event.h
> > > @@ -93,7 +93,8 @@ struct amd_nb {
> > >   	PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
> > >   	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> > >   	PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
> > > -	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
> > > +	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
> > > +	PERF_SAMPLE_PERIOD)
> > 
> > seems to work, getting large PEBS event for following command line:
> > 
> >    perf record -e cycles:P -c 100 --no-timestamp -C 0 --period
> > 
> > 
> 
> Yes, I tried the patch. The large PEBS can be enabled with the PERIOD flag.
> Everything looks good.
> 
> Jirka, could you please post the patch as well?

yes, will post that one as well

jirka

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload
  2018-01-29 16:29 ` [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload kan.liang
@ 2018-02-06 15:06   ` Peter Zijlstra
  2018-02-06 17:58     ` Liang, Kan
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2018-02-06 15:06 UTC (permalink / raw)
  To: kan.liang; +Cc: mingo, linux-kernel, acme, tglx, jolsa, eranian, ak

On Mon, Jan 29, 2018 at 08:29:29AM -0800, kan.liang@linux.intel.com wrote:
> +/*
> + * Specific intel_pmu_save_and_restart() for auto-reload.
> + * It only be called from drain_pebs().
> + */
> +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
> +					     int reload_times)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +	int shift = 64 - x86_pmu.cntval_bits;
> +	u64 reload_val = hwc->sample_period;
> +	u64 prev_raw_count, new_raw_count;
> +	u64 delta;
> +
> +	WARN_ON((reload_times == 0) || (reload_val == 0));

I'm struggling to see why !count is a problem. You can call read() twice
without having extra PEBS samples, right?

Which also seems to suggest we have a problem in intel_pmu_drain_pebs*()
because those will simply not call us when there aren't any PEBS events
in the buffer, even though the count value can have changed. Your patch
doesn't appear to address that.

> +
> +	/*
> +	 * drain_pebs() only happens when the PMU is disabled.
> +	 * It doesnot need to specially handle the previous event value.
> +	 * The hwc->prev_count will be updated in x86_perf_event_set_period().

That's completely buggered. You shouldn't need to call
x86_perf_event_set_period() _at_all_.

> +	 */

	WARN_ON(this_cpu_read(cpu_hw_events.enabled));

> +	prev_raw_count = local64_read(&hwc->prev_count);
> +	rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> +
> +	/*
> +	 * Now we have the new raw value and have updated the prev
> +	 * timestamp already. We can now calculate the elapsed delta
> +	 * (event-)time and add that to the generic event.
> +	 *
> +	 * Careful, not all hw sign-extends above the physical width
> +	 * of the count.
> +	 *
> +	 * There is a small gap between 'PEBS hardware is armed' and 'the NMI
> +	 * is handled'. Because of the gap, the first record also needs to be
> +	 * specially handled.

True, but not at all relevant.

> +	 * The formula to calculate the increments of event->count is as below.
> +	 * The increments = the period for first record +
> +	 *                  (reload_times - 1) * reload_val +
> +	 *                  the gap
> +	 * 'The period for first record' can be got from -prev_raw_count.
> +	 *
> +	 * 'The gap' = new_raw_count + reload_val. Because the start point of
> +	 * counting is always -reload_val for auto-reload.

That is still very much confused, what you seem to want is something
like:

	Since the counter increments a negative counter value and
	overflows on the sign switch, giving the interval:

		[-period, 0]

	the difference between two consequtive reads is:

	  A) value2 - value1;
	     when no overflows have happened in between,

	  B) (0 - value1) + (value2 - (-period));
	     when one overflow happened in between,

	  C) (0 - value1) + (n - 1) * (period) + (value2 - (-period));
	     when @n overflows happened in betwee.

	Here A) is the obvious difference, B) is the extention to the
	discrete interval, where the first term is to the top of the
	interval and the second term is from the bottom of the next
	interval and 3) the extention to multiple intervals, where the
	middle term is the whole intervals covered.

	An equivalent of C, by reduction, is:

	  value2 - value1 + n * period

> +	 *
> +	 * The period_left needs to do the same adjustment by adding
> +	 * reload_val.
> +	 */
> +	delta = (reload_val << shift) + (new_raw_count << shift) -
> +		(prev_raw_count << shift);
> +	delta >>= shift;
> +
> +	local64_add(reload_val * (reload_times - 1), &event->count);
> +	local64_add(delta, &event->count);
> +	local64_sub(delta, &hwc->period_left);

Nobody cares about period_left, since AUTO_RELOAD already limits the
periods to what the hardware supports, right?

> +
> +	return x86_perf_event_set_period(event);
> +}

With the exception of handling 'empty' buffers, I ended up with the
below. Please try again.

---
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1156,16 +1156,13 @@ int x86_perf_event_set_period(struct per
 
 	per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
 
-	if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
-	    local64_read(&hwc->prev_count) != (u64)-left) {
-		/*
-		 * The hw event starts counting from this event offset,
-		 * mark it to be able to extra future deltas:
-		 */
-		local64_set(&hwc->prev_count, (u64)-left);
+	/*
+	 * The hw event starts counting from this event offset,
+	 * mark it to be able to extra future deltas:
+	 */
+	local64_set(&hwc->prev_count, (u64)-left);
 
-		wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
-	}
+	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
 
 	/*
 	 * Due to erratum on certan cpu we need
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1306,17 +1306,89 @@ get_next_pebs_record_by_bit(void *base,
 	return NULL;
 }
 
+/*
+ * Special variant of intel_pmu_save_and_restart() for auto-reload.
+ */
+static int
+intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	int shift = 64 - x86_pmu.cntval_bits;
+	u64 period = hwc->sample_period;
+	u64 prev_raw_count, new_raw_count;
+	u64 delta;
+
+	WARN_ON(!period);
+
+	/*
+	 * drain_pebs() only happens when the PMU is disabled.
+	 */
+	WARN_ON(this_cpu_read(cpu_hw_events.enabled));
+
+	prev_raw_count = local64_read(&hwc->prev_count);
+	rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+	local64_set(&hwx->prev_count, new_raw_count);
+
+	/*
+	 * Careful, not all hw sign-extends above the physical width
+	 * of the counter.
+	 */
+	delta = (new_raw_count << shift) - (prev_raw_count << shift);
+	delta >>= shift;
+
+	/*
+	 * Since the counter increments a negative counter value and
+	 * overflows on the sign switch, giving the interval:
+	 *
+	 *   [-period, 0]
+	 *
+	 * the difference between two consequtive reads is:
+	 *
+	 *   A) value2 - value1;
+	 *      when no overflows have happened in between,
+	 *
+	 *   B) (0 - value1) + (value2 - (-period));
+	 *      when one overflow happened in between,
+	 *
+	 *   C) (0 - value1) + (n - 1) * (period) + (value2 - (-period));
+	 *      when @n overflows happened in between.
+	 *
+	 * Here A) is the obvious difference, B) is the extention to the
+	 * discrete interval, where the first term is to the top of the
+	 * interval and the second term is from the bottom of the next
+	 * interval and 3) the extention to multiple intervals, where the
+	 * middle term is the whole intervals covered.
+	 *
+	 * An equivalent of C, by reduction, is:
+	 *
+	 *   value2 - value1 + n * period
+	 */
+	local64_add(delta + period * count, &event->count);
+
+	perf_event_update_userpage(event);
+
+	return 0;
+}
+
 static void __intel_pmu_pebs_event(struct perf_event *event,
 				   struct pt_regs *iregs,
 				   void *base, void *top,
 				   int bit, int count)
 {
+	struct hw_perf_event *hwc = &event->hw;
 	struct perf_sample_data data;
 	struct pt_regs regs;
 	void *at = get_next_pebs_record_by_bit(base, top, bit);
 
-	if (!intel_pmu_save_and_restart(event) &&
-	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+		/*
+		 * Now, auto-reload is only enabled in fixed period mode.
+		 * The reload value is always hwc->sample_period.
+		 * May need to change it, if auto-reload is enabled in
+		 * freq mode later.
+		 */
+		intel_pmu_save_and_restart_reload(event, count);
+	} else if (!intel_pmu_save_and_restart(event))
 		return;
 
 	while (count > 1) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload
  2018-02-06 15:06   ` Peter Zijlstra
@ 2018-02-06 17:58     ` Liang, Kan
  2018-02-09 14:09       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Liang, Kan @ 2018-02-06 17:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, acme, tglx, jolsa, eranian, ak



> With the exception of handling 'empty' buffers, I ended up with the
> below. Please try again.
>

There are two small errors. After fixing them, the patch works well.

> ---
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1156,16 +1156,13 @@ int x86_perf_event_set_period(struct per
>   
>   	per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
>   
> -	if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
> -	    local64_read(&hwc->prev_count) != (u64)-left) {
> -		/*
> -		 * The hw event starts counting from this event offset,
> -		 * mark it to be able to extra future deltas:
> -		 */
> -		local64_set(&hwc->prev_count, (u64)-left);
> +	/*
> +	 * The hw event starts counting from this event offset,
> +	 * mark it to be able to extra future deltas:
> +	 */
> +	local64_set(&hwc->prev_count, (u64)-left);
>   
> -		wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> -	}
> +	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>   
>   	/*
>   	 * Due to erratum on certan cpu we need
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1306,17 +1306,89 @@ get_next_pebs_record_by_bit(void *base,
>   	return NULL;
>   }
>   
> +/*
> + * Special variant of intel_pmu_save_and_restart() for auto-reload.
> + */
> +static int
> +intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +	int shift = 64 - x86_pmu.cntval_bits;
> +	u64 period = hwc->sample_period;
> +	u64 prev_raw_count, new_raw_count;
> +	u64 delta;
> +
> +	WARN_ON(!period);
> +
> +	/*
> +	 * drain_pebs() only happens when the PMU is disabled.
> +	 */
> +	WARN_ON(this_cpu_read(cpu_hw_events.enabled));
> +
> +	prev_raw_count = local64_read(&hwc->prev_count);
> +	rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> +	local64_set(&hwx->prev_count, new_raw_count);

Here is a typo. It should be &hwc->prev_count.

> +
> +	/*
> +	 * Careful, not all hw sign-extends above the physical width
> +	 * of the counter.
> +	 */
> +	delta = (new_raw_count << shift) - (prev_raw_count << shift);
> +	delta >>= shift;

new_raw_count could be smaller than prev_raw_count.
The sign bit will be set. The delta>> could be wrong.

I think we can add a period here to prevent it.
+       delta = (period << shift) + (new_raw_count << shift) -
+               (prev_raw_count << shift);
+       delta >>= shift;
......
+       local64_add(delta + period * (count - 1), &event->count);


Thanks,
Kan
> +
> +	/*
> +	 * Since the counter increments a negative counter value and
> +	 * overflows on the sign switch, giving the interval:
> +	 *
> +	 *   [-period, 0]
> +	 *
> +	 * the difference between two consequtive reads is:
> +	 *
> +	 *   A) value2 - value1;
> +	 *      when no overflows have happened in between,
> +	 *
> +	 *   B) (0 - value1) + (value2 - (-period));
> +	 *      when one overflow happened in between,
> +	 *
> +	 *   C) (0 - value1) + (n - 1) * (period) + (value2 - (-period));
> +	 *      when @n overflows happened in between.
> +	 *
> +	 * Here A) is the obvious difference, B) is the extention to the
> +	 * discrete interval, where the first term is to the top of the
> +	 * interval and the second term is from the bottom of the next
> +	 * interval and 3) the extention to multiple intervals, where the
> +	 * middle term is the whole intervals covered.
> +	 *
> +	 * An equivalent of C, by reduction, is:
> +	 *
> +	 *   value2 - value1 + n * period
> +	 */
> +	local64_add(delta + period * count, &event->count);
> +
> +	perf_event_update_userpage(event);
> +
> +	return 0;
> +}
> +
>   static void __intel_pmu_pebs_event(struct perf_event *event,
>   				   struct pt_regs *iregs,
>   				   void *base, void *top,
>   				   int bit, int count)
>   {
> +	struct hw_perf_event *hwc = &event->hw;
>   	struct perf_sample_data data;
>   	struct pt_regs regs;
>   	void *at = get_next_pebs_record_by_bit(base, top, bit);
>   
> -	if (!intel_pmu_save_and_restart(event) &&
> -	    !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
> +	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
> +		/*
> +		 * Now, auto-reload is only enabled in fixed period mode.
> +		 * The reload value is always hwc->sample_period.
> +		 * May need to change it, if auto-reload is enabled in
> +		 * freq mode later.
> +		 */
> +		intel_pmu_save_and_restart_reload(event, count);
> +	} else if (!intel_pmu_save_and_restart(event))
>   		return;
>   
>   	while (count > 1) {
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload
  2018-02-06 17:58     ` Liang, Kan
@ 2018-02-09 14:09       ` Peter Zijlstra
  2018-02-09 15:49         ` Liang, Kan
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2018-02-09 14:09 UTC (permalink / raw)
  To: Liang, Kan; +Cc: mingo, linux-kernel, acme, tglx, jolsa, eranian, ak

On Tue, Feb 06, 2018 at 12:58:23PM -0500, Liang, Kan wrote:
> 
> 
> > With the exception of handling 'empty' buffers, I ended up with the
> > below. Please try again.
> > 
> 
> There are two small errors. After fixing them, the patch works well.

Well, it still doesn't do A, two read()s without PEBS record in between.
So that needs fixing. What 3/5 does, call x86_perf_event_update() after
drain_pebs() is actively wrong after this patch.

> > +
> > +	/*
> > +	 * Careful, not all hw sign-extends above the physical width
> > +	 * of the counter.
> > +	 */
> > +	delta = (new_raw_count << shift) - (prev_raw_count << shift);
> > +	delta >>= shift;
> 
> new_raw_count could be smaller than prev_raw_count.
> The sign bit will be set. The delta>> could be wrong.
> 
> I think we can add a period here to prevent it.
> +       delta = (period << shift) + (new_raw_count << shift) -
> +               (prev_raw_count << shift);
> +       delta >>= shift;
> ......
> +       local64_add(delta + period * (count - 1), &event->count);
> 

Right it does, but that wrecks case A again, because then we get here
with !@count.

Maybe something like:


	s64 new, old;

	new = ((s64)(new_raw_count << shift) >> shift);
	old = ((s64)(old_raw_count << shift) >> shift);

	local64_add(new - old + count * period, &event->count);


And then make intel_pmu_drain_pebs_*(), call this function even when !n.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload
  2018-02-09 14:09       ` Peter Zijlstra
@ 2018-02-09 15:49         ` Liang, Kan
  2018-02-10 14:09           ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Liang, Kan @ 2018-02-09 15:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, acme, tglx, jolsa, eranian, ak



On 2/9/2018 9:09 AM, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 12:58:23PM -0500, Liang, Kan wrote:
>>
>>
>>> With the exception of handling 'empty' buffers, I ended up with the
>>> below. Please try again.
>>>
>>
>> There are two small errors. After fixing them, the patch works well.
> 
> Well, it still doesn't do A, two read()s without PEBS record in between.
> So that needs fixing. What 3/5 does, call x86_perf_event_update() after
> drain_pebs() is actively wrong after this patch.
>

As my understanding, for case A, drain_pebs() will return immediately. 
It cannot reach the patch.
Because there is no PEBS record ready. So the ds->pebs_index should be 
the same as ds->pebs_buffer_base.

3/5 is to handle case A.

Thanks,
Kan

>>> +
>>> +	/*
>>> +	 * Careful, not all hw sign-extends above the physical width
>>> +	 * of the counter.
>>> +	 */
>>> +	delta = (new_raw_count << shift) - (prev_raw_count << shift);
>>> +	delta >>= shift;
>>
>> new_raw_count could be smaller than prev_raw_count.
>> The sign bit will be set. The delta>> could be wrong.
>>
>> I think we can add a period here to prevent it.
>> +       delta = (period << shift) + (new_raw_count << shift) -
>> +               (prev_raw_count << shift);
>> +       delta >>= shift;
>> ......
>> +       local64_add(delta + period * (count - 1), &event->count);
>>
> 
> Right it does, but that wrecks case A again, because then we get here
> with !@count.
> 
> Maybe something like:
> 
> 
> 	s64 new, old;
> 
> 	new = ((s64)(new_raw_count << shift) >> shift);
> 	old = ((s64)(old_raw_count << shift) >> shift);
> 
> 	local64_add(new - old + count * period, &event->count);
> 
> 
> And then make intel_pmu_drain_pebs_*(), call this function even when !n.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload
  2018-02-09 15:49         ` Liang, Kan
@ 2018-02-10 14:09           ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2018-02-10 14:09 UTC (permalink / raw)
  To: Liang, Kan; +Cc: mingo, linux-kernel, acme, tglx, jolsa, eranian, ak

On Fri, Feb 09, 2018 at 10:49:35AM -0500, Liang, Kan wrote:
> 
> 
> On 2/9/2018 9:09 AM, Peter Zijlstra wrote:
> > On Tue, Feb 06, 2018 at 12:58:23PM -0500, Liang, Kan wrote:
> > > 
> > > 
> > > > With the exception of handling 'empty' buffers, I ended up with the
> > > > below. Please try again.
> > > > 
> > > 
> > > There are two small errors. After fixing them, the patch works well.
> > 
> > Well, it still doesn't do A, two read()s without PEBS record in between.
> > So that needs fixing. What 3/5 does, call x86_perf_event_update() after
> > drain_pebs() is actively wrong after this patch.
> > 
> 
> As my understanding, for case A, drain_pebs() will return immediately. It
> cannot reach the patch.
> Because there is no PEBS record ready. So the ds->pebs_index should be the
> same as ds->pebs_buffer_base.

Right, so fix that.

> 3/5 is to handle case A.

3/5 is terminatlly broken, you should not call x86_perf_event_update()
on a auto-reload event _ever_ after my patch.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2018-02-10 14:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-29 16:29 [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read kan.liang
2018-01-29 16:29 ` [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload kan.liang
2018-02-06 15:06   ` Peter Zijlstra
2018-02-06 17:58     ` Liang, Kan
2018-02-09 14:09       ` Peter Zijlstra
2018-02-09 15:49         ` Liang, Kan
2018-02-10 14:09           ` Peter Zijlstra
2018-01-29 16:29 ` [PATCH V3 2/5] perf/x86: introduce read function for x86_pmu kan.liang
2018-01-29 16:29 ` [PATCH V3 3/5] perf/x86/intel/ds: introduce read function for large pebs kan.liang
2018-01-29 16:29 ` [PATCH V3 4/5] perf/x86/intel: fix pmu read for large PEBS kan.liang
2018-01-29 16:29 ` [PATCH V3 5/5] perf/x86: fix: disable userspace RDPMC usage " kan.liang
2018-01-30  9:16 ` [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read Stephane Eranian
2018-01-30 13:39   ` Jiri Olsa
2018-01-30 14:59     ` Liang, Kan
2018-01-30 15:04       ` Jiri Olsa
2018-01-30 15:25         ` Liang, Kan
2018-01-30 16:36           ` Stephane Eranian
2018-01-30 16:48             ` Liang, Kan
2018-01-30 18:52               ` Jiri Olsa
2018-01-30 19:56                 ` Stephane Eranian
2018-01-31  3:59                   ` Andi Kleen
2018-01-31  9:15                     ` Jiri Olsa
2018-01-31 13:15                       ` Jiri Olsa
2018-01-31 15:15                         ` Liang, Kan
2018-01-31 15:41                           ` Jiri Olsa
2018-01-30 14:41   ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).