linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] perf: Add munmap callback
@ 2018-10-24 15:11 kan.liang
  2018-10-24 15:11 ` [PATCH 2/2] perf/x86/intel: Fix missing physical address in large PEBS kan.liang
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: kan.liang @ 2018-10-24 15:11 UTC (permalink / raw)
  To: peterz, tglx, mingo, acme, linux-kernel; +Cc: bp, ak, eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

To calculate the physical address, perf needs to walk the pages tables.
The related mapping may has already been removed from pages table in
some cases (e.g. large PEBS). The virtual address recorded in the first
PEBS records may already be unmapped before draining PEBS buffers.

Add a munmap callback to notify the PMU of any unmapping, which only be
invoked when the munmap is implemented.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h |  3 +++
 kernel/events/core.c       | 25 +++++++++++++++++++++++++
 mm/mmap.c                  |  1 +
 3 files changed, 29 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 53c500f0ca79..7f0a9258ce1f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -400,6 +400,7 @@ struct pmu {
 	 */
 	void (*sched_task)		(struct perf_event_context *ctx,
 					bool sched_in);
+	void (*munmap)			(void);
 	/*
 	 * PMU specific data size
 	 */
@@ -1113,6 +1114,7 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
 }
 
 extern void perf_event_mmap(struct vm_area_struct *vma);
+extern void perf_event_munmap(void);
 extern struct perf_guest_info_callbacks *perf_guest_cbs;
 extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
 extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
@@ -1333,6 +1335,7 @@ static inline int perf_unregister_guest_info_callbacks
 (struct perf_guest_info_callbacks *callbacks)				{ return 0; }
 
 static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
+static inline void perf_event_munmap(void)				{ }
 static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_namespaces(struct task_struct *tsk)	{ }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5a97f34bc14c..00338d6fbed7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3184,6 +3184,31 @@ static void perf_pmu_sched_task(struct task_struct *prev,
 	}
 }
 
+void perf_event_munmap(void)
+{
+	struct perf_cpu_context *cpuctx;
+	unsigned long flags;
+	struct pmu *pmu;
+
+	local_irq_save(flags);
+	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
+		pmu = cpuctx->ctx.pmu;
+
+		if (!pmu->munmap)
+			continue;
+
+		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+		perf_pmu_disable(pmu);
+
+		pmu->munmap();
+
+		perf_pmu_enable(pmu);
+
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+	}
+	local_irq_restore(flags);
+}
+
 static void perf_event_switch(struct task_struct *task,
 			      struct task_struct *next_prev, bool sched_in);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b184c60..61978ad8c480 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	/*
 	 * Remove the vma's, and unmap the actual pages
 	 */
+	perf_event_munmap();
 	detach_vmas_to_be_unmapped(mm, vma, prev, end);
 	unmap_region(mm, vma, prev, start, end);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] perf/x86/intel: Fix missing physical address in large PEBS
  2018-10-24 15:11 [PATCH 1/2] perf: Add munmap callback kan.liang
@ 2018-10-24 15:11 ` kan.liang
  2018-10-24 16:23 ` [PATCH 1/2] perf: Add munmap callback Andi Kleen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: kan.liang @ 2018-10-24 15:11 UTC (permalink / raw)
  To: peterz, tglx, mingo, acme, linux-kernel; +Cc: bp, ak, eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The physical addresses for the last several samples are always lost in
large PEBS. For example,
 #perf record -e mem-loads:uP --phys-data -c10000 -- ./dtlb
 [ perf record: Woken up 1 times to write data ]
 [ perf record: Captured and wrote 0.103 MB perf.data (2661 samples) ]
 #perf script -Ftime,event,sym,phys_addr  | tail
  1595.162483: mem-loads:uP: DoDependentLoads 0
  1595.162484: mem-loads:uP: DoDependentLoads 0
  1595.162484: mem-loads:uP: DoDependentLoads 0
  1595.162485: mem-loads:uP: DoDependentLoads 0

The problem happens because the mapping has been removed before walking
through the pages table.
To avoid this, drain the PEBS buffer on munmap.

With the patch,
 #perf script -Ftime,event,sym,phys_addr  | tail
  190.425922: mem-loads:uP: DoDependentLoads 3ce180a80
  190.425922: mem-loads:uP: DoDependentLoads 3c59ef540
  190.425922: mem-loads:uP: DoDependentLoads 3e3c73dc0
  190.425923: mem-loads:uP: DoDependentLoads 3d6c0d440

Fixes: fc7ce9c74c3a ("perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       | 7 +++++++
 arch/x86/events/intel/core.c | 6 ++++++
 arch/x86/events/intel/ds.c   | 8 ++++++++
 arch/x86/events/perf_event.h | 4 +++-
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index de32741d041a..9b23b49a0778 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2288,6 +2288,12 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
 		x86_pmu.sched_task(ctx, sched_in);
 }
 
+static void x86_pmu_munmap(void)
+{
+	if (x86_pmu.munmap)
+		x86_pmu.munmap();
+}
+
 void perf_check_microcode(void)
 {
 	if (x86_pmu.check_microcode)
@@ -2317,6 +2323,7 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.sched_task		= x86_pmu_sched_task,
+	.munmap			= x86_pmu_munmap,
 	.task_ctx_size          = sizeof(struct x86_perf_task_context),
 };
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0fb8659b20d8..db393f6b3f53 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3548,6 +3548,11 @@ static void intel_pmu_sched_task(struct perf_event_context *ctx,
 	intel_pmu_lbr_sched_task(ctx, sched_in);
 }
 
+static void intel_pmu_munmap(void)
+{
+	intel_pmu_pebs_munmap();
+}
+
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
 
 PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -3669,6 +3674,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
 	.sched_task		= intel_pmu_sched_task,
+	.munmap			= intel_pmu_munmap,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index b7b01d762d32..a0ca0e7c005c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -893,6 +893,14 @@ void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
 		intel_pmu_drain_pebs_buffer();
 }
 
+void intel_pmu_pebs_munmap(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	if (pebs_needs_sched_cb(cpuc))
+		intel_pmu_drain_pebs_buffer();
+}
+
 static inline void pebs_update_threshold(struct cpu_hw_events *cpuc)
 {
 	struct debug_store *ds = cpuc->ds;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index adae087cecdd..e1a8b8b928e8 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -591,7 +591,7 @@ struct x86_pmu {
 	void		(*check_microcode)(void);
 	void		(*sched_task)(struct perf_event_context *ctx,
 				      bool sched_in);
-
+	void		(*munmap)(void);
 	/*
 	 * Intel Arch Perfmon v2+
 	 */
@@ -932,6 +932,8 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
 
+void intel_pmu_pebs_munmap(void);
+
 void intel_pmu_auto_reload_read(struct perf_event *event);
 
 void intel_ds_init(void);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 15:11 [PATCH 1/2] perf: Add munmap callback kan.liang
  2018-10-24 15:11 ` [PATCH 2/2] perf/x86/intel: Fix missing physical address in large PEBS kan.liang
@ 2018-10-24 16:23 ` Andi Kleen
  2018-10-24 16:32   ` Arnaldo Carvalho de Melo
  2018-10-24 19:30 ` Stephane Eranian
  2018-10-25  0:29 ` Peter Zijlstra
  3 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-10-24 16:23 UTC (permalink / raw)
  To: kan.liang; +Cc: peterz, tglx, mingo, acme, linux-kernel, bp, eranian

> +void perf_event_munmap(void)
> +{
> +	struct perf_cpu_context *cpuctx;
> +	unsigned long flags;
> +	struct pmu *pmu;
> +
> +	local_irq_save(flags);
> +	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {

Would be good have a fast path here that checks for the list being empty
without disabling the interrupts. munmap can be somewhat hot. I think
it's ok to make it slower with perf running, but we shouldn't impact
it without perf.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 16:23 ` [PATCH 1/2] perf: Add munmap callback Andi Kleen
@ 2018-10-24 16:32   ` Arnaldo Carvalho de Melo
  2018-10-24 18:12     ` Liang, Kan
  0 siblings, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-24 16:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kan.liang, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	linux-kernel, bp, Stephane Eranian

Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu:
> > +void perf_event_munmap(void)
> > +{
> > +	struct perf_cpu_context *cpuctx;
> > +	unsigned long flags;
> > +	struct pmu *pmu;
> > +
> > +	local_irq_save(flags);
> > +	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
> 
> Would be good have a fast path here that checks for the list being empty
> without disabling the interrupts. munmap can be somewhat hot. I think
> it's ok to make it slower with perf running, but we shouldn't impact
> it without perf.

Right, look at how its counterpart, perf_event_mmap() works:

void perf_event_mmap(struct vm_area_struct *vma)
{
        struct perf_mmap_event mmap_event;

        if (!atomic_read(&nr_mmap_events))
                return;
<SNIP>
}

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 16:32   ` Arnaldo Carvalho de Melo
@ 2018-10-24 18:12     ` Liang, Kan
  2018-10-24 18:28       ` Andi Kleen
  2018-10-24 19:15       ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 21+ messages in thread
From: Liang, Kan @ 2018-10-24 18:12 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Andi Kleen
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, linux-kernel, bp,
	Stephane Eranian



On 10/24/2018 12:32 PM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu:
>>> +void perf_event_munmap(void)
>>> +{
>>> +	struct perf_cpu_context *cpuctx;
>>> +	unsigned long flags;
>>> +	struct pmu *pmu;
>>> +
>>> +	local_irq_save(flags);
>>> +	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
>>
>> Would be good have a fast path here that checks for the list being empty
>> without disabling the interrupts. munmap can be somewhat hot. I think
>> it's ok to make it slower with perf running, but we shouldn't impact
>> it without perf.
> 
> Right, look at how its counterpart, perf_event_mmap() works:
> 
> void perf_event_mmap(struct vm_area_struct *vma)
> {
>          struct perf_mmap_event mmap_event;
> 
>          if (!atomic_read(&nr_mmap_events))
>                  return;
> <SNIP>
> }
> 

Thanks. I'll add the nr_mmap_events check in V2.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 18:12     ` Liang, Kan
@ 2018-10-24 18:28       ` Andi Kleen
  2018-10-25  0:31         ` Peter Zijlstra
  2018-10-24 19:15       ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-10-24 18:28 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, linux-kernel, bp, Stephane Eranian

> > void perf_event_mmap(struct vm_area_struct *vma)
> > {
> >          struct perf_mmap_event mmap_event;
> > 
> >          if (!atomic_read(&nr_mmap_events))
> >                  return;
> > <SNIP>
> > }
> > 
> 
> Thanks. I'll add the nr_mmap_events check in V2.

No, that's the wrong check here. The PEBS flush is independent of mmap
events being requested.

If anything would need to check for any PEBS events active, which
would need a new counter.  I think the easiest is to just check if 
this_cpu_ptr(&sched_cb_list)
is empty, which should be good enough.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 18:12     ` Liang, Kan
  2018-10-24 18:28       ` Andi Kleen
@ 2018-10-24 19:15       ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-24 19:15 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Andi Kleen, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	linux-kernel, bp, Stephane Eranian

Em Wed, Oct 24, 2018 at 02:12:54PM -0400, Liang, Kan escreveu:
> 
> 
> On 10/24/2018 12:32 PM, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Oct 24, 2018 at 09:23:34AM -0700, Andi Kleen escreveu:
> > > > +void perf_event_munmap(void)
> > > > +{
> > > > +	struct perf_cpu_context *cpuctx;
> > > > +	unsigned long flags;
> > > > +	struct pmu *pmu;
> > > > +
> > > > +	local_irq_save(flags);
> > > > +	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
> > > 
> > > Would be good have a fast path here that checks for the list being empty
> > > without disabling the interrupts. munmap can be somewhat hot. I think
> > > it's ok to make it slower with perf running, but we shouldn't impact
> > > it without perf.
> > 
> > Right, look at how its counterpart, perf_event_mmap() works:
> > 
> > void perf_event_mmap(struct vm_area_struct *vma)
> > {
> >          struct perf_mmap_event mmap_event;
> > 
> >          if (!atomic_read(&nr_mmap_events))
> >                  return;
> > <SNIP>
> > }
> > 
> 
> Thanks. I'll add the nr_mmap_events check in V2.

That would be a nr_munmap_events, if this is tied to PERF_RECORD_MUNMAP
(right?), which it isn't right now, check Andi's response, mine was more
of a "hey, perf_event_mmap does an atomic check, before grabbing any
locks".

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 15:11 [PATCH 1/2] perf: Add munmap callback kan.liang
  2018-10-24 15:11 ` [PATCH 2/2] perf/x86/intel: Fix missing physical address in large PEBS kan.liang
  2018-10-24 16:23 ` [PATCH 1/2] perf: Add munmap callback Andi Kleen
@ 2018-10-24 19:30 ` Stephane Eranian
  2018-10-25  0:23   ` Peter Zijlstra
  2018-11-01 14:09   ` Liang, Kan
  2018-10-25  0:29 ` Peter Zijlstra
  3 siblings, 2 replies; 21+ messages in thread
From: Stephane Eranian @ 2018-10-24 19:30 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen

Hi,

On Wed, Oct 24, 2018 at 8:12 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> To calculate the physical address, perf needs to walk the pages tables.
> The related mapping may has already been removed from pages table in
> some cases (e.g. large PEBS). The virtual address recorded in the first
> PEBS records may already be unmapped before draining PEBS buffers.
>
> Add a munmap callback to notify the PMU of any unmapping, which only be
> invoked when the munmap is implemented.
>
The need for this new record type extends beyond physical address conversions
and PEBS. A long while ago, someone reported issues with symbolization related
to perf lacking munmap tracking. It had to do with vma merging. I think the
sequence of mmaps was as follows in the problematic case:
1.   addr1 = mmap(8192);
2.   munmap(addr1 + 4096, 4096)
3.   addr2 = mmap(addr1+4096, 4096)

If successful, that yields addr2 = addr1 + 4096 (could also get the
same without forcing the address).

In that case, if I recall correctly, the vma for 1st mapping (now at
4k) and that of the 2nd mapping (4k)
get merged into a single 8k vma and this is what perf_events will
record for PERF_RECORD_MMAP.
On the perf tool side, it is assumed that if two timestamped mappings
overlap then, the latter overrides
the former. In this case, perf would loose the mapping of the first
4kb and assume all symbols comes from
2nd mapping. Hopefully I got the scenario right. If so, then you'd
need PERF_RECORD_UNMAP to
disambiguate assuming the perf tool is modified accordingly.


>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  include/linux/perf_event.h |  3 +++
>  kernel/events/core.c       | 25 +++++++++++++++++++++++++
>  mm/mmap.c                  |  1 +
>  3 files changed, 29 insertions(+)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 53c500f0ca79..7f0a9258ce1f 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -400,6 +400,7 @@ struct pmu {
>          */
>         void (*sched_task)              (struct perf_event_context *ctx,
>                                         bool sched_in);
> +       void (*munmap)                  (void);
>         /*
>          * PMU specific data size
>          */
> @@ -1113,6 +1114,7 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
>  }
>
>  extern void perf_event_mmap(struct vm_area_struct *vma);
> +extern void perf_event_munmap(void);
>  extern struct perf_guest_info_callbacks *perf_guest_cbs;
>  extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
>  extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
> @@ -1333,6 +1335,7 @@ static inline int perf_unregister_guest_info_callbacks
>  (struct perf_guest_info_callbacks *callbacks)                          { return 0; }
>
>  static inline void perf_event_mmap(struct vm_area_struct *vma)         { }
> +static inline void perf_event_munmap(void)                             { }
>  static inline void perf_event_exec(void)                               { }
>  static inline void perf_event_comm(struct task_struct *tsk, bool exec) { }
>  static inline void perf_event_namespaces(struct task_struct *tsk)      { }
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5a97f34bc14c..00338d6fbed7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3184,6 +3184,31 @@ static void perf_pmu_sched_task(struct task_struct *prev,
>         }
>  }
>
> +void perf_event_munmap(void)
> +{
> +       struct perf_cpu_context *cpuctx;
> +       unsigned long flags;
> +       struct pmu *pmu;
> +
> +       local_irq_save(flags);
> +       list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
> +               pmu = cpuctx->ctx.pmu;
> +
> +               if (!pmu->munmap)
> +                       continue;
> +
> +               perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +               perf_pmu_disable(pmu);
> +
> +               pmu->munmap();
> +
> +               perf_pmu_enable(pmu);
> +
> +               perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +       }
> +       local_irq_restore(flags);
> +}
> +
>  static void perf_event_switch(struct task_struct *task,
>                               struct task_struct *next_prev, bool sched_in);
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5f2b2b184c60..61978ad8c480 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>         /*
>          * Remove the vma's, and unmap the actual pages
>          */
> +       perf_event_munmap();
>         detach_vmas_to_be_unmapped(mm, vma, prev, end);
>         unmap_region(mm, vma, prev, start, end);
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 19:30 ` Stephane Eranian
@ 2018-10-25  0:23   ` Peter Zijlstra
  2018-10-25  0:25     ` Stephane Eranian
  2018-11-01 14:09   ` Liang, Kan
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-25  0:23 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Liang, Kan, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen

On Wed, Oct 24, 2018 at 12:30:52PM -0700, Stephane Eranian wrote:
> Hi,
> 
> On Wed, Oct 24, 2018 at 8:12 AM <kan.liang@linux.intel.com> wrote:
> >
> > From: Kan Liang <kan.liang@linux.intel.com>
> >
> > To calculate the physical address, perf needs to walk the pages tables.
> > The related mapping may has already been removed from pages table in
> > some cases (e.g. large PEBS). The virtual address recorded in the first
> > PEBS records may already be unmapped before draining PEBS buffers.
> >
> > Add a munmap callback to notify the PMU of any unmapping, which only be
> > invoked when the munmap is implemented.
> >
> The need for this new record type extends beyond physical address conversions
> and PEBS. A long while ago, someone reported issues with symbolization related
> to perf lacking munmap tracking. It had to do with vma merging. I think the
> sequence of mmaps was as follows in the problematic case:
> 1.   addr1 = mmap(8192);
> 2.   munmap(addr1 + 4096, 4096)
> 3.   addr2 = mmap(addr1+4096, 4096)
> 
> If successful, that yields addr2 = addr1 + 4096 (could also get the
> same without forcing the address).

That is actually a different problem. And you're right, we never did fix
that.

I agree with you that Kan's Changelog is somewhat cryptic; it took me at
least 3 times reading and looking at what the patches actually do to
understand :/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-25  0:23   ` Peter Zijlstra
@ 2018-10-25  0:25     ` Stephane Eranian
  2018-10-25  0:34       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Stephane Eranian @ 2018-10-25  0:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liang, Kan, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen

On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 24, 2018 at 12:30:52PM -0700, Stephane Eranian wrote:
> > Hi,
> >
> > On Wed, Oct 24, 2018 at 8:12 AM <kan.liang@linux.intel.com> wrote:
> > >
> > > From: Kan Liang <kan.liang@linux.intel.com>
> > >
> > > To calculate the physical address, perf needs to walk the pages tables.
> > > The related mapping may has already been removed from pages table in
> > > some cases (e.g. large PEBS). The virtual address recorded in the first
> > > PEBS records may already be unmapped before draining PEBS buffers.
> > >
> > > Add a munmap callback to notify the PMU of any unmapping, which only be
> > > invoked when the munmap is implemented.
> > >
> > The need for this new record type extends beyond physical address conversions
> > and PEBS. A long while ago, someone reported issues with symbolization related
> > to perf lacking munmap tracking. It had to do with vma merging. I think the
> > sequence of mmaps was as follows in the problematic case:
> > 1.   addr1 = mmap(8192);
> > 2.   munmap(addr1 + 4096, 4096)
> > 3.   addr2 = mmap(addr1+4096, 4096)
> >
> > If successful, that yields addr2 = addr1 + 4096 (could also get the
> > same without forcing the address).
>
> That is actually a different problem. And you're right, we never did fix
> that.
>
it is a different problem but the solution is the same: PERF_RECORD_UNMAP!
That's why I mentioned it here. To show that this is needed for more
than one reason ;-)


> I agree with you that Kan's Changelog is somewhat cryptic; it took me at
> least 3 times reading and looking at what the patches actually do to
> understand :/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 15:11 [PATCH 1/2] perf: Add munmap callback kan.liang
                   ` (2 preceding siblings ...)
  2018-10-24 19:30 ` Stephane Eranian
@ 2018-10-25  0:29 ` Peter Zijlstra
  2018-10-25 14:00   ` Liang, Kan
  3 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-25  0:29 UTC (permalink / raw)
  To: kan.liang; +Cc: tglx, mingo, acme, linux-kernel, bp, ak, eranian

On Wed, Oct 24, 2018 at 08:11:15AM -0700, kan.liang@linux.intel.com wrote:
> +void perf_event_munmap(void)
> +{
> +	struct perf_cpu_context *cpuctx;
> +	unsigned long flags;
> +	struct pmu *pmu;
> +
> +	local_irq_save(flags);

It is impossible to get here with IRQs already disabled.

> +	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
> +		pmu = cpuctx->ctx.pmu;
> +
> +		if (!pmu->munmap)
> +			continue;
> +
> +		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +		perf_pmu_disable(pmu);
> +
> +		pmu->munmap();
> +
> +		perf_pmu_enable(pmu);
> +
> +		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +	}
> +	local_irq_restore(flags);
> +}
> +
>  static void perf_event_switch(struct task_struct *task,
>  			      struct task_struct *next_prev, bool sched_in);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5f2b2b184c60..61978ad8c480 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	/*
>  	 * Remove the vma's, and unmap the actual pages
>  	 */
> +	perf_event_munmap();

I think that if you add the munmap hook, you should do it right and at
least do it such that we can solve the other munmap problem.

>  	detach_vmas_to_be_unmapped(mm, vma, prev, end);
>  	unmap_region(mm, vma, prev, start, end);
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 18:28       ` Andi Kleen
@ 2018-10-25  0:31         ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-25  0:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Liang, Kan, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Ingo Molnar, linux-kernel, bp, Stephane Eranian

On Wed, Oct 24, 2018 at 11:28:54AM -0700, Andi Kleen wrote:
> > > void perf_event_mmap(struct vm_area_struct *vma)
> > > {
> > >          struct perf_mmap_event mmap_event;
> > > 
> > >          if (!atomic_read(&nr_mmap_events))
> > >                  return;
> > > <SNIP>
> > > }
> > > 
> > 
> > Thanks. I'll add the nr_mmap_events check in V2.
> 
> No, that's the wrong check here. The PEBS flush is independent of mmap
> events being requested.
> 
> If anything would need to check for any PEBS events active, which
> would need a new counter.  I think the easiest is to just check if 
> this_cpu_ptr(&sched_cb_list)
> is empty, which should be good enough.

It is just the CLI+STI, not PUSHF;CLI+POPF that are required and that is
a lot cheaper already. Also, you need to have preemption disabled in
order to check the per-cpu cb list.

So I don't think it really makes much sense to try and frob a fast path
in there.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-25  0:25     ` Stephane Eranian
@ 2018-10-25  0:34       ` Peter Zijlstra
  2018-10-25  0:44         ` Stephane Eranian
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-25  0:34 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Liang, Kan, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen

On Wed, Oct 24, 2018 at 05:25:59PM -0700, Stephane Eranian wrote:
> On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra <peterz@infradead.org> wrote:

> > That is actually a different problem. And you're right, we never did fix
> > that.
> >
> it is a different problem but the solution is the same: PERF_RECORD_UNMAP!

But he's not actually doing PERF_RECORD_UNMAP, nor is his perf_munmap()
hook sufficient to actually generate those.

Now I agree that if he's going to do an munmap hook, he should do it
'right' and at the very least allow for PERF_RECORD_UNMAP to be done,
but ideally simply pick up and finish that patch we had back then.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-25  0:34       ` Peter Zijlstra
@ 2018-10-25  0:44         ` Stephane Eranian
  0 siblings, 0 replies; 21+ messages in thread
From: Stephane Eranian @ 2018-10-25  0:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liang, Kan, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen

On Wed, Oct 24, 2018 at 5:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 24, 2018 at 05:25:59PM -0700, Stephane Eranian wrote:
> > On Wed, Oct 24, 2018 at 5:23 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > That is actually a different problem. And you're right, we never did fix
> > > that.
> > >
> > it is a different problem but the solution is the same: PERF_RECORD_UNMAP!
>
> But he's not actually doing PERF_RECORD_UNMAP, nor is his perf_munmap()
> hook sufficient to actually generate those.
>
You're right. But I saw that and thought, this is going in the right
direction: tracking munmap.
I agree with you that we should use this opportunity to track unmap
for his purpose but also
the issue I raised, which needs a new record type based on the new unmap hook.

> Now I agree that if he's going to do an munmap hook, he should do it
> 'right' and at the very least allow for PERF_RECORD_UNMAP to be done,
> but ideally simply pick up and finish that patch we had back then.
Agreed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-25  0:29 ` Peter Zijlstra
@ 2018-10-25 14:00   ` Liang, Kan
  2018-10-30 12:51     ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Liang, Kan @ 2018-10-25 14:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, mingo, acme, linux-kernel, bp, ak, eranian



On 10/24/2018 8:29 PM, Peter Zijlstra wrote:
> On Wed, Oct 24, 2018 at 08:11:15AM -0700, kan.liang@linux.intel.com wrote:
>> +void perf_event_munmap(void)
>> +{
>> +	struct perf_cpu_context *cpuctx;
>> +	unsigned long flags;
>> +	struct pmu *pmu;
>> +
>> +	local_irq_save(flags);
> 
> It is impossible to get here with IRQs already disabled.

I don't think so. Based on my test, IRQs are still enabled here. I once 
observed dead lock with my stress test. So I have to explicitly disable 
the IRQs here.

> 
>> +	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
>> +		pmu = cpuctx->ctx.pmu;
>> +
>> +		if (!pmu->munmap)
>> +			continue;
>> +
>> +		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +		perf_pmu_disable(pmu);
>> +
>> +		pmu->munmap();
>> +
>> +		perf_pmu_enable(pmu);
>> +
>> +		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +	}
>> +	local_irq_restore(flags);
>> +}
>> +
>>   static void perf_event_switch(struct task_struct *task,
>>   			      struct task_struct *next_prev, bool sched_in);
>>   
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 5f2b2b184c60..61978ad8c480 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -2777,6 +2777,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>>   	/*
>>   	 * Remove the vma's, and unmap the actual pages
>>   	 */
>> +	perf_event_munmap();
> 
> I think that if you add the munmap hook, you should do it right and at
> least do it such that we can solve the other munmap problem.
>

Is this patch you mentioned?
https://lkml.org/lkml/2017/1/27/452

I will take a look and find a right place for both problems.

Thanks,
Kan
>>   	detach_vmas_to_be_unmapped(mm, vma, prev, end);
>>   	unmap_region(mm, vma, prev, start, end);
>>   
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-25 14:00   ` Liang, Kan
@ 2018-10-30 12:51     ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2018-10-30 12:51 UTC (permalink / raw)
  To: Liang, Kan; +Cc: tglx, mingo, acme, linux-kernel, bp, ak, eranian

On Thu, Oct 25, 2018 at 10:00:07AM -0400, Liang, Kan wrote:
> Is this patch you mentioned?
> https://lkml.org/lkml/2017/1/27/452

Yes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-10-24 19:30 ` Stephane Eranian
  2018-10-25  0:23   ` Peter Zijlstra
@ 2018-11-01 14:09   ` Liang, Kan
  2018-11-05 10:59     ` Stephane Eranian
  1 sibling, 1 reply; 21+ messages in thread
From: Liang, Kan @ 2018-11-01 14:09 UTC (permalink / raw)
  To: Stephane Eranian, Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Arnaldo Carvalho de Melo, LKML,
	Borislav Petkov, Andi Kleen



On 10/24/2018 3:30 PM, Stephane Eranian wrote:
> The need for this new record type extends beyond physical address conversions
> and PEBS. A long while ago, someone reported issues with symbolization related
> to perf lacking munmap tracking. It had to do with vma merging. I think the
> sequence of mmaps was as follows in the problematic case:
> 1.   addr1 = mmap(8192);
> 2.   munmap(addr1 + 4096, 4096)
> 3.   addr2 = mmap(addr1+4096, 4096)
> 
> If successful, that yields addr2 = addr1 + 4096 (could also get the
> same without forcing the address).
> 
> In that case, if I recall correctly, the vma for 1st mapping (now at
> 4k) and that of the 2nd mapping (4k)
> get merged into a single 8k vma and this is what perf_events will
> record for PERF_RECORD_MMAP.
> On the perf tool side, it is assumed that if two timestamped mappings
> overlap then, the latter overrides
> the former. In this case, perf would loose the mapping of the first
> 4kb and assume all symbols comes from
> 2nd mapping. Hopefully I got the scenario right. If so, then you'd
> need PERF_RECORD_UNMAP to
> disambiguate assuming the perf tool is modified accordingly.
> 

Hi Stephane and Peter,

I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying 
to understand the problematic case.

It looks like the issue can only be triggered by perf inject --jit. 
Because it can inject extra MMAP events.
As my understanding,  Linux kernel only try to merge VMAs if they are 
both from anon or they are both from the same file. --jit breaks the 
rule, and makes the merged VMA partly from anon, partly from file.
Now, there is a new MMAP event which range covers the modified VMA.
Without the help of MUNMAP event, perf tool have no idea if the new one 
is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA.
Current code just simply overwrite the modified VMAs. The VMA 
information which --jit injected may be lost. The symbolization may be 
lost as well.

Except --jit, the VMAs information should be consistent between kernel 
and perf tools. We shouldn't observe the problem. MUNMAP event is not 
needed.

Is my understanding correct?

Do you have a test case for the problem?

Thanks,
Kan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-11-01 14:09   ` Liang, Kan
@ 2018-11-05 10:59     ` Stephane Eranian
  2018-11-05 15:43       ` Liang, Kan
  0 siblings, 1 reply; 21+ messages in thread
From: Stephane Eranian @ 2018-11-05 10:59 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen

Hi Kan,

I built a small test case for you to demonstrate the issue for code and data.
Compile the test program and then do:
For text:
$ perf record  ./mmap
$ perf report -D | fgrep MMAP2

The test program mmaps 2 pages, unmaps the second, and remap 1 page
over the freed space.
If you look at the MMAP2 record, you will not be able to reconstruct
what happened and perf will
get confused should it try to symbolize from the address range

With Text:
PERF_RECORD_MMAP2 5937/5937: [0x400000(0x1000) @ 0 08:01 400938
824817672]: r-xp /home/eranian/mmap
PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000
00:00 0 0]: rwxp //anon
PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000
00:00 0 0]: rwxp //anon

^^^^^^^^^^^^^^^^^^^^^^^^ captures the whole VMA but not the mapping
change in user space

For data:
$ perf record -d  ./mmap
$ perf report -D | fgrep MMAP2
With data:
PERF_RECORD_MMAP2 6430/6430: [0x400000(0x1000) @ 0 08:01 400938
3278843184]: r-xp /home/eranian/mmap
PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000
00:00 0 0]: rw-p //anon
PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000
00:00 0 0]: rw-p //anon

Same test case with data.
Perf will think the entire 2 pages have been replaced when in fact
only the second has.
I believe the problem is likely to impact data and jitted code cache

#include <sys/types.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <err.h>
#include <getopt.h>

int main(int argc, char **argv)
{
void *addr1, *addr2;
size_t pgsz = sysconf(_SC_PAGESIZE);
int n = 2;
int ret;
int c, mode = 0;

while ((c = getopt(argc, argv, "hd")) != -1) {
switch (c) {
case 'h':
printf("[-h]\tget this help\n");
printf("[-d]\tuse data mmaps (no PROT_EXEC)\n");
return 0;
case 'd':
mode = PROT_EXEC;
break;
default:
errx(1, "unknown option");
}
}
/* default to data */
if (mode == 0)
mode = PROT_WRITE;

/*
* mmap 2 contiugous pages
*/
addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr1 == (void *)MAP_FAILED)
err(1, "mmap 1 failed");

printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz);

/*
* unmap only the second page
*/
ret = munmap(addr1 + pgsz, pgsz);
if (ret == -1)
err(1, "munmp failed");

/*
* mmap 1 page at the location of the unmap page (should reuse virtual space)
* This creates a continuous region built from two mmaps and
potentially two different sources
* especially with jitted runtimes
*/
addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

printf("addr2=%p\n", addr2);

if (addr2 == (void *)MAP_FAILED)
err(1, "mmap 2 failed");
if (addr2 != (addr1 + pgsz))
errx(1, "wrong mmap2 address");

sleep(1);

return 0;
}

On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 10/24/2018 3:30 PM, Stephane Eranian wrote:
> > The need for this new record type extends beyond physical address conversions
> > and PEBS. A long while ago, someone reported issues with symbolization related
> > to perf lacking munmap tracking. It had to do with vma merging. I think the
> > sequence of mmaps was as follows in the problematic case:
> > 1.   addr1 = mmap(8192);
> > 2.   munmap(addr1 + 4096, 4096)
> > 3.   addr2 = mmap(addr1+4096, 4096)
> >
> > If successful, that yields addr2 = addr1 + 4096 (could also get the
> > same without forcing the address).
> >
> > In that case, if I recall correctly, the vma for 1st mapping (now at
> > 4k) and that of the 2nd mapping (4k)
> > get merged into a single 8k vma and this is what perf_events will
> > record for PERF_RECORD_MMAP.
> > On the perf tool side, it is assumed that if two timestamped mappings
> > overlap then, the latter overrides
> > the former. In this case, perf would loose the mapping of the first
> > 4kb and assume all symbols comes from
> > 2nd mapping. Hopefully I got the scenario right. If so, then you'd
> > need PERF_RECORD_UNMAP to
> > disambiguate assuming the perf tool is modified accordingly.
> >
>
> Hi Stephane and Peter,
>
> I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying
> to understand the problematic case.
>
> It looks like the issue can only be triggered by perf inject --jit.
> Because it can inject extra MMAP events.
> As my understanding,  Linux kernel only try to merge VMAs if they are
> both from anon or they are both from the same file. --jit breaks the
> rule, and makes the merged VMA partly from anon, partly from file.
> Now, there is a new MMAP event which range covers the modified VMA.
> Without the help of MUNMAP event, perf tool have no idea if the new one
> is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA.
> Current code just simply overwrite the modified VMAs. The VMA
> information which --jit injected may be lost. The symbolization may be
> lost as well.
>
> Except --jit, the VMAs information should be consistent between kernel
> and perf tools. We shouldn't observe the problem. MUNMAP event is not
> needed.
>
> Is my understanding correct?
>
> Do you have a test case for the problem?
>
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-11-05 10:59     ` Stephane Eranian
@ 2018-11-05 15:43       ` Liang, Kan
  2018-11-06 15:00         ` Stephane Eranian
  0 siblings, 1 reply; 21+ messages in thread
From: Liang, Kan @ 2018-11-05 15:43 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen



On 11/5/2018 5:59 AM, Stephane Eranian wrote:
> Hi Kan,
> 
> I built a small test case for you to demonstrate the issue for code and data.
> Compile the test program and then do:
> For text:
> $ perf record  ./mmap
> $ perf report -D | fgrep MMAP2
> 
> The test program mmaps 2 pages, unmaps the second, and remap 1 page
> over the freed space.
> If you look at the MMAP2 record, you will not be able to reconstruct
> what happened and perf will
> get confused should it try to symbolize from the address range
> 
> With Text:
> PERF_RECORD_MMAP2 5937/5937: [0x400000(0x1000) @ 0 08:01 400938
> 824817672]: r-xp /home/eranian/mmap
> PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000
> 00:00 0 0]: rwxp //anon
> PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000
> 00:00 0 0]: rwxp //anon
> 
> ^^^^^^^^^^^^^^^^^^^^^^^^ captures the whole VMA but not the mapping
> change in user space
> 
> For data:
> $ perf record -d  ./mmap
> $ perf report -D | fgrep MMAP2
> With data:
> PERF_RECORD_MMAP2 6430/6430: [0x400000(0x1000) @ 0 08:01 400938
> 3278843184]: r-xp /home/eranian/mmap
> PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000
> 00:00 0 0]: rw-p //anon
> PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000
> 00:00 0 0]: rw-p //anon
> 
> Same test case with data.
> Perf will think the entire 2 pages have been replaced when in fact
> only the second has.
> I believe the problem is likely to impact data and jitted code cache
> 
> #include <sys/types.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/mman.h>
> #include <err.h>
> #include <getopt.h>
> 
> int main(int argc, char **argv)
> {
> void *addr1, *addr2;
> size_t pgsz = sysconf(_SC_PAGESIZE);
> int n = 2;
> int ret;
> int c, mode = 0;
> 
> while ((c = getopt(argc, argv, "hd")) != -1) {
> switch (c) {
> case 'h':
> printf("[-h]\tget this help\n");
> printf("[-d]\tuse data mmaps (no PROT_EXEC)\n");
> return 0;
> case 'd':
> mode = PROT_EXEC;
> break;
> default:
> errx(1, "unknown option");
> }
> }
> /* default to data */
> if (mode == 0)
> mode = PROT_WRITE;
> 
> /*
> * mmap 2 contiugous pages
> */
> addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> if (addr1 == (void *)MAP_FAILED)
> err(1, "mmap 1 failed");
> 
> printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz);
> 
> /*
> * unmap only the second page
> */
> ret = munmap(addr1 + pgsz, pgsz);
> if (ret == -1)
> err(1, "munmp failed");
> 
> /*
> * mmap 1 page at the location of the unmap page (should reuse virtual space)
> * This creates a continuous region built from two mmaps and
> potentially two different sources
> * especially with jitted runtimes
> */

The two mmaps are both anon. As my understanding, we cannot symbolize 
from the anonymous address, can we?
If we cannot, why we have to distinguish with them? I think we do not 
need to know their sources for symbolization.

As my understanding, only --jit can inject MMAP event, which tag an 
anon. Perf can symbolize the address after that. Then the unmap is needed.

Thanks,
Kan
> addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode,
> MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> 
> printf("addr2=%p\n", addr2);
> 
> if (addr2 == (void *)MAP_FAILED)
> err(1, "mmap 2 failed");
> if (addr2 != (addr1 + pgsz))
> errx(1, "wrong mmap2 address");
> 
> sleep(1);
> 
> return 0;
> }
> 
> On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 10/24/2018 3:30 PM, Stephane Eranian wrote:
>>> The need for this new record type extends beyond physical address conversions
>>> and PEBS. A long while ago, someone reported issues with symbolization related
>>> to perf lacking munmap tracking. It had to do with vma merging. I think the
>>> sequence of mmaps was as follows in the problematic case:
>>> 1.   addr1 = mmap(8192);
>>> 2.   munmap(addr1 + 4096, 4096)
>>> 3.   addr2 = mmap(addr1+4096, 4096)
>>>
>>> If successful, that yields addr2 = addr1 + 4096 (could also get the
>>> same without forcing the address).
>>>
>>> In that case, if I recall correctly, the vma for 1st mapping (now at
>>> 4k) and that of the 2nd mapping (4k)
>>> get merged into a single 8k vma and this is what perf_events will
>>> record for PERF_RECORD_MMAP.
>>> On the perf tool side, it is assumed that if two timestamped mappings
>>> overlap then, the latter overrides
>>> the former. In this case, perf would loose the mapping of the first
>>> 4kb and assume all symbols comes from
>>> 2nd mapping. Hopefully I got the scenario right. If so, then you'd
>>> need PERF_RECORD_UNMAP to
>>> disambiguate assuming the perf tool is modified accordingly.
>>>
>>
>> Hi Stephane and Peter,
>>
>> I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying
>> to understand the problematic case.
>>
>> It looks like the issue can only be triggered by perf inject --jit.
>> Because it can inject extra MMAP events.
>> As my understanding,  Linux kernel only try to merge VMAs if they are
>> both from anon or they are both from the same file. --jit breaks the
>> rule, and makes the merged VMA partly from anon, partly from file.
>> Now, there is a new MMAP event which range covers the modified VMA.
>> Without the help of MUNMAP event, perf tool have no idea if the new one
>> is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA.
>> Current code just simply overwrite the modified VMAs. The VMA
>> information which --jit injected may be lost. The symbolization may be
>> lost as well.
>>
>> Except --jit, the VMAs information should be consistent between kernel
>> and perf tools. We shouldn't observe the problem. MUNMAP event is not
>> needed.
>>
>> Is my understanding correct?
>>
>> Do you have a test case for the problem?
>>
>> Thanks,
>> Kan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-11-05 15:43       ` Liang, Kan
@ 2018-11-06 15:00         ` Stephane Eranian
  2018-11-06 16:47           ` Liang, Kan
  0 siblings, 1 reply; 21+ messages in thread
From: Stephane Eranian @ 2018-11-06 15:00 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen

On Mon, Nov 5, 2018 at 7:43 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 11/5/2018 5:59 AM, Stephane Eranian wrote:
> > Hi Kan,
> >
> > I built a small test case for you to demonstrate the issue for code and data.
> > Compile the test program and then do:
> > For text:
> > $ perf record  ./mmap
> > $ perf report -D | fgrep MMAP2
> >
> > The test program mmaps 2 pages, unmaps the second, and remap 1 page
> > over the freed space.
> > If you look at the MMAP2 record, you will not be able to reconstruct
> > what happened and perf will
> > get confused should it try to symbolize from the address range
> >
> > With Text:
> > PERF_RECORD_MMAP2 5937/5937: [0x400000(0x1000) @ 0 08:01 400938
> > 824817672]: r-xp /home/eranian/mmap
> > PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000
> > 00:00 0 0]: rwxp //anon
> > PERF_RECORD_MMAP2 5937/5937: [0x7f7c01019000(0x2000) @ 0x7f7c01019000
> > 00:00 0 0]: rwxp //anon
> >
> > ^^^^^^^^^^^^^^^^^^^^^^^^ captures the whole VMA but not the mapping
> > change in user space
> >
> > For data:
> > $ perf record -d  ./mmap
> > $ perf report -D | fgrep MMAP2
> > With data:
> > PERF_RECORD_MMAP2 6430/6430: [0x400000(0x1000) @ 0 08:01 400938
> > 3278843184]: r-xp /home/eranian/mmap
> > PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000
> > 00:00 0 0]: rw-p //anon
> > PERF_RECORD_MMAP2 6430/6430: [0x7f4aa704b000(0x2000) @ 0x7f4aa704b000
> > 00:00 0 0]: rw-p //anon
> >
> > Same test case with data.
> > Perf will think the entire 2 pages have been replaced when in fact
> > only the second has.
> > I believe the problem is likely to impact data and jitted code cache
> >
> > #include <sys/types.h>
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <sys/mman.h>
> > #include <err.h>
> > #include <getopt.h>
> >
> > int main(int argc, char **argv)
> > {
> > void *addr1, *addr2;
> > size_t pgsz = sysconf(_SC_PAGESIZE);
> > int n = 2;
> > int ret;
> > int c, mode = 0;
> >
> > while ((c = getopt(argc, argv, "hd")) != -1) {
> > switch (c) {
> > case 'h':
> > printf("[-h]\tget this help\n");
> > printf("[-d]\tuse data mmaps (no PROT_EXEC)\n");
> > return 0;
> > case 'd':
> > mode = PROT_EXEC;
> > break;
> > default:
> > errx(1, "unknown option");
> > }
> > }
> > /* default to data */
> > if (mode == 0)
> > mode = PROT_WRITE;
> >
> > /*
> > * mmap 2 contiugous pages
> > */
> > addr1 = mmap(NULL, n * pgsz, PROT_READ| mode, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> > if (addr1 == (void *)MAP_FAILED)
> > err(1, "mmap 1 failed");
> >
> > printf("addr1=[%p : %p]\n", addr1, addr1 + n * pgsz);
> >
> > /*
> > * unmap only the second page
> > */
> > ret = munmap(addr1 + pgsz, pgsz);
> > if (ret == -1)
> > err(1, "munmp failed");
> >
> > /*
> > * mmap 1 page at the location of the unmap page (should reuse virtual space)
> > * This creates a continuous region built from two mmaps and
> > potentially two different sources
> > * especially with jitted runtimes
> > */
>
> The two mmaps are both anon. As my understanding, we cannot symbolize
> from the anonymous address, can we?
Can't we build the same test case using an actual file mapping (both
mmap from the same file)?

> If we cannot, why we have to distinguish with them? I think we do not
> need to know their sources for symbolization.
>
> As my understanding, only --jit can inject MMAP event, which tag an
> anon. Perf can symbolize the address after that. Then the unmap is needed.
>
Yes,  perf inject --jit injects timestamped MMAP2 records to cover the
jitted regions
which helps symbolize anons.

> Thanks,
> Kan
> > addr2 = mmap(addr1 + pgsz, 1 * pgsz, PROT_READ|PROT_WRITE | mode,
> > MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> >
> > printf("addr2=%p\n", addr2);
> >
> > if (addr2 == (void *)MAP_FAILED)
> > err(1, "mmap 2 failed");
> > if (addr2 != (addr1 + pgsz))
> > errx(1, "wrong mmap2 address");
> >
> > sleep(1);
> >
> > return 0;
> > }
> >
> > On Thu, Nov 1, 2018 at 7:10 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 10/24/2018 3:30 PM, Stephane Eranian wrote:
> >>> The need for this new record type extends beyond physical address conversions
> >>> and PEBS. A long while ago, someone reported issues with symbolization related
> >>> to perf lacking munmap tracking. It had to do with vma merging. I think the
> >>> sequence of mmaps was as follows in the problematic case:
> >>> 1.   addr1 = mmap(8192);
> >>> 2.   munmap(addr1 + 4096, 4096)
> >>> 3.   addr2 = mmap(addr1+4096, 4096)
> >>>
> >>> If successful, that yields addr2 = addr1 + 4096 (could also get the
> >>> same without forcing the address).
> >>>
> >>> In that case, if I recall correctly, the vma for 1st mapping (now at
> >>> 4k) and that of the 2nd mapping (4k)
> >>> get merged into a single 8k vma and this is what perf_events will
> >>> record for PERF_RECORD_MMAP.
> >>> On the perf tool side, it is assumed that if two timestamped mappings
> >>> overlap then, the latter overrides
> >>> the former. In this case, perf would loose the mapping of the first
> >>> 4kb and assume all symbols comes from
> >>> 2nd mapping. Hopefully I got the scenario right. If so, then you'd
> >>> need PERF_RECORD_UNMAP to
> >>> disambiguate assuming the perf tool is modified accordingly.
> >>>
> >>
> >> Hi Stephane and Peter,
> >>
> >> I went through the link(https://lkml.org/lkml/2017/1/27/452). I'm trying
> >> to understand the problematic case.
> >>
> >> It looks like the issue can only be triggered by perf inject --jit.
> >> Because it can inject extra MMAP events.
> >> As my understanding,  Linux kernel only try to merge VMAs if they are
> >> both from anon or they are both from the same file. --jit breaks the
> >> rule, and makes the merged VMA partly from anon, partly from file.
> >> Now, there is a new MMAP event which range covers the modified VMA.
> >> Without the help of MUNMAP event, perf tool have no idea if the new one
> >> is a newly merged VMA (modified VMA + a new VMA) or a brand new VMA.
> >> Current code just simply overwrite the modified VMAs. The VMA
> >> information which --jit injected may be lost. The symbolization may be
> >> lost as well.
> >>
> >> Except --jit, the VMAs information should be consistent between kernel
> >> and perf tools. We shouldn't observe the problem. MUNMAP event is not
> >> needed.
> >>
> >> Is my understanding correct?
> >>
> >> Do you have a test case for the problem?
> >>
> >> Thanks,
> >> Kan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] perf: Add munmap callback
  2018-11-06 15:00         ` Stephane Eranian
@ 2018-11-06 16:47           ` Liang, Kan
  0 siblings, 0 replies; 21+ messages in thread
From: Liang, Kan @ 2018-11-06 16:47 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Borislav Petkov, Andi Kleen



On 11/6/2018 10:00 AM, Stephane Eranian wrote:
>>> /*
>>> * mmap 1 page at the location of the unmap page (should reuse virtual space)
>>> * This creates a continuous region built from two mmaps and
>>> potentially two different sources
>>> * especially with jitted runtimes
>>> */
>> The two mmaps are both anon. As my understanding, we cannot symbolize
>> from the anonymous address, can we?
> Can't we build the same test case using an actual file mapping (both
> mmap from the same file)?
> 

If they are from same file, both mmap have same symbolization name.
We don't need to distinguish them either.

I thought the symbolization issue can only happen when two mmaps are 
from different sources. Right?

Thanks,
Kan

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-11-06 16:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-24 15:11 [PATCH 1/2] perf: Add munmap callback kan.liang
2018-10-24 15:11 ` [PATCH 2/2] perf/x86/intel: Fix missing physical address in large PEBS kan.liang
2018-10-24 16:23 ` [PATCH 1/2] perf: Add munmap callback Andi Kleen
2018-10-24 16:32   ` Arnaldo Carvalho de Melo
2018-10-24 18:12     ` Liang, Kan
2018-10-24 18:28       ` Andi Kleen
2018-10-25  0:31         ` Peter Zijlstra
2018-10-24 19:15       ` Arnaldo Carvalho de Melo
2018-10-24 19:30 ` Stephane Eranian
2018-10-25  0:23   ` Peter Zijlstra
2018-10-25  0:25     ` Stephane Eranian
2018-10-25  0:34       ` Peter Zijlstra
2018-10-25  0:44         ` Stephane Eranian
2018-11-01 14:09   ` Liang, Kan
2018-11-05 10:59     ` Stephane Eranian
2018-11-05 15:43       ` Liang, Kan
2018-11-06 15:00         ` Stephane Eranian
2018-11-06 16:47           ` Liang, Kan
2018-10-25  0:29 ` Peter Zijlstra
2018-10-25 14:00   ` Liang, Kan
2018-10-30 12:51     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).