linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V8 0/4] Add the page size in the perf record (kernel)
@ 2020-09-21 15:26 kan.liang
  2020-09-21 15:26 ` [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE kan.liang
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: kan.liang @ 2020-09-21 15:26 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, eranian, ak,
	dave.hansen, kirill.shutemov, mpe, benh, paulus, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Changes since V7
- Use active_mm to replace mm and init_mm
- Update the commit message of the patch 1

Changes since V6
- Return the MMU page size of a given virtual address, not the kernel
  software page size
- Add PERF_SAMPLE_DATA_PAGE_SIZE support for Power
- Allow large PEBS for PERF_SAMPLE_CODE_PAGE_SIZE
- Only include the kernel patches. The perf tool patches will be posted
  later separately once the kernel patches are accepted.

Changes since V5
- Introduce a new universal page walker for the page size in the perf
  subsystem.
- Rebased on Peter's tree.

Current perf can report both virtual addresses and physical addresses,
but not the page size. Without the page size information of the utilized
page, users cannot decide whether to promote/demote large pages to
optimize memory usage.

The patch set was submitted a year ago.
https://lkml.kernel.org/r/1549648509-12704-1-git-send-email-kan.liang@linux.intel.com
It introduced a __weak function, perf_get_page_size(), aim to retrieve
the page size via a given virtual address in the generic code, and
implemented a x86 specific version of perf_get_page_size().
However, the proposal was rejected, because it's a pure x86
implementation.
https://lkml.kernel.org/r/20190208200731.GN32511@hirez.programming.kicks-ass.net

At that time, it's not easy to support perf_get_page_size() universally,
because some key functions, e.g., p?d_large, are not supported by some
architectures.

Now, the generic p?d_leaf() functions are added in the latest kernel.
https://lkml.kernel.org/r/20191218162402.45610-2-steven.price@arm.com
Starts from V6, a new universal perf_get_page_size() function is
implemented based on the generic p?d_leaf() functions.

On some platforms, e.g., X86, the page walker is invoked in an NMI
handler. So the page walker must be NMI-safe and low overhead. Besides,
the page walker should work for both user and kernel virtual address.
The existing generic page walker, e.g., walk_page_range_novma(), is a
little bit complex and doesn't guarantee the NMI-safe. The follow_page()
is only for the user-virtual address. So a simpler page walk function is
implemented here.

Kan Liang (3):
  perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE
  powerpc/perf: Support PERF_SAMPLE_DATA_PAGE_SIZE

Stephane Eranian (1):
  perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE

 arch/powerpc/perf/core-book3s.c |   6 +-
 arch/x86/events/intel/ds.c      |  11 +++-
 arch/x86/events/perf_event.h    |   2 +-
 include/linux/perf_event.h      |   2 +
 include/uapi/linux/perf_event.h |   6 +-
 kernel/events/core.c            | 104 +++++++++++++++++++++++++++++++-
 6 files changed, 123 insertions(+), 8 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-21 15:26 [PATCH V8 0/4] Add the page size in the perf record (kernel) kan.liang
@ 2020-09-21 15:26 ` kan.liang
  2020-09-30  7:15   ` Stephane Eranian
  2020-09-21 15:26 ` [PATCH V8 2/4] perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE kan.liang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: kan.liang @ 2020-09-21 15:26 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, eranian, ak,
	dave.hansen, kirill.shutemov, mpe, benh, paulus, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Current perf can report both virtual addresses and physical addresses,
but not the MMU page size. Without the MMU page size information of the
utilized page, users cannot decide whether to promote/demote large pages
to optimize memory usage.

Add a new sample type for the data MMU page size.

Current perf already has a facility to collect data virtual addresses.
A page walker is required to walk the pages tables and calculate the
MMU page size from a given virtual address.

On some platforms, e.g., X86, the page walker is invoked in an NMI
handler. So the page walker must be NMI-safe and low overhead. Besides,
the page walker should work for both user and kernel virtual address.
The existing generic page walker, e.g., walk_page_range_novma(), is a
little bit complex and doesn't guarantee the NMI-safe. The follow_page()
is only for user-virtual address.

Add a new function perf_get_page_size() to walk the page tables and
calculate the MMU page size. In the function:
- Interrupts have to be disabled to prevent any teardown of the page
  tables.
- The active_mm is used for the page walker. Compared with mm, the
  active_mm is a better choice. It's always non-NULL. For the user
  thread, it always points to the real address space. For the kernel
  thread, it "take over" the mm of the threads that switched to it,
  so it's not using all of the page tables from the init_mm all the
  time.
- The MMU page size is calculated from the page table level.

The method should work for all architectures, but it has only been
verified on X86. Should there be some architectures, which support perf,
where the method doesn't work, it can be fixed later separately.
Reporting the wrong page size would not be fatal for the architecture.

Some under discussion features may impact the method in the future.
Quote from Dave Hansen,
  "There are lots of weird things folks are trying to do with the page
   tables, like Address Space Isolation.  For instance, if you get a
   perf NMI when running userspace, current->mm->pgd is *different* than
   the PGD that was in use when userspace was running. It's close enough
   today, but it might not stay that way."
If the case happens later, lots of consecutive page walk errors will
happen. The worst case is that lots of page-size '0' are returned, which
would not be fatal.
In the perf tool, a check is implemented to detect this case. Once it
happens, a kernel patch could be implemented accordingly then.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h      |  1 +
 include/uapi/linux/perf_event.h |  4 +-
 kernel/events/core.c            | 93 +++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0c19d279b97f..7e3785dd27d9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1034,6 +1034,7 @@ struct perf_sample_data {
 
 	u64				phys_addr;
 	u64				cgroup;
+	u64				data_page_size;
 } ____cacheline_aligned;
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 077e7ee69e3d..cc6ea346e9f9 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -143,8 +143,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
 	PERF_SAMPLE_AUX				= 1U << 20,
 	PERF_SAMPLE_CGROUP			= 1U << 21,
+	PERF_SAMPLE_DATA_PAGE_SIZE		= 1U << 22,
 
-	PERF_SAMPLE_MAX = 1U << 22,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 23,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -896,6 +897,7 @@ enum perf_event_type {
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
 	 *	{ u64			size;
 	 *	  char			data[size]; } && PERF_SAMPLE_AUX
+	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 45edb85344a1..dd329a8f99f7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -51,6 +51,7 @@
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
 #include <linux/min_heap.h>
+#include <linux/highmem.h>
 
 #include "internal.h"
 
@@ -1894,6 +1895,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
 	if (sample_type & PERF_SAMPLE_CGROUP)
 		size += sizeof(data->cgroup);
 
+	if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
+		size += sizeof(data->data_page_size);
+
 	event->header_size = size;
 }
 
@@ -6937,6 +6941,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_CGROUP)
 		perf_output_put(handle, data->cgroup);
 
+	if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
+		perf_output_put(handle, data->data_page_size);
+
 	if (sample_type & PERF_SAMPLE_AUX) {
 		perf_output_put(handle, data->aux_size);
 
@@ -6994,6 +7001,84 @@ static u64 perf_virt_to_phys(u64 virt)
 	return phys_addr;
 }
 
+#ifdef CONFIG_MMU
+
+/*
+ * Return the MMU page size of a given virtual address
+ */
+static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset(mm, addr);
+	if (pgd_none(*pgd))
+		return 0;
+
+	p4d = p4d_offset(pgd, addr);
+	if (!p4d_present(*p4d))
+		return 0;
+
+	if (p4d_leaf(*p4d))
+		return 1ULL << P4D_SHIFT;
+
+	pud = pud_offset(p4d, addr);
+	if (!pud_present(*pud))
+		return 0;
+
+	if (pud_leaf(*pud))
+		return 1ULL << PUD_SHIFT;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		return 0;
+
+	if (pmd_leaf(*pmd))
+		return 1ULL << PMD_SHIFT;
+
+	pte = pte_offset_map(pmd, addr);
+	if (!pte_present(*pte)) {
+		pte_unmap(pte);
+		return 0;
+	}
+
+	pte_unmap(pte);
+	return PAGE_SIZE;
+}
+
+#else
+
+static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr)
+{
+	return 0;
+}
+
+#endif
+
+static u64 perf_get_page_size(unsigned long addr)
+{
+	unsigned long flags;
+	u64 size;
+
+	if (!addr)
+		return 0;
+
+	/*
+	 * Software page-table walkers must disable IRQs,
+	 * which prevents any tear down of the page tables.
+	 */
+	local_irq_save(flags);
+
+	size = __perf_get_page_size(current->active_mm, addr);
+
+	local_irq_restore(flags);
+
+	return size;
+}
+
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
 struct perf_callchain_entry *
@@ -7149,6 +7234,14 @@ void perf_prepare_sample(struct perf_event_header *header,
 	}
 #endif
 
+	/*
+	 * PERF_DATA_PAGE_SIZE requires PERF_SAMPLE_ADDR. If the user doesn't
+	 * require PERF_SAMPLE_ADDR, kernel implicitly retrieve the data->addr,
+	 * but the value will not dump to the userspace.
+	 */
+	if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
+		data->data_page_size = perf_get_page_size(data->addr);
+
 	if (sample_type & PERF_SAMPLE_AUX) {
 		u64 size;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V8 2/4] perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-21 15:26 [PATCH V8 0/4] Add the page size in the perf record (kernel) kan.liang
  2020-09-21 15:26 ` [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE kan.liang
@ 2020-09-21 15:26 ` kan.liang
  2020-09-21 15:26 ` [PATCH V8 3/4] powerpc/perf: " kan.liang
  2020-09-21 15:26 ` [PATCH V8 4/4] perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE kan.liang
  3 siblings, 0 replies; 15+ messages in thread
From: kan.liang @ 2020-09-21 15:26 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, eranian, ak,
	dave.hansen, kirill.shutemov, mpe, benh, paulus, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The new sample type, PERF_SAMPLE_DATA_PAGE_SIZE, requires the virtual
address. Update the data->addr if the sample type is set.

The large PEBS is disabled with the sample type, because perf doesn't
support munmap tracking yet. The PEBS buffer for large PEBS cannot be
flushed for each munmap. Wrong page size may be calculated. The large
PEBS can be enabled later separately when munmap tracking is supported.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 404315df1e16..444e5f061d04 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -959,7 +959,8 @@ static void adaptive_pebs_record_size_update(void)
 
 #define PERF_PEBS_MEMINFO_TYPE	(PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC |   \
 				PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_WEIGHT | \
-				PERF_SAMPLE_TRANSACTION)
+				PERF_SAMPLE_TRANSACTION |		     \
+				PERF_SAMPLE_DATA_PAGE_SIZE)
 
 static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 {
@@ -1335,6 +1336,10 @@ static u64 get_data_src(struct perf_event *event, u64 aux)
 	return val;
 }
 
+#define PERF_SAMPLE_ADDR_TYPE	(PERF_SAMPLE_ADDR |		\
+				 PERF_SAMPLE_PHYS_ADDR |	\
+				 PERF_SAMPLE_DATA_PAGE_SIZE)
+
 static void setup_pebs_fixed_sample_data(struct perf_event *event,
 				   struct pt_regs *iregs, void *__pebs,
 				   struct perf_sample_data *data,
@@ -1449,7 +1454,7 @@ static void setup_pebs_fixed_sample_data(struct perf_event *event,
 	}
 
 
-	if ((sample_type & (PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR)) &&
+	if ((sample_type & PERF_SAMPLE_ADDR_TYPE) &&
 	    x86_pmu.intel_cap.pebs_format >= 1)
 		data->addr = pebs->dla;
 
@@ -1577,7 +1582,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		if (sample_type & PERF_SAMPLE_DATA_SRC)
 			data->data_src.val = get_data_src(event, meminfo->aux);
 
-		if (sample_type & (PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR))
+		if (sample_type & PERF_SAMPLE_ADDR_TYPE)
 			data->addr = meminfo->address;
 
 		if (sample_type & PERF_SAMPLE_TRANSACTION)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V8 3/4] powerpc/perf: Support PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-21 15:26 [PATCH V8 0/4] Add the page size in the perf record (kernel) kan.liang
  2020-09-21 15:26 ` [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE kan.liang
  2020-09-21 15:26 ` [PATCH V8 2/4] perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE kan.liang
@ 2020-09-21 15:26 ` kan.liang
  2020-09-21 15:26 ` [PATCH V8 4/4] perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE kan.liang
  3 siblings, 0 replies; 15+ messages in thread
From: kan.liang @ 2020-09-21 15:26 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, eranian, ak,
	dave.hansen, kirill.shutemov, mpe, benh, paulus, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The new sample type, PERF_SAMPLE_DATA_PAGE_SIZE, requires the virtual
address. Update the data->addr if the sample type is set.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/powerpc/perf/core-book3s.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 78fe34986594..ce22bd23082d 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2065,6 +2065,9 @@ static struct pmu power_pmu = {
 	.sched_task	= power_pmu_sched_task,
 };
 
+#define PERF_SAMPLE_ADDR_TYPE  (PERF_SAMPLE_ADDR |		\
+				PERF_SAMPLE_PHYS_ADDR |		\
+				PERF_SAMPLE_DATA_PAGE_SIZE)
 /*
  * A counter has overflowed; update its count and record
  * things if requested.  Note that interrupts are hard-disabled
@@ -2120,8 +2123,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
 
 		perf_sample_data_init(&data, ~0ULL, event->hw.last_period);
 
-		if (event->attr.sample_type &
-		    (PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR))
+		if (event->attr.sample_type & PERF_SAMPLE_ADDR_TYPE)
 			perf_get_data_addr(event, regs, &data.addr);
 
 		if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V8 4/4] perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE
  2020-09-21 15:26 [PATCH V8 0/4] Add the page size in the perf record (kernel) kan.liang
                   ` (2 preceding siblings ...)
  2020-09-21 15:26 ` [PATCH V8 3/4] powerpc/perf: " kan.liang
@ 2020-09-21 15:26 ` kan.liang
  3 siblings, 0 replies; 15+ messages in thread
From: kan.liang @ 2020-09-21 15:26 UTC (permalink / raw)
  To: peterz, mingo, acme, linux-kernel
  Cc: mark.rutland, alexander.shishkin, jolsa, eranian, ak,
	dave.hansen, kirill.shutemov, mpe, benh, paulus, Kan Liang

From: Stephane Eranian <eranian@google.com>

When studying code layout, it is useful to capture the page size of the
sampled code address.

Add a new sample type for code page size.
The new sample type requires collecting the ip. The code page size can
be calculated from the NMI-safe perf_get_page_size().

For large PEBS, it's very unlikely that the mapping is gone for the
earlier PEBS records. Enable the feature for the large PEBS. The worst
case is that page-size '0' is returned.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/perf_event.h    |  2 +-
 include/linux/perf_event.h      |  1 +
 include/uapi/linux/perf_event.h |  4 +++-
 kernel/events/core.c            | 11 ++++++++++-
 4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 345442410a4d..10629ef1b626 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -132,7 +132,7 @@ struct amd_nb {
 	PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
 	PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
 	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
-	PERF_SAMPLE_PERIOD)
+	PERF_SAMPLE_PERIOD | PERF_SAMPLE_CODE_PAGE_SIZE)
 
 #define PEBS_GP_REGS			\
 	((1ULL << PERF_REG_X86_AX)    | \
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7e3785dd27d9..e533b03af053 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1035,6 +1035,7 @@ struct perf_sample_data {
 	u64				phys_addr;
 	u64				cgroup;
 	u64				data_page_size;
+	u64				code_page_size;
 } ____cacheline_aligned;
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index cc6ea346e9f9..c2f20ee3124d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -144,8 +144,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_AUX				= 1U << 20,
 	PERF_SAMPLE_CGROUP			= 1U << 21,
 	PERF_SAMPLE_DATA_PAGE_SIZE		= 1U << 22,
+	PERF_SAMPLE_CODE_PAGE_SIZE		= 1U << 23,
 
-	PERF_SAMPLE_MAX = 1U << 23,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 24,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -898,6 +899,7 @@ enum perf_event_type {
 	 *	{ u64			size;
 	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
+	 *	{ u64			code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index dd329a8f99f7..5fb27cb845fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1898,6 +1898,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
 	if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
 		size += sizeof(data->data_page_size);
 
+	if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
+		size += sizeof(data->code_page_size);
+
 	event->header_size = size;
 }
 
@@ -6944,6 +6947,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
 		perf_output_put(handle, data->data_page_size);
 
+	if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
+		perf_output_put(handle, data->code_page_size);
+
 	if (sample_type & PERF_SAMPLE_AUX) {
 		perf_output_put(handle, data->aux_size);
 
@@ -7114,7 +7120,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 	__perf_event_header__init_id(header, data, event);
 
-	if (sample_type & PERF_SAMPLE_IP)
+	if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
 		data->ip = perf_instruction_pointer(regs);
 
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
@@ -7242,6 +7248,9 @@ void perf_prepare_sample(struct perf_event_header *header,
 	if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
 		data->data_page_size = perf_get_page_size(data->addr);
 
+	if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
+		data->code_page_size = perf_get_page_size(data->ip);
+
 	if (sample_type & PERF_SAMPLE_AUX) {
 		u64 size;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-21 15:26 ` [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE kan.liang
@ 2020-09-30  7:15   ` Stephane Eranian
  2020-09-30 13:51     ` Dave Hansen
  2020-09-30 14:42     ` Liang, Kan
  0 siblings, 2 replies; 15+ messages in thread
From: Stephane Eranian @ 2020-09-30  7:15 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Andi Kleen,
	Dave Hansen, kirill.shutemov, Michael Ellerman, benh,
	Paul Mackerras

On Mon, Sep 21, 2020 at 8:29 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Current perf can report both virtual addresses and physical addresses,
> but not the MMU page size. Without the MMU page size information of the
> utilized page, users cannot decide whether to promote/demote large pages
> to optimize memory usage.
>
> Add a new sample type for the data MMU page size.
>
> Current perf already has a facility to collect data virtual addresses.
> A page walker is required to walk the pages tables and calculate the
> MMU page size from a given virtual address.
>
> On some platforms, e.g., X86, the page walker is invoked in an NMI
> handler. So the page walker must be NMI-safe and low overhead. Besides,
> the page walker should work for both user and kernel virtual address.
> The existing generic page walker, e.g., walk_page_range_novma(), is a
> little bit complex and doesn't guarantee the NMI-safe. The follow_page()
> is only for user-virtual address.
>
> Add a new function perf_get_page_size() to walk the page tables and
> calculate the MMU page size. In the function:
> - Interrupts have to be disabled to prevent any teardown of the page
>   tables.
> - The active_mm is used for the page walker. Compared with mm, the
>   active_mm is a better choice. It's always non-NULL. For the user
>   thread, it always points to the real address space. For the kernel
>   thread, it "take over" the mm of the threads that switched to it,
>   so it's not using all of the page tables from the init_mm all the
>   time.
> - The MMU page size is calculated from the page table level.
>
> The method should work for all architectures, but it has only been
> verified on X86. Should there be some architectures, which support perf,
> where the method doesn't work, it can be fixed later separately.
> Reporting the wrong page size would not be fatal for the architecture.
>
> Some under discussion features may impact the method in the future.
> Quote from Dave Hansen,
>   "There are lots of weird things folks are trying to do with the page
>    tables, like Address Space Isolation.  For instance, if you get a
>    perf NMI when running userspace, current->mm->pgd is *different* than
>    the PGD that was in use when userspace was running. It's close enough
>    today, but it might not stay that way."
> If the case happens later, lots of consecutive page walk errors will
> happen. The worst case is that lots of page-size '0' are returned, which
> would not be fatal.
> In the perf tool, a check is implemented to detect this case. Once it
> happens, a kernel patch could be implemented accordingly then.
>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  include/linux/perf_event.h      |  1 +
>  include/uapi/linux/perf_event.h |  4 +-
>  kernel/events/core.c            | 93 +++++++++++++++++++++++++++++++++
>  3 files changed, 97 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0c19d279b97f..7e3785dd27d9 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1034,6 +1034,7 @@ struct perf_sample_data {
>
>         u64                             phys_addr;
>         u64                             cgroup;
> +       u64                             data_page_size;
>  } ____cacheline_aligned;
>
>  /* default value for data source */
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 077e7ee69e3d..cc6ea346e9f9 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -143,8 +143,9 @@ enum perf_event_sample_format {
>         PERF_SAMPLE_PHYS_ADDR                   = 1U << 19,
>         PERF_SAMPLE_AUX                         = 1U << 20,
>         PERF_SAMPLE_CGROUP                      = 1U << 21,
> +       PERF_SAMPLE_DATA_PAGE_SIZE              = 1U << 22,
>
> -       PERF_SAMPLE_MAX = 1U << 22,             /* non-ABI */
> +       PERF_SAMPLE_MAX = 1U << 23,             /* non-ABI */
>
>         __PERF_SAMPLE_CALLCHAIN_EARLY           = 1ULL << 63, /* non-ABI; internal use */
>  };
> @@ -896,6 +897,7 @@ enum perf_event_type {
>          *      { u64                   phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>          *      { u64                   size;
>          *        char                  data[size]; } && PERF_SAMPLE_AUX
> +        *      { u64                   data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>          * };
>          */
>         PERF_RECORD_SAMPLE                      = 9,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 45edb85344a1..dd329a8f99f7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -51,6 +51,7 @@
>  #include <linux/proc_ns.h>
>  #include <linux/mount.h>
>  #include <linux/min_heap.h>
> +#include <linux/highmem.h>
>
>  #include "internal.h"
>
> @@ -1894,6 +1895,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
>         if (sample_type & PERF_SAMPLE_CGROUP)
>                 size += sizeof(data->cgroup);
>
> +       if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
> +               size += sizeof(data->data_page_size);
> +
>         event->header_size = size;
>  }
>
> @@ -6937,6 +6941,9 @@ void perf_output_sample(struct perf_output_handle *handle,
>         if (sample_type & PERF_SAMPLE_CGROUP)
>                 perf_output_put(handle, data->cgroup);
>
> +       if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
> +               perf_output_put(handle, data->data_page_size);
> +
>         if (sample_type & PERF_SAMPLE_AUX) {
>                 perf_output_put(handle, data->aux_size);
>
> @@ -6994,6 +7001,84 @@ static u64 perf_virt_to_phys(u64 virt)
>         return phys_addr;
>  }
>
> +#ifdef CONFIG_MMU
> +
> +/*
> + * Return the MMU page size of a given virtual address
> + */
> +static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr)
> +{
> +       pgd_t *pgd;
> +       p4d_t *p4d;
> +       pud_t *pud;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       pgd = pgd_offset(mm, addr);
> +       if (pgd_none(*pgd))
> +               return 0;
> +
> +       p4d = p4d_offset(pgd, addr);
> +       if (!p4d_present(*p4d))
> +               return 0;
> +
> +       if (p4d_leaf(*p4d))
> +               return 1ULL << P4D_SHIFT;
> +
> +       pud = pud_offset(p4d, addr);
> +       if (!pud_present(*pud))
> +               return 0;
> +
> +       if (pud_leaf(*pud))
> +               return 1ULL << PUD_SHIFT;
> +
> +       pmd = pmd_offset(pud, addr);
> +       if (!pmd_present(*pmd))
> +               return 0;
> +
> +       if (pmd_leaf(*pmd))
> +               return 1ULL << PMD_SHIFT;
> +
> +       pte = pte_offset_map(pmd, addr);
> +       if (!pte_present(*pte)) {
> +               pte_unmap(pte);
> +               return 0;
> +       }
> +
> +       pte_unmap(pte);
> +       return PAGE_SIZE;
> +}
> +
> +#else
> +
> +static u64 __perf_get_page_size(struct mm_struct *mm, unsigned long addr)
> +{
> +       return 0;
> +}
> +
> +#endif
> +
> +static u64 perf_get_page_size(unsigned long addr)
> +{
> +       unsigned long flags;
> +       u64 size;
> +
> +       if (!addr)
> +               return 0;
> +
> +       /*
> +        * Software page-table walkers must disable IRQs,
> +        * which prevents any tear down of the page tables.
> +        */
> +       local_irq_save(flags);
> +
> +       size = __perf_get_page_size(current->active_mm, addr);
> +

When I tested on my kernel, it panicked because I suspect
current->active_mm could be NULL. Adding a check for NULL avoided the
problem. But I suspect this is not the correct solution.

>
> +       local_irq_restore(flags);
> +
> +       return size;
> +}
> +
>  static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
>
>  struct perf_callchain_entry *
> @@ -7149,6 +7234,14 @@ void perf_prepare_sample(struct perf_event_header *header,
>         }
>  #endif
>
> +       /*
> +        * PERF_DATA_PAGE_SIZE requires PERF_SAMPLE_ADDR. If the user doesn't
> +        * require PERF_SAMPLE_ADDR, kernel implicitly retrieve the data->addr,
> +        * but the value will not dump to the userspace.
> +        */
> +       if (sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)
> +               data->data_page_size = perf_get_page_size(data->addr);
> +
>         if (sample_type & PERF_SAMPLE_AUX) {
>                 u64 size;
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-30  7:15   ` Stephane Eranian
@ 2020-09-30 13:51     ` Dave Hansen
  2020-09-30 14:42     ` Liang, Kan
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2020-09-30 13:51 UTC (permalink / raw)
  To: Stephane Eranian, Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Andi Kleen,
	kirill.shutemov, Michael Ellerman, benh, Paul Mackerras

On 9/30/20 12:15 AM, Stephane Eranian wrote:
>> +       /*
>> +        * Software page-table walkers must disable IRQs,
>> +        * which prevents any tear down of the page tables.
>> +        */
>> +       local_irq_save(flags);
>> +
>> +       size = __perf_get_page_size(current->active_mm, addr);
>> +
> When I tested on my kernel, it panicked because I suspect
> current->active_mm could be NULL. Adding a check for NULL avoided the
> problem. But I suspect this is not the correct solution.

Did you happen to capture the oops?  I can _imagine_ scenarios where
current->active_mm could be NULL, I just can't find any obvious ones in
the code.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-30  7:15   ` Stephane Eranian
  2020-09-30 13:51     ` Dave Hansen
@ 2020-09-30 14:42     ` Liang, Kan
  2020-09-30 14:48       ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Liang, Kan @ 2020-09-30 14:42 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Andi Kleen,
	Dave Hansen, kirill.shutemov, Michael Ellerman, benh,
	Paul Mackerras



On 9/30/2020 3:15 AM, Stephane Eranian wrote:
>> +static u64 perf_get_page_size(unsigned long addr)
>> +{
>> +       unsigned long flags;
>> +       u64 size;
>> +
>> +       if (!addr)
>> +               return 0;
>> +
>> +       /*
>> +        * Software page-table walkers must disable IRQs,
>> +        * which prevents any tear down of the page tables.
>> +        */
>> +       local_irq_save(flags);
>> +
>> +       size = __perf_get_page_size(current->active_mm, addr);
>> +
> When I tested on my kernel, it panicked because I suspect
> current->active_mm could be NULL. Adding a check for NULL avoided the
> problem. But I suspect this is not the correct solution.
> 

I guess the NULL active_mm should be a rare case. If so, I think it's 
not bad to add a check and return 0 page size.


Thanks,
Kan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-30 14:42     ` Liang, Kan
@ 2020-09-30 14:48       ` Dave Hansen
  2020-09-30 17:17         ` Stephane Eranian
  2020-09-30 17:30         ` Peter Zijlstra
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2020-09-30 14:48 UTC (permalink / raw)
  To: Liang, Kan, Stephane Eranian
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Andi Kleen,
	kirill.shutemov, Michael Ellerman, benh, Paul Mackerras

On 9/30/20 7:42 AM, Liang, Kan wrote:
>> When I tested on my kernel, it panicked because I suspect
>> current->active_mm could be NULL. Adding a check for NULL avoided the
>> problem. But I suspect this is not the correct solution.
> 
> I guess the NULL active_mm should be a rare case. If so, I think it's
> not bad to add a check and return 0 page size.

I think it would be best to understand why ->active_mm is NULL instead
of just papering over the problem.  If it is papered over, and this is
common, you might end up effectively turning off your shiny new feature
inadvertently.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-30 14:48       ` Dave Hansen
@ 2020-09-30 17:17         ` Stephane Eranian
  2020-09-30 17:30         ` Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: Stephane Eranian @ 2020-09-30 17:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Liang, Kan, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Andi Kleen, kirill.shutemov, Michael Ellerman, benh,
	Paul Mackerras

On Wed, Sep 30, 2020 at 7:48 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 9/30/20 7:42 AM, Liang, Kan wrote:
> >> When I tested on my kernel, it panicked because I suspect
> >> current->active_mm could be NULL. Adding a check for NULL avoided the
> >> problem. But I suspect this is not the correct solution.
> >
> > I guess the NULL active_mm should be a rare case. If so, I think it's
> > not bad to add a check and return 0 page size.
>
> I think it would be best to understand why ->active_mm is NULL instead
> of just papering over the problem.  If it is papered over, and this is
> common, you might end up effectively turning off your shiny new feature
> inadvertently.

I tried that on a backport of the patch to an older kernel. Maybe the
behavior of active_mm has change compared to tip.git.
I will try again with tip.git.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-30 14:48       ` Dave Hansen
  2020-09-30 17:17         ` Stephane Eranian
@ 2020-09-30 17:30         ` Peter Zijlstra
  2020-09-30 22:45           ` Stephane Eranian
  2020-10-01 13:49           ` Dave Hansen
  1 sibling, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2020-09-30 17:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Liang, Kan, Stephane Eranian, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Andi Kleen, kirill.shutemov, Michael Ellerman, benh,
	Paul Mackerras

On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote:
> On 9/30/20 7:42 AM, Liang, Kan wrote:
> >> When I tested on my kernel, it panicked because I suspect
> >> current->active_mm could be NULL. Adding a check for NULL avoided the
> >> problem. But I suspect this is not the correct solution.
> > 
> > I guess the NULL active_mm should be a rare case. If so, I think it's
> > not bad to add a check and return 0 page size.
> 
> I think it would be best to understand why ->active_mm is NULL instead
> of just papering over the problem.  If it is papered over, and this is
> common, you might end up effectively turning off your shiny new feature
> inadvertently.

context_switch() can set prev->active_mm to NULL when it transfers it to
@next. It does this before @current is updated. So an NMI that comes in
between this active_mm swizzling and updating @current will see
!active_mm.

In general though; I think using ->active_mm is a mistake though. That
code should be doing something like:


	mm = current->mm;
	if (!mm)
		mm = &init_mm;



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-30 17:30         ` Peter Zijlstra
@ 2020-09-30 22:45           ` Stephane Eranian
  2020-10-01 13:30             ` Liang, Kan
  2020-10-01 13:49           ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Stephane Eranian @ 2020-09-30 22:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Liang, Kan, Ingo Molnar, Arnaldo Carvalho de Melo,
	LKML, Mark Rutland, Alexander Shishkin, Jiri Olsa, Andi Kleen,
	kirill.shutemov, Michael Ellerman, benh, Paul Mackerras

On Wed, Sep 30, 2020 at 10:30 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote:
> > On 9/30/20 7:42 AM, Liang, Kan wrote:
> > >> When I tested on my kernel, it panicked because I suspect
> > >> current->active_mm could be NULL. Adding a check for NULL avoided the
> > >> problem. But I suspect this is not the correct solution.
> > >
> > > I guess the NULL active_mm should be a rare case. If so, I think it's
> > > not bad to add a check and return 0 page size.
> >
> > I think it would be best to understand why ->active_mm is NULL instead
> > of just papering over the problem.  If it is papered over, and this is
> > common, you might end up effectively turning off your shiny new feature
> > inadvertently.
>
> context_switch() can set prev->active_mm to NULL when it transfers it to
> @next. It does this before @current is updated. So an NMI that comes in
> between this active_mm swizzling and updating @current will see
> !active_mm.
>
I think Peter is right. This code is called in the context of NMI, so
if active_mm is set to NULL inside
a locked section, this is not enough to protect the perf_events code
from seeing it.

> In general though; I think using ->active_mm is a mistake though. That
> code should be doing something like:
>
>
>         mm = current->mm;
>         if (!mm)
>                 mm = &init_mm;
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-30 22:45           ` Stephane Eranian
@ 2020-10-01 13:30             ` Liang, Kan
  0 siblings, 0 replies; 15+ messages in thread
From: Liang, Kan @ 2020-10-01 13:30 UTC (permalink / raw)
  To: Stephane Eranian, Peter Zijlstra
  Cc: Dave Hansen, Ingo Molnar, Arnaldo Carvalho de Melo, LKML,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Andi Kleen,
	kirill.shutemov, Michael Ellerman, benh, Paul Mackerras



On 9/30/2020 6:45 PM, Stephane Eranian wrote:
> On Wed, Sep 30, 2020 at 10:30 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Wed, Sep 30, 2020 at 07:48:48AM -0700, Dave Hansen wrote:
>>> On 9/30/20 7:42 AM, Liang, Kan wrote:
>>>>> When I tested on my kernel, it panicked because I suspect
>>>>> current->active_mm could be NULL. Adding a check for NULL avoided the
>>>>> problem. But I suspect this is not the correct solution.
>>>>
>>>> I guess the NULL active_mm should be a rare case. If so, I think it's
>>>> not bad to add a check and return 0 page size.
>>>
>>> I think it would be best to understand why ->active_mm is NULL instead
>>> of just papering over the problem.  If it is papered over, and this is
>>> common, you might end up effectively turning off your shiny new feature
>>> inadvertently.
>>
>> context_switch() can set prev->active_mm to NULL when it transfers it to
>> @next. It does this before @current is updated. So an NMI that comes in
>> between this active_mm swizzling and updating @current will see
>> !active_mm.
>>
> I think Peter is right. This code is called in the context of NMI, so
> if active_mm is set to NULL inside
> a locked section, this is not enough to protect the perf_events code
> from seeing it.
> 
>> In general though; I think using ->active_mm is a mistake though. That
>> code should be doing something like:
>>
>>
>>          mm = current->mm;
>>          if (!mm)
>>                  mm = &init_mm;
>>
>>

I will send a V9 with the changes Peter suggests.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-09-30 17:30         ` Peter Zijlstra
  2020-09-30 22:45           ` Stephane Eranian
@ 2020-10-01 13:49           ` Dave Hansen
  2020-10-01 14:23             ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2020-10-01 13:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liang, Kan, Stephane Eranian, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Andi Kleen, kirill.shutemov, Michael Ellerman, benh,
	Paul Mackerras

On 9/30/20 10:30 AM, Peter Zijlstra wrote:
> In general though; I think using ->active_mm is a mistake though. That
> code should be doing something like:
> 
> 
> 	mm = current->mm;
> 	if (!mm)
> 		mm = &init_mm;
> 

I was hoping that using ->active_mm would give us the *actual* copy of
the page tables that's loaded in CR3 for kernel thraeds.  But, there are
few if any practical advantages of doing that at the moment.

Using that ^ is fine with me.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE
  2020-10-01 13:49           ` Dave Hansen
@ 2020-10-01 14:23             ` Peter Zijlstra
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2020-10-01 14:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Liang, Kan, Stephane Eranian, Ingo Molnar,
	Arnaldo Carvalho de Melo, LKML, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Andi Kleen, kirill.shutemov, Michael Ellerman, benh,
	Paul Mackerras, Andy Lutomirski

On Thu, Oct 01, 2020 at 06:49:36AM -0700, Dave Hansen wrote:
> On 9/30/20 10:30 AM, Peter Zijlstra wrote:
> > In general though; I think using ->active_mm is a mistake though. That
> > code should be doing something like:
> > 
> > 
> > 	mm = current->mm;
> > 	if (!mm)
> > 		mm = &init_mm;
> > 
> 
> I was hoping that using ->active_mm would give us the *actual* copy of
> the page tables that's loaded in CR3 for kernel thraeds.  But, there are
> few if any practical advantages of doing that at the moment.

Some of us hate active_mm with a passion and want to remove it entirely
(/me waves at amluto).

Also, if !current->mm, it had better not be accessing anything outside
of the kernel map, so &init_mm should cover that just fine. And given
the kernel maps are shared between all CR3s, they'll see the exact same
pagetables.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-10-01 14:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-21 15:26 [PATCH V8 0/4] Add the page size in the perf record (kernel) kan.liang
2020-09-21 15:26 ` [PATCH V8 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE kan.liang
2020-09-30  7:15   ` Stephane Eranian
2020-09-30 13:51     ` Dave Hansen
2020-09-30 14:42     ` Liang, Kan
2020-09-30 14:48       ` Dave Hansen
2020-09-30 17:17         ` Stephane Eranian
2020-09-30 17:30         ` Peter Zijlstra
2020-09-30 22:45           ` Stephane Eranian
2020-10-01 13:30             ` Liang, Kan
2020-10-01 13:49           ` Dave Hansen
2020-10-01 14:23             ` Peter Zijlstra
2020-09-21 15:26 ` [PATCH V8 2/4] perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE kan.liang
2020-09-21 15:26 ` [PATCH V8 3/4] powerpc/perf: " kan.liang
2020-09-21 15:26 ` [PATCH V8 4/4] perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).