linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot
       [not found] <BYAPR01MB40856478D5BE74CB6A7D5578CFBD9@BYAPR01MB4085.prod.exchangelabs.com>
@ 2021-01-25 20:15 ` Dave Hansen
  2021-01-25 20:32   ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2021-01-25 20:15 UTC (permalink / raw)
  To: Saravanan D, x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team

On 1/25/21 12:11 PM, Saravanan D wrote:
> Numerous hugepage splits in the linear mapping would give
> admins the signal to narrow down the sluggishness caused by TLB
> miss/reload.
> 
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.
> 
> The split event information will be displayed at the bottom of
> /proc/meminfo
> ....
> DirectMap4k:     3505112 kB
> DirectMap2M:    19464192 kB
> DirectMap1G:    12582912 kB
> DirectMap2MSplits:  1705
> DirectMap1GSplits:    20

This seems much more like something we'd want in /proc/vmstat or as a
tracepoint than meminfo.  A tracepoint would be especially nice because
the trace buffer could actually be examined if an admin finds an
excessive number of these.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot
  2021-01-25 20:15 ` [PATCH] x86/mm: Tracking linear mapping split events since boot Dave Hansen
@ 2021-01-25 20:32   ` Tejun Heo
  2021-01-26  0:47     ` Dave Hansen
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2021-01-25 20:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

Hello,

On Mon, Jan 25, 2021 at 12:15:51PM -0800, Dave Hansen wrote:
> > DirectMap4k:     3505112 kB
> > DirectMap2M:    19464192 kB
> > DirectMap1G:    12582912 kB
> > DirectMap2MSplits:  1705
> > DirectMap1GSplits:    20
> 
> This seems much more like something we'd want in /proc/vmstat or as a
> tracepoint than meminfo.  A tracepoint would be especially nice because
> the trace buffer could actually be examined if an admin finds an
> excessive number of these.

Adding a TP sure can be helpful but I'm not sure how that'd make counters
unnecessary given that the accumulated number of events since boot is what
matters.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot
  2021-01-25 20:32   ` Tejun Heo
@ 2021-01-26  0:47     ` Dave Hansen
  2021-01-26  0:53       ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2021-01-26  0:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

On 1/25/21 12:32 PM, Tejun Heo wrote:
> On Mon, Jan 25, 2021 at 12:15:51PM -0800, Dave Hansen wrote:
>>> DirectMap4k:     3505112 kB
>>> DirectMap2M:    19464192 kB
>>> DirectMap1G:    12582912 kB
>>> DirectMap2MSplits:  1705
>>> DirectMap1GSplits:    20
>> This seems much more like something we'd want in /proc/vmstat or as a
>> tracepoint than meminfo.  A tracepoint would be especially nice because
>> the trace buffer could actually be examined if an admin finds an
>> excessive number of these.
> Adding a TP sure can be helpful but I'm not sure how that'd make counters
> unnecessary given that the accumulated number of events since boot is what
> matters.

Kinda.  The thing that *REALLY* matters is how many of these splits were
avoidable and *could* be coalesced.

The patch here does not actually separate out pre-boot from post-boot,
so it's pretty hard to tell if the splits came from something like
tracing which is totally unnecessary or they were the result of
something at boot that we can't do anything about.

This would be a lot more useful if you could reset the counters.  Then
just reset them from userspace at boot.  Adding read-write debugfs
exports for these should be pretty trivial.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot
  2021-01-26  0:47     ` Dave Hansen
@ 2021-01-26  0:53       ` Tejun Heo
  2021-01-26  1:04         ` Dave Hansen
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2021-01-26  0:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

Hello, Dave.

On Mon, Jan 25, 2021 at 04:47:42PM -0800, Dave Hansen wrote:
> The patch here does not actually separate out pre-boot from post-boot,
> so it's pretty hard to tell if the splits came from something like
> tracing which is totally unnecessary or they were the result of
> something at boot that we can't do anything about.

Ah, right, didn't know they also included splits during boot. It'd be a lot
more useful if they were counting post-boot splits.

> This would be a lot more useful if you could reset the counters.  Then
> just reset them from userspace at boot.  Adding read-write debugfs
> exports for these should be pretty trivial.

While this would work for hands-on cases, I'm a bit worried that this might
be more challenging to gain confidence in large production environments.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot
  2021-01-26  0:53       ` Tejun Heo
@ 2021-01-26  1:04         ` Dave Hansen
  2021-01-26  1:17           ` Tejun Heo
  2021-01-27 17:51           ` [PATCH V2] x86/mm: Tracking linear mapping split events Saravanan D
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Hansen @ 2021-01-26  1:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

On 1/25/21 4:53 PM, Tejun Heo wrote:
>> This would be a lot more useful if you could reset the counters.  Then
>> just reset them from userspace at boot.  Adding read-write debugfs
>> exports for these should be pretty trivial.
> While this would work for hands-on cases, I'm a bit worried that this might
> be more challenging to gain confidence in large production environments.

Which part?  Large production environments don't trust data from
debugfs?  Or don't trust it if it might have been reset?

You could stick the "reset" switch in debugfs, and dump something out in
dmesg like we do for /proc/sys/vm/drop_caches so it's not a surprise
that it happened.

BTW, counts of *events* don't really belong in meminfo.  These really do
belong in /proc/vmstat if anything.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot
  2021-01-26  1:04         ` Dave Hansen
@ 2021-01-26  1:17           ` Tejun Heo
  2021-01-27 17:51           ` [PATCH V2] x86/mm: Tracking linear mapping split events Saravanan D
  1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2021-01-26  1:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

Hello,

On Mon, Jan 25, 2021 at 05:04:00PM -0800, Dave Hansen wrote:
> Which part?  Large production environments don't trust data from
> debugfs?  Or don't trust it if it might have been reset?

When the last reset was. Not saying it's impossible or anything but in
general it's a lot better to have the counters to be monotonically
increasing with time/event stamped markers than the counters themselves
getting reset or modified in other ways because the ownership of a specific
counter might not be obvious to everyone and accidents and mistakes happen.

Note that the "time/event stamped markers" above don't need to and shouldn't
be in the kernel. It can be managed by whoever that wants to monitor a given
time period and there can be any number of them.

> You could stick the "reset" switch in debugfs, and dump something out in
> dmesg like we do for /proc/sys/vm/drop_caches so it's not a surprise
> that it happened.

Processing dmesgs can work too but isn't particularly reliable or scalable.

> BTW, counts of *events* don't really belong in meminfo.  These really do
> belong in /proc/vmstat if anything.

Oh yeah, I don't have a strong opinion on where the counters should go.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V2] x86/mm: Tracking linear mapping split events
  2021-01-26  1:04         ` Dave Hansen
  2021-01-26  1:17           ` Tejun Heo
@ 2021-01-27 17:51           ` Saravanan D
  2021-01-27 21:03             ` Tejun Heo
  1 sibling, 1 reply; 33+ messages in thread
From: Saravanan D @ 2021-01-27 17:51 UTC (permalink / raw)
  To: x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team, Saravanan D

Numerous hugepage splits in the linear mapping would give
admins the signal to narrow down the sluggishness caused by TLB
miss/reload.

To help with debugging, we introduce monotonic lifetime  hugepage
split event counts since SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_2M_splits 139
direct_map_4M_splits 0
direct_map_1G_splits 7
nr_unstable 0
....

Ancillary debugfs split event counts exported to userspace via read-write
endpoints : /sys/kernel/debug/x86/direct_map_[2M|4M|1G]_split

dmesg log when user resets the debugfs split event count for
debugging
....
[  232.470531] debugfs 2M Pages split event count(128) reset to 0
....

One of the many lasting (as we don't coalesce back) sources for huge page
splits is tracing as the granular page attribute/permission changes would
force the kernel to split code segments mapped to huge pages to smaller
ones thereby increasing the probability of TLB miss/reload even after
tracing has been stopped.

Signed-off-by: Saravanan D <saravanand@fb.com>
---
 arch/x86/mm/pat/set_memory.c  | 117 ++++++++++++++++++++++++++++++++++
 include/linux/vm_event_item.h |   8 +++
 mm/vmstat.c                   |   8 +++
 3 files changed, 133 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..97b6ef8dbd12 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -76,6 +78,104 @@ static inline pgprot_t cachemode2pgprot(enum page_cache_mode pcm)
 
 #ifdef CONFIG_PROC_FS
 static unsigned long direct_pages_count[PG_LEVEL_NUM];
+static unsigned long split_page_event_count[PG_LEVEL_NUM];
+
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+static int direct_map_2M_split_set(void *data, u64 val)
+{
+	switch (val) {
+	case 0:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	pr_info("debugfs 2M Pages split event count(%lu) reset to 0",
+		  split_page_event_count[PG_LEVEL_2M]);
+	split_page_event_count[PG_LEVEL_2M] = 0;
+
+	return 0;
+}
+
+static int direct_map_2M_split_get(void *data, u64 *val)
+{
+	*val = split_page_event_count[PG_LEVEL_2M];
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_direct_map_2M_split, direct_map_2M_split_get,
+			 direct_map_2M_split_set, "%llu\n");
+#else
+static int direct_map_4M_split_set(void *data, u64 val)
+{
+	switch (val) {
+	case 0:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	pr_info("debugfs 4M Pages split event count(%lu) reset to 0",
+		  split_page_event_count[PG_LEVEL_2M]);
+	split_page_event_count[PG_LEVEL_2M] = 0;
+
+	return 0;
+}
+
+static int direct_map_4M_split_get(void *data, u64 *val)
+{
+	*val = split_page_event_count[PG_LEVEL_2M];
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_direct_map_4M_split, direct_map_4M_split_get,
+			 direct_map_4M_split_set, "%llu\n");
+#endif
+
+static int direct_map_1G_split_set(void *data, u64 val)
+{
+	switch (val) {
+	case 0:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	pr_info("debugfs 1G Pages split event count(%lu) reset to 0",
+		  split_page_event_count[PG_LEVEL_1G]);
+	split_page_event_count[PG_LEVEL_1G] = 0;
+
+	return 0;
+}
+
+static int direct_map_1G_split_get(void *data, u64 *val)
+{
+	*val = split_page_event_count[PG_LEVEL_1G];
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_direct_map_1G_split, direct_map_1G_split_get,
+			 direct_map_1G_split_set, "%llu\n");
+
+static __init int direct_map_split_debugfs_init(void)
+{
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+	debugfs_create_file("direct_map_2M_split", 0600,
+			    arch_debugfs_dir, NULL,
+			    &fops_direct_map_2M_split);
+#else
+	debugfs_create_file("direct_map_4M_split", 0600,
+			    arch_debugfs_dir, NULL,
+			    &fops_direct_map_4M_split);
+#endif
+	if (direct_gbpages)
+		debugfs_create_file("direct_map_1G_split", 0600,
+				    arch_debugfs_dir, NULL,
+				    &fops_direct_map_1G_split);
+	return 0;
+}
+
+late_initcall(direct_map_split_debugfs_init);
 
 void update_page_count(int level, unsigned long pages)
 {
@@ -85,12 +185,29 @@ void update_page_count(int level, unsigned long pages)
 	spin_unlock(&pgd_lock);
 }
 
+void update_split_page_event_count(int level)
+{
+	if (system_state == SYSTEM_RUNNING) {
+		split_page_event_count[level]++;
+		if (level == PG_LEVEL_2M) {
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+			count_vm_event(DIRECT_MAP_2M_SPLIT);
+#else
+			count_vm_event(DIRECT_MAP_4M_SPLIT);
+#endif
+		} else if (level == PG_LEVEL_1G) {
+			count_vm_event(DIRECT_MAP_1G_SPLIT);
+		}
+	}
+}
+
 static void split_page_count(int level)
 {
 	if (direct_pages_count[level] == 0)
 		return;
 
 	direct_pages_count[level]--;
+	update_split_page_event_count(level);
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..439742d2435e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,14 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#if defined(__x86_64__)
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+		DIRECT_MAP_2M_SPLIT,
+#else
+		DIRECT_MAP_4M_SPLIT,
+#endif
+		DIRECT_MAP_1G_SPLIT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..beaa2bb4f9dc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,14 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#if defined(__x86_64__)
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+	"direct_map_2M_splits",
+#else
+	"direct_map_4M_splits",
+#endif
+	"direct_map_1G_splits",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] x86/mm: Tracking linear mapping split events
  2021-01-27 17:51           ` [PATCH V2] x86/mm: Tracking linear mapping split events Saravanan D
@ 2021-01-27 21:03             ` Tejun Heo
  2021-01-27 21:32               ` Dave Hansen
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2021-01-27 21:03 UTC (permalink / raw)
  To: Saravanan D; +Cc: x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

Hello,

On Wed, Jan 27, 2021 at 09:51:24AM -0800, Saravanan D wrote:
> Numerous hugepage splits in the linear mapping would give
> admins the signal to narrow down the sluggishness caused by TLB
> miss/reload.
> 
> To help with debugging, we introduce monotonic lifetime  hugepage
> split event counts since SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
> 
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_2M_splits 139
> direct_map_4M_splits 0
> direct_map_1G_splits 7
> nr_unstable 0
> ....

This looks great to me.

> 
> Ancillary debugfs split event counts exported to userspace via read-write
> endpoints : /sys/kernel/debug/x86/direct_map_[2M|4M|1G]_split
> 
> dmesg log when user resets the debugfs split event count for
> debugging
> ....
> [  232.470531] debugfs 2M Pages split event count(128) reset to 0
> ....

I'm not convinced this part is necessary or even beneficial.

> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.
> 
> Signed-off-by: Saravanan D <saravanand@fb.com>
> ---
>  arch/x86/mm/pat/set_memory.c  | 117 ++++++++++++++++++++++++++++++++++
>  include/linux/vm_event_item.h |   8 +++
>  mm/vmstat.c                   |   8 +++
>  3 files changed, 133 insertions(+)

So, now the majority of the added code is to add debugfs knobs which don't
provide anything that userland can't already do by simply reading the
monotonic counters.

Dave, are you still set on the resettable counters?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] x86/mm: Tracking linear mapping split events
  2021-01-27 21:03             ` Tejun Heo
@ 2021-01-27 21:32               ` Dave Hansen
  2021-01-27 21:36                 ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2021-01-27 21:32 UTC (permalink / raw)
  To: Tejun Heo, Saravanan D
  Cc: x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

On 1/27/21 1:03 PM, Tejun Heo wrote:
>> The lifetime split event information will be displayed at the bottom of
>> /proc/vmstat
>> ....
>> swap_ra 0
>> swap_ra_hit 0
>> direct_map_2M_splits 139
>> direct_map_4M_splits 0
>> direct_map_1G_splits 7
>> nr_unstable 0
>> ....
> 
> This looks great to me.

Yeah, this looks fine to me.  It's way better than meminfo.

>>  arch/x86/mm/pat/set_memory.c  | 117 ++++++++++++++++++++++++++++++++++
>>  include/linux/vm_event_item.h |   8 +++
>>  mm/vmstat.c                   |   8 +++
>>  3 files changed, 133 insertions(+)
> 
> So, now the majority of the added code is to add debugfs knobs which don't
> provide anything that userland can't already do by simply reading the
> monotonic counters.
> 
> Dave, are you still set on the resettable counters?

Not *really*.  But, you either need them to be resettable, or you need
to expect your users to take snapshots and compare changes over time.
Considering how much more code it is, though, I'm not super attached to it.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] x86/mm: Tracking linear mapping split events
  2021-01-27 21:32               ` Dave Hansen
@ 2021-01-27 21:36                 ` Tejun Heo
  2021-01-27 21:42                   ` Saravanan D
  2021-01-27 22:50                   ` [PATCH V3] " Saravanan D
  0 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2021-01-27 21:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

Hello,

On Wed, Jan 27, 2021 at 01:32:03PM -0800, Dave Hansen wrote:
> >>  arch/x86/mm/pat/set_memory.c  | 117 ++++++++++++++++++++++++++++++++++
> >>  include/linux/vm_event_item.h |   8 +++
> >>  mm/vmstat.c                   |   8 +++
> >>  3 files changed, 133 insertions(+)
> > 
> > So, now the majority of the added code is to add debugfs knobs which don't
> > provide anything that userland can't already do by simply reading the
> > monotonic counters.
> > 
> > Dave, are you still set on the resettable counters?
> 
> Not *really*.  But, you either need them to be resettable, or you need
> to expect your users to take snapshots and compare changes over time.
> Considering how much more code it is, though, I'm not super attached to it.

Saravanan, can you please drop the debugfs portion and repost?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] x86/mm: Tracking linear mapping split events
  2021-01-27 21:36                 ` Tejun Heo
@ 2021-01-27 21:42                   ` Saravanan D
  2021-01-27 22:50                   ` [PATCH V3] " Saravanan D
  1 sibling, 0 replies; 33+ messages in thread
From: Saravanan D @ 2021-01-27 21:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Hansen, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

Hi Tejun,

> Saravanan, can you please drop the debugfs portion and repost?
Sure.

Saravanan D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V3] x86/mm: Tracking linear mapping split events
  2021-01-27 21:36                 ` Tejun Heo
  2021-01-27 21:42                   ` Saravanan D
@ 2021-01-27 22:50                   ` Saravanan D
  2021-01-27 23:00                     ` Randy Dunlap
  2021-01-27 23:41                     ` Dave Hansen
  1 sibling, 2 replies; 33+ messages in thread
From: Saravanan D @ 2021-01-27 22:50 UTC (permalink / raw)
  To: x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team, Saravanan D

To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic lifetime hugepage split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_2M_splits 167
direct_map_1G_splits 6
nr_unstable 0
....

One of the many lasting (as we don't coalesce back) sources for huge page
splits is tracing as the granular page attribute/permission changes would
force the kernel to split code segments mapped to huge pages to smaller
ones thereby increasing the probability of TLB miss/reload even after
tracing has been stopped.

Signed-off-by: Saravanan D <saravanand@fb.com>
---
 arch/x86/mm/pat/set_memory.c  | 18 ++++++++++++++++++
 include/linux/vm_event_item.h |  8 ++++++++
 mm/vmstat.c                   |  8 ++++++++
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..3ea6316df089 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -85,12 +87,28 @@ void update_page_count(int level, unsigned long pages)
 	spin_unlock(&pgd_lock);
 }
 
+void update_split_page_event_count(int level)
+{
+	if (system_state == SYSTEM_RUNNING) {
+		if (level == PG_LEVEL_2M) {
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+			count_vm_event(DIRECT_MAP_2M_SPLIT);
+#else
+			count_vm_event(DIRECT_MAP_4M_SPLIT);
+#endif
+		} else if (level == PG_LEVEL_1G) {
+			count_vm_event(DIRECT_MAP_1G_SPLIT);
+		}
+	}
+}
+
 static void split_page_count(int level)
 {
 	if (direct_pages_count[level] == 0)
 		return;
 
 	direct_pages_count[level]--;
+	update_split_page_event_count(level);
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..439742d2435e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,14 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#if defined(__x86_64__)
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+		DIRECT_MAP_2M_SPLIT,
+#else
+		DIRECT_MAP_4M_SPLIT,
+#endif
+		DIRECT_MAP_1G_SPLIT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..beaa2bb4f9dc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,14 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#if defined(__x86_64__)
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+	"direct_map_2M_splits",
+#else
+	"direct_map_4M_splits",
+#endif
+	"direct_map_1G_splits",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH V3] x86/mm: Tracking linear mapping split events
  2021-01-27 22:50                   ` [PATCH V3] " Saravanan D
@ 2021-01-27 23:00                     ` Randy Dunlap
  2021-01-27 23:56                       ` Saravanan D
  2021-01-27 23:41                     ` Dave Hansen
  1 sibling, 1 reply; 33+ messages in thread
From: Randy Dunlap @ 2021-01-27 23:00 UTC (permalink / raw)
  To: Saravanan D, x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team

On 1/27/21 2:50 PM, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
> 
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_2M_splits 167
> direct_map_1G_splits 6
> nr_unstable 0
> ....
> 
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.
> 
> Signed-off-by: Saravanan D <saravanand@fb.com>
> ---
>  arch/x86/mm/pat/set_memory.c  | 18 ++++++++++++++++++
>  include/linux/vm_event_item.h |  8 ++++++++
>  mm/vmstat.c                   |  8 ++++++++
>  3 files changed, 34 insertions(+)

Documenation/ update, please.

-- 
~Randy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V3] x86/mm: Tracking linear mapping split events
  2021-01-27 22:50                   ` [PATCH V3] " Saravanan D
  2021-01-27 23:00                     ` Randy Dunlap
@ 2021-01-27 23:41                     ` Dave Hansen
  2021-01-28  0:15                       ` Saravanan D
  2021-01-28  4:35                       ` [PATCH V4] " Saravanan D
  1 sibling, 2 replies; 33+ messages in thread
From: Dave Hansen @ 2021-01-27 23:41 UTC (permalink / raw)
  To: Saravanan D, x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team

On 1/27/21 2:50 PM, Saravanan D wrote:
> +#if defined(__x86_64__)

We don't use __x86_64__ in the kernel.  This should be CONFIG_X86.

> +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> +	"direct_map_2M_splits",
> +#else
> +	"direct_map_4M_splits",
> +#endif
> +	"direct_map_1G_splits",
> +#endif

These #ifdefs are hideous, and repeated.

I'd rather have no 32-bit support than expose us to this ugliness.
Worst case, the 32-bit non-PAE folks (i.e. almost nobody in the world)
can just live with seeing "2M" when the mappings are really 4M.  Or, you
*could* name these after the page table levels:

	direct_map_pmd_splits
	direct_map_pud_splits

or the level from the bottom where the split occurred:

	direct_map_level2_splits
	direct_map_level3_splits

That has the bonus of being usable on other architectures.

Oh, and 1G splits aren't possible on non-PAE 32-bit.  There are only 2
levels: 4M and 4k, which would make what you have above:

> +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> +	"direct_map_2M_splits",
> +	"direct_map_1G_splits",
> +#else
> +	"direct_map_4M_splits",
> +#endif

I don't think there's ever a 1G/4M case.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V3] x86/mm: Tracking linear mapping split events
  2021-01-27 23:00                     ` Randy Dunlap
@ 2021-01-27 23:56                       ` Saravanan D
  0 siblings, 0 replies; 33+ messages in thread
From: Saravanan D @ 2021-01-27 23:56 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

Hi Randy,
> Documenation/ update, please.
I will include it in the V4 patch.

- Saravanan D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V3] x86/mm: Tracking linear mapping split events
  2021-01-27 23:41                     ` Dave Hansen
@ 2021-01-28  0:15                       ` Saravanan D
  2021-01-28  4:35                       ` [PATCH V4] " Saravanan D
  1 sibling, 0 replies; 33+ messages in thread
From: Saravanan D @ 2021-01-28  0:15 UTC (permalink / raw)
  To: Dave Hansen; +Cc: x86, dave.hansen, luto, peterz, linux-kernel, kernel-team

Hi Dave,

> We don't use __x86_64__ in the kernel.  This should be CONFIG_X86.
Noted. I will correct this in V4

> or the level from the bottom where the split occurred:
> 
> 	direct_map_level2_splits
> 	direct_map_level3_splits
> 
> That has the bonus of being usable on other architectures.
Naming them after page table levels makes lot of sense. 2 new vmstat 
event counters that is relevant for all without the need for #ifdef 
page size craziness.

- Saravanan D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V4] x86/mm: Tracking linear mapping split events
  2021-01-27 23:41                     ` Dave Hansen
  2021-01-28  0:15                       ` Saravanan D
@ 2021-01-28  4:35                       ` Saravanan D
  2021-01-28  4:51                         ` Matthew Wilcox
  1 sibling, 1 reply; 33+ messages in thread
From: Saravanan D @ 2021-01-28  4:35 UTC (permalink / raw)
  To: x86, dave.hansen, luto, peterz, corbet
  Cc: linux-kernel, kernel-team, linux-doc, Saravanan D

To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic lifetime hugepage split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting (as we don't coalesce back) sources for huge page
splits is tracing as the granular page attribute/permission changes would
force the kernel to split code segments mapped to huge pages to smaller
ones thereby increasing the probability of TLB miss/reload even after
tracing has been stopped.

Documentation regarding linear mapping split events added to admin-guide
as requested in V3 of the patch.

Signed-off-by: Saravanan D <saravanand@fb.com>
---
 .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
 Documentation/admin-guide/mm/index.rst        |  1 +
 arch/x86/mm/pat/set_memory.c                  | 13 ++++
 include/linux/vm_event_item.h                 |  4 ++
 mm/vmstat.c                                   |  4 ++
 5 files changed, 81 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst

diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
new file mode 100644
index 000000000000..298751391deb
--- /dev/null
+++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Direct Mapping Splits
+=====================
+
+Kernel maps all of physical memory in linear/direct mapped pages with
+translation of virtual kernel address to physical address is achieved
+through a simple subtraction of offset. CPUs maintain a cache of these
+translations on fast caches called TLBs. CPU architectures like x86 allow
+direct mapping large portions of memory into hugepages (2M, 1G, etc) in
+various page table levels.
+
+Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
+The splintering of huge direct pages into smaller ones does result in
+a measurable performance hit caused by frequent TLB miss and reloads.
+
+One of the many lasting (as we don't coalesce back) sources for huge page
+splits is tracing as the granular page attribute/permission changes would
+force the kernel to split code segments mapped to hugepages to smaller
+ones thus increasing the probability of TLB miss/reloads even after
+tracing has been stopped.
+
+On x86 systems, we can track the splitting of huge direct mapped pages
+through lifetime event counters in ``/proc/vmstat``
+
+	direct_map_level2_splits xxx
+	direct_map_level3_splits yyy
+
+where:
+
+direct_map_level2_splits
+	are 2M/4M hugepage split events
+direct_map_level3_splits
+	are 1G hugepage split events
+
+The distribution of direct mapped system memory in various page sizes
+post splits can be viewed through ``/proc/meminfo`` whose output
+will include the following lines depending upon supporting CPU
+architecture
+
+	DirectMap4k:    xxxxx kB
+	DirectMap2M:    yyyyy kB
+	DirectMap1G:    zzzzz kB
+
+where:
+
+DirectMap4k
+	is the total amount of direct mapped memory (in kB)
+	accessed through 4k pages
+DirectMap2M
+	is the total amount of direct mapped memory (in kB)
+	accessed through 2M pages
+DirectMap1G
+	is the total amount of direct mapped memory (in kB)
+	accessed through 1G pages
+
+
+-- Saravanan D, Jan 27, 2021
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 4b14d8b50e9e..9439780f3f07 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -38,3 +38,4 @@ the Linux memory management.
    soft-dirty
    transhuge
    userfaultfd
+   direct_mapping_splits
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..767cade53bdc 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -85,12 +87,23 @@ void update_page_count(int level, unsigned long pages)
 	spin_unlock(&pgd_lock);
 }
 
+void update_split_page_event_count(int level)
+{
+	if (system_state == SYSTEM_RUNNING) {
+		if (level == PG_LEVEL_2M)
+			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
+		else if (level == PG_LEVEL_1G)
+			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
+	}
+}
+
 static void split_page_count(int level)
 {
 	if (direct_pages_count[level] == 0)
 		return;
 
 	direct_pages_count[level]--;
+	update_split_page_event_count(level);
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..7c06c2bdc33b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_X86
+		DIRECT_MAP_LEVEL2_SPLIT,
+		DIRECT_MAP_LEVEL3_SPLIT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..a43ac4ac98a2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_X86
+	"direct_map_level2_splits",
+	"direct_map_level3_splits",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH V4] x86/mm: Tracking linear mapping split events
  2021-01-28  4:35                       ` [PATCH V4] " Saravanan D
@ 2021-01-28  4:51                         ` Matthew Wilcox
       [not found]                           ` <20210128104934.2916679-1-saravanand@fb.com>
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2021-01-28  4:51 UTC (permalink / raw)
  To: Saravanan D
  Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel,
	kernel-team, linux-doc, linux-mm, Song Liu

You forgot to cc linux-mm.  Adding.  Also I think you should be cc'ing
Song.

On Wed, Jan 27, 2021 at 08:35:47PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
> 
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> ....
> 
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

Are you talking about kernel text here or application text?

In either case, I don't know why you're saying we don't coalesce
back after tracing is disabled.  I was under the impression we did
(either actively in the case of the kernel or via khugepaged for
user text).

> Documentation regarding linear mapping split events added to admin-guide
> as requested in V3 of the patch.
> 
> Signed-off-by: Saravanan D <saravanand@fb.com>
> ---
>  .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
>  Documentation/admin-guide/mm/index.rst        |  1 +
>  arch/x86/mm/pat/set_memory.c                  | 13 ++++
>  include/linux/vm_event_item.h                 |  4 ++
>  mm/vmstat.c                                   |  4 ++
>  5 files changed, 81 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
> 
> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> new file mode 100644
> index 000000000000..298751391deb
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Direct Mapping Splits
> +=====================
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.
> +
> +One of the many lasting (as we don't coalesce back) sources for huge page
> +splits is tracing as the granular page attribute/permission changes would
> +force the kernel to split code segments mapped to hugepages to smaller
> +ones thus increasing the probability of TLB miss/reloads even after
> +tracing has been stopped.
> +
> +On x86 systems, we can track the splitting of huge direct mapped pages
> +through lifetime event counters in ``/proc/vmstat``
> +
> +	direct_map_level2_splits xxx
> +	direct_map_level3_splits yyy
> +
> +where:
> +
> +direct_map_level2_splits
> +	are 2M/4M hugepage split events
> +direct_map_level3_splits
> +	are 1G hugepage split events
> +
> +The distribution of direct mapped system memory in various page sizes
> +post splits can be viewed through ``/proc/meminfo`` whose output
> +will include the following lines depending upon supporting CPU
> +architecture
> +
> +	DirectMap4k:    xxxxx kB
> +	DirectMap2M:    yyyyy kB
> +	DirectMap1G:    zzzzz kB
> +
> +where:
> +
> +DirectMap4k
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 4k pages
> +DirectMap2M
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 2M pages
> +DirectMap1G
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 1G pages
> +
> +
> +-- Saravanan D, Jan 27, 2021
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index 4b14d8b50e9e..9439780f3f07 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -38,3 +38,4 @@ the Linux memory management.
>     soft-dirty
>     transhuge
>     userfaultfd
> +   direct_mapping_splits
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..767cade53bdc 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -16,6 +16,8 @@
>  #include <linux/pci.h>
>  #include <linux/vmalloc.h>
>  #include <linux/libnvdimm.h>
> +#include <linux/vmstat.h>
> +#include <linux/kernel.h>
>  
>  #include <asm/e820/api.h>
>  #include <asm/processor.h>
> @@ -85,12 +87,23 @@ void update_page_count(int level, unsigned long pages)
>  	spin_unlock(&pgd_lock);
>  }
>  
> +void update_split_page_event_count(int level)
> +{
> +	if (system_state == SYSTEM_RUNNING) {
> +		if (level == PG_LEVEL_2M)
> +			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
> +		else if (level == PG_LEVEL_1G)
> +			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
> +	}
> +}
> +
>  static void split_page_count(int level)
>  {
>  	if (direct_pages_count[level] == 0)
>  		return;
>  
>  	direct_pages_count[level]--;
> +	update_split_page_event_count(level);
>  	direct_pages_count[level - 1] += PTRS_PER_PTE;
>  }
>  
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 18e75974d4e3..7c06c2bdc33b 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_SWAP
>  		SWAP_RA,
>  		SWAP_RA_HIT,
> +#endif
> +#ifdef CONFIG_X86
> +		DIRECT_MAP_LEVEL2_SPLIT,
> +		DIRECT_MAP_LEVEL3_SPLIT,
>  #endif
>  		NR_VM_EVENT_ITEMS
>  };
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f8942160fc95..a43ac4ac98a2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
>  	"swap_ra",
>  	"swap_ra_hit",
>  #endif
> +#ifdef CONFIG_X86
> +	"direct_map_level2_splits",
> +	"direct_map_level3_splits",
> +#endif
>  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>  };
>  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
> -- 
> 2.24.1
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
       [not found]                           ` <20210128104934.2916679-1-saravanand@fb.com>
@ 2021-01-28 15:04                             ` Matthew Wilcox
  2021-01-28 19:49                               ` Saravanan D
  2021-01-28 16:33                             ` Zi Yan
       [not found]                             ` <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com>
  2 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2021-01-28 15:04 UTC (permalink / raw)
  To: Saravanan D
  Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

On Thu, Jan 28, 2021 at 02:49:34AM -0800, Saravanan D wrote:
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

You didn't answer my question.

Is this tracing of userspace programs causing splits, or is it kernel
tracing?  Also, we have lots of kinds of tracing these days; are you
referring to kprobes?  tracepoints?  ftrace?  Something else?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
       [not found]                           ` <20210128104934.2916679-1-saravanand@fb.com>
  2021-01-28 15:04                             ` [PATCH V5] " Matthew Wilcox
@ 2021-01-28 16:33                             ` Zi Yan
  2021-01-28 16:41                               ` Dave Hansen
  2021-01-28 16:59                               ` Song Liu
       [not found]                             ` <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com>
  2 siblings, 2 replies; 33+ messages in thread
From: Zi Yan @ 2021-01-28 16:33 UTC (permalink / raw)
  To: Saravanan D, Xing Zhengjun
  Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

[-- Attachment #1: Type: text/plain, Size: 6539 bytes --]

On 28 Jan 2021, at 5:49, Saravanan D wrote:

> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
>
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> ....
>
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

It is interesting to see this statement saying splitting kernel direct mappings
causes performance loss, when Zhengjun (cc’d) from Intel recently posted
a kernel direct mapping performance report[1] saying 1GB mappings are good
but not much better than 2MB and 4KB mappings.

I would love to hear the stories from both sides. Or maybe I misunderstand
anything.


[1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
>
> Documentation regarding linear mapping split events added to admin-guide
> as requested in V3 of the patch.
>
> Signed-off-by: Saravanan D <saravanand@fb.com>
> ---
>  .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
>  Documentation/admin-guide/mm/index.rst        |  1 +
>  arch/x86/mm/pat/set_memory.c                  |  8 +++
>  include/linux/vm_event_item.h                 |  4 ++
>  mm/vmstat.c                                   |  4 ++
>  5 files changed, 76 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
>
> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> new file mode 100644
> index 000000000000..298751391deb
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Direct Mapping Splits
> +=====================
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.
> +
> +One of the many lasting (as we don't coalesce back) sources for huge page
> +splits is tracing as the granular page attribute/permission changes would
> +force the kernel to split code segments mapped to hugepages to smaller
> +ones thus increasing the probability of TLB miss/reloads even after
> +tracing has been stopped.
> +
> +On x86 systems, we can track the splitting of huge direct mapped pages
> +through lifetime event counters in ``/proc/vmstat``
> +
> +	direct_map_level2_splits xxx
> +	direct_map_level3_splits yyy
> +
> +where:
> +
> +direct_map_level2_splits
> +	are 2M/4M hugepage split events
> +direct_map_level3_splits
> +	are 1G hugepage split events
> +
> +The distribution of direct mapped system memory in various page sizes
> +post splits can be viewed through ``/proc/meminfo`` whose output
> +will include the following lines depending upon supporting CPU
> +architecture
> +
> +	DirectMap4k:    xxxxx kB
> +	DirectMap2M:    yyyyy kB
> +	DirectMap1G:    zzzzz kB
> +
> +where:
> +
> +DirectMap4k
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 4k pages
> +DirectMap2M
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 2M pages
> +DirectMap1G
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 1G pages
> +
> +
> +-- Saravanan D, Jan 27, 2021
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index 4b14d8b50e9e..9439780f3f07 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -38,3 +38,4 @@ the Linux memory management.
>     soft-dirty
>     transhuge
>     userfaultfd
> +   direct_mapping_splits
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..a7b3c5f1d316 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -16,6 +16,8 @@
>  #include <linux/pci.h>
>  #include <linux/vmalloc.h>
>  #include <linux/libnvdimm.h>
> +#include <linux/vmstat.h>
> +#include <linux/kernel.h>
>
>  #include <asm/e820/api.h>
>  #include <asm/processor.h>
> @@ -91,6 +93,12 @@ static void split_page_count(int level)
>  		return;
>
>  	direct_pages_count[level]--;
> +	if (system_state == SYSTEM_RUNNING) {
> +		if (level == PG_LEVEL_2M)
> +			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
> +		else if (level == PG_LEVEL_1G)
> +			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
> +	}
>  	direct_pages_count[level - 1] += PTRS_PER_PTE;
>  }
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 18e75974d4e3..7c06c2bdc33b 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_SWAP
>  		SWAP_RA,
>  		SWAP_RA_HIT,
> +#endif
> +#ifdef CONFIG_X86
> +		DIRECT_MAP_LEVEL2_SPLIT,
> +		DIRECT_MAP_LEVEL3_SPLIT,
>  #endif
>  		NR_VM_EVENT_ITEMS
>  };
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f8942160fc95..a43ac4ac98a2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
>  	"swap_ra",
>  	"swap_ra_hit",
>  #endif
> +#ifdef CONFIG_X86
> +	"direct_map_level2_splits",
> +	"direct_map_level3_splits",
> +#endif
>  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>  };
>  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
> -- 
> 2.24.1


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 16:33                             ` Zi Yan
@ 2021-01-28 16:41                               ` Dave Hansen
  2021-01-28 16:56                                 ` Zi Yan
  2021-01-28 16:59                               ` Song Liu
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Hansen @ 2021-01-28 16:41 UTC (permalink / raw)
  To: Zi Yan, Saravanan D, Xing Zhengjun
  Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

On 1/28/21 8:33 AM, Zi Yan wrote:
>> One of the many lasting (as we don't coalesce back) sources for
>> huge page splits is tracing as the granular page
>> attribute/permission changes would force the kernel to split code
>> segments mapped to huge pages to smaller ones thereby increasing
>> the probability of TLB miss/reload even after tracing has been
>> stopped.
> It is interesting to see this statement saying splitting kernel
> direct mappings causes performance loss, when Zhengjun (cc’d) from
> Intel recently posted a kernel direct mapping performance report[1]
> saying 1GB mappings are good but not much better than 2MB and 4KB
> mappings.

No, that's not what the report said.

*Overall*, there is no clear winner between 4k, 2M and 1G.  In other
words, no one page size is best for *ALL* workloads.

There were *ABSOLUTELY* individual workloads in those tests that saw
significant deltas between the direct map sizes.  There are also
real-world workloads that feel the impact here.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 16:41                               ` Dave Hansen
@ 2021-01-28 16:56                                 ` Zi Yan
  0 siblings, 0 replies; 33+ messages in thread
From: Zi Yan @ 2021-01-28 16:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Saravanan D, Xing Zhengjun, x86, dave.hansen, luto, peterz,
	corbet, willy, linux-kernel, kernel-team, linux-doc, linux-mm,
	songliubraving

[-- Attachment #1: Type: text/plain, Size: 1716 bytes --]

On 28 Jan 2021, at 11:41, Dave Hansen wrote:

> On 1/28/21 8:33 AM, Zi Yan wrote:
>>> One of the many lasting (as we don't coalesce back) sources for
>>> huge page splits is tracing as the granular page
>>> attribute/permission changes would force the kernel to split code
>>> segments mapped to huge pages to smaller ones thereby increasing
>>> the probability of TLB miss/reload even after tracing has been
>>> stopped.
>> It is interesting to see this statement saying splitting kernel
>> direct mappings causes performance loss, when Zhengjun (cc’d) from
>> Intel recently posted a kernel direct mapping performance report[1]
>> saying 1GB mappings are good but not much better than 2MB and 4KB
>> mappings.
>
> No, that's not what the report said.
>
> *Overall*, there is no clear winner between 4k, 2M and 1G.  In other
> words, no one page size is best for *ALL* workloads.
>
> There were *ABSOLUTELY* individual workloads in those tests that saw
> significant deltas between the direct map sizes.  There are also
> real-world workloads that feel the impact here.

Yes, it is what I understand from the report. But this patch says
“
Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
The splintering of huge direct pages into smaller ones does result in
a measurable performance hit caused by frequent TLB miss and reloads.
”,

indicating large mappings (2MB, 1GB) are generally better. It is
different from what the report said, right?

The above text could be improved to make sure readers get both sides
of the story and not get afraid of performance loss after seeing
a lot of direct_map_xxx_splits events.



—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 16:33                             ` Zi Yan
  2021-01-28 16:41                               ` Dave Hansen
@ 2021-01-28 16:59                               ` Song Liu
  1 sibling, 0 replies; 33+ messages in thread
From: Song Liu @ 2021-01-28 16:59 UTC (permalink / raw)
  To: Zi Yan
  Cc: Saravanan D, Xing Zhengjun, the arch/x86 maintainers,
	dave hansen@linux intel com, Andy Lutomirski, Peter Ziljstra,
	corbet, Matthew Wilcox, linux-kernel, Kernel Team, linux-doc,
	linux-mm



> On Jan 28, 2021, at 8:33 AM, Zi Yan <ziy@nvidia.com> wrote:
> 
> On 28 Jan 2021, at 5:49, Saravanan D wrote:
> 
>> To help with debugging the sluggishness caused by TLB miss/reload,
>> we introduce monotonic lifetime hugepage split event counts since
>> system state: SYSTEM_RUNNING to be displayed as part of
>> /proc/vmstat in x86 servers
>> 
>> The lifetime split event information will be displayed at the bottom of
>> /proc/vmstat
>> ....
>> swap_ra 0
>> swap_ra_hit 0
>> direct_map_level2_splits 94
>> direct_map_level3_splits 4
>> nr_unstable 0
>> ....
>> 
>> One of the many lasting (as we don't coalesce back) sources for huge page
>> splits is tracing as the granular page attribute/permission changes would
>> force the kernel to split code segments mapped to huge pages to smaller
>> ones thereby increasing the probability of TLB miss/reload even after
>> tracing has been stopped.
> 
> It is interesting to see this statement saying splitting kernel direct mappings
> causes performance loss, when Zhengjun (cc’d) from Intel recently posted
> a kernel direct mapping performance report[1] saying 1GB mappings are good
> but not much better than 2MB and 4KB mappings.
> 
> I would love to hear the stories from both sides. Or maybe I misunderstand
> anything.

We had an issue about 1.5 years ago, when ftrace splits 2MB kernel text page
table entry into 512x 4kB ones. This split caused ~1% performance regression. 
That instance was fixed in [1]. 

Saravanan, could you please share more information about the split. Is it 
possible to avoid the split? If not, can we regroup after tracing is disabled?

We have the split-and-regroup logic for application .text on THP. When uprobe 
is attached to the THP text, we have to split the 2MB page table entry. So we
introduced mechanism to regroup the 2MB page table entry when all uprobes are
removed from the THP [2]. 

Thanks,
Song

[1] commit 7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text")
[2] commit f385cb85a42f ("uprobe: collapse THP pmd after removing all uprobes")

> 
> 
> [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
>> 
>> Documentation regarding linear mapping split events added to admin-guide
>> as requested in V3 of the patch.
>> 
>> Signed-off-by: Saravanan D <saravanand@fb.com>
>> ---
>> .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
>> Documentation/admin-guide/mm/index.rst        |  1 +
>> arch/x86/mm/pat/set_memory.c                  |  8 +++
>> include/linux/vm_event_item.h                 |  4 ++
>> mm/vmstat.c                                   |  4 ++
>> 5 files changed, 76 insertions(+)
>> create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
>> 
>> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
>> new file mode 100644
>> index 000000000000..298751391deb
>> --- /dev/null
>> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
>> @@ -0,0 +1,59 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================
>> +Direct Mapping Splits
>> +=====================
>> +
>> +Kernel maps all of physical memory in linear/direct mapped pages with
>> +translation of virtual kernel address to physical address is achieved
>> +through a simple subtraction of offset. CPUs maintain a cache of these
>> +translations on fast caches called TLBs. CPU architectures like x86 allow
>> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
>> +various page table levels.
>> +
>> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
>> +The splintering of huge direct pages into smaller ones does result in
>> +a measurable performance hit caused by frequent TLB miss and reloads.
>> +
>> +One of the many lasting (as we don't coalesce back) sources for huge page
>> +splits is tracing as the granular page attribute/permission changes would
>> +force the kernel to split code segments mapped to hugepages to smaller
>> +ones thus increasing the probability of TLB miss/reloads even after
>> +tracing has been stopped.
>> +
>> +On x86 systems, we can track the splitting of huge direct mapped pages
>> +through lifetime event counters in ``/proc/vmstat``
>> +
>> +	direct_map_level2_splits xxx
>> +	direct_map_level3_splits yyy
>> +
>> +where:
>> +
>> +direct_map_level2_splits
>> +	are 2M/4M hugepage split events
>> +direct_map_level3_splits
>> +	are 1G hugepage split events
>> +
>> +The distribution of direct mapped system memory in various page sizes
>> +post splits can be viewed through ``/proc/meminfo`` whose output
>> +will include the following lines depending upon supporting CPU
>> +architecture
>> +
>> +	DirectMap4k:    xxxxx kB
>> +	DirectMap2M:    yyyyy kB
>> +	DirectMap1G:    zzzzz kB
>> +
>> +where:
>> +
>> +DirectMap4k
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 4k pages
>> +DirectMap2M
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 2M pages
>> +DirectMap1G
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 1G pages
>> +
>> +
>> +-- Saravanan D, Jan 27, 2021
>> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
>> index 4b14d8b50e9e..9439780f3f07 100644
>> --- a/Documentation/admin-guide/mm/index.rst
>> +++ b/Documentation/admin-guide/mm/index.rst
>> @@ -38,3 +38,4 @@ the Linux memory management.
>>    soft-dirty
>>    transhuge
>>    userfaultfd
>> +   direct_mapping_splits
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index 16f878c26667..a7b3c5f1d316 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -16,6 +16,8 @@
>> #include <linux/pci.h>
>> #include <linux/vmalloc.h>
>> #include <linux/libnvdimm.h>
>> +#include <linux/vmstat.h>
>> +#include <linux/kernel.h>
>> 
>> #include <asm/e820/api.h>
>> #include <asm/processor.h>
>> @@ -91,6 +93,12 @@ static void split_page_count(int level)
>> 		return;
>> 
>> 	direct_pages_count[level]--;
>> +	if (system_state == SYSTEM_RUNNING) {
>> +		if (level == PG_LEVEL_2M)
>> +			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
>> +		else if (level == PG_LEVEL_1G)
>> +			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
>> +	}
>> 	direct_pages_count[level - 1] += PTRS_PER_PTE;
>> }
>> 
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 18e75974d4e3..7c06c2bdc33b 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> #ifdef CONFIG_SWAP
>> 		SWAP_RA,
>> 		SWAP_RA_HIT,
>> +#endif
>> +#ifdef CONFIG_X86
>> +		DIRECT_MAP_LEVEL2_SPLIT,
>> +		DIRECT_MAP_LEVEL3_SPLIT,
>> #endif
>> 		NR_VM_EVENT_ITEMS
>> };
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index f8942160fc95..a43ac4ac98a2 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
>> 	"swap_ra",
>> 	"swap_ra_hit",
>> #endif
>> +#ifdef CONFIG_X86
>> +	"direct_map_level2_splits",
>> +	"direct_map_level3_splits",
>> +#endif
>> #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>> };
>> #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
>> -- 
>> 2.24.1
> 
> 
> —
> Best Regards,
> Yan Zi


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 15:04                             ` [PATCH V5] " Matthew Wilcox
@ 2021-01-28 19:49                               ` Saravanan D
  0 siblings, 0 replies; 33+ messages in thread
From: Saravanan D @ 2021-01-28 19:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

Hi Mathew,

> Is this tracing of userspace programs causing splits, or is it kernel
> tracing?  Also, we have lots of kinds of tracing these days; are you
> referring to kprobes?  tracepoints?  ftrace?  Something else?

It has to be kernel tracing (kprobes, tracepoints) as we are dealing with 
direct mapping splits.

Kernel's direct mapping
`` ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct
 mapping of all physical memory (page_offset_base)``

The kernel text range
``ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel
text mapping, mapped to physical address 0``

Source : Documentation/x86/x86_64/mm.rst

Kernel code segment points to the same physical addresses already mapped 
in the direct mapping range (0x20000000 = 512 MB)

When we enable kernel tracing, we would have to modify attributes/permissions 
of the text segment pages that are direct mapped causing them to split.

When we track the direct_pages_count[] in arch/x86/mm/pat/set_memory.c
There are only splits from higher levels. They never coalesce back.

Splits when we turn on dynamic tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....

Thanks,
Saravanan D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
       [not found]                             ` <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com>
@ 2021-01-28 21:20                               ` Saravanan D
       [not found]                                 ` <20210128233430.1460964-1-saravanand@fb.com>
  0 siblings, 1 reply; 33+ messages in thread
From: Saravanan D @ 2021-01-28 21:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

Hi Dave,
> 
> Eek.  There really doesn't appear to be a place in Documentation/ that
> we've documented vmstat entries.
> 
> Maybe you can start:
> 
> 	Documentation/admin-guide/mm/vmstat.rst
> 
I was also very surprised that there does not exist documentation for
vmstat, that lead me to add a page in admin-guide which now requires lot
of caveats.

Starting a new documentation for vmstat goes beyond the scope of this patch.
I am inclined to remove Documentation from the next version [V6] of the patch.

I presume that a detailed commit log [V6] explaining why direct mapped kernel
page splis will never coalesce, how kernel tracing causes some of those
splits and why it is worth tracking them can do the job.

Proposed [V6] Commit Log:
>>>
To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic hugepage [direct mapped] split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting sources of direct hugepage splits is kernel
tracing (kprobes, tracepoints).

Note that the kernel's code segment [512 MB] points to the same 
physical addresses that have been already mapped in the kernel's 
direct mapping range.

Source : Documentation/x86/x86_64/mm.rst

When we enable kernel tracing, the kernel has to modify attributes/permissions
of the text segment hugepages that are direct mapped causing them to split.

Kernel's direct mapped hugepages do not coalesce back after split and
remain in place for the remainder of the lifetime.

An instance of direct page splits when we turn on
dynamic kernel tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....
<<<

Thanks,
Saravanan D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
       [not found]                                 ` <20210128233430.1460964-1-saravanand@fb.com>
@ 2021-01-28 23:41                                   ` Tejun Heo
  2021-01-29 19:27                                   ` Johannes Weiner
  2021-02-08 23:30                                   ` Dave Hansen
  2 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2021-01-28 23:41 UTC (permalink / raw)
  To: Saravanan D
  Cc: x86, dave.hansen, luto, peterz, willy, linux-kernel, kernel-team,
	linux-mm, songliubraving, hannes

On Thu, Jan 28, 2021 at 03:34:30PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic hugepage [direct mapped] split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
...
> Signed-off-by: Saravanan D <saravanand@fb.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
       [not found]                                 ` <20210128233430.1460964-1-saravanand@fb.com>
  2021-01-28 23:41                                   ` [PATCH V6] " Tejun Heo
@ 2021-01-29 19:27                                   ` Johannes Weiner
  2021-02-08 23:17                                     ` Saravanan D
  2021-02-08 23:30                                   ` Dave Hansen
  2 siblings, 1 reply; 33+ messages in thread
From: Johannes Weiner @ 2021-01-29 19:27 UTC (permalink / raw)
  To: Saravanan D
  Cc: x86, dave.hansen, luto, peterz, willy, linux-kernel, kernel-team,
	linux-mm, songliubraving, tj

On Thu, Jan 28, 2021 at 03:34:30PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic hugepage [direct mapped] split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
> 
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> ....
> 
> One of the many lasting sources of direct hugepage splits is kernel
> tracing (kprobes, tracepoints).
> 
> Note that the kernel's code segment [512 MB] points to the same
> physical addresses that have been already mapped in the kernel's
> direct mapping range.
> 
> Source : Documentation/x86/x86_64/mm.rst
> 
> When we enable kernel tracing, the kernel has to modify
> attributes/permissions
> of the text segment hugepages that are direct mapped causing them to
> split.
> 
> Kernel's direct mapped hugepages do not coalesce back after split and
> remain in place for the remainder of the lifetime.
> 
> An instance of direct page splits when we turn on
> dynamic kernel tracing
> ....
> cat /proc/vmstat | grep -i direct_map_level
> direct_map_level2_splits 784
> direct_map_level3_splits 12
> bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
> count(); }'
> cat /proc/vmstat | grep -i
> direct_map_level
> direct_map_level2_splits 789
> direct_map_level3_splits 12
> ....
> 
> Signed-off-by: Saravanan D <saravanand@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-01-29 19:27                                   ` Johannes Weiner
@ 2021-02-08 23:17                                     ` Saravanan D
  0 siblings, 0 replies; 33+ messages in thread
From: Saravanan D @ 2021-02-08 23:17 UTC (permalink / raw)
  To: Johannes Weiner, tj, x86
  Cc: dave.hansen, luto, peterz, willy, linux-kernel, kernel-team,
	linux-mm, songliubraving, tj

Hi all,

So far I have received two acks for V6 version of my patch

> Acked-by: Tejun Heo <tj@kernel.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Are there any more objections ?

Thanks,
Saravanan D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
       [not found]                                 ` <20210128233430.1460964-1-saravanand@fb.com>
  2021-01-28 23:41                                   ` [PATCH V6] " Tejun Heo
  2021-01-29 19:27                                   ` Johannes Weiner
@ 2021-02-08 23:30                                   ` Dave Hansen
  2 siblings, 0 replies; 33+ messages in thread
From: Dave Hansen @ 2021-02-08 23:30 UTC (permalink / raw)
  To: Saravanan D, x86, dave.hansen, luto, peterz, willy
  Cc: linux-kernel, kernel-team, linux-mm, songliubraving, tj, hannes

On 1/28/21 3:34 PM, Saravanan D wrote:
> 
> One of the many lasting sources of direct hugepage splits is kernel
> tracing (kprobes, tracepoints).
> 
> Note that the kernel's code segment [512 MB] points to the same
> physical addresses that have been already mapped in the kernel's
> direct mapping range.

Looks fine to me:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-03-06  0:57 ` Andrew Morton
@ 2021-03-08 15:06   ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2021-03-08 15:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Saravanan D, mingo, x86, dave.hansen, tj, linux-kernel, kernel-team

On Fri, Mar 05, 2021 at 04:57:15PM -0800, Andrew Morton wrote:
> On Thu, 18 Feb 2021 15:57:44 -0800 Saravanan D <saravanand@fb.com> wrote:
> 
> > To help with debugging the sluggishness caused by TLB miss/reload,
> > we introduce monotonic hugepage [direct mapped] split event counts since
> > system state: SYSTEM_RUNNING to be displayed as part of
> > /proc/vmstat in x86 servers
> >
> > ...
> >
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  #ifdef CONFIG_SWAP
> >  		SWAP_RA,
> >  		SWAP_RA_HIT,
> > +#endif
> > +#ifdef CONFIG_X86
> > +		DIRECT_MAP_LEVEL2_SPLIT,
> > +		DIRECT_MAP_LEVEL3_SPLIT,
> >  #endif
> >  		NR_VM_EVENT_ITEMS
> >  };
> 
> This is the first appearance of arch-specific fields in /proc/vmstat.
> 
> I don't really see a problem with this - vmstat is basically a dumping
> ground of random developer stuff.  But was this the best place in which
> to present this data?

IMO it's a big plus for discoverability.

One of the first things I tend to do when triaging mysterious memory
issues is going to /proc/vmstat and seeing if anything looks abnormal.
There is value in making that file comprehensive for all things that
could indicate memory-related pathologies.

The impetus for adding these is a real-world tlb regression caused by
kprobes chewing up the direct mapping that took longer to debug than
necessary. We have the /proc/meminfo lines on the DirectMap, but those
are more useful when you already have a theory - they simply don't
make problems immediately stand out the same way.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-02-18 23:57 Saravanan D
  2021-03-01 22:43 ` Tejun Heo
@ 2021-03-06  0:57 ` Andrew Morton
  2021-03-08 15:06   ` Johannes Weiner
  1 sibling, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2021-03-06  0:57 UTC (permalink / raw)
  To: Saravanan D
  Cc: mingo, x86, dave.hansen, tj, hannes, linux-kernel, kernel-team

On Thu, 18 Feb 2021 15:57:44 -0800 Saravanan D <saravanand@fb.com> wrote:

> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic hugepage [direct mapped] split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
>
> ...
>
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_SWAP
>  		SWAP_RA,
>  		SWAP_RA_HIT,
> +#endif
> +#ifdef CONFIG_X86
> +		DIRECT_MAP_LEVEL2_SPLIT,
> +		DIRECT_MAP_LEVEL3_SPLIT,
>  #endif
>  		NR_VM_EVENT_ITEMS
>  };

This is the first appearance of arch-specific fields in /proc/vmstat.

I don't really see a problem with this - vmstat is basically a dumping
ground of random developer stuff.  But was this the best place in which
to present this data?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-02-18 23:57 Saravanan D
@ 2021-03-01 22:43 ` Tejun Heo
  2021-03-06  0:57 ` Andrew Morton
  1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2021-03-01 22:43 UTC (permalink / raw)
  To: Saravanan D
  Cc: akpm, mingo, x86, dave.hansen, hannes, linux-kernel, kernel-team

Hello,

On Thu, Feb 18, 2021 at 03:57:44PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic hugepage [direct mapped] split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
...
> Signed-off-by: Saravanan D <saravanand@fb.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Andrew, do you mind picking this one up? It has enough acks and can go
through either mm or x86 tree.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V6] x86/mm: Tracking linear mapping split events
@ 2021-02-18 23:57 Saravanan D
  2021-03-01 22:43 ` Tejun Heo
  2021-03-06  0:57 ` Andrew Morton
  0 siblings, 2 replies; 33+ messages in thread
From: Saravanan D @ 2021-02-18 23:57 UTC (permalink / raw)
  To: akpm, mingo, x86
  Cc: dave.hansen, tj, hannes, linux-kernel, kernel-team, Saravanan D

To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic hugepage [direct mapped] split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting sources of direct hugepage splits is kernel
tracing (kprobes, tracepoints).

Note that the kernel's code segment [512 MB] points to the same
physical addresses that have been already mapped in the kernel's
direct mapping range.

Source : Documentation/x86/x86_64/mm.rst

When we enable kernel tracing, the kernel has to modify
attributes/permissions
of the text segment hugepages that are direct mapped causing them to
split.

Kernel's direct mapped hugepages do not coalesce back after split and
remain in place for the remainder of the lifetime.

An instance of direct page splits when we turn on
dynamic kernel tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....

Signed-off-by: Saravanan D <saravanand@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
---
This patch has been acked and can be routed through either x86 or -mm
Please let me know if there's anything needed. Thanks.
---
 arch/x86/mm/pat/set_memory.c  | 8 ++++++++
 include/linux/vm_event_item.h | 4 ++++
 mm/vmstat.c                   | 4 ++++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..a7b3c5f1d316 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -91,6 +93,12 @@ static void split_page_count(int level)
 		return;
 
 	direct_pages_count[level]--;
+	if (system_state == SYSTEM_RUNNING) {
+		if (level == PG_LEVEL_2M)
+			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
+		else if (level == PG_LEVEL_1G)
+			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
+	}
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..7c06c2bdc33b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_X86
+		DIRECT_MAP_LEVEL2_SPLIT,
+		DIRECT_MAP_LEVEL3_SPLIT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..a43ac4ac98a2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_X86
+	"direct_map_level2_splits",
+	"direct_map_level3_splits",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-03-08 15:07 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BYAPR01MB40856478D5BE74CB6A7D5578CFBD9@BYAPR01MB4085.prod.exchangelabs.com>
2021-01-25 20:15 ` [PATCH] x86/mm: Tracking linear mapping split events since boot Dave Hansen
2021-01-25 20:32   ` Tejun Heo
2021-01-26  0:47     ` Dave Hansen
2021-01-26  0:53       ` Tejun Heo
2021-01-26  1:04         ` Dave Hansen
2021-01-26  1:17           ` Tejun Heo
2021-01-27 17:51           ` [PATCH V2] x86/mm: Tracking linear mapping split events Saravanan D
2021-01-27 21:03             ` Tejun Heo
2021-01-27 21:32               ` Dave Hansen
2021-01-27 21:36                 ` Tejun Heo
2021-01-27 21:42                   ` Saravanan D
2021-01-27 22:50                   ` [PATCH V3] " Saravanan D
2021-01-27 23:00                     ` Randy Dunlap
2021-01-27 23:56                       ` Saravanan D
2021-01-27 23:41                     ` Dave Hansen
2021-01-28  0:15                       ` Saravanan D
2021-01-28  4:35                       ` [PATCH V4] " Saravanan D
2021-01-28  4:51                         ` Matthew Wilcox
     [not found]                           ` <20210128104934.2916679-1-saravanand@fb.com>
2021-01-28 15:04                             ` [PATCH V5] " Matthew Wilcox
2021-01-28 19:49                               ` Saravanan D
2021-01-28 16:33                             ` Zi Yan
2021-01-28 16:41                               ` Dave Hansen
2021-01-28 16:56                                 ` Zi Yan
2021-01-28 16:59                               ` Song Liu
     [not found]                             ` <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com>
2021-01-28 21:20                               ` Saravanan D
     [not found]                                 ` <20210128233430.1460964-1-saravanand@fb.com>
2021-01-28 23:41                                   ` [PATCH V6] " Tejun Heo
2021-01-29 19:27                                   ` Johannes Weiner
2021-02-08 23:17                                     ` Saravanan D
2021-02-08 23:30                                   ` Dave Hansen
2021-02-18 23:57 Saravanan D
2021-03-01 22:43 ` Tejun Heo
2021-03-06  0:57 ` Andrew Morton
2021-03-08 15:06   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).