* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot [not found] <BYAPR01MB40856478D5BE74CB6A7D5578CFBD9@BYAPR01MB4085.prod.exchangelabs.com> @ 2021-01-25 20:15 ` Dave Hansen 2021-01-25 20:32 ` Tejun Heo 0 siblings, 1 reply; 29+ messages in thread From: Dave Hansen @ 2021-01-25 20:15 UTC (permalink / raw) To: Saravanan D, x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team On 1/25/21 12:11 PM, Saravanan D wrote: > Numerous hugepage splits in the linear mapping would give > admins the signal to narrow down the sluggishness caused by TLB > miss/reload. > > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. > > The split event information will be displayed at the bottom of > /proc/meminfo > .... > DirectMap4k: 3505112 kB > DirectMap2M: 19464192 kB > DirectMap1G: 12582912 kB > DirectMap2MSplits: 1705 > DirectMap1GSplits: 20 This seems much more like something we'd want in /proc/vmstat or as a tracepoint than meminfo. A tracepoint would be especially nice because the trace buffer could actually be examined if an admin finds an excessive number of these. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot 2021-01-25 20:15 ` [PATCH] x86/mm: Tracking linear mapping split events since boot Dave Hansen @ 2021-01-25 20:32 ` Tejun Heo 2021-01-26 0:47 ` Dave Hansen 0 siblings, 1 reply; 29+ messages in thread From: Tejun Heo @ 2021-01-25 20:32 UTC (permalink / raw) To: Dave Hansen Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team Hello, On Mon, Jan 25, 2021 at 12:15:51PM -0800, Dave Hansen wrote: > > DirectMap4k: 3505112 kB > > DirectMap2M: 19464192 kB > > DirectMap1G: 12582912 kB > > DirectMap2MSplits: 1705 > > DirectMap1GSplits: 20 > > This seems much more like something we'd want in /proc/vmstat or as a > tracepoint than meminfo. A tracepoint would be especially nice because > the trace buffer could actually be examined if an admin finds an > excessive number of these. Adding a TP sure can be helpful but I'm not sure how that'd make counters unnecessary given that the accumulated number of events since boot is what matters. Thanks. -- tejun ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot 2021-01-25 20:32 ` Tejun Heo @ 2021-01-26 0:47 ` Dave Hansen 2021-01-26 0:53 ` Tejun Heo 0 siblings, 1 reply; 29+ messages in thread From: Dave Hansen @ 2021-01-26 0:47 UTC (permalink / raw) To: Tejun Heo Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team On 1/25/21 12:32 PM, Tejun Heo wrote: > On Mon, Jan 25, 2021 at 12:15:51PM -0800, Dave Hansen wrote: >>> DirectMap4k: 3505112 kB >>> DirectMap2M: 19464192 kB >>> DirectMap1G: 12582912 kB >>> DirectMap2MSplits: 1705 >>> DirectMap1GSplits: 20 >> This seems much more like something we'd want in /proc/vmstat or as a >> tracepoint than meminfo. A tracepoint would be especially nice because >> the trace buffer could actually be examined if an admin finds an >> excessive number of these. > Adding a TP sure can be helpful but I'm not sure how that'd make counters > unnecessary given that the accumulated number of events since boot is what > matters. Kinda. The thing that *REALLY* matters is how many of these splits were avoidable and *could* be coalesced. The patch here does not actually separate out pre-boot from post-boot, so it's pretty hard to tell if the splits came from something like tracing which is totally unnecessary or they were the result of something at boot that we can't do anything about. This would be a lot more useful if you could reset the counters. Then just reset them from userspace at boot. Adding read-write debugfs exports for these should be pretty trivial. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot 2021-01-26 0:47 ` Dave Hansen @ 2021-01-26 0:53 ` Tejun Heo 2021-01-26 1:04 ` Dave Hansen 0 siblings, 1 reply; 29+ messages in thread From: Tejun Heo @ 2021-01-26 0:53 UTC (permalink / raw) To: Dave Hansen Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team Hello, Dave. On Mon, Jan 25, 2021 at 04:47:42PM -0800, Dave Hansen wrote: > The patch here does not actually separate out pre-boot from post-boot, > so it's pretty hard to tell if the splits came from something like > tracing which is totally unnecessary or they were the result of > something at boot that we can't do anything about. Ah, right, didn't know they also included splits during boot. It'd be a lot more useful if they were counting post-boot splits. > This would be a lot more useful if you could reset the counters. Then > just reset them from userspace at boot. Adding read-write debugfs > exports for these should be pretty trivial. While this would work for hands-on cases, I'm a bit worried that this might be more challenging to gain confidence in large production environments. Thanks. -- tejun ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot 2021-01-26 0:53 ` Tejun Heo @ 2021-01-26 1:04 ` Dave Hansen 2021-01-26 1:17 ` Tejun Heo 2021-01-27 17:51 ` [PATCH V2] x86/mm: Tracking linear mapping split events Saravanan D 0 siblings, 2 replies; 29+ messages in thread From: Dave Hansen @ 2021-01-26 1:04 UTC (permalink / raw) To: Tejun Heo Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team On 1/25/21 4:53 PM, Tejun Heo wrote: >> This would be a lot more useful if you could reset the counters. Then >> just reset them from userspace at boot. Adding read-write debugfs >> exports for these should be pretty trivial. > While this would work for hands-on cases, I'm a bit worried that this might > be more challenging to gain confidence in large production environments. Which part? Large production environments don't trust data from debugfs? Or don't trust it if it might have been reset? You could stick the "reset" switch in debugfs, and dump something out in dmesg like we do for /proc/sys/vm/drop_caches so it's not a surprise that it happened. BTW, counts of *events* don't really belong in meminfo. These really do belong in /proc/vmstat if anything. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] x86/mm: Tracking linear mapping split events since boot 2021-01-26 1:04 ` Dave Hansen @ 2021-01-26 1:17 ` Tejun Heo 2021-01-27 17:51 ` [PATCH V2] x86/mm: Tracking linear mapping split events Saravanan D 1 sibling, 0 replies; 29+ messages in thread From: Tejun Heo @ 2021-01-26 1:17 UTC (permalink / raw) To: Dave Hansen Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team Hello, On Mon, Jan 25, 2021 at 05:04:00PM -0800, Dave Hansen wrote: > Which part? Large production environments don't trust data from > debugfs? Or don't trust it if it might have been reset? When the last reset was. Not saying it's impossible or anything but in general it's a lot better to have the counters to be monotonically increasing with time/event stamped markers than the counters themselves getting reset or modified in other ways because the ownership of a specific counter might not be obvious to everyone and accidents and mistakes happen. Note that the "time/event stamped markers" above don't need to and shouldn't be in the kernel. It can be managed by whoever that wants to monitor a given time period and there can be any number of them. > You could stick the "reset" switch in debugfs, and dump something out in > dmesg like we do for /proc/sys/vm/drop_caches so it's not a surprise > that it happened. Processing dmesgs can work too but isn't particularly reliable or scalable. > BTW, counts of *events* don't really belong in meminfo. These really do > belong in /proc/vmstat if anything. Oh yeah, I don't have a strong opinion on where the counters should go. Thanks. -- tejun ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH V2] x86/mm: Tracking linear mapping split events 2021-01-26 1:04 ` Dave Hansen 2021-01-26 1:17 ` Tejun Heo @ 2021-01-27 17:51 ` Saravanan D 2021-01-27 21:03 ` Tejun Heo 1 sibling, 1 reply; 29+ messages in thread From: Saravanan D @ 2021-01-27 17:51 UTC (permalink / raw) To: x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team, Saravanan D Numerous hugepage splits in the linear mapping would give admins the signal to narrow down the sluggishness caused by TLB miss/reload. To help with debugging, we introduce monotonic lifetime hugepage split event counts since SYSTEM_RUNNING to be displayed as part of /proc/vmstat in x86 servers The lifetime split event information will be displayed at the bottom of /proc/vmstat .... swap_ra 0 swap_ra_hit 0 direct_map_2M_splits 139 direct_map_4M_splits 0 direct_map_1G_splits 7 nr_unstable 0 .... Ancillary debugfs split event counts exported to userspace via read-write endpoints : /sys/kernel/debug/x86/direct_map_[2M|4M|1G]_split dmesg log when user resets the debugfs split event count for debugging .... [ 232.470531] debugfs 2M Pages split event count(128) reset to 0 .... One of the many lasting (as we don't coalesce back) sources for huge page splits is tracing as the granular page attribute/permission changes would force the kernel to split code segments mapped to huge pages to smaller ones thereby increasing the probability of TLB miss/reload even after tracing has been stopped. Signed-off-by: Saravanan D <saravanand@fb.com> --- arch/x86/mm/pat/set_memory.c | 117 ++++++++++++++++++++++++++++++++++ include/linux/vm_event_item.h | 8 +++ mm/vmstat.c | 8 +++ 3 files changed, 133 insertions(+) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 16f878c26667..97b6ef8dbd12 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -16,6 +16,8 @@ #include <linux/pci.h> #include <linux/vmalloc.h> #include <linux/libnvdimm.h> +#include <linux/vmstat.h> +#include <linux/kernel.h> #include <asm/e820/api.h> #include <asm/processor.h> @@ -76,6 +78,104 @@ static inline pgprot_t cachemode2pgprot(enum page_cache_mode pcm) #ifdef CONFIG_PROC_FS static unsigned long direct_pages_count[PG_LEVEL_NUM]; +static unsigned long split_page_event_count[PG_LEVEL_NUM]; + +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) +static int direct_map_2M_split_set(void *data, u64 val) +{ + switch (val) { + case 0: + break; + default: + return -EINVAL; + } + + pr_info("debugfs 2M Pages split event count(%lu) reset to 0", + split_page_event_count[PG_LEVEL_2M]); + split_page_event_count[PG_LEVEL_2M] = 0; + + return 0; +} + +static int direct_map_2M_split_get(void *data, u64 *val) +{ + *val = split_page_event_count[PG_LEVEL_2M]; + return 0; +} + +DEFINE_DEBUGFS_ATTRIBUTE(fops_direct_map_2M_split, direct_map_2M_split_get, + direct_map_2M_split_set, "%llu\n"); +#else +static int direct_map_4M_split_set(void *data, u64 val) +{ + switch (val) { + case 0: + break; + default: + return -EINVAL; + } + + pr_info("debugfs 4M Pages split event count(%lu) reset to 0", + split_page_event_count[PG_LEVEL_2M]); + split_page_event_count[PG_LEVEL_2M] = 0; + + return 0; +} + +static int direct_map_4M_split_get(void *data, u64 *val) +{ + *val = split_page_event_count[PG_LEVEL_2M]; + return 0; +} + +DEFINE_DEBUGFS_ATTRIBUTE(fops_direct_map_4M_split, direct_map_4M_split_get, + direct_map_4M_split_set, "%llu\n"); +#endif + +static int direct_map_1G_split_set(void *data, u64 val) +{ + switch (val) { + case 0: + break; + default: + return -EINVAL; + } + + pr_info("debugfs 1G Pages split event count(%lu) reset to 0", + split_page_event_count[PG_LEVEL_1G]); + split_page_event_count[PG_LEVEL_1G] = 0; + + return 0; +} + +static int direct_map_1G_split_get(void *data, u64 *val) +{ + *val = split_page_event_count[PG_LEVEL_1G]; + return 0; +} + +DEFINE_DEBUGFS_ATTRIBUTE(fops_direct_map_1G_split, direct_map_1G_split_get, + direct_map_1G_split_set, "%llu\n"); + +static __init int direct_map_split_debugfs_init(void) +{ +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + debugfs_create_file("direct_map_2M_split", 0600, + arch_debugfs_dir, NULL, + &fops_direct_map_2M_split); +#else + debugfs_create_file("direct_map_4M_split", 0600, + arch_debugfs_dir, NULL, + &fops_direct_map_4M_split); +#endif + if (direct_gbpages) + debugfs_create_file("direct_map_1G_split", 0600, + arch_debugfs_dir, NULL, + &fops_direct_map_1G_split); + return 0; +} + +late_initcall(direct_map_split_debugfs_init); void update_page_count(int level, unsigned long pages) { @@ -85,12 +185,29 @@ void update_page_count(int level, unsigned long pages) spin_unlock(&pgd_lock); } +void update_split_page_event_count(int level) +{ + if (system_state == SYSTEM_RUNNING) { + split_page_event_count[level]++; + if (level == PG_LEVEL_2M) { +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + count_vm_event(DIRECT_MAP_2M_SPLIT); +#else + count_vm_event(DIRECT_MAP_4M_SPLIT); +#endif + } else if (level == PG_LEVEL_1G) { + count_vm_event(DIRECT_MAP_1G_SPLIT); + } + } +} + static void split_page_count(int level) { if (direct_pages_count[level] == 0) return; direct_pages_count[level]--; + update_split_page_event_count(level); direct_pages_count[level - 1] += PTRS_PER_PTE; } diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 18e75974d4e3..439742d2435e 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -120,6 +120,14 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_SWAP SWAP_RA, SWAP_RA_HIT, +#endif +#if defined(__x86_64__) +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + DIRECT_MAP_2M_SPLIT, +#else + DIRECT_MAP_4M_SPLIT, +#endif + DIRECT_MAP_1G_SPLIT, #endif NR_VM_EVENT_ITEMS }; diff --git a/mm/vmstat.c b/mm/vmstat.c index f8942160fc95..beaa2bb4f9dc 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1350,6 +1350,14 @@ const char * const vmstat_text[] = { "swap_ra", "swap_ra_hit", #endif +#if defined(__x86_64__) +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + "direct_map_2M_splits", +#else + "direct_map_4M_splits", +#endif + "direct_map_1G_splits", +#endif #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ -- 2.24.1 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH V2] x86/mm: Tracking linear mapping split events 2021-01-27 17:51 ` [PATCH V2] x86/mm: Tracking linear mapping split events Saravanan D @ 2021-01-27 21:03 ` Tejun Heo 2021-01-27 21:32 ` Dave Hansen 0 siblings, 1 reply; 29+ messages in thread From: Tejun Heo @ 2021-01-27 21:03 UTC (permalink / raw) To: Saravanan D; +Cc: x86, dave.hansen, luto, peterz, linux-kernel, kernel-team Hello, On Wed, Jan 27, 2021 at 09:51:24AM -0800, Saravanan D wrote: > Numerous hugepage splits in the linear mapping would give > admins the signal to narrow down the sluggishness caused by TLB > miss/reload. > > To help with debugging, we introduce monotonic lifetime hugepage > split event counts since SYSTEM_RUNNING to be displayed as part of > /proc/vmstat in x86 servers > > The lifetime split event information will be displayed at the bottom of > /proc/vmstat > .... > swap_ra 0 > swap_ra_hit 0 > direct_map_2M_splits 139 > direct_map_4M_splits 0 > direct_map_1G_splits 7 > nr_unstable 0 > .... This looks great to me. > > Ancillary debugfs split event counts exported to userspace via read-write > endpoints : /sys/kernel/debug/x86/direct_map_[2M|4M|1G]_split > > dmesg log when user resets the debugfs split event count for > debugging > .... > [ 232.470531] debugfs 2M Pages split event count(128) reset to 0 > .... I'm not convinced this part is necessary or even beneficial. > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. > > Signed-off-by: Saravanan D <saravanand@fb.com> > --- > arch/x86/mm/pat/set_memory.c | 117 ++++++++++++++++++++++++++++++++++ > include/linux/vm_event_item.h | 8 +++ > mm/vmstat.c | 8 +++ > 3 files changed, 133 insertions(+) So, now the majority of the added code is to add debugfs knobs which don't provide anything that userland can't already do by simply reading the monotonic counters. Dave, are you still set on the resettable counters? Thanks. -- tejun ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V2] x86/mm: Tracking linear mapping split events 2021-01-27 21:03 ` Tejun Heo @ 2021-01-27 21:32 ` Dave Hansen 2021-01-27 21:36 ` Tejun Heo 0 siblings, 1 reply; 29+ messages in thread From: Dave Hansen @ 2021-01-27 21:32 UTC (permalink / raw) To: Tejun Heo, Saravanan D Cc: x86, dave.hansen, luto, peterz, linux-kernel, kernel-team On 1/27/21 1:03 PM, Tejun Heo wrote: >> The lifetime split event information will be displayed at the bottom of >> /proc/vmstat >> .... >> swap_ra 0 >> swap_ra_hit 0 >> direct_map_2M_splits 139 >> direct_map_4M_splits 0 >> direct_map_1G_splits 7 >> nr_unstable 0 >> .... > > This looks great to me. Yeah, this looks fine to me. It's way better than meminfo. >> arch/x86/mm/pat/set_memory.c | 117 ++++++++++++++++++++++++++++++++++ >> include/linux/vm_event_item.h | 8 +++ >> mm/vmstat.c | 8 +++ >> 3 files changed, 133 insertions(+) > > So, now the majority of the added code is to add debugfs knobs which don't > provide anything that userland can't already do by simply reading the > monotonic counters. > > Dave, are you still set on the resettable counters? Not *really*. But, you either need them to be resettable, or you need to expect your users to take snapshots and compare changes over time. Considering how much more code it is, though, I'm not super attached to it. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V2] x86/mm: Tracking linear mapping split events 2021-01-27 21:32 ` Dave Hansen @ 2021-01-27 21:36 ` Tejun Heo 2021-01-27 21:42 ` Saravanan D 2021-01-27 22:50 ` [PATCH V3] " Saravanan D 0 siblings, 2 replies; 29+ messages in thread From: Tejun Heo @ 2021-01-27 21:36 UTC (permalink / raw) To: Dave Hansen Cc: Saravanan D, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team Hello, On Wed, Jan 27, 2021 at 01:32:03PM -0800, Dave Hansen wrote: > >> arch/x86/mm/pat/set_memory.c | 117 ++++++++++++++++++++++++++++++++++ > >> include/linux/vm_event_item.h | 8 +++ > >> mm/vmstat.c | 8 +++ > >> 3 files changed, 133 insertions(+) > > > > So, now the majority of the added code is to add debugfs knobs which don't > > provide anything that userland can't already do by simply reading the > > monotonic counters. > > > > Dave, are you still set on the resettable counters? > > Not *really*. But, you either need them to be resettable, or you need > to expect your users to take snapshots and compare changes over time. > Considering how much more code it is, though, I'm not super attached to it. Saravanan, can you please drop the debugfs portion and repost? Thanks. -- tejun ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V2] x86/mm: Tracking linear mapping split events 2021-01-27 21:36 ` Tejun Heo @ 2021-01-27 21:42 ` Saravanan D 2021-01-27 22:50 ` [PATCH V3] " Saravanan D 1 sibling, 0 replies; 29+ messages in thread From: Saravanan D @ 2021-01-27 21:42 UTC (permalink / raw) To: Tejun Heo Cc: Dave Hansen, x86, dave.hansen, luto, peterz, linux-kernel, kernel-team Hi Tejun, > Saravanan, can you please drop the debugfs portion and repost? Sure. Saravanan D ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH V3] x86/mm: Tracking linear mapping split events 2021-01-27 21:36 ` Tejun Heo 2021-01-27 21:42 ` Saravanan D @ 2021-01-27 22:50 ` Saravanan D 2021-01-27 23:00 ` Randy Dunlap 2021-01-27 23:41 ` Dave Hansen 1 sibling, 2 replies; 29+ messages in thread From: Saravanan D @ 2021-01-27 22:50 UTC (permalink / raw) To: x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team, Saravanan D To help with debugging the sluggishness caused by TLB miss/reload, we introduce monotonic lifetime hugepage split event counts since system state: SYSTEM_RUNNING to be displayed as part of /proc/vmstat in x86 servers The lifetime split event information will be displayed at the bottom of /proc/vmstat .... swap_ra 0 swap_ra_hit 0 direct_map_2M_splits 167 direct_map_1G_splits 6 nr_unstable 0 .... One of the many lasting (as we don't coalesce back) sources for huge page splits is tracing as the granular page attribute/permission changes would force the kernel to split code segments mapped to huge pages to smaller ones thereby increasing the probability of TLB miss/reload even after tracing has been stopped. Signed-off-by: Saravanan D <saravanand@fb.com> --- arch/x86/mm/pat/set_memory.c | 18 ++++++++++++++++++ include/linux/vm_event_item.h | 8 ++++++++ mm/vmstat.c | 8 ++++++++ 3 files changed, 34 insertions(+) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 16f878c26667..3ea6316df089 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -16,6 +16,8 @@ #include <linux/pci.h> #include <linux/vmalloc.h> #include <linux/libnvdimm.h> +#include <linux/vmstat.h> +#include <linux/kernel.h> #include <asm/e820/api.h> #include <asm/processor.h> @@ -85,12 +87,28 @@ void update_page_count(int level, unsigned long pages) spin_unlock(&pgd_lock); } +void update_split_page_event_count(int level) +{ + if (system_state == SYSTEM_RUNNING) { + if (level == PG_LEVEL_2M) { +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + count_vm_event(DIRECT_MAP_2M_SPLIT); +#else + count_vm_event(DIRECT_MAP_4M_SPLIT); +#endif + } else if (level == PG_LEVEL_1G) { + count_vm_event(DIRECT_MAP_1G_SPLIT); + } + } +} + static void split_page_count(int level) { if (direct_pages_count[level] == 0) return; direct_pages_count[level]--; + update_split_page_event_count(level); direct_pages_count[level - 1] += PTRS_PER_PTE; } diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 18e75974d4e3..439742d2435e 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -120,6 +120,14 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_SWAP SWAP_RA, SWAP_RA_HIT, +#endif +#if defined(__x86_64__) +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + DIRECT_MAP_2M_SPLIT, +#else + DIRECT_MAP_4M_SPLIT, +#endif + DIRECT_MAP_1G_SPLIT, #endif NR_VM_EVENT_ITEMS }; diff --git a/mm/vmstat.c b/mm/vmstat.c index f8942160fc95..beaa2bb4f9dc 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1350,6 +1350,14 @@ const char * const vmstat_text[] = { "swap_ra", "swap_ra_hit", #endif +#if defined(__x86_64__) +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) + "direct_map_2M_splits", +#else + "direct_map_4M_splits", +#endif + "direct_map_1G_splits", +#endif #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ -- 2.24.1 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH V3] x86/mm: Tracking linear mapping split events 2021-01-27 22:50 ` [PATCH V3] " Saravanan D @ 2021-01-27 23:00 ` Randy Dunlap 2021-01-27 23:56 ` Saravanan D 2021-01-27 23:41 ` Dave Hansen 1 sibling, 1 reply; 29+ messages in thread From: Randy Dunlap @ 2021-01-27 23:00 UTC (permalink / raw) To: Saravanan D, x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team On 1/27/21 2:50 PM, Saravanan D wrote: > To help with debugging the sluggishness caused by TLB miss/reload, > we introduce monotonic lifetime hugepage split event counts since > system state: SYSTEM_RUNNING to be displayed as part of > /proc/vmstat in x86 servers > > The lifetime split event information will be displayed at the bottom of > /proc/vmstat > .... > swap_ra 0 > swap_ra_hit 0 > direct_map_2M_splits 167 > direct_map_1G_splits 6 > nr_unstable 0 > .... > > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. > > Signed-off-by: Saravanan D <saravanand@fb.com> > --- > arch/x86/mm/pat/set_memory.c | 18 ++++++++++++++++++ > include/linux/vm_event_item.h | 8 ++++++++ > mm/vmstat.c | 8 ++++++++ > 3 files changed, 34 insertions(+) Documenation/ update, please. -- ~Randy ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V3] x86/mm: Tracking linear mapping split events 2021-01-27 23:00 ` Randy Dunlap @ 2021-01-27 23:56 ` Saravanan D 0 siblings, 0 replies; 29+ messages in thread From: Saravanan D @ 2021-01-27 23:56 UTC (permalink / raw) To: Randy Dunlap; +Cc: x86, dave.hansen, luto, peterz, linux-kernel, kernel-team Hi Randy, > Documenation/ update, please. I will include it in the V4 patch. - Saravanan D ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V3] x86/mm: Tracking linear mapping split events 2021-01-27 22:50 ` [PATCH V3] " Saravanan D 2021-01-27 23:00 ` Randy Dunlap @ 2021-01-27 23:41 ` Dave Hansen 2021-01-28 0:15 ` Saravanan D 2021-01-28 4:35 ` [PATCH V4] " Saravanan D 1 sibling, 2 replies; 29+ messages in thread From: Dave Hansen @ 2021-01-27 23:41 UTC (permalink / raw) To: Saravanan D, x86, dave.hansen, luto, peterz; +Cc: linux-kernel, kernel-team On 1/27/21 2:50 PM, Saravanan D wrote: > +#if defined(__x86_64__) We don't use __x86_64__ in the kernel. This should be CONFIG_X86. > +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) > + "direct_map_2M_splits", > +#else > + "direct_map_4M_splits", > +#endif > + "direct_map_1G_splits", > +#endif These #ifdefs are hideous, and repeated. I'd rather have no 32-bit support than expose us to this ugliness. Worst case, the 32-bit non-PAE folks (i.e. almost nobody in the world) can just live with seeing "2M" when the mappings are really 4M. Or, you *could* name these after the page table levels: direct_map_pmd_splits direct_map_pud_splits or the level from the bottom where the split occurred: direct_map_level2_splits direct_map_level3_splits That has the bonus of being usable on other architectures. Oh, and 1G splits aren't possible on non-PAE 32-bit. There are only 2 levels: 4M and 4k, which would make what you have above: > +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) > + "direct_map_2M_splits", > + "direct_map_1G_splits", > +#else > + "direct_map_4M_splits", > +#endif I don't think there's ever a 1G/4M case. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V3] x86/mm: Tracking linear mapping split events 2021-01-27 23:41 ` Dave Hansen @ 2021-01-28 0:15 ` Saravanan D 2021-01-28 4:35 ` [PATCH V4] " Saravanan D 1 sibling, 0 replies; 29+ messages in thread From: Saravanan D @ 2021-01-28 0:15 UTC (permalink / raw) To: Dave Hansen; +Cc: x86, dave.hansen, luto, peterz, linux-kernel, kernel-team Hi Dave, > We don't use __x86_64__ in the kernel. This should be CONFIG_X86. Noted. I will correct this in V4 > or the level from the bottom where the split occurred: > > direct_map_level2_splits > direct_map_level3_splits > > That has the bonus of being usable on other architectures. Naming them after page table levels makes lot of sense. 2 new vmstat event counters that is relevant for all without the need for #ifdef page size craziness. - Saravanan D ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH V4] x86/mm: Tracking linear mapping split events 2021-01-27 23:41 ` Dave Hansen 2021-01-28 0:15 ` Saravanan D @ 2021-01-28 4:35 ` Saravanan D 2021-01-28 4:51 ` Matthew Wilcox 1 sibling, 1 reply; 29+ messages in thread From: Saravanan D @ 2021-01-28 4:35 UTC (permalink / raw) To: x86, dave.hansen, luto, peterz, corbet Cc: linux-kernel, kernel-team, linux-doc, Saravanan D To help with debugging the sluggishness caused by TLB miss/reload, we introduce monotonic lifetime hugepage split event counts since system state: SYSTEM_RUNNING to be displayed as part of /proc/vmstat in x86 servers The lifetime split event information will be displayed at the bottom of /proc/vmstat .... swap_ra 0 swap_ra_hit 0 direct_map_level2_splits 94 direct_map_level3_splits 4 nr_unstable 0 .... One of the many lasting (as we don't coalesce back) sources for huge page splits is tracing as the granular page attribute/permission changes would force the kernel to split code segments mapped to huge pages to smaller ones thereby increasing the probability of TLB miss/reload even after tracing has been stopped. Documentation regarding linear mapping split events added to admin-guide as requested in V3 of the patch. Signed-off-by: Saravanan D <saravanand@fb.com> --- .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ Documentation/admin-guide/mm/index.rst | 1 + arch/x86/mm/pat/set_memory.c | 13 ++++ include/linux/vm_event_item.h | 4 ++ mm/vmstat.c | 4 ++ 5 files changed, 81 insertions(+) create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst new file mode 100644 index 000000000000..298751391deb --- /dev/null +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst @@ -0,0 +1,59 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Direct Mapping Splits +===================== + +Kernel maps all of physical memory in linear/direct mapped pages with +translation of virtual kernel address to physical address is achieved +through a simple subtraction of offset. CPUs maintain a cache of these +translations on fast caches called TLBs. CPU architectures like x86 allow +direct mapping large portions of memory into hugepages (2M, 1G, etc) in +various page table levels. + +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. +The splintering of huge direct pages into smaller ones does result in +a measurable performance hit caused by frequent TLB miss and reloads. + +One of the many lasting (as we don't coalesce back) sources for huge page +splits is tracing as the granular page attribute/permission changes would +force the kernel to split code segments mapped to hugepages to smaller +ones thus increasing the probability of TLB miss/reloads even after +tracing has been stopped. + +On x86 systems, we can track the splitting of huge direct mapped pages +through lifetime event counters in ``/proc/vmstat`` + + direct_map_level2_splits xxx + direct_map_level3_splits yyy + +where: + +direct_map_level2_splits + are 2M/4M hugepage split events +direct_map_level3_splits + are 1G hugepage split events + +The distribution of direct mapped system memory in various page sizes +post splits can be viewed through ``/proc/meminfo`` whose output +will include the following lines depending upon supporting CPU +architecture + + DirectMap4k: xxxxx kB + DirectMap2M: yyyyy kB + DirectMap1G: zzzzz kB + +where: + +DirectMap4k + is the total amount of direct mapped memory (in kB) + accessed through 4k pages +DirectMap2M + is the total amount of direct mapped memory (in kB) + accessed through 2M pages +DirectMap1G + is the total amount of direct mapped memory (in kB) + accessed through 1G pages + + +-- Saravanan D, Jan 27, 2021 diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 4b14d8b50e9e..9439780f3f07 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -38,3 +38,4 @@ the Linux memory management. soft-dirty transhuge userfaultfd + direct_mapping_splits diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 16f878c26667..767cade53bdc 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -16,6 +16,8 @@ #include <linux/pci.h> #include <linux/vmalloc.h> #include <linux/libnvdimm.h> +#include <linux/vmstat.h> +#include <linux/kernel.h> #include <asm/e820/api.h> #include <asm/processor.h> @@ -85,12 +87,23 @@ void update_page_count(int level, unsigned long pages) spin_unlock(&pgd_lock); } +void update_split_page_event_count(int level) +{ + if (system_state == SYSTEM_RUNNING) { + if (level == PG_LEVEL_2M) + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); + else if (level == PG_LEVEL_1G) + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); + } +} + static void split_page_count(int level) { if (direct_pages_count[level] == 0) return; direct_pages_count[level]--; + update_split_page_event_count(level); direct_pages_count[level - 1] += PTRS_PER_PTE; } diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 18e75974d4e3..7c06c2bdc33b 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_SWAP SWAP_RA, SWAP_RA_HIT, +#endif +#ifdef CONFIG_X86 + DIRECT_MAP_LEVEL2_SPLIT, + DIRECT_MAP_LEVEL3_SPLIT, #endif NR_VM_EVENT_ITEMS }; diff --git a/mm/vmstat.c b/mm/vmstat.c index f8942160fc95..a43ac4ac98a2 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { "swap_ra", "swap_ra_hit", #endif +#ifdef CONFIG_X86 + "direct_map_level2_splits", + "direct_map_level3_splits", +#endif #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ -- 2.24.1 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH V4] x86/mm: Tracking linear mapping split events 2021-01-28 4:35 ` [PATCH V4] " Saravanan D @ 2021-01-28 4:51 ` Matthew Wilcox [not found] ` <20210128104934.2916679-1-saravanand@fb.com> 0 siblings, 1 reply; 29+ messages in thread From: Matthew Wilcox @ 2021-01-28 4:51 UTC (permalink / raw) To: Saravanan D Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel, kernel-team, linux-doc, linux-mm, Song Liu You forgot to cc linux-mm. Adding. Also I think you should be cc'ing Song. On Wed, Jan 27, 2021 at 08:35:47PM -0800, Saravanan D wrote: > To help with debugging the sluggishness caused by TLB miss/reload, > we introduce monotonic lifetime hugepage split event counts since > system state: SYSTEM_RUNNING to be displayed as part of > /proc/vmstat in x86 servers > > The lifetime split event information will be displayed at the bottom of > /proc/vmstat > .... > swap_ra 0 > swap_ra_hit 0 > direct_map_level2_splits 94 > direct_map_level3_splits 4 > nr_unstable 0 > .... > > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. Are you talking about kernel text here or application text? In either case, I don't know why you're saying we don't coalesce back after tracing is disabled. I was under the impression we did (either actively in the case of the kernel or via khugepaged for user text). > Documentation regarding linear mapping split events added to admin-guide > as requested in V3 of the patch. > > Signed-off-by: Saravanan D <saravanand@fb.com> > --- > .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ > Documentation/admin-guide/mm/index.rst | 1 + > arch/x86/mm/pat/set_memory.c | 13 ++++ > include/linux/vm_event_item.h | 4 ++ > mm/vmstat.c | 4 ++ > 5 files changed, 81 insertions(+) > create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst > > diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst > new file mode 100644 > index 000000000000..298751391deb > --- /dev/null > +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst > @@ -0,0 +1,59 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Direct Mapping Splits > +===================== > + > +Kernel maps all of physical memory in linear/direct mapped pages with > +translation of virtual kernel address to physical address is achieved > +through a simple subtraction of offset. CPUs maintain a cache of these > +translations on fast caches called TLBs. CPU architectures like x86 allow > +direct mapping large portions of memory into hugepages (2M, 1G, etc) in > +various page table levels. > + > +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. > +The splintering of huge direct pages into smaller ones does result in > +a measurable performance hit caused by frequent TLB miss and reloads. > + > +One of the many lasting (as we don't coalesce back) sources for huge page > +splits is tracing as the granular page attribute/permission changes would > +force the kernel to split code segments mapped to hugepages to smaller > +ones thus increasing the probability of TLB miss/reloads even after > +tracing has been stopped. > + > +On x86 systems, we can track the splitting of huge direct mapped pages > +through lifetime event counters in ``/proc/vmstat`` > + > + direct_map_level2_splits xxx > + direct_map_level3_splits yyy > + > +where: > + > +direct_map_level2_splits > + are 2M/4M hugepage split events > +direct_map_level3_splits > + are 1G hugepage split events > + > +The distribution of direct mapped system memory in various page sizes > +post splits can be viewed through ``/proc/meminfo`` whose output > +will include the following lines depending upon supporting CPU > +architecture > + > + DirectMap4k: xxxxx kB > + DirectMap2M: yyyyy kB > + DirectMap1G: zzzzz kB > + > +where: > + > +DirectMap4k > + is the total amount of direct mapped memory (in kB) > + accessed through 4k pages > +DirectMap2M > + is the total amount of direct mapped memory (in kB) > + accessed through 2M pages > +DirectMap1G > + is the total amount of direct mapped memory (in kB) > + accessed through 1G pages > + > + > +-- Saravanan D, Jan 27, 2021 > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst > index 4b14d8b50e9e..9439780f3f07 100644 > --- a/Documentation/admin-guide/mm/index.rst > +++ b/Documentation/admin-guide/mm/index.rst > @@ -38,3 +38,4 @@ the Linux memory management. > soft-dirty > transhuge > userfaultfd > + direct_mapping_splits > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index 16f878c26667..767cade53bdc 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -16,6 +16,8 @@ > #include <linux/pci.h> > #include <linux/vmalloc.h> > #include <linux/libnvdimm.h> > +#include <linux/vmstat.h> > +#include <linux/kernel.h> > > #include <asm/e820/api.h> > #include <asm/processor.h> > @@ -85,12 +87,23 @@ void update_page_count(int level, unsigned long pages) > spin_unlock(&pgd_lock); > } > > +void update_split_page_event_count(int level) > +{ > + if (system_state == SYSTEM_RUNNING) { > + if (level == PG_LEVEL_2M) > + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); > + else if (level == PG_LEVEL_1G) > + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); > + } > +} > + > static void split_page_count(int level) > { > if (direct_pages_count[level] == 0) > return; > > direct_pages_count[level]--; > + update_split_page_event_count(level); > direct_pages_count[level - 1] += PTRS_PER_PTE; > } > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 18e75974d4e3..7c06c2bdc33b 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_SWAP > SWAP_RA, > SWAP_RA_HIT, > +#endif > +#ifdef CONFIG_X86 > + DIRECT_MAP_LEVEL2_SPLIT, > + DIRECT_MAP_LEVEL3_SPLIT, > #endif > NR_VM_EVENT_ITEMS > }; > diff --git a/mm/vmstat.c b/mm/vmstat.c > index f8942160fc95..a43ac4ac98a2 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { > "swap_ra", > "swap_ra_hit", > #endif > +#ifdef CONFIG_X86 > + "direct_map_level2_splits", > + "direct_map_level3_splits", > +#endif > #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ > }; > #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ > -- > 2.24.1 > ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20210128104934.2916679-1-saravanand@fb.com>]
* Re: [PATCH V5] x86/mm: Tracking linear mapping split events [not found] ` <20210128104934.2916679-1-saravanand@fb.com> @ 2021-01-28 15:04 ` Matthew Wilcox 2021-01-28 19:49 ` Saravanan D 2021-01-28 16:33 ` Zi Yan [not found] ` <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com> 2 siblings, 1 reply; 29+ messages in thread From: Matthew Wilcox @ 2021-01-28 15:04 UTC (permalink / raw) To: Saravanan D Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel, kernel-team, linux-doc, linux-mm, songliubraving On Thu, Jan 28, 2021 at 02:49:34AM -0800, Saravanan D wrote: > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. You didn't answer my question. Is this tracing of userspace programs causing splits, or is it kernel tracing? Also, we have lots of kinds of tracing these days; are you referring to kprobes? tracepoints? ftrace? Something else? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V5] x86/mm: Tracking linear mapping split events 2021-01-28 15:04 ` [PATCH V5] " Matthew Wilcox @ 2021-01-28 19:49 ` Saravanan D 0 siblings, 0 replies; 29+ messages in thread From: Saravanan D @ 2021-01-28 19:49 UTC (permalink / raw) To: Matthew Wilcox Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel, kernel-team, linux-doc, linux-mm, songliubraving Hi Mathew, > Is this tracing of userspace programs causing splits, or is it kernel > tracing? Also, we have lots of kinds of tracing these days; are you > referring to kprobes? tracepoints? ftrace? Something else? It has to be kernel tracing (kprobes, tracepoints) as we are dealing with direct mapping splits. Kernel's direct mapping `` ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)`` The kernel text range ``ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0`` Source : Documentation/x86/x86_64/mm.rst Kernel code segment points to the same physical addresses already mapped in the direct mapping range (0x20000000 = 512 MB) When we enable kernel tracing, we would have to modify attributes/permissions of the text segment pages that are direct mapped causing them to split. When we track the direct_pages_count[] in arch/x86/mm/pat/set_memory.c There are only splits from higher levels. They never coalesce back. Splits when we turn on dynamic tracing .... cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 784 direct_map_level3_splits 12 bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = count(); }' cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 789 direct_map_level3_splits 12 .... Thanks, Saravanan D ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V5] x86/mm: Tracking linear mapping split events [not found] ` <20210128104934.2916679-1-saravanand@fb.com> 2021-01-28 15:04 ` [PATCH V5] " Matthew Wilcox @ 2021-01-28 16:33 ` Zi Yan 2021-01-28 16:41 ` Dave Hansen 2021-01-28 16:59 ` Song Liu [not found] ` <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com> 2 siblings, 2 replies; 29+ messages in thread From: Zi Yan @ 2021-01-28 16:33 UTC (permalink / raw) To: Saravanan D, Xing Zhengjun Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel, kernel-team, linux-doc, linux-mm, songliubraving [-- Attachment #1: Type: text/plain, Size: 6539 bytes --] On 28 Jan 2021, at 5:49, Saravanan D wrote: > To help with debugging the sluggishness caused by TLB miss/reload, > we introduce monotonic lifetime hugepage split event counts since > system state: SYSTEM_RUNNING to be displayed as part of > /proc/vmstat in x86 servers > > The lifetime split event information will be displayed at the bottom of > /proc/vmstat > .... > swap_ra 0 > swap_ra_hit 0 > direct_map_level2_splits 94 > direct_map_level3_splits 4 > nr_unstable 0 > .... > > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. It is interesting to see this statement saying splitting kernel direct mappings causes performance loss, when Zhengjun (cc’d) from Intel recently posted a kernel direct mapping performance report[1] saying 1GB mappings are good but not much better than 2MB and 4KB mappings. I would love to hear the stories from both sides. Or maybe I misunderstand anything. [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ > > Documentation regarding linear mapping split events added to admin-guide > as requested in V3 of the patch. > > Signed-off-by: Saravanan D <saravanand@fb.com> > --- > .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ > Documentation/admin-guide/mm/index.rst | 1 + > arch/x86/mm/pat/set_memory.c | 8 +++ > include/linux/vm_event_item.h | 4 ++ > mm/vmstat.c | 4 ++ > 5 files changed, 76 insertions(+) > create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst > > diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst > new file mode 100644 > index 000000000000..298751391deb > --- /dev/null > +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst > @@ -0,0 +1,59 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Direct Mapping Splits > +===================== > + > +Kernel maps all of physical memory in linear/direct mapped pages with > +translation of virtual kernel address to physical address is achieved > +through a simple subtraction of offset. CPUs maintain a cache of these > +translations on fast caches called TLBs. CPU architectures like x86 allow > +direct mapping large portions of memory into hugepages (2M, 1G, etc) in > +various page table levels. > + > +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. > +The splintering of huge direct pages into smaller ones does result in > +a measurable performance hit caused by frequent TLB miss and reloads. > + > +One of the many lasting (as we don't coalesce back) sources for huge page > +splits is tracing as the granular page attribute/permission changes would > +force the kernel to split code segments mapped to hugepages to smaller > +ones thus increasing the probability of TLB miss/reloads even after > +tracing has been stopped. > + > +On x86 systems, we can track the splitting of huge direct mapped pages > +through lifetime event counters in ``/proc/vmstat`` > + > + direct_map_level2_splits xxx > + direct_map_level3_splits yyy > + > +where: > + > +direct_map_level2_splits > + are 2M/4M hugepage split events > +direct_map_level3_splits > + are 1G hugepage split events > + > +The distribution of direct mapped system memory in various page sizes > +post splits can be viewed through ``/proc/meminfo`` whose output > +will include the following lines depending upon supporting CPU > +architecture > + > + DirectMap4k: xxxxx kB > + DirectMap2M: yyyyy kB > + DirectMap1G: zzzzz kB > + > +where: > + > +DirectMap4k > + is the total amount of direct mapped memory (in kB) > + accessed through 4k pages > +DirectMap2M > + is the total amount of direct mapped memory (in kB) > + accessed through 2M pages > +DirectMap1G > + is the total amount of direct mapped memory (in kB) > + accessed through 1G pages > + > + > +-- Saravanan D, Jan 27, 2021 > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst > index 4b14d8b50e9e..9439780f3f07 100644 > --- a/Documentation/admin-guide/mm/index.rst > +++ b/Documentation/admin-guide/mm/index.rst > @@ -38,3 +38,4 @@ the Linux memory management. > soft-dirty > transhuge > userfaultfd > + direct_mapping_splits > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index 16f878c26667..a7b3c5f1d316 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -16,6 +16,8 @@ > #include <linux/pci.h> > #include <linux/vmalloc.h> > #include <linux/libnvdimm.h> > +#include <linux/vmstat.h> > +#include <linux/kernel.h> > > #include <asm/e820/api.h> > #include <asm/processor.h> > @@ -91,6 +93,12 @@ static void split_page_count(int level) > return; > > direct_pages_count[level]--; > + if (system_state == SYSTEM_RUNNING) { > + if (level == PG_LEVEL_2M) > + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); > + else if (level == PG_LEVEL_1G) > + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); > + } > direct_pages_count[level - 1] += PTRS_PER_PTE; > } > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 18e75974d4e3..7c06c2bdc33b 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_SWAP > SWAP_RA, > SWAP_RA_HIT, > +#endif > +#ifdef CONFIG_X86 > + DIRECT_MAP_LEVEL2_SPLIT, > + DIRECT_MAP_LEVEL3_SPLIT, > #endif > NR_VM_EVENT_ITEMS > }; > diff --git a/mm/vmstat.c b/mm/vmstat.c > index f8942160fc95..a43ac4ac98a2 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { > "swap_ra", > "swap_ra_hit", > #endif > +#ifdef CONFIG_X86 > + "direct_map_level2_splits", > + "direct_map_level3_splits", > +#endif > #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ > }; > #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ > -- > 2.24.1 — Best Regards, Yan Zi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 854 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V5] x86/mm: Tracking linear mapping split events 2021-01-28 16:33 ` Zi Yan @ 2021-01-28 16:41 ` Dave Hansen 2021-01-28 16:56 ` Zi Yan 2021-01-28 16:59 ` Song Liu 1 sibling, 1 reply; 29+ messages in thread From: Dave Hansen @ 2021-01-28 16:41 UTC (permalink / raw) To: Zi Yan, Saravanan D, Xing Zhengjun Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel, kernel-team, linux-doc, linux-mm, songliubraving On 1/28/21 8:33 AM, Zi Yan wrote: >> One of the many lasting (as we don't coalesce back) sources for >> huge page splits is tracing as the granular page >> attribute/permission changes would force the kernel to split code >> segments mapped to huge pages to smaller ones thereby increasing >> the probability of TLB miss/reload even after tracing has been >> stopped. > It is interesting to see this statement saying splitting kernel > direct mappings causes performance loss, when Zhengjun (cc’d) from > Intel recently posted a kernel direct mapping performance report[1] > saying 1GB mappings are good but not much better than 2MB and 4KB > mappings. No, that's not what the report said. *Overall*, there is no clear winner between 4k, 2M and 1G. In other words, no one page size is best for *ALL* workloads. There were *ABSOLUTELY* individual workloads in those tests that saw significant deltas between the direct map sizes. There are also real-world workloads that feel the impact here. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V5] x86/mm: Tracking linear mapping split events 2021-01-28 16:41 ` Dave Hansen @ 2021-01-28 16:56 ` Zi Yan 0 siblings, 0 replies; 29+ messages in thread From: Zi Yan @ 2021-01-28 16:56 UTC (permalink / raw) To: Dave Hansen Cc: Saravanan D, Xing Zhengjun, x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel, kernel-team, linux-doc, linux-mm, songliubraving [-- Attachment #1: Type: text/plain, Size: 1716 bytes --] On 28 Jan 2021, at 11:41, Dave Hansen wrote: > On 1/28/21 8:33 AM, Zi Yan wrote: >>> One of the many lasting (as we don't coalesce back) sources for >>> huge page splits is tracing as the granular page >>> attribute/permission changes would force the kernel to split code >>> segments mapped to huge pages to smaller ones thereby increasing >>> the probability of TLB miss/reload even after tracing has been >>> stopped. >> It is interesting to see this statement saying splitting kernel >> direct mappings causes performance loss, when Zhengjun (cc’d) from >> Intel recently posted a kernel direct mapping performance report[1] >> saying 1GB mappings are good but not much better than 2MB and 4KB >> mappings. > > No, that's not what the report said. > > *Overall*, there is no clear winner between 4k, 2M and 1G. In other > words, no one page size is best for *ALL* workloads. > > There were *ABSOLUTELY* individual workloads in those tests that saw > significant deltas between the direct map sizes. There are also > real-world workloads that feel the impact here. Yes, it is what I understand from the report. But this patch says “ Maintaining huge direct mapped pages greatly reduces TLB miss pressure. The splintering of huge direct pages into smaller ones does result in a measurable performance hit caused by frequent TLB miss and reloads. ”, indicating large mappings (2MB, 1GB) are generally better. It is different from what the report said, right? The above text could be improved to make sure readers get both sides of the story and not get afraid of performance loss after seeing a lot of direct_map_xxx_splits events. — Best Regards, Yan Zi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 854 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V5] x86/mm: Tracking linear mapping split events 2021-01-28 16:33 ` Zi Yan 2021-01-28 16:41 ` Dave Hansen @ 2021-01-28 16:59 ` Song Liu 1 sibling, 0 replies; 29+ messages in thread From: Song Liu @ 2021-01-28 16:59 UTC (permalink / raw) To: Zi Yan Cc: Saravanan D, Xing Zhengjun, the arch/x86 maintainers, dave hansen@linux intel com, Andy Lutomirski, Peter Ziljstra, corbet, Matthew Wilcox, linux-kernel, Kernel Team, linux-doc, linux-mm > On Jan 28, 2021, at 8:33 AM, Zi Yan <ziy@nvidia.com> wrote: > > On 28 Jan 2021, at 5:49, Saravanan D wrote: > >> To help with debugging the sluggishness caused by TLB miss/reload, >> we introduce monotonic lifetime hugepage split event counts since >> system state: SYSTEM_RUNNING to be displayed as part of >> /proc/vmstat in x86 servers >> >> The lifetime split event information will be displayed at the bottom of >> /proc/vmstat >> .... >> swap_ra 0 >> swap_ra_hit 0 >> direct_map_level2_splits 94 >> direct_map_level3_splits 4 >> nr_unstable 0 >> .... >> >> One of the many lasting (as we don't coalesce back) sources for huge page >> splits is tracing as the granular page attribute/permission changes would >> force the kernel to split code segments mapped to huge pages to smaller >> ones thereby increasing the probability of TLB miss/reload even after >> tracing has been stopped. > > It is interesting to see this statement saying splitting kernel direct mappings > causes performance loss, when Zhengjun (cc’d) from Intel recently posted > a kernel direct mapping performance report[1] saying 1GB mappings are good > but not much better than 2MB and 4KB mappings. > > I would love to hear the stories from both sides. Or maybe I misunderstand > anything. We had an issue about 1.5 years ago, when ftrace splits 2MB kernel text page table entry into 512x 4kB ones. This split caused ~1% performance regression. That instance was fixed in [1]. Saravanan, could you please share more information about the split. Is it possible to avoid the split? If not, can we regroup after tracing is disabled? We have the split-and-regroup logic for application .text on THP. When uprobe is attached to the THP text, we have to split the 2MB page table entry. So we introduced mechanism to regroup the 2MB page table entry when all uprobes are removed from the THP [2]. Thanks, Song [1] commit 7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text") [2] commit f385cb85a42f ("uprobe: collapse THP pmd after removing all uprobes") > > > [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ >> >> Documentation regarding linear mapping split events added to admin-guide >> as requested in V3 of the patch. >> >> Signed-off-by: Saravanan D <saravanand@fb.com> >> --- >> .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ >> Documentation/admin-guide/mm/index.rst | 1 + >> arch/x86/mm/pat/set_memory.c | 8 +++ >> include/linux/vm_event_item.h | 4 ++ >> mm/vmstat.c | 4 ++ >> 5 files changed, 76 insertions(+) >> create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst >> >> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst >> new file mode 100644 >> index 000000000000..298751391deb >> --- /dev/null >> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst >> @@ -0,0 +1,59 @@ >> +.. SPDX-License-Identifier: GPL-2.0 >> + >> +===================== >> +Direct Mapping Splits >> +===================== >> + >> +Kernel maps all of physical memory in linear/direct mapped pages with >> +translation of virtual kernel address to physical address is achieved >> +through a simple subtraction of offset. CPUs maintain a cache of these >> +translations on fast caches called TLBs. CPU architectures like x86 allow >> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in >> +various page table levels. >> + >> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. >> +The splintering of huge direct pages into smaller ones does result in >> +a measurable performance hit caused by frequent TLB miss and reloads. >> + >> +One of the many lasting (as we don't coalesce back) sources for huge page >> +splits is tracing as the granular page attribute/permission changes would >> +force the kernel to split code segments mapped to hugepages to smaller >> +ones thus increasing the probability of TLB miss/reloads even after >> +tracing has been stopped. >> + >> +On x86 systems, we can track the splitting of huge direct mapped pages >> +through lifetime event counters in ``/proc/vmstat`` >> + >> + direct_map_level2_splits xxx >> + direct_map_level3_splits yyy >> + >> +where: >> + >> +direct_map_level2_splits >> + are 2M/4M hugepage split events >> +direct_map_level3_splits >> + are 1G hugepage split events >> + >> +The distribution of direct mapped system memory in various page sizes >> +post splits can be viewed through ``/proc/meminfo`` whose output >> +will include the following lines depending upon supporting CPU >> +architecture >> + >> + DirectMap4k: xxxxx kB >> + DirectMap2M: yyyyy kB >> + DirectMap1G: zzzzz kB >> + >> +where: >> + >> +DirectMap4k >> + is the total amount of direct mapped memory (in kB) >> + accessed through 4k pages >> +DirectMap2M >> + is the total amount of direct mapped memory (in kB) >> + accessed through 2M pages >> +DirectMap1G >> + is the total amount of direct mapped memory (in kB) >> + accessed through 1G pages >> + >> + >> +-- Saravanan D, Jan 27, 2021 >> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst >> index 4b14d8b50e9e..9439780f3f07 100644 >> --- a/Documentation/admin-guide/mm/index.rst >> +++ b/Documentation/admin-guide/mm/index.rst >> @@ -38,3 +38,4 @@ the Linux memory management. >> soft-dirty >> transhuge >> userfaultfd >> + direct_mapping_splits >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c >> index 16f878c26667..a7b3c5f1d316 100644 >> --- a/arch/x86/mm/pat/set_memory.c >> +++ b/arch/x86/mm/pat/set_memory.c >> @@ -16,6 +16,8 @@ >> #include <linux/pci.h> >> #include <linux/vmalloc.h> >> #include <linux/libnvdimm.h> >> +#include <linux/vmstat.h> >> +#include <linux/kernel.h> >> >> #include <asm/e820/api.h> >> #include <asm/processor.h> >> @@ -91,6 +93,12 @@ static void split_page_count(int level) >> return; >> >> direct_pages_count[level]--; >> + if (system_state == SYSTEM_RUNNING) { >> + if (level == PG_LEVEL_2M) >> + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); >> + else if (level == PG_LEVEL_1G) >> + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); >> + } >> direct_pages_count[level - 1] += PTRS_PER_PTE; >> } >> >> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >> index 18e75974d4e3..7c06c2bdc33b 100644 >> --- a/include/linux/vm_event_item.h >> +++ b/include/linux/vm_event_item.h >> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >> #ifdef CONFIG_SWAP >> SWAP_RA, >> SWAP_RA_HIT, >> +#endif >> +#ifdef CONFIG_X86 >> + DIRECT_MAP_LEVEL2_SPLIT, >> + DIRECT_MAP_LEVEL3_SPLIT, >> #endif >> NR_VM_EVENT_ITEMS >> }; >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index f8942160fc95..a43ac4ac98a2 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { >> "swap_ra", >> "swap_ra_hit", >> #endif >> +#ifdef CONFIG_X86 >> + "direct_map_level2_splits", >> + "direct_map_level3_splits", >> +#endif >> #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ >> }; >> #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ >> -- >> 2.24.1 > > > — > Best Regards, > Yan Zi ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com>]
* Re: [PATCH V5] x86/mm: Tracking linear mapping split events [not found] ` <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com> @ 2021-01-28 21:20 ` Saravanan D [not found] ` <20210128233430.1460964-1-saravanand@fb.com> 0 siblings, 1 reply; 29+ messages in thread From: Saravanan D @ 2021-01-28 21:20 UTC (permalink / raw) To: Dave Hansen Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel, kernel-team, linux-doc, linux-mm, songliubraving Hi Dave, > > Eek. There really doesn't appear to be a place in Documentation/ that > we've documented vmstat entries. > > Maybe you can start: > > Documentation/admin-guide/mm/vmstat.rst > I was also very surprised that there does not exist documentation for vmstat, that lead me to add a page in admin-guide which now requires lot of caveats. Starting a new documentation for vmstat goes beyond the scope of this patch. I am inclined to remove Documentation from the next version [V6] of the patch. I presume that a detailed commit log [V6] explaining why direct mapped kernel page splis will never coalesce, how kernel tracing causes some of those splits and why it is worth tracking them can do the job. Proposed [V6] Commit Log: >>> To help with debugging the sluggishness caused by TLB miss/reload, we introduce monotonic hugepage [direct mapped] split event counts since system state: SYSTEM_RUNNING to be displayed as part of /proc/vmstat in x86 servers The lifetime split event information will be displayed at the bottom of /proc/vmstat .... swap_ra 0 swap_ra_hit 0 direct_map_level2_splits 94 direct_map_level3_splits 4 nr_unstable 0 .... One of the many lasting sources of direct hugepage splits is kernel tracing (kprobes, tracepoints). Note that the kernel's code segment [512 MB] points to the same physical addresses that have been already mapped in the kernel's direct mapping range. Source : Documentation/x86/x86_64/mm.rst When we enable kernel tracing, the kernel has to modify attributes/permissions of the text segment hugepages that are direct mapped causing them to split. Kernel's direct mapped hugepages do not coalesce back after split and remain in place for the remainder of the lifetime. An instance of direct page splits when we turn on dynamic kernel tracing .... cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 784 direct_map_level3_splits 12 bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = count(); }' cat /proc/vmstat | grep -i direct_map_level direct_map_level2_splits 789 direct_map_level3_splits 12 .... <<< Thanks, Saravanan D ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20210128233430.1460964-1-saravanand@fb.com>]
* Re: [PATCH V6] x86/mm: Tracking linear mapping split events [not found] ` <20210128233430.1460964-1-saravanand@fb.com> @ 2021-01-28 23:41 ` Tejun Heo 2021-01-29 19:27 ` Johannes Weiner 2021-02-08 23:30 ` Dave Hansen 2 siblings, 0 replies; 29+ messages in thread From: Tejun Heo @ 2021-01-28 23:41 UTC (permalink / raw) To: Saravanan D Cc: x86, dave.hansen, luto, peterz, willy, linux-kernel, kernel-team, linux-mm, songliubraving, hannes On Thu, Jan 28, 2021 at 03:34:30PM -0800, Saravanan D wrote: > To help with debugging the sluggishness caused by TLB miss/reload, > we introduce monotonic hugepage [direct mapped] split event counts since > system state: SYSTEM_RUNNING to be displayed as part of > /proc/vmstat in x86 servers ... > Signed-off-by: Saravanan D <saravanand@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Thanks. -- tejun ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V6] x86/mm: Tracking linear mapping split events [not found] ` <20210128233430.1460964-1-saravanand@fb.com> 2021-01-28 23:41 ` [PATCH V6] " Tejun Heo @ 2021-01-29 19:27 ` Johannes Weiner 2021-02-08 23:17 ` Saravanan D 2021-02-08 23:30 ` Dave Hansen 2 siblings, 1 reply; 29+ messages in thread From: Johannes Weiner @ 2021-01-29 19:27 UTC (permalink / raw) To: Saravanan D Cc: x86, dave.hansen, luto, peterz, willy, linux-kernel, kernel-team, linux-mm, songliubraving, tj On Thu, Jan 28, 2021 at 03:34:30PM -0800, Saravanan D wrote: > To help with debugging the sluggishness caused by TLB miss/reload, > we introduce monotonic hugepage [direct mapped] split event counts since > system state: SYSTEM_RUNNING to be displayed as part of > /proc/vmstat in x86 servers > > The lifetime split event information will be displayed at the bottom of > /proc/vmstat > .... > swap_ra 0 > swap_ra_hit 0 > direct_map_level2_splits 94 > direct_map_level3_splits 4 > nr_unstable 0 > .... > > One of the many lasting sources of direct hugepage splits is kernel > tracing (kprobes, tracepoints). > > Note that the kernel's code segment [512 MB] points to the same > physical addresses that have been already mapped in the kernel's > direct mapping range. > > Source : Documentation/x86/x86_64/mm.rst > > When we enable kernel tracing, the kernel has to modify > attributes/permissions > of the text segment hugepages that are direct mapped causing them to > split. > > Kernel's direct mapped hugepages do not coalesce back after split and > remain in place for the remainder of the lifetime. > > An instance of direct page splits when we turn on > dynamic kernel tracing > .... > cat /proc/vmstat | grep -i direct_map_level > direct_map_level2_splits 784 > direct_map_level3_splits 12 > bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = > count(); }' > cat /proc/vmstat | grep -i > direct_map_level > direct_map_level2_splits 789 > direct_map_level3_splits 12 > .... > > Signed-off-by: Saravanan D <saravanand@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V6] x86/mm: Tracking linear mapping split events 2021-01-29 19:27 ` Johannes Weiner @ 2021-02-08 23:17 ` Saravanan D 0 siblings, 0 replies; 29+ messages in thread From: Saravanan D @ 2021-02-08 23:17 UTC (permalink / raw) To: Johannes Weiner, tj, x86 Cc: dave.hansen, luto, peterz, willy, linux-kernel, kernel-team, linux-mm, songliubraving, tj Hi all, So far I have received two acks for V6 version of my patch > Acked-by: Tejun Heo <tj@kernel.org> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Are there any more objections ? Thanks, Saravanan D ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH V6] x86/mm: Tracking linear mapping split events [not found] ` <20210128233430.1460964-1-saravanand@fb.com> 2021-01-28 23:41 ` [PATCH V6] " Tejun Heo 2021-01-29 19:27 ` Johannes Weiner @ 2021-02-08 23:30 ` Dave Hansen 2 siblings, 0 replies; 29+ messages in thread From: Dave Hansen @ 2021-02-08 23:30 UTC (permalink / raw) To: Saravanan D, x86, dave.hansen, luto, peterz, willy Cc: linux-kernel, kernel-team, linux-mm, songliubraving, tj, hannes On 1/28/21 3:34 PM, Saravanan D wrote: > > One of the many lasting sources of direct hugepage splits is kernel > tracing (kprobes, tracepoints). > > Note that the kernel's code segment [512 MB] points to the same > physical addresses that have been already mapped in the kernel's > direct mapping range. Looks fine to me: Acked-by: Dave Hansen <dave.hansen@linux.intel.com> ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2021-02-08 23:30 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <BYAPR01MB40856478D5BE74CB6A7D5578CFBD9@BYAPR01MB4085.prod.exchangelabs.com> 2021-01-25 20:15 ` [PATCH] x86/mm: Tracking linear mapping split events since boot Dave Hansen 2021-01-25 20:32 ` Tejun Heo 2021-01-26 0:47 ` Dave Hansen 2021-01-26 0:53 ` Tejun Heo 2021-01-26 1:04 ` Dave Hansen 2021-01-26 1:17 ` Tejun Heo 2021-01-27 17:51 ` [PATCH V2] x86/mm: Tracking linear mapping split events Saravanan D 2021-01-27 21:03 ` Tejun Heo 2021-01-27 21:32 ` Dave Hansen 2021-01-27 21:36 ` Tejun Heo 2021-01-27 21:42 ` Saravanan D 2021-01-27 22:50 ` [PATCH V3] " Saravanan D 2021-01-27 23:00 ` Randy Dunlap 2021-01-27 23:56 ` Saravanan D 2021-01-27 23:41 ` Dave Hansen 2021-01-28 0:15 ` Saravanan D 2021-01-28 4:35 ` [PATCH V4] " Saravanan D 2021-01-28 4:51 ` Matthew Wilcox [not found] ` <20210128104934.2916679-1-saravanand@fb.com> 2021-01-28 15:04 ` [PATCH V5] " Matthew Wilcox 2021-01-28 19:49 ` Saravanan D 2021-01-28 16:33 ` Zi Yan 2021-01-28 16:41 ` Dave Hansen 2021-01-28 16:56 ` Zi Yan 2021-01-28 16:59 ` Song Liu [not found] ` <3aec2d10-f4c3-d07a-356f-6f1001679181@intel.com> 2021-01-28 21:20 ` Saravanan D [not found] ` <20210128233430.1460964-1-saravanand@fb.com> 2021-01-28 23:41 ` [PATCH V6] " Tejun Heo 2021-01-29 19:27 ` Johannes Weiner 2021-02-08 23:17 ` Saravanan D 2021-02-08 23:30 ` Dave Hansen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).