linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH V4] x86/mm: Tracking linear mapping split events
       [not found] ` <20210128043547.1560435-1-saravanand@fb.com>
@ 2021-01-28  4:51   ` Matthew Wilcox
  2021-01-28 10:49     ` [PATCH V5] " Saravanan D
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2021-01-28  4:51 UTC (permalink / raw)
  To: Saravanan D
  Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel,
	kernel-team, linux-doc, linux-mm, Song Liu

You forgot to cc linux-mm.  Adding.  Also I think you should be cc'ing
Song.

On Wed, Jan 27, 2021 at 08:35:47PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
> 
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> ....
> 
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

Are you talking about kernel text here or application text?

In either case, I don't know why you're saying we don't coalesce
back after tracing is disabled.  I was under the impression we did
(either actively in the case of the kernel or via khugepaged for
user text).

> Documentation regarding linear mapping split events added to admin-guide
> as requested in V3 of the patch.
> 
> Signed-off-by: Saravanan D <saravanand@fb.com>
> ---
>  .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
>  Documentation/admin-guide/mm/index.rst        |  1 +
>  arch/x86/mm/pat/set_memory.c                  | 13 ++++
>  include/linux/vm_event_item.h                 |  4 ++
>  mm/vmstat.c                                   |  4 ++
>  5 files changed, 81 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
> 
> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> new file mode 100644
> index 000000000000..298751391deb
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Direct Mapping Splits
> +=====================
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.
> +
> +One of the many lasting (as we don't coalesce back) sources for huge page
> +splits is tracing as the granular page attribute/permission changes would
> +force the kernel to split code segments mapped to hugepages to smaller
> +ones thus increasing the probability of TLB miss/reloads even after
> +tracing has been stopped.
> +
> +On x86 systems, we can track the splitting of huge direct mapped pages
> +through lifetime event counters in ``/proc/vmstat``
> +
> +	direct_map_level2_splits xxx
> +	direct_map_level3_splits yyy
> +
> +where:
> +
> +direct_map_level2_splits
> +	are 2M/4M hugepage split events
> +direct_map_level3_splits
> +	are 1G hugepage split events
> +
> +The distribution of direct mapped system memory in various page sizes
> +post splits can be viewed through ``/proc/meminfo`` whose output
> +will include the following lines depending upon supporting CPU
> +architecture
> +
> +	DirectMap4k:    xxxxx kB
> +	DirectMap2M:    yyyyy kB
> +	DirectMap1G:    zzzzz kB
> +
> +where:
> +
> +DirectMap4k
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 4k pages
> +DirectMap2M
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 2M pages
> +DirectMap1G
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 1G pages
> +
> +
> +-- Saravanan D, Jan 27, 2021
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index 4b14d8b50e9e..9439780f3f07 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -38,3 +38,4 @@ the Linux memory management.
>     soft-dirty
>     transhuge
>     userfaultfd
> +   direct_mapping_splits
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..767cade53bdc 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -16,6 +16,8 @@
>  #include <linux/pci.h>
>  #include <linux/vmalloc.h>
>  #include <linux/libnvdimm.h>
> +#include <linux/vmstat.h>
> +#include <linux/kernel.h>
>  
>  #include <asm/e820/api.h>
>  #include <asm/processor.h>
> @@ -85,12 +87,23 @@ void update_page_count(int level, unsigned long pages)
>  	spin_unlock(&pgd_lock);
>  }
>  
> +void update_split_page_event_count(int level)
> +{
> +	if (system_state == SYSTEM_RUNNING) {
> +		if (level == PG_LEVEL_2M)
> +			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
> +		else if (level == PG_LEVEL_1G)
> +			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
> +	}
> +}
> +
>  static void split_page_count(int level)
>  {
>  	if (direct_pages_count[level] == 0)
>  		return;
>  
>  	direct_pages_count[level]--;
> +	update_split_page_event_count(level);
>  	direct_pages_count[level - 1] += PTRS_PER_PTE;
>  }
>  
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 18e75974d4e3..7c06c2bdc33b 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_SWAP
>  		SWAP_RA,
>  		SWAP_RA_HIT,
> +#endif
> +#ifdef CONFIG_X86
> +		DIRECT_MAP_LEVEL2_SPLIT,
> +		DIRECT_MAP_LEVEL3_SPLIT,
>  #endif
>  		NR_VM_EVENT_ITEMS
>  };
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f8942160fc95..a43ac4ac98a2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
>  	"swap_ra",
>  	"swap_ra_hit",
>  #endif
> +#ifdef CONFIG_X86
> +	"direct_map_level2_splits",
> +	"direct_map_level3_splits",
> +#endif
>  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>  };
>  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
> -- 
> 2.24.1
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28  4:51   ` [PATCH V4] x86/mm: Tracking linear mapping split events Matthew Wilcox
@ 2021-01-28 10:49     ` Saravanan D
  2021-01-28 15:04       ` Matthew Wilcox
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Saravanan D @ 2021-01-28 10:49 UTC (permalink / raw)
  To: x86, dave.hansen, luto, peterz, corbet, willy
  Cc: linux-kernel, kernel-team, linux-doc, linux-mm, songliubraving,
	Saravanan D

To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic lifetime hugepage split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting (as we don't coalesce back) sources for huge page
splits is tracing as the granular page attribute/permission changes would
force the kernel to split code segments mapped to huge pages to smaller
ones thereby increasing the probability of TLB miss/reload even after
tracing has been stopped.

Documentation regarding linear mapping split events added to admin-guide
as requested in V3 of the patch.

Signed-off-by: Saravanan D <saravanand@fb.com>
---
 .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
 Documentation/admin-guide/mm/index.rst        |  1 +
 arch/x86/mm/pat/set_memory.c                  |  8 +++
 include/linux/vm_event_item.h                 |  4 ++
 mm/vmstat.c                                   |  4 ++
 5 files changed, 76 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst

diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
new file mode 100644
index 000000000000..298751391deb
--- /dev/null
+++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Direct Mapping Splits
+=====================
+
+Kernel maps all of physical memory in linear/direct mapped pages with
+translation of virtual kernel address to physical address is achieved
+through a simple subtraction of offset. CPUs maintain a cache of these
+translations on fast caches called TLBs. CPU architectures like x86 allow
+direct mapping large portions of memory into hugepages (2M, 1G, etc) in
+various page table levels.
+
+Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
+The splintering of huge direct pages into smaller ones does result in
+a measurable performance hit caused by frequent TLB miss and reloads.
+
+One of the many lasting (as we don't coalesce back) sources for huge page
+splits is tracing as the granular page attribute/permission changes would
+force the kernel to split code segments mapped to hugepages to smaller
+ones thus increasing the probability of TLB miss/reloads even after
+tracing has been stopped.
+
+On x86 systems, we can track the splitting of huge direct mapped pages
+through lifetime event counters in ``/proc/vmstat``
+
+	direct_map_level2_splits xxx
+	direct_map_level3_splits yyy
+
+where:
+
+direct_map_level2_splits
+	are 2M/4M hugepage split events
+direct_map_level3_splits
+	are 1G hugepage split events
+
+The distribution of direct mapped system memory in various page sizes
+post splits can be viewed through ``/proc/meminfo`` whose output
+will include the following lines depending upon supporting CPU
+architecture
+
+	DirectMap4k:    xxxxx kB
+	DirectMap2M:    yyyyy kB
+	DirectMap1G:    zzzzz kB
+
+where:
+
+DirectMap4k
+	is the total amount of direct mapped memory (in kB)
+	accessed through 4k pages
+DirectMap2M
+	is the total amount of direct mapped memory (in kB)
+	accessed through 2M pages
+DirectMap1G
+	is the total amount of direct mapped memory (in kB)
+	accessed through 1G pages
+
+
+-- Saravanan D, Jan 27, 2021
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 4b14d8b50e9e..9439780f3f07 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -38,3 +38,4 @@ the Linux memory management.
    soft-dirty
    transhuge
    userfaultfd
+   direct_mapping_splits
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..a7b3c5f1d316 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -91,6 +93,12 @@ static void split_page_count(int level)
 		return;
 
 	direct_pages_count[level]--;
+	if (system_state == SYSTEM_RUNNING) {
+		if (level == PG_LEVEL_2M)
+			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
+		else if (level == PG_LEVEL_1G)
+			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
+	}
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..7c06c2bdc33b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_X86
+		DIRECT_MAP_LEVEL2_SPLIT,
+		DIRECT_MAP_LEVEL3_SPLIT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..a43ac4ac98a2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_X86
+	"direct_map_level2_splits",
+	"direct_map_level3_splits",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.24.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 10:49     ` [PATCH V5] " Saravanan D
@ 2021-01-28 15:04       ` Matthew Wilcox
  2021-01-28 19:49         ` Saravanan D
  2021-01-28 16:33       ` Zi Yan
  2021-01-28 19:17       ` Dave Hansen
  2 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2021-01-28 15:04 UTC (permalink / raw)
  To: Saravanan D
  Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

On Thu, Jan 28, 2021 at 02:49:34AM -0800, Saravanan D wrote:
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

You didn't answer my question.

Is this tracing of userspace programs causing splits, or is it kernel
tracing?  Also, we have lots of kinds of tracing these days; are you
referring to kprobes?  tracepoints?  ftrace?  Something else?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 10:49     ` [PATCH V5] " Saravanan D
  2021-01-28 15:04       ` Matthew Wilcox
@ 2021-01-28 16:33       ` Zi Yan
  2021-01-28 16:41         ` Dave Hansen
  2021-01-28 16:59         ` Song Liu
  2021-01-28 19:17       ` Dave Hansen
  2 siblings, 2 replies; 15+ messages in thread
From: Zi Yan @ 2021-01-28 16:33 UTC (permalink / raw)
  To: Saravanan D, Xing Zhengjun
  Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

[-- Attachment #1: Type: text/plain, Size: 6539 bytes --]

On 28 Jan 2021, at 5:49, Saravanan D wrote:

> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
>
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> ....
>
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

It is interesting to see this statement saying splitting kernel direct mappings
causes performance loss, when Zhengjun (cc’d) from Intel recently posted
a kernel direct mapping performance report[1] saying 1GB mappings are good
but not much better than 2MB and 4KB mappings.

I would love to hear the stories from both sides. Or maybe I misunderstand
anything.


[1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
>
> Documentation regarding linear mapping split events added to admin-guide
> as requested in V3 of the patch.
>
> Signed-off-by: Saravanan D <saravanand@fb.com>
> ---
>  .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
>  Documentation/admin-guide/mm/index.rst        |  1 +
>  arch/x86/mm/pat/set_memory.c                  |  8 +++
>  include/linux/vm_event_item.h                 |  4 ++
>  mm/vmstat.c                                   |  4 ++
>  5 files changed, 76 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
>
> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> new file mode 100644
> index 000000000000..298751391deb
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Direct Mapping Splits
> +=====================
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.
> +
> +One of the many lasting (as we don't coalesce back) sources for huge page
> +splits is tracing as the granular page attribute/permission changes would
> +force the kernel to split code segments mapped to hugepages to smaller
> +ones thus increasing the probability of TLB miss/reloads even after
> +tracing has been stopped.
> +
> +On x86 systems, we can track the splitting of huge direct mapped pages
> +through lifetime event counters in ``/proc/vmstat``
> +
> +	direct_map_level2_splits xxx
> +	direct_map_level3_splits yyy
> +
> +where:
> +
> +direct_map_level2_splits
> +	are 2M/4M hugepage split events
> +direct_map_level3_splits
> +	are 1G hugepage split events
> +
> +The distribution of direct mapped system memory in various page sizes
> +post splits can be viewed through ``/proc/meminfo`` whose output
> +will include the following lines depending upon supporting CPU
> +architecture
> +
> +	DirectMap4k:    xxxxx kB
> +	DirectMap2M:    yyyyy kB
> +	DirectMap1G:    zzzzz kB
> +
> +where:
> +
> +DirectMap4k
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 4k pages
> +DirectMap2M
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 2M pages
> +DirectMap1G
> +	is the total amount of direct mapped memory (in kB)
> +	accessed through 1G pages
> +
> +
> +-- Saravanan D, Jan 27, 2021
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index 4b14d8b50e9e..9439780f3f07 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -38,3 +38,4 @@ the Linux memory management.
>     soft-dirty
>     transhuge
>     userfaultfd
> +   direct_mapping_splits
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..a7b3c5f1d316 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -16,6 +16,8 @@
>  #include <linux/pci.h>
>  #include <linux/vmalloc.h>
>  #include <linux/libnvdimm.h>
> +#include <linux/vmstat.h>
> +#include <linux/kernel.h>
>
>  #include <asm/e820/api.h>
>  #include <asm/processor.h>
> @@ -91,6 +93,12 @@ static void split_page_count(int level)
>  		return;
>
>  	direct_pages_count[level]--;
> +	if (system_state == SYSTEM_RUNNING) {
> +		if (level == PG_LEVEL_2M)
> +			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
> +		else if (level == PG_LEVEL_1G)
> +			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
> +	}
>  	direct_pages_count[level - 1] += PTRS_PER_PTE;
>  }
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 18e75974d4e3..7c06c2bdc33b 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  #ifdef CONFIG_SWAP
>  		SWAP_RA,
>  		SWAP_RA_HIT,
> +#endif
> +#ifdef CONFIG_X86
> +		DIRECT_MAP_LEVEL2_SPLIT,
> +		DIRECT_MAP_LEVEL3_SPLIT,
>  #endif
>  		NR_VM_EVENT_ITEMS
>  };
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f8942160fc95..a43ac4ac98a2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
>  	"swap_ra",
>  	"swap_ra_hit",
>  #endif
> +#ifdef CONFIG_X86
> +	"direct_map_level2_splits",
> +	"direct_map_level3_splits",
> +#endif
>  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>  };
>  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
> -- 
> 2.24.1


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 16:33       ` Zi Yan
@ 2021-01-28 16:41         ` Dave Hansen
  2021-01-28 16:56           ` Zi Yan
  2021-01-28 16:59         ` Song Liu
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2021-01-28 16:41 UTC (permalink / raw)
  To: Zi Yan, Saravanan D, Xing Zhengjun
  Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

On 1/28/21 8:33 AM, Zi Yan wrote:
>> One of the many lasting (as we don't coalesce back) sources for
>> huge page splits is tracing as the granular page
>> attribute/permission changes would force the kernel to split code
>> segments mapped to huge pages to smaller ones thereby increasing
>> the probability of TLB miss/reload even after tracing has been
>> stopped.
> It is interesting to see this statement saying splitting kernel
> direct mappings causes performance loss, when Zhengjun (cc’d) from
> Intel recently posted a kernel direct mapping performance report[1]
> saying 1GB mappings are good but not much better than 2MB and 4KB
> mappings.

No, that's not what the report said.

*Overall*, there is no clear winner between 4k, 2M and 1G.  In other
words, no one page size is best for *ALL* workloads.

There were *ABSOLUTELY* individual workloads in those tests that saw
significant deltas between the direct map sizes.  There are also
real-world workloads that feel the impact here.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 16:41         ` Dave Hansen
@ 2021-01-28 16:56           ` Zi Yan
  0 siblings, 0 replies; 15+ messages in thread
From: Zi Yan @ 2021-01-28 16:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Saravanan D, Xing Zhengjun, x86, dave.hansen, luto, peterz,
	corbet, willy, linux-kernel, kernel-team, linux-doc, linux-mm,
	songliubraving

[-- Attachment #1: Type: text/plain, Size: 1716 bytes --]

On 28 Jan 2021, at 11:41, Dave Hansen wrote:

> On 1/28/21 8:33 AM, Zi Yan wrote:
>>> One of the many lasting (as we don't coalesce back) sources for
>>> huge page splits is tracing as the granular page
>>> attribute/permission changes would force the kernel to split code
>>> segments mapped to huge pages to smaller ones thereby increasing
>>> the probability of TLB miss/reload even after tracing has been
>>> stopped.
>> It is interesting to see this statement saying splitting kernel
>> direct mappings causes performance loss, when Zhengjun (cc’d) from
>> Intel recently posted a kernel direct mapping performance report[1]
>> saying 1GB mappings are good but not much better than 2MB and 4KB
>> mappings.
>
> No, that's not what the report said.
>
> *Overall*, there is no clear winner between 4k, 2M and 1G.  In other
> words, no one page size is best for *ALL* workloads.
>
> There were *ABSOLUTELY* individual workloads in those tests that saw
> significant deltas between the direct map sizes.  There are also
> real-world workloads that feel the impact here.

Yes, it is what I understand from the report. But this patch says
“
Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
The splintering of huge direct pages into smaller ones does result in
a measurable performance hit caused by frequent TLB miss and reloads.
”,

indicating large mappings (2MB, 1GB) are generally better. It is
different from what the report said, right?

The above text could be improved to make sure readers get both sides
of the story and not get afraid of performance loss after seeing
a lot of direct_map_xxx_splits events.



—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 16:33       ` Zi Yan
  2021-01-28 16:41         ` Dave Hansen
@ 2021-01-28 16:59         ` Song Liu
  1 sibling, 0 replies; 15+ messages in thread
From: Song Liu @ 2021-01-28 16:59 UTC (permalink / raw)
  To: Zi Yan
  Cc: Saravanan D, Xing Zhengjun, the arch/x86 maintainers,
	dave hansen@linux intel com, Andy Lutomirski, Peter Ziljstra,
	corbet, Matthew Wilcox, linux-kernel, Kernel Team, linux-doc,
	linux-mm



> On Jan 28, 2021, at 8:33 AM, Zi Yan <ziy@nvidia.com> wrote:
> 
> On 28 Jan 2021, at 5:49, Saravanan D wrote:
> 
>> To help with debugging the sluggishness caused by TLB miss/reload,
>> we introduce monotonic lifetime hugepage split event counts since
>> system state: SYSTEM_RUNNING to be displayed as part of
>> /proc/vmstat in x86 servers
>> 
>> The lifetime split event information will be displayed at the bottom of
>> /proc/vmstat
>> ....
>> swap_ra 0
>> swap_ra_hit 0
>> direct_map_level2_splits 94
>> direct_map_level3_splits 4
>> nr_unstable 0
>> ....
>> 
>> One of the many lasting (as we don't coalesce back) sources for huge page
>> splits is tracing as the granular page attribute/permission changes would
>> force the kernel to split code segments mapped to huge pages to smaller
>> ones thereby increasing the probability of TLB miss/reload even after
>> tracing has been stopped.
> 
> It is interesting to see this statement saying splitting kernel direct mappings
> causes performance loss, when Zhengjun (cc’d) from Intel recently posted
> a kernel direct mapping performance report[1] saying 1GB mappings are good
> but not much better than 2MB and 4KB mappings.
> 
> I would love to hear the stories from both sides. Or maybe I misunderstand
> anything.

We had an issue about 1.5 years ago, when ftrace splits 2MB kernel text page
table entry into 512x 4kB ones. This split caused ~1% performance regression. 
That instance was fixed in [1]. 

Saravanan, could you please share more information about the split. Is it 
possible to avoid the split? If not, can we regroup after tracing is disabled?

We have the split-and-regroup logic for application .text on THP. When uprobe 
is attached to the THP text, we have to split the 2MB page table entry. So we
introduced mechanism to regroup the 2MB page table entry when all uprobes are
removed from the THP [2]. 

Thanks,
Song

[1] commit 7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text")
[2] commit f385cb85a42f ("uprobe: collapse THP pmd after removing all uprobes")

> 
> 
> [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
>> 
>> Documentation regarding linear mapping split events added to admin-guide
>> as requested in V3 of the patch.
>> 
>> Signed-off-by: Saravanan D <saravanand@fb.com>
>> ---
>> .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++++++++++++++++++
>> Documentation/admin-guide/mm/index.rst        |  1 +
>> arch/x86/mm/pat/set_memory.c                  |  8 +++
>> include/linux/vm_event_item.h                 |  4 ++
>> mm/vmstat.c                                   |  4 ++
>> 5 files changed, 76 insertions(+)
>> create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
>> 
>> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
>> new file mode 100644
>> index 000000000000..298751391deb
>> --- /dev/null
>> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
>> @@ -0,0 +1,59 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +=====================
>> +Direct Mapping Splits
>> +=====================
>> +
>> +Kernel maps all of physical memory in linear/direct mapped pages with
>> +translation of virtual kernel address to physical address is achieved
>> +through a simple subtraction of offset. CPUs maintain a cache of these
>> +translations on fast caches called TLBs. CPU architectures like x86 allow
>> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
>> +various page table levels.
>> +
>> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
>> +The splintering of huge direct pages into smaller ones does result in
>> +a measurable performance hit caused by frequent TLB miss and reloads.
>> +
>> +One of the many lasting (as we don't coalesce back) sources for huge page
>> +splits is tracing as the granular page attribute/permission changes would
>> +force the kernel to split code segments mapped to hugepages to smaller
>> +ones thus increasing the probability of TLB miss/reloads even after
>> +tracing has been stopped.
>> +
>> +On x86 systems, we can track the splitting of huge direct mapped pages
>> +through lifetime event counters in ``/proc/vmstat``
>> +
>> +	direct_map_level2_splits xxx
>> +	direct_map_level3_splits yyy
>> +
>> +where:
>> +
>> +direct_map_level2_splits
>> +	are 2M/4M hugepage split events
>> +direct_map_level3_splits
>> +	are 1G hugepage split events
>> +
>> +The distribution of direct mapped system memory in various page sizes
>> +post splits can be viewed through ``/proc/meminfo`` whose output
>> +will include the following lines depending upon supporting CPU
>> +architecture
>> +
>> +	DirectMap4k:    xxxxx kB
>> +	DirectMap2M:    yyyyy kB
>> +	DirectMap1G:    zzzzz kB
>> +
>> +where:
>> +
>> +DirectMap4k
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 4k pages
>> +DirectMap2M
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 2M pages
>> +DirectMap1G
>> +	is the total amount of direct mapped memory (in kB)
>> +	accessed through 1G pages
>> +
>> +
>> +-- Saravanan D, Jan 27, 2021
>> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
>> index 4b14d8b50e9e..9439780f3f07 100644
>> --- a/Documentation/admin-guide/mm/index.rst
>> +++ b/Documentation/admin-guide/mm/index.rst
>> @@ -38,3 +38,4 @@ the Linux memory management.
>>    soft-dirty
>>    transhuge
>>    userfaultfd
>> +   direct_mapping_splits
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index 16f878c26667..a7b3c5f1d316 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -16,6 +16,8 @@
>> #include <linux/pci.h>
>> #include <linux/vmalloc.h>
>> #include <linux/libnvdimm.h>
>> +#include <linux/vmstat.h>
>> +#include <linux/kernel.h>
>> 
>> #include <asm/e820/api.h>
>> #include <asm/processor.h>
>> @@ -91,6 +93,12 @@ static void split_page_count(int level)
>> 		return;
>> 
>> 	direct_pages_count[level]--;
>> +	if (system_state == SYSTEM_RUNNING) {
>> +		if (level == PG_LEVEL_2M)
>> +			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
>> +		else if (level == PG_LEVEL_1G)
>> +			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
>> +	}
>> 	direct_pages_count[level - 1] += PTRS_PER_PTE;
>> }
>> 
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index 18e75974d4e3..7c06c2bdc33b 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> #ifdef CONFIG_SWAP
>> 		SWAP_RA,
>> 		SWAP_RA_HIT,
>> +#endif
>> +#ifdef CONFIG_X86
>> +		DIRECT_MAP_LEVEL2_SPLIT,
>> +		DIRECT_MAP_LEVEL3_SPLIT,
>> #endif
>> 		NR_VM_EVENT_ITEMS
>> };
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index f8942160fc95..a43ac4ac98a2 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
>> 	"swap_ra",
>> 	"swap_ra_hit",
>> #endif
>> +#ifdef CONFIG_X86
>> +	"direct_map_level2_splits",
>> +	"direct_map_level3_splits",
>> +#endif
>> #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>> };
>> #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
>> -- 
>> 2.24.1
> 
> 
> —
> Best Regards,
> Yan Zi


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 10:49     ` [PATCH V5] " Saravanan D
  2021-01-28 15:04       ` Matthew Wilcox
  2021-01-28 16:33       ` Zi Yan
@ 2021-01-28 19:17       ` Dave Hansen
  2021-01-28 21:20         ` Saravanan D
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2021-01-28 19:17 UTC (permalink / raw)
  To: Saravanan D, x86, dave.hansen, luto, peterz, corbet, willy
  Cc: linux-kernel, kernel-team, linux-doc, linux-mm, songliubraving

On 1/28/21 2:49 AM, Saravanan D wrote:
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Direct Mapping Splits
> +=====================
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.

Eek.  There really doesn't appear to be a place in Documentation/ that
we've documented vmstat entries.

Maybe you can start:

	Documentation/admin-guide/mm/vmstat.rst

Also, I don't think we need background on the direct map or TLBs here.
Just get to the point and describe what the files do, don't justify why
they are there.

I also agree with Willy that you should qualify some of the strong
statements (if they remain) in your changelog and documentation>

This:

	Maintaining huge direct mapped pages
	greatly reduces TLB miss pressure.

for instance isn't universally true.  There were CPUs with a very small
number of 1G TLB entries.  Using 1G pages on those systems often led to
*GREATER* TLB pressure and lower performance.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 15:04       ` Matthew Wilcox
@ 2021-01-28 19:49         ` Saravanan D
  0 siblings, 0 replies; 15+ messages in thread
From: Saravanan D @ 2021-01-28 19:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: x86, dave.hansen, luto, peterz, corbet, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

Hi Mathew,

> Is this tracing of userspace programs causing splits, or is it kernel
> tracing?  Also, we have lots of kinds of tracing these days; are you
> referring to kprobes?  tracepoints?  ftrace?  Something else?

It has to be kernel tracing (kprobes, tracepoints) as we are dealing with 
direct mapping splits.

Kernel's direct mapping
`` ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct
 mapping of all physical memory (page_offset_base)``

The kernel text range
``ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel
text mapping, mapped to physical address 0``

Source : Documentation/x86/x86_64/mm.rst

Kernel code segment points to the same physical addresses already mapped 
in the direct mapping range (0x20000000 = 512 MB)

When we enable kernel tracing, we would have to modify attributes/permissions 
of the text segment pages that are direct mapped causing them to split.

When we track the direct_pages_count[] in arch/x86/mm/pat/set_memory.c
There are only splits from higher levels. They never coalesce back.

Splits when we turn on dynamic tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] = count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....

Thanks,
Saravanan D


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V5] x86/mm: Tracking linear mapping split events
  2021-01-28 19:17       ` Dave Hansen
@ 2021-01-28 21:20         ` Saravanan D
  2021-01-28 23:34           ` [PATCH V6] " Saravanan D
  0 siblings, 1 reply; 15+ messages in thread
From: Saravanan D @ 2021-01-28 21:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, dave.hansen, luto, peterz, corbet, willy, linux-kernel,
	kernel-team, linux-doc, linux-mm, songliubraving

Hi Dave,
> 
> Eek.  There really doesn't appear to be a place in Documentation/ that
> we've documented vmstat entries.
> 
> Maybe you can start:
> 
> 	Documentation/admin-guide/mm/vmstat.rst
> 
I was also very surprised that there does not exist documentation for
vmstat, that lead me to add a page in admin-guide which now requires lot
of caveats.

Starting a new documentation for vmstat goes beyond the scope of this patch.
I am inclined to remove Documentation from the next version [V6] of the patch.

I presume that a detailed commit log [V6] explaining why direct mapped kernel
page splis will never coalesce, how kernel tracing causes some of those
splits and why it is worth tracking them can do the job.

Proposed [V6] Commit Log:
>>>
To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic hugepage [direct mapped] split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting sources of direct hugepage splits is kernel
tracing (kprobes, tracepoints).

Note that the kernel's code segment [512 MB] points to the same 
physical addresses that have been already mapped in the kernel's 
direct mapping range.

Source : Documentation/x86/x86_64/mm.rst

When we enable kernel tracing, the kernel has to modify attributes/permissions
of the text segment hugepages that are direct mapped causing them to split.

Kernel's direct mapped hugepages do not coalesce back after split and
remain in place for the remainder of the lifetime.

An instance of direct page splits when we turn on
dynamic kernel tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....
<<<

Thanks,
Saravanan D


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-01-28 21:20         ` Saravanan D
@ 2021-01-28 23:34           ` Saravanan D
  2021-01-28 23:41             ` Tejun Heo
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Saravanan D @ 2021-01-28 23:34 UTC (permalink / raw)
  To: x86, dave.hansen, luto, peterz, willy
  Cc: linux-kernel, kernel-team, linux-mm, songliubraving, tj, hannes,
	Saravanan D

To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic hugepage [direct mapped] split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting sources of direct hugepage splits is kernel
tracing (kprobes, tracepoints).

Note that the kernel's code segment [512 MB] points to the same
physical addresses that have been already mapped in the kernel's
direct mapping range.

Source : Documentation/x86/x86_64/mm.rst

When we enable kernel tracing, the kernel has to modify
attributes/permissions
of the text segment hugepages that are direct mapped causing them to
split.

Kernel's direct mapped hugepages do not coalesce back after split and
remain in place for the remainder of the lifetime.

An instance of direct page splits when we turn on
dynamic kernel tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....

Signed-off-by: Saravanan D <saravanand@fb.com>
---
 arch/x86/mm/pat/set_memory.c  | 8 ++++++++
 include/linux/vm_event_item.h | 4 ++++
 mm/vmstat.c                   | 4 ++++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..a7b3c5f1d316 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
 #include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>
 
 #include <asm/e820/api.h>
 #include <asm/processor.h>
@@ -91,6 +93,12 @@ static void split_page_count(int level)
 		return;
 
 	direct_pages_count[level]--;
+	if (system_state == SYSTEM_RUNNING) {
+		if (level == PG_LEVEL_2M)
+			count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
+		else if (level == PG_LEVEL_1G)
+			count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
+	}
 	direct_pages_count[level - 1] += PTRS_PER_PTE;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..7c06c2bdc33b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_X86
+		DIRECT_MAP_LEVEL2_SPLIT,
+		DIRECT_MAP_LEVEL3_SPLIT,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..a43ac4ac98a2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+#ifdef CONFIG_X86
+	"direct_map_level2_splits",
+	"direct_map_level3_splits",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.24.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-01-28 23:34           ` [PATCH V6] " Saravanan D
@ 2021-01-28 23:41             ` Tejun Heo
  2021-01-29 19:27             ` Johannes Weiner
  2021-02-08 23:30             ` Dave Hansen
  2 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2021-01-28 23:41 UTC (permalink / raw)
  To: Saravanan D
  Cc: x86, dave.hansen, luto, peterz, willy, linux-kernel, kernel-team,
	linux-mm, songliubraving, hannes

On Thu, Jan 28, 2021 at 03:34:30PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic hugepage [direct mapped] split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
...
> Signed-off-by: Saravanan D <saravanand@fb.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-01-28 23:34           ` [PATCH V6] " Saravanan D
  2021-01-28 23:41             ` Tejun Heo
@ 2021-01-29 19:27             ` Johannes Weiner
  2021-02-08 23:17               ` Saravanan D
  2021-02-08 23:30             ` Dave Hansen
  2 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2021-01-29 19:27 UTC (permalink / raw)
  To: Saravanan D
  Cc: x86, dave.hansen, luto, peterz, willy, linux-kernel, kernel-team,
	linux-mm, songliubraving, tj

On Thu, Jan 28, 2021 at 03:34:30PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic hugepage [direct mapped] split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
> 
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> ....
> 
> One of the many lasting sources of direct hugepage splits is kernel
> tracing (kprobes, tracepoints).
> 
> Note that the kernel's code segment [512 MB] points to the same
> physical addresses that have been already mapped in the kernel's
> direct mapping range.
> 
> Source : Documentation/x86/x86_64/mm.rst
> 
> When we enable kernel tracing, the kernel has to modify
> attributes/permissions
> of the text segment hugepages that are direct mapped causing them to
> split.
> 
> Kernel's direct mapped hugepages do not coalesce back after split and
> remain in place for the remainder of the lifetime.
> 
> An instance of direct page splits when we turn on
> dynamic kernel tracing
> ....
> cat /proc/vmstat | grep -i direct_map_level
> direct_map_level2_splits 784
> direct_map_level3_splits 12
> bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
> count(); }'
> cat /proc/vmstat | grep -i
> direct_map_level
> direct_map_level2_splits 789
> direct_map_level3_splits 12
> ....
> 
> Signed-off-by: Saravanan D <saravanand@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-01-29 19:27             ` Johannes Weiner
@ 2021-02-08 23:17               ` Saravanan D
  0 siblings, 0 replies; 15+ messages in thread
From: Saravanan D @ 2021-02-08 23:17 UTC (permalink / raw)
  To: Johannes Weiner, tj, x86
  Cc: dave.hansen, luto, peterz, willy, linux-kernel, kernel-team,
	linux-mm, songliubraving, tj

Hi all,

So far I have received two acks for V6 version of my patch

> Acked-by: Tejun Heo <tj@kernel.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Are there any more objections ?

Thanks,
Saravanan D


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V6] x86/mm: Tracking linear mapping split events
  2021-01-28 23:34           ` [PATCH V6] " Saravanan D
  2021-01-28 23:41             ` Tejun Heo
  2021-01-29 19:27             ` Johannes Weiner
@ 2021-02-08 23:30             ` Dave Hansen
  2 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2021-02-08 23:30 UTC (permalink / raw)
  To: Saravanan D, x86, dave.hansen, luto, peterz, willy
  Cc: linux-kernel, kernel-team, linux-mm, songliubraving, tj, hannes

On 1/28/21 3:34 PM, Saravanan D wrote:
> 
> One of the many lasting sources of direct hugepage splits is kernel
> tracing (kprobes, tracepoints).
> 
> Note that the kernel's code segment [512 MB] points to the same
> physical addresses that have been already mapped in the kernel's
> direct mapping range.

Looks fine to me:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-02-08 23:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <a936a943-9d8f-7e3c-af38-1c99ae176e1f@intel.com>
     [not found] ` <20210128043547.1560435-1-saravanand@fb.com>
2021-01-28  4:51   ` [PATCH V4] x86/mm: Tracking linear mapping split events Matthew Wilcox
2021-01-28 10:49     ` [PATCH V5] " Saravanan D
2021-01-28 15:04       ` Matthew Wilcox
2021-01-28 19:49         ` Saravanan D
2021-01-28 16:33       ` Zi Yan
2021-01-28 16:41         ` Dave Hansen
2021-01-28 16:56           ` Zi Yan
2021-01-28 16:59         ` Song Liu
2021-01-28 19:17       ` Dave Hansen
2021-01-28 21:20         ` Saravanan D
2021-01-28 23:34           ` [PATCH V6] " Saravanan D
2021-01-28 23:41             ` Tejun Heo
2021-01-29 19:27             ` Johannes Weiner
2021-02-08 23:17               ` Saravanan D
2021-02-08 23:30             ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).