On 28 Jan 2021, at 5:49, Saravanan D wrote: > To help with debugging the sluggishness caused by TLB miss/reload, > we introduce monotonic lifetime hugepage split event counts since > system state: SYSTEM_RUNNING to be displayed as part of > /proc/vmstat in x86 servers > > The lifetime split event information will be displayed at the bottom of > /proc/vmstat > .... > swap_ra 0 > swap_ra_hit 0 > direct_map_level2_splits 94 > direct_map_level3_splits 4 > nr_unstable 0 > .... > > One of the many lasting (as we don't coalesce back) sources for huge page > splits is tracing as the granular page attribute/permission changes would > force the kernel to split code segments mapped to huge pages to smaller > ones thereby increasing the probability of TLB miss/reload even after > tracing has been stopped. It is interesting to see this statement saying splitting kernel direct mappings causes performance loss, when Zhengjun (cc’d) from Intel recently posted a kernel direct mapping performance report[1] saying 1GB mappings are good but not much better than 2MB and 4KB mappings. I would love to hear the stories from both sides. Or maybe I misunderstand anything. [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ > > Documentation regarding linear mapping split events added to admin-guide > as requested in V3 of the patch. > > Signed-off-by: Saravanan D > --- > .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ > Documentation/admin-guide/mm/index.rst | 1 + > arch/x86/mm/pat/set_memory.c | 8 +++ > include/linux/vm_event_item.h | 4 ++ > mm/vmstat.c | 4 ++ > 5 files changed, 76 insertions(+) > create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst > > diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst > new file mode 100644 > index 000000000000..298751391deb > --- /dev/null > +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst > @@ -0,0 +1,59 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Direct Mapping Splits > +===================== > + > +Kernel maps all of physical memory in linear/direct mapped pages with > +translation of virtual kernel address to physical address is achieved > +through a simple subtraction of offset. CPUs maintain a cache of these > +translations on fast caches called TLBs. CPU architectures like x86 allow > +direct mapping large portions of memory into hugepages (2M, 1G, etc) in > +various page table levels. > + > +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. > +The splintering of huge direct pages into smaller ones does result in > +a measurable performance hit caused by frequent TLB miss and reloads. > + > +One of the many lasting (as we don't coalesce back) sources for huge page > +splits is tracing as the granular page attribute/permission changes would > +force the kernel to split code segments mapped to hugepages to smaller > +ones thus increasing the probability of TLB miss/reloads even after > +tracing has been stopped. > + > +On x86 systems, we can track the splitting of huge direct mapped pages > +through lifetime event counters in ``/proc/vmstat`` > + > + direct_map_level2_splits xxx > + direct_map_level3_splits yyy > + > +where: > + > +direct_map_level2_splits > + are 2M/4M hugepage split events > +direct_map_level3_splits > + are 1G hugepage split events > + > +The distribution of direct mapped system memory in various page sizes > +post splits can be viewed through ``/proc/meminfo`` whose output > +will include the following lines depending upon supporting CPU > +architecture > + > + DirectMap4k: xxxxx kB > + DirectMap2M: yyyyy kB > + DirectMap1G: zzzzz kB > + > +where: > + > +DirectMap4k > + is the total amount of direct mapped memory (in kB) > + accessed through 4k pages > +DirectMap2M > + is the total amount of direct mapped memory (in kB) > + accessed through 2M pages > +DirectMap1G > + is the total amount of direct mapped memory (in kB) > + accessed through 1G pages > + > + > +-- Saravanan D, Jan 27, 2021 > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst > index 4b14d8b50e9e..9439780f3f07 100644 > --- a/Documentation/admin-guide/mm/index.rst > +++ b/Documentation/admin-guide/mm/index.rst > @@ -38,3 +38,4 @@ the Linux memory management. > soft-dirty > transhuge > userfaultfd > + direct_mapping_splits > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c > index 16f878c26667..a7b3c5f1d316 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -16,6 +16,8 @@ > #include > #include > #include > +#include > +#include > > #include > #include > @@ -91,6 +93,12 @@ static void split_page_count(int level) > return; > > direct_pages_count[level]--; > + if (system_state == SYSTEM_RUNNING) { > + if (level == PG_LEVEL_2M) > + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); > + else if (level == PG_LEVEL_1G) > + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); > + } > direct_pages_count[level - 1] += PTRS_PER_PTE; > } > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 18e75974d4e3..7c06c2bdc33b 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_SWAP > SWAP_RA, > SWAP_RA_HIT, > +#endif > +#ifdef CONFIG_X86 > + DIRECT_MAP_LEVEL2_SPLIT, > + DIRECT_MAP_LEVEL3_SPLIT, > #endif > NR_VM_EVENT_ITEMS > }; > diff --git a/mm/vmstat.c b/mm/vmstat.c > index f8942160fc95..a43ac4ac98a2 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { > "swap_ra", > "swap_ra_hit", > #endif > +#ifdef CONFIG_X86 > + "direct_map_level2_splits", > + "direct_map_level3_splits", > +#endif > #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ > }; > #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ > -- > 2.24.1 — Best Regards, Yan Zi