* [PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
2018-12-11 14:36 [PATCH 0/3] THP eligibility reporting via proc Michal Hocko
@ 2018-12-11 14:36 ` Michal Hocko
2018-12-11 14:36 ` [PATCH 2/3] mm, thp, proc: report THP eligibility for each vma Michal Hocko
2018-12-11 14:36 ` [PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc Michal Hocko
2 siblings, 0 replies; 4+ messages in thread
From: Michal Hocko @ 2018-12-11 14:36 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-api, linux-mm, LKML, Michal Hocko, Dan Williams,
David Rientjes, Jan Kara, Mike Rapoport, Vlastimil Babka
From: Michal Hocko <mhocko@suse.com>
Even though vma flags exported via /proc/<pid>/smaps are explicitly
documented to be not guaranteed for future compatibility the warning
doesn't go far enough because it doesn't mention semantic changes to
those flags. And they are important as well because these flags are
a deep implementation internal to the MM code and the semantic might
change at any time.
Let's consider two recent examples:
http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
: commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
: removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
: mean time certain customer of ours started poking into /proc/<pid>/smaps
: and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
: flags, the application just fails to start complaining that DAX support is
: missing in the kernel.
http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
: Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
: introduced a regression in that userspace cannot always determine the set
: of vmas where thp is ineligible.
: Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
: to determine if a vma is eligible to be backed by hugepages.
: Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
: be disabled and emit "nh" as a flag for the corresponding vmas as part of
: /proc/pid/smaps. After the commit, thp is disabled by means of an mm
: flag and "nh" is not emitted.
: This causes smaps parsing libraries to assume a vma is eligible for thp
: and ends up puzzling the user on why its memory is not backed by thp.
In both cases userspace was relying on a semantic of a specific VMA
flag. The primary reason why that happened is a lack of a proper
internface. While this has been worked on and it will be fixed properly,
it seems that our wording could see some refinement and be more vocal
about semantic aspect of these flags as well.
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Documentation/filesystems/proc.txt | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 12a5e6e693b6..2a4e63f5122c 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
Note that there is no guarantee that every flag and associated mnemonic will
be present in all further kernel releases. Things get changed, the flags may
-be vanished or the reverse -- new added.
+be vanished or the reverse -- new added. Interpretation of their meaning
+might change in future as well. So each consumer of these flags has to
+follow each specific kernel version for the exact semantic.
This file is only present if the CONFIG_MMU kernel configuration option is
enabled.
--
2.19.2
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/3] mm, thp, proc: report THP eligibility for each vma
2018-12-11 14:36 [PATCH 0/3] THP eligibility reporting via proc Michal Hocko
2018-12-11 14:36 ` [PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
@ 2018-12-11 14:36 ` Michal Hocko
2018-12-11 14:36 ` [PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc Michal Hocko
2 siblings, 0 replies; 4+ messages in thread
From: Michal Hocko @ 2018-12-11 14:36 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-api, linux-mm, LKML, Michal Hocko, Vlastimil Babka
From: Michal Hocko <mhocko@suse.com>
Userspace falls short when trying to find out whether a specific memory
range is eligible for THP. There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.
The only way to deduce this now is to query for hg resp. nh flags and
confronting the state with the global setting. Except that there is
also PR_SET_THP_DISABLE that might change the picture. So the final
logic is not trivial. Moreover the eligibility of the vma depends on
the type of VMA as well. In the past we have supported only anononymous
memory VMAs but things have changed and shmem based vmas are supported
as well these days and the query logic gets even more complicated
because the eligibility depends on the mount option and another global
configuration knob.
Simplify the current state and report the THP eligibility in
/proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled
for this purpose. The original implementation of this function assumes
that the caller knows that the vma itself is supported for THP so make
the core checks into __transparent_hugepage_enabled and use it for
existing callers. __show_smap just use the new transparent_hugepage_enabled
which also checks the vma support status (please note that this one has
to be out of line due to include dependency issues).
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Documentation/filesystems/proc.txt | 3 +++
fs/proc/task_mmu.c | 2 ++
include/linux/huge_mm.h | 13 ++++++++++++-
mm/huge_memory.c | 12 +++++++++++-
mm/memory.c | 4 ++--
5 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 2a4e63f5122c..cd465304bec4 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -425,6 +425,7 @@ SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
+THPeligible: 0
VmFlags: rd ex mr mw me dw
the first of these lines shows the same information as is displayed for the
@@ -462,6 +463,8 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
+"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if
+true, 0 otherwise.
"VmFlags" field deserves a separate description. This member represents the kernel
flags associated with the particular virtual memory area in two letter encoded
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 47c3764c469b..c9f160eb9fbc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -790,6 +790,8 @@ static int show_smap(struct seq_file *m, void *v)
__show_smap(m, &mss);
+ seq_printf(m, "THPeligible: %d\n", transparent_hugepage_enabled(vma));
+
if (arch_pkeys_enabled())
seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
show_smap_vma_flags(m, vma);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4663ee96cf59..381e872bfde0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -93,7 +93,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
extern unsigned long transparent_hugepage_flags;
-static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
+/*
+ * to be used on vmas which are known to support THP.
+ * Use transparent_hugepage_enabled otherwise
+ */
+static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
{
if (vma->vm_flags & VM_NOHUGEPAGE)
return false;
@@ -117,6 +121,8 @@ static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
return false;
}
+bool transparent_hugepage_enabled(struct vm_area_struct *vma);
+
#define transparent_hugepage_use_zero_page() \
(transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
@@ -257,6 +263,11 @@ static inline bool thp_migration_supported(void)
#define hpage_nr_pages(x) 1
+static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
+{
+ return false;
+}
+
static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
{
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55478ab3c83b..f64733c23067 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -62,6 +62,16 @@ static struct shrinker deferred_split_shrinker;
static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;
+bool transparent_hugepage_enabled(struct vm_area_struct *vma)
+{
+ if (vma_is_anonymous(vma))
+ return __transparent_hugepage_enabled(vma);
+ if (shmem_mapping(vma->vm_file->f_mapping) && shmem_huge_enabled(vma))
+ return __transparent_hugepage_enabled(vma);
+
+ return false;
+}
+
static struct page *get_huge_zero_page(void)
{
struct page *zero_page;
@@ -1303,7 +1313,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
get_page(page);
spin_unlock(vmf->ptl);
alloc:
- if (transparent_hugepage_enabled(vma) &&
+ if (__transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow()) {
huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..3c2716ec7fbd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3830,7 +3830,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
vmf.pud = pud_alloc(mm, p4d, address);
if (!vmf.pud)
return VM_FAULT_OOM;
- if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
+ if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -3856,7 +3856,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
- if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
+ if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
--
2.19.2
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
2018-12-11 14:36 [PATCH 0/3] THP eligibility reporting via proc Michal Hocko
2018-12-11 14:36 ` [PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
2018-12-11 14:36 ` [PATCH 2/3] mm, thp, proc: report THP eligibility for each vma Michal Hocko
@ 2018-12-11 14:36 ` Michal Hocko
2 siblings, 0 replies; 4+ messages in thread
From: Michal Hocko @ 2018-12-11 14:36 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-api, linux-mm, LKML, Michal Hocko, Vlastimil Babka
From: Michal Hocko <mhocko@suse.com>
David Rientjes has reported that 1860033237d4 ("mm: make
PR_SET_THP_DISABLE immediately active") has changed the way how
we report THPable VMAs to the userspace. Their monitoring tool is
triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
an insufficient THP usage as a memory fragmentation resp. memory
pressure issue.
Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
flag and that got exposed to the userspace via /proc/<pid>/smaps file.
This implementation had its downsides as explained in the commit message
but it is true that the userspace doesn't have any means to query for
the process wide THP enabled/disabled status.
PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense
to export in the process wide context rather than per-vma. Introduce
a new field to /proc/<pid>/status which export this status. If
PR_SET_THP_DISABLE is used then it reports false same as when the THP is
not compiled in. It doesn't consider the global THP status because we
already export that information via sysfs
Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Documentation/filesystems/proc.txt | 3 +++
fs/proc/array.c | 10 ++++++++++
2 files changed, 13 insertions(+)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index cd465304bec4..b24fd9bccc99 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -182,6 +182,7 @@ For example, to get the status information of a process, all you have to do is
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
+ THP_enabled: 1
Threads: 1
SigQ: 0/28578
SigPnd: 0000000000000000
@@ -256,6 +257,8 @@ Table 1-2: Contents of the status files (as of 4.8)
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
+ THP_enabled process is allowed to use THP (returns 0 when
+ PR_SET_THP_DISABLE is set on the process
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 0ceb3b6b37e7..9d428d5a0ac8 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -392,6 +392,15 @@ static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm)
seq_putc(m, '\n');
}
+static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
+{
+ bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
+
+ if (thp_enabled)
+ thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
+ seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
+}
+
int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -406,6 +415,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
if (mm) {
task_mem(m, mm);
task_core_dumping(m, mm);
+ task_thp_status(m, mm);
mmput(mm);
}
task_sig(m, task);
--
2.19.2
^ permalink raw reply related [flat|nested] 4+ messages in thread