Linux-api Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/3] THP eligibility reporting via proc
@ 2018-11-20 10:35 Michal Hocko
  2018-11-20 10:35 ` [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-20 10:35 UTC (permalink / raw)
  To: linux-api
  Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Dan Williams,
	David Rientjes, Jan Kara, Michal Hocko

Hi,
this series of three patches aims at making THP eligibility reporting
much more robust and long term sustainable. The trigger for the change
is a regression report [1] and the long follow up discussion. In short
the specific application didn't have good API to query whether a particular
mapping can be backed by THP so it has used VMA flags to workaround that.
These flags represent a deep internal state of VMAs and as such they should
be used by userspace with a great deal of caution.

A similar has happened for [2] when users complained that VM_MIXEDMAP is
no longer set on DAX mappings. Again a lack of a proper API led to an
abuse.

The first patch in the series tries to emphasise that that the semantic
of flags might change and any application consuming those should be really
careful.

The remaining two patches provide a more suitable interface to address [1]
and provide a consistent API to query the THP status both for each VMA
and process wide as well.

[1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
[2] http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-20 10:35 [RFC PATCH 0/3] THP eligibility reporting via proc Michal Hocko
@ 2018-11-20 10:35 ` Michal Hocko
  2018-11-20 10:51   ` Jan Kara
                     ` (3 more replies)
  2018-11-20 10:35 ` [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-20 10:35 UTC (permalink / raw)
  To: linux-api
  Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Michal Hocko,
	Jan Kara, Dan Williams, David Rientjes

From: Michal Hocko <mhocko@suse.com>

Even though vma flags exported via /proc/<pid>/smaps are explicitly
documented to be not guaranteed for future compatibility the warning
doesn't go far enough because it doesn't mention semantic changes to
those flags. And they are important as well because these flags are
a deep implementation internal to the MM code and the semantic might
change at any time.

Let's consider two recent examples:
http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
: commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
: removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
: mean time certain customer of ours started poking into /proc/<pid>/smaps
: and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
: flags, the application just fails to start complaining that DAX support is
: missing in the kernel.

http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
: Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
: introduced a regression in that userspace cannot always determine the set
: of vmas where thp is ineligible.
: Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
: to determine if a vma is eligible to be backed by hugepages.
: Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
: be disabled and emit "nh" as a flag for the corresponding vmas as part of
: /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
: flag and "nh" is not emitted.
: This causes smaps parsing libraries to assume a vma is eligible for thp
: and ends up puzzling the user on why its memory is not backed by thp.

In both cases userspace was relying on a semantic of a specific VMA
flag. The primary reason why that happened is a lack of a proper
internface. While this has been worked on and it will be fixed properly,
it seems that our wording could see some refinement and be more vocal
about semantic aspect of these flags as well.

Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/filesystems/proc.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 12a5e6e693b6..b1fda309f067 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
 
 Note that there is no guarantee that every flag and associated mnemonic will
 be present in all further kernel releases. Things get changed, the flags may
-be vanished or the reverse -- new added.
+be vanished or the reverse -- new added. Interpretatation of their meaning
+might change in future as well. So each consumnent of these flags have to
+follow each specific kernel version for the exact semantic.
 
 This file is only present if the CONFIG_MMU kernel configuration option is
 enabled.
-- 
2.19.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma
  2018-11-20 10:35 [RFC PATCH 0/3] THP eligibility reporting via proc Michal Hocko
  2018-11-20 10:35 ` [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
@ 2018-11-20 10:35 ` Michal Hocko
  2018-11-20 11:42   ` Michal Hocko
  2018-11-23 15:07   ` Vlastimil Babka
  2018-11-20 10:35 ` [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc Michal Hocko
  2018-12-07 10:55 ` [RFC PATCH 0/3] THP eligibility reporting via proc Michal Hocko
  3 siblings, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-20 10:35 UTC (permalink / raw)
  To: linux-api; +Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Userspace falls short when trying to find out whether a specific memory
range is eligible for THP. There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.

The only way to deduce this now is to query for hg resp. nh flags and
confronting the state with the global setting. Except that there is
also PR_SET_THP_DISABLE that might change the picture. So the final
logic is not trivial. Moreover the eligibility of the vma depends on
the type of VMA as well. In the past we have supported only anononymous
memory VMAs but things have changed and shmem based vmas are supported
as well these days and the query logic gets even more complicated
because the eligibility depends on the mount option and another global
configuration knob.

Simplify the current state and report the THP eligibility in
/proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled
for this purpose. The original implementation of this function assumes
that the caller knows that the vma itself is supported for THP so make
the core checks into __transparent_hugepage_enabled and use it for
existing callers. __show_smap just use the new transparent_hugepage_enabled
which also checks the vma support status (please note that this one has
to be out of line due to include dependency issues).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/filesystems/proc.txt |  3 +++
 fs/proc/task_mmu.c                 |  2 ++
 include/linux/huge_mm.h            | 13 ++++++++++++-
 mm/huge_memory.c                   | 12 +++++++++++-
 mm/memory.c                        |  4 ++--
 5 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index b1fda309f067..06562bab509a 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -425,6 +425,7 @@ SwapPss:               0 kB
 KernelPageSize:        4 kB
 MMUPageSize:           4 kB
 Locked:                0 kB
+THPeligible:           0
 VmFlags: rd ex mr mw me dw
 
 the first of these lines shows the same information as is displayed for the
@@ -462,6 +463,8 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
 "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
 does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
+"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if
+true, 0 otherwise.
 
 "VmFlags" field deserves a separate description. This member represents the kernel
 flags associated with the particular virtual memory area in two letter encoded
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 47c3764c469b..c9f160eb9fbc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -790,6 +790,8 @@ static int show_smap(struct seq_file *m, void *v)
 
 	__show_smap(m, &mss);
 
+	seq_printf(m, "THPeligible:    %d\n", transparent_hugepage_enabled(vma));
+
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
 	show_smap_vma_flags(m, vma);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4663ee96cf59..381e872bfde0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -93,7 +93,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
 
-static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
+/*
+ * to be used on vmas which are known to support THP.
+ * Use transparent_hugepage_enabled otherwise
+ */
+static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & VM_NOHUGEPAGE)
 		return false;
@@ -117,6 +121,8 @@ static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 	return false;
 }
 
+bool transparent_hugepage_enabled(struct vm_area_struct *vma);
+
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
 	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
@@ -257,6 +263,11 @@ static inline bool thp_migration_supported(void)
 
 #define hpage_nr_pages(x) 1
 
+static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55478ab3c83b..f64733c23067 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -62,6 +62,16 @@ static struct shrinker deferred_split_shrinker;
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 
+bool transparent_hugepage_enabled(struct vm_area_struct *vma)
+{
+	if (vma_is_anonymous(vma))
+		return __transparent_hugepage_enabled(vma);
+	if (shmem_mapping(vma->vm_file->f_mapping) && shmem_huge_enabled(vma))
+		return __transparent_hugepage_enabled(vma);
+
+	return false;
+}
+
 static struct page *get_huge_zero_page(void)
 {
 	struct page *zero_page;
@@ -1303,7 +1313,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	get_page(page);
 	spin_unlock(vmf->ptl);
 alloc:
-	if (transparent_hugepage_enabled(vma) &&
+	if (__transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
 		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..3c2716ec7fbd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3830,7 +3830,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	vmf.pud = pud_alloc(mm, p4d, address);
 	if (!vmf.pud)
 		return VM_FAULT_OOM;
-	if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
+	if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
@@ -3856,7 +3856,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
-	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
+	if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
 		ret = create_huge_pmd(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
-- 
2.19.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-20 10:35 [RFC PATCH 0/3] THP eligibility reporting via proc Michal Hocko
  2018-11-20 10:35 ` [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
  2018-11-20 10:35 ` [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma Michal Hocko
@ 2018-11-20 10:35 ` Michal Hocko
  2018-11-20 11:42   ` Michal Hocko
                     ` (2 more replies)
  2018-12-07 10:55 ` [RFC PATCH 0/3] THP eligibility reporting via proc Michal Hocko
  3 siblings, 3 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-20 10:35 UTC (permalink / raw)
  To: linux-api; +Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

David Rientjes has reported that 1860033237d4 ("mm: make
PR_SET_THP_DISABLE immediately active") has changed the way how
we report THPable VMAs to the userspace. Their monitoring tool is
triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
an insufficient THP usage as a memory fragmentation resp. memory
pressure issue.

Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
flag and that got exposed to the userspace via /proc/<pid>/smaps file.
This implementation had its downsides as explained in the commit message
but it is true that the userspace doesn't have any means to query for
the process wide THP enabled/disabled status.

PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense
to export in the process wide context rather than per-vma. Introduce
a new field to /proc/<pid>/status which export this status.  If
PR_SET_THP_DISABLE is used then it reports false same as when the THP is
not compiled in. It doesn't consider the global THP status because we
already export that information via sysfs

Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/filesystems/proc.txt |  3 +++
 fs/proc/array.c                    | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 06562bab509a..7995e9322889 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -182,6 +182,7 @@ For example, to get the status information of a process, all you have to do is
   VmSwap:        0 kB
   HugetlbPages:          0 kB
   CoreDumping:    0
+  THP_enabled:	  1
   Threads:        1
   SigQ:   0/28578
   SigPnd: 0000000000000000
@@ -256,6 +257,8 @@ Table 1-2: Contents of the status files (as of 4.8)
  HugetlbPages                size of hugetlb memory portions
  CoreDumping                 process's memory is currently being dumped
                              (killing the process may lead to a corrupted core)
+ THP_enabled		     process is allowed to use THP (returns 0 when
+			     PR_SET_THP_DISABLE is set on the process
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
  SigPnd                      bitmap of pending signals for the thread
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 0ceb3b6b37e7..9d428d5a0ac8 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -392,6 +392,15 @@ static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm)
 	seq_putc(m, '\n');
 }
 
+static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
+{
+	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
+
+	if (thp_enabled)
+		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
+	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
+}
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
@@ -406,6 +415,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 	if (mm) {
 		task_mem(m, mm);
 		task_core_dumping(m, mm);
+		task_thp_status(m, mm);
 		mmput(mm);
 	}
 	task_sig(m, task);
-- 
2.19.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-20 10:35 ` [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
@ 2018-11-20 10:51   ` Jan Kara
  2018-11-20 11:41     ` Michal Hocko
  2018-11-21  0:01     ` David Rientjes
  2018-11-20 18:32   ` Dan Williams
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Jan Kara @ 2018-11-20 10:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML,
	Michal Hocko, Jan Kara, Dan Williams, David Rientjes

On Tue 20-11-18 11:35:13, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Even though vma flags exported via /proc/<pid>/smaps are explicitly
> documented to be not guaranteed for future compatibility the warning
> doesn't go far enough because it doesn't mention semantic changes to
> those flags. And they are important as well because these flags are
> a deep implementation internal to the MM code and the semantic might
> change at any time.
> 
> Let's consider two recent examples:
> http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> : mean time certain customer of ours started poking into /proc/<pid>/smaps
> : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> : flags, the application just fails to start complaining that DAX support is
> : missing in the kernel.
> 
> http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> : introduced a regression in that userspace cannot always determine the set
> : of vmas where thp is ineligible.
> : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> : to determine if a vma is eligible to be backed by hugepages.
> : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> : flag and "nh" is not emitted.
> : This causes smaps parsing libraries to assume a vma is eligible for thp
> : and ends up puzzling the user on why its memory is not backed by thp.
> 
> In both cases userspace was relying on a semantic of a specific VMA
> flag. The primary reason why that happened is a lack of a proper
> internface. While this has been worked on and it will be fixed properly,
> it seems that our wording could see some refinement and be more vocal
> about semantic aspect of these flags as well.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Honestly, it just shows that no amount of documentation is going to stop
userspace from abusing API that's exposing too much if there's no better
alternative. But this is a good clarification regardless. So feel free to
add:

Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  Documentation/filesystems/proc.txt | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..b1fda309f067 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
>  
>  Note that there is no guarantee that every flag and associated mnemonic will
>  be present in all further kernel releases. Things get changed, the flags may
> -be vanished or the reverse -- new added.
> +be vanished or the reverse -- new added. Interpretatation of their meaning
> +might change in future as well. So each consumnent of these flags have to
> +follow each specific kernel version for the exact semantic.
>  
>  This file is only present if the CONFIG_MMU kernel configuration option is
>  enabled.
> -- 
> 2.19.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-20 10:51   ` Jan Kara
@ 2018-11-20 11:41     ` Michal Hocko
  2018-11-21  0:01     ` David Rientjes
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-20 11:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML,
	Dan Williams, David Rientjes

On Tue 20-11-18 11:51:35, Jan Kara wrote:
> Honestly, it just shows that no amount of documentation is going to stop
> userspace from abusing API that's exposing too much if there's no better
> alternative.

Yeah, I agree. And we should never expose such a low level stuff in the
first place. But, well, this ship has already sailed...

> But this is a good clarification regardless. So feel free to
> add:
> 
> Acked-by: Jan Kara <jack@suse.cz>

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma
  2018-11-20 10:35 ` [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma Michal Hocko
@ 2018-11-20 11:42   ` Michal Hocko
  2018-11-23 15:07   ` Vlastimil Babka
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-20 11:42 UTC (permalink / raw)
  To: linux-api; +Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, David Rientjes

Damn, David somehow didn't make it to the CC list. Sorry about that.

On Tue 20-11-18 11:35:14, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Userspace falls short when trying to find out whether a specific memory
> range is eligible for THP. There are usecases that would like to know
> that
> http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
> : This is used to identify heap mappings that should be able to fault thp
> : but do not, and they normally point to a low-on-memory or fragmentation
> : issue.
> 
> The only way to deduce this now is to query for hg resp. nh flags and
> confronting the state with the global setting. Except that there is
> also PR_SET_THP_DISABLE that might change the picture. So the final
> logic is not trivial. Moreover the eligibility of the vma depends on
> the type of VMA as well. In the past we have supported only anononymous
> memory VMAs but things have changed and shmem based vmas are supported
> as well these days and the query logic gets even more complicated
> because the eligibility depends on the mount option and another global
> configuration knob.
> 
> Simplify the current state and report the THP eligibility in
> /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled
> for this purpose. The original implementation of this function assumes
> that the caller knows that the vma itself is supported for THP so make
> the core checks into __transparent_hugepage_enabled and use it for
> existing callers. __show_smap just use the new transparent_hugepage_enabled
> which also checks the vma support status (please note that this one has
> to be out of line due to include dependency issues).
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  Documentation/filesystems/proc.txt |  3 +++
>  fs/proc/task_mmu.c                 |  2 ++
>  include/linux/huge_mm.h            | 13 ++++++++++++-
>  mm/huge_memory.c                   | 12 +++++++++++-
>  mm/memory.c                        |  4 ++--
>  5 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index b1fda309f067..06562bab509a 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -425,6 +425,7 @@ SwapPss:               0 kB
>  KernelPageSize:        4 kB
>  MMUPageSize:           4 kB
>  Locked:                0 kB
> +THPeligible:           0
>  VmFlags: rd ex mr mw me dw
>  
>  the first of these lines shows the same information as is displayed for the
> @@ -462,6 +463,8 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
>  "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
>  does not take into account swapped out page of underlying shmem objects.
>  "Locked" indicates whether the mapping is locked in memory or not.
> +"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if
> +true, 0 otherwise.
>  
>  "VmFlags" field deserves a separate description. This member represents the kernel
>  flags associated with the particular virtual memory area in two letter encoded
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 47c3764c469b..c9f160eb9fbc 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -790,6 +790,8 @@ static int show_smap(struct seq_file *m, void *v)
>  
>  	__show_smap(m, &mss);
>  
> +	seq_printf(m, "THPeligible:    %d\n", transparent_hugepage_enabled(vma));
> +
>  	if (arch_pkeys_enabled())
>  		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
>  	show_smap_vma_flags(m, vma);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 4663ee96cf59..381e872bfde0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -93,7 +93,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>  
>  extern unsigned long transparent_hugepage_flags;
>  
> -static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
> +/*
> + * to be used on vmas which are known to support THP.
> + * Use transparent_hugepage_enabled otherwise
> + */
> +static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
>  	if (vma->vm_flags & VM_NOHUGEPAGE)
>  		return false;
> @@ -117,6 +121,8 @@ static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>  	return false;
>  }
>  
> +bool transparent_hugepage_enabled(struct vm_area_struct *vma);
> +
>  #define transparent_hugepage_use_zero_page()				\
>  	(transparent_hugepage_flags &					\
>  	 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
> @@ -257,6 +263,11 @@ static inline bool thp_migration_supported(void)
>  
>  #define hpage_nr_pages(x) 1
>  
> +static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
> +{
> +	return false;
> +}
> +
>  static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
>  	return false;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 55478ab3c83b..f64733c23067 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -62,6 +62,16 @@ static struct shrinker deferred_split_shrinker;
>  static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  
> +bool transparent_hugepage_enabled(struct vm_area_struct *vma)
> +{
> +	if (vma_is_anonymous(vma))
> +		return __transparent_hugepage_enabled(vma);
> +	if (shmem_mapping(vma->vm_file->f_mapping) && shmem_huge_enabled(vma))
> +		return __transparent_hugepage_enabled(vma);
> +
> +	return false;
> +}
> +
>  static struct page *get_huge_zero_page(void)
>  {
>  	struct page *zero_page;
> @@ -1303,7 +1313,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  	get_page(page);
>  	spin_unlock(vmf->ptl);
>  alloc:
> -	if (transparent_hugepage_enabled(vma) &&
> +	if (__transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
>  		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
> diff --git a/mm/memory.c b/mm/memory.c
> index 4ad2d293ddc2..3c2716ec7fbd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3830,7 +3830,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  	vmf.pud = pud_alloc(mm, p4d, address);
>  	if (!vmf.pud)
>  		return VM_FAULT_OOM;
> -	if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
> +	if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
>  		ret = create_huge_pud(&vmf);
>  		if (!(ret & VM_FAULT_FALLBACK))
>  			return ret;
> @@ -3856,7 +3856,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
>  	if (!vmf.pmd)
>  		return VM_FAULT_OOM;
> -	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
> +	if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
>  		ret = create_huge_pmd(&vmf);
>  		if (!(ret & VM_FAULT_FALLBACK))
>  			return ret;
> -- 
> 2.19.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-20 10:35 ` [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc Michal Hocko
@ 2018-11-20 11:42   ` Michal Hocko
  2018-11-23 15:49   ` Vlastimil Babka
  2018-11-27  0:33   ` William Kucharski
  2 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-20 11:42 UTC (permalink / raw)
  To: linux-api; +Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, David Rientjes

Damn, David somehow didn't make it to the CC list. Sorry about that.

On Tue 20-11-18 11:35:15, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> David Rientjes has reported that 1860033237d4 ("mm: make
> PR_SET_THP_DISABLE immediately active") has changed the way how
> we report THPable VMAs to the userspace. Their monitoring tool is
> triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
> an insufficient THP usage as a memory fragmentation resp. memory
> pressure issue.
> 
> Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
> flag and that got exposed to the userspace via /proc/<pid>/smaps file.
> This implementation had its downsides as explained in the commit message
> but it is true that the userspace doesn't have any means to query for
> the process wide THP enabled/disabled status.
> 
> PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense
> to export in the process wide context rather than per-vma. Introduce
> a new field to /proc/<pid>/status which export this status.  If
> PR_SET_THP_DISABLE is used then it reports false same as when the THP is
> not compiled in. It doesn't consider the global THP status because we
> already export that information via sysfs
> 
> Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  Documentation/filesystems/proc.txt |  3 +++
>  fs/proc/array.c                    | 10 ++++++++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 06562bab509a..7995e9322889 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -182,6 +182,7 @@ For example, to get the status information of a process, all you have to do is
>    VmSwap:        0 kB
>    HugetlbPages:          0 kB
>    CoreDumping:    0
> +  THP_enabled:	  1
>    Threads:        1
>    SigQ:   0/28578
>    SigPnd: 0000000000000000
> @@ -256,6 +257,8 @@ Table 1-2: Contents of the status files (as of 4.8)
>   HugetlbPages                size of hugetlb memory portions
>   CoreDumping                 process's memory is currently being dumped
>                               (killing the process may lead to a corrupted core)
> + THP_enabled		     process is allowed to use THP (returns 0 when
> +			     PR_SET_THP_DISABLE is set on the process
>   Threads                     number of threads
>   SigQ                        number of signals queued/max. number for queue
>   SigPnd                      bitmap of pending signals for the thread
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 0ceb3b6b37e7..9d428d5a0ac8 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -392,6 +392,15 @@ static inline void task_core_dumping(struct seq_file *m, struct mm_struct *mm)
>  	seq_putc(m, '\n');
>  }
>  
> +static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
> +{
> +	bool thp_enabled = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE);
> +
> +	if (thp_enabled)
> +		thp_enabled = !test_bit(MMF_DISABLE_THP, &mm->flags);
> +	seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
> +}
> +
>  int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>  			struct pid *pid, struct task_struct *task)
>  {
> @@ -406,6 +415,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>  	if (mm) {
>  		task_mem(m, mm);
>  		task_core_dumping(m, mm);
> +		task_thp_status(m, mm);
>  		mmput(mm);
>  	}
>  	task_sig(m, task);
> -- 
> 2.19.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-20 10:35 ` [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
  2018-11-20 10:51   ` Jan Kara
@ 2018-11-20 18:32   ` Dan Williams
  2018-11-21  7:05     ` Michal Hocko
  2018-11-21 17:54   ` Mike Rapoport
  2018-11-23 13:47   ` Vlastimil Babka
  3 siblings, 1 reply; 27+ messages in thread
From: Dan Williams @ 2018-11-20 18:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux API, Andrew Morton, adobriyan, Linux MM,
	Linux Kernel Mailing List, Michal Hocko, Jan Kara,
	David Rientjes

On Tue, Nov 20, 2018 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> From: Michal Hocko <mhocko@suse.com>
>
> Even though vma flags exported via /proc/<pid>/smaps are explicitly
> documented to be not guaranteed for future compatibility the warning
> doesn't go far enough because it doesn't mention semantic changes to
> those flags. And they are important as well because these flags are
> a deep implementation internal to the MM code and the semantic might
> change at any time.
>
> Let's consider two recent examples:
> http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> : mean time certain customer of ours started poking into /proc/<pid>/smaps
> : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> : flags, the application just fails to start complaining that DAX support is
> : missing in the kernel.
>
> http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> : introduced a regression in that userspace cannot always determine the set
> : of vmas where thp is ineligible.
> : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> : to determine if a vma is eligible to be backed by hugepages.
> : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> : flag and "nh" is not emitted.
> : This causes smaps parsing libraries to assume a vma is eligible for thp
> : and ends up puzzling the user on why its memory is not backed by thp.
>
> In both cases userspace was relying on a semantic of a specific VMA
> flag. The primary reason why that happened is a lack of a proper
> internface. While this has been worked on and it will be fixed properly,
> it seems that our wording could see some refinement and be more vocal
> about semantic aspect of these flags as well.
>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  Documentation/filesystems/proc.txt | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..b1fda309f067 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
>
>  Note that there is no guarantee that every flag and associated mnemonic will
>  be present in all further kernel releases. Things get changed, the flags may
> -be vanished or the reverse -- new added.
> +be vanished or the reverse -- new added. Interpretatation of their meaning
> +might change in future as well. So each consumnent of these flags have to
> +follow each specific kernel version for the exact semantic.

Can we start to claw some of this back? Perhaps with a config option
to hide the flags to put applications on notice? I recall that when I
introduced CONFIG_IO_STRICT_DEVMEM it caused enough regressions that
distros did not enable it, but now a few years out I'm finding that it
is enabled in more places.

In any event,

Acked-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-20 10:51   ` Jan Kara
  2018-11-20 11:41     ` Michal Hocko
@ 2018-11-21  0:01     ` David Rientjes
  2018-11-21  6:56       ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: David Rientjes @ 2018-11-21  0:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Michal Hocko, linux-api, Andrew Morton, Alexey Dobriyan,
	linux-mm, LKML, Michal Hocko, Dan Williams

On Tue, 20 Nov 2018, Jan Kara wrote:

> > Even though vma flags exported via /proc/<pid>/smaps are explicitly
> > documented to be not guaranteed for future compatibility the warning
> > doesn't go far enough because it doesn't mention semantic changes to
> > those flags. And they are important as well because these flags are
> > a deep implementation internal to the MM code and the semantic might
> > change at any time.
> > 
> > Let's consider two recent examples:
> > http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> > : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > : mean time certain customer of ours started poking into /proc/<pid>/smaps
> > : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > : flags, the application just fails to start complaining that DAX support is
> > : missing in the kernel.
> > 
> > http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> > : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> > : introduced a regression in that userspace cannot always determine the set
> > : of vmas where thp is ineligible.
> > : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> > : to determine if a vma is eligible to be backed by hugepages.
> > : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> > : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> > : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> > : flag and "nh" is not emitted.
> > : This causes smaps parsing libraries to assume a vma is eligible for thp
> > : and ends up puzzling the user on why its memory is not backed by thp.
> > 
> > In both cases userspace was relying on a semantic of a specific VMA
> > flag. The primary reason why that happened is a lack of a proper
> > internface. While this has been worked on and it will be fixed properly,
> > it seems that our wording could see some refinement and be more vocal
> > about semantic aspect of these flags as well.
> > 
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Honestly, it just shows that no amount of documentation is going to stop
> userspace from abusing API that's exposing too much if there's no better
> alternative. But this is a good clarification regardless. So feel free to
> add:
> 
> Acked-by: Jan Kara <jack@suse.cz>
> 

I'm not sure what is expected of a userspace developer who finds they have 
a single way to determine if something is enabled/disabled.  Should they 
refer to the documentation and see that the flag may be unstable so they 
write a kernel patch and have it merged upstream before using it?  What to 
do when they don't control the kernel version they are running on?

Anyway, mentioning that the vm flags here only have meaning depending on 
the kernel version seems like a worthwhile addition:

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-21  0:01     ` David Rientjes
@ 2018-11-21  6:56       ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-21  6:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Jan Kara, linux-api, Andrew Morton, Alexey Dobriyan, linux-mm,
	LKML, Dan Williams

On Tue 20-11-18 16:01:47, David Rientjes wrote:
> On Tue, 20 Nov 2018, Jan Kara wrote:
> 
> > > Even though vma flags exported via /proc/<pid>/smaps are explicitly
> > > documented to be not guaranteed for future compatibility the warning
> > > doesn't go far enough because it doesn't mention semantic changes to
> > > those flags. And they are important as well because these flags are
> > > a deep implementation internal to the MM code and the semantic might
> > > change at any time.
> > > 
> > > Let's consider two recent examples:
> > > http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> > > : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > > : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > > : mean time certain customer of ours started poking into /proc/<pid>/smaps
> > > : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > > : flags, the application just fails to start complaining that DAX support is
> > > : missing in the kernel.
> > > 
> > > http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> > > : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> > > : introduced a regression in that userspace cannot always determine the set
> > > : of vmas where thp is ineligible.
> > > : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> > > : to determine if a vma is eligible to be backed by hugepages.
> > > : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> > > : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> > > : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> > > : flag and "nh" is not emitted.
> > > : This causes smaps parsing libraries to assume a vma is eligible for thp
> > > : and ends up puzzling the user on why its memory is not backed by thp.
> > > 
> > > In both cases userspace was relying on a semantic of a specific VMA
> > > flag. The primary reason why that happened is a lack of a proper
> > > internface. While this has been worked on and it will be fixed properly,
> > > it seems that our wording could see some refinement and be more vocal
> > > about semantic aspect of these flags as well.
> > > 
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > Honestly, it just shows that no amount of documentation is going to stop
> > userspace from abusing API that's exposing too much if there's no better
> > alternative. But this is a good clarification regardless. So feel free to
> > add:
> > 
> > Acked-by: Jan Kara <jack@suse.cz>
> > 
> 
> I'm not sure what is expected of a userspace developer who finds they have 
> a single way to determine if something is enabled/disabled.  Should they 
> refer to the documentation and see that the flag may be unstable so they 
> write a kernel patch and have it merged upstream before using it?  What to 
> do when they don't control the kernel version they are running on?

Well, I would treat it as any standard feature request. Ask for the
feature upstream and work with the comunity to come up with a reasonable
and a stable API.

> Anyway, mentioning that the vm flags here only have meaning depending on 
> the kernel version seems like a worthwhile addition:
> 
> Acked-by: David Rientjes <rientjes@google.com>

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-20 18:32   ` Dan Williams
@ 2018-11-21  7:05     ` Michal Hocko
  2018-11-21 18:01       ` Mike Rapoport
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2018-11-21  7:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux API, Andrew Morton, adobriyan, Linux MM,
	Linux Kernel Mailing List, Jan Kara, David Rientjes

On Tue 20-11-18 10:32:07, Dan Williams wrote:
> On Tue, Nov 20, 2018 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Even though vma flags exported via /proc/<pid>/smaps are explicitly
> > documented to be not guaranteed for future compatibility the warning
> > doesn't go far enough because it doesn't mention semantic changes to
> > those flags. And they are important as well because these flags are
> > a deep implementation internal to the MM code and the semantic might
> > change at any time.
> >
> > Let's consider two recent examples:
> > http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> > : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > : mean time certain customer of ours started poking into /proc/<pid>/smaps
> > : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > : flags, the application just fails to start complaining that DAX support is
> > : missing in the kernel.
> >
> > http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> > : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> > : introduced a regression in that userspace cannot always determine the set
> > : of vmas where thp is ineligible.
> > : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> > : to determine if a vma is eligible to be backed by hugepages.
> > : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> > : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> > : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> > : flag and "nh" is not emitted.
> > : This causes smaps parsing libraries to assume a vma is eligible for thp
> > : and ends up puzzling the user on why its memory is not backed by thp.
> >
> > In both cases userspace was relying on a semantic of a specific VMA
> > flag. The primary reason why that happened is a lack of a proper
> > internface. While this has been worked on and it will be fixed properly,
> > it seems that our wording could see some refinement and be more vocal
> > about semantic aspect of these flags as well.
> >
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  Documentation/filesystems/proc.txt | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 12a5e6e693b6..b1fda309f067 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
> >
> >  Note that there is no guarantee that every flag and associated mnemonic will
> >  be present in all further kernel releases. Things get changed, the flags may
> > -be vanished or the reverse -- new added.
> > +be vanished or the reverse -- new added. Interpretatation of their meaning
> > +might change in future as well. So each consumnent of these flags have to
> > +follow each specific kernel version for the exact semantic.
> 
> Can we start to claw some of this back? Perhaps with a config option
> to hide the flags to put applications on notice?

I would love to. My knowledge of CRIU is very minimal, but my
understanding is that this is the primary consumer of those flags. And
checkpointing is so close to the specific kernel version that I assume
that this abuse is somehow justified. We can hide it behind
CONFIG_CHECKPOINT_RESTORE but does it going to help? I presume that many
distro kernels will have the config enabled.

> I recall that when I
> introduced CONFIG_IO_STRICT_DEVMEM it caused enough regressions that
> distros did not enable it, but now a few years out I'm finding that it
> is enabled in more places.
> 
> In any event,
> 
> Acked-by: Dan Williams <dan.j.williams@intel.com>

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-20 10:35 ` [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
  2018-11-20 10:51   ` Jan Kara
  2018-11-20 18:32   ` Dan Williams
@ 2018-11-21 17:54   ` Mike Rapoport
  2018-11-21 17:58     ` Michal Hocko
  2018-11-23 13:47   ` Vlastimil Babka
  3 siblings, 1 reply; 27+ messages in thread
From: Mike Rapoport @ 2018-11-21 17:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML,
	Michal Hocko, Jan Kara, Dan Williams, David Rientjes

On Tue, Nov 20, 2018 at 11:35:13AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Even though vma flags exported via /proc/<pid>/smaps are explicitly
> documented to be not guaranteed for future compatibility the warning
> doesn't go far enough because it doesn't mention semantic changes to
> those flags. And they are important as well because these flags are
> a deep implementation internal to the MM code and the semantic might
> change at any time.
> 
> Let's consider two recent examples:
> http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> : mean time certain customer of ours started poking into /proc/<pid>/smaps
> : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> : flags, the application just fails to start complaining that DAX support is
> : missing in the kernel.
> 
> http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> : introduced a regression in that userspace cannot always determine the set
> : of vmas where thp is ineligible.
> : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> : to determine if a vma is eligible to be backed by hugepages.
> : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> : flag and "nh" is not emitted.
> : This causes smaps parsing libraries to assume a vma is eligible for thp
> : and ends up puzzling the user on why its memory is not backed by thp.
> 
> In both cases userspace was relying on a semantic of a specific VMA
> flag. The primary reason why that happened is a lack of a proper
> internface. While this has been worked on and it will be fixed properly,
> it seems that our wording could see some refinement and be more vocal
> about semantic aspect of these flags as well.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  Documentation/filesystems/proc.txt | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..b1fda309f067 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
> 
>  Note that there is no guarantee that every flag and associated mnemonic will
>  be present in all further kernel releases. Things get changed, the flags may
> -be vanished or the reverse -- new added.
> +be vanished or the reverse -- new added. Interpretatation of their meaning
> +might change in future as well. So each consumnent of these flags have to

                                           consumer?                 has

> +follow each specific kernel version for the exact semantic.
> 
>  This file is only present if the CONFIG_MMU kernel configuration option is
>  enabled.
> -- 
> 2.19.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-21 17:54   ` Mike Rapoport
@ 2018-11-21 17:58     ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-21 17:58 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML,
	Jan Kara, Dan Williams, David Rientjes

On Wed 21-11-18 18:54:28, Mike Rapoport wrote:
> On Tue, Nov 20, 2018 at 11:35:13AM +0100, Michal Hocko wrote:
[...]
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 12a5e6e693b6..b1fda309f067 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
> > 
> >  Note that there is no guarantee that every flag and associated mnemonic will
> >  be present in all further kernel releases. Things get changed, the flags may
> > -be vanished or the reverse -- new added.
> > +be vanished or the reverse -- new added. Interpretatation of their meaning
> > +might change in future as well. So each consumnent of these flags have to
> 
>                                            consumer?                 has

fixed. Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-21  7:05     ` Michal Hocko
@ 2018-11-21 18:01       ` Mike Rapoport
  0 siblings, 0 replies; 27+ messages in thread
From: Mike Rapoport @ 2018-11-21 18:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Linux API, Andrew Morton, adobriyan, Linux MM,
	Linux Kernel Mailing List, Jan Kara, David Rientjes

On Wed, Nov 21, 2018 at 08:05:00AM +0100, Michal Hocko wrote:
> On Tue 20-11-18 10:32:07, Dan Williams wrote:
> > On Tue, Nov 20, 2018 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > From: Michal Hocko <mhocko@suse.com>
> > >
> > > Even though vma flags exported via /proc/<pid>/smaps are explicitly
> > > documented to be not guaranteed for future compatibility the warning
> > > doesn't go far enough because it doesn't mention semantic changes to
> > > those flags. And they are important as well because these flags are
> > > a deep implementation internal to the MM code and the semantic might
> > > change at any time.
> > >
> > > Let's consider two recent examples:
> > > http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> > > : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > > : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > > : mean time certain customer of ours started poking into /proc/<pid>/smaps
> > > : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > > : flags, the application just fails to start complaining that DAX support is
> > > : missing in the kernel.
> > >
> > > http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> > > : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> > > : introduced a regression in that userspace cannot always determine the set
> > > : of vmas where thp is ineligible.
> > > : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> > > : to determine if a vma is eligible to be backed by hugepages.
> > > : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> > > : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> > > : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> > > : flag and "nh" is not emitted.
> > > : This causes smaps parsing libraries to assume a vma is eligible for thp
> > > : and ends up puzzling the user on why its memory is not backed by thp.
> > >
> > > In both cases userspace was relying on a semantic of a specific VMA
> > > flag. The primary reason why that happened is a lack of a proper
> > > internface. While this has been worked on and it will be fixed properly,
> > > it seems that our wording could see some refinement and be more vocal
> > > about semantic aspect of these flags as well.
> > >
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  Documentation/filesystems/proc.txt | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > > index 12a5e6e693b6..b1fda309f067 100644
> > > --- a/Documentation/filesystems/proc.txt
> > > +++ b/Documentation/filesystems/proc.txt
> > > @@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
> > >
> > >  Note that there is no guarantee that every flag and associated mnemonic will
> > >  be present in all further kernel releases. Things get changed, the flags may
> > > -be vanished or the reverse -- new added.
> > > +be vanished or the reverse -- new added. Interpretatation of their meaning
> > > +might change in future as well. So each consumnent of these flags have to
> > > +follow each specific kernel version for the exact semantic.
> > 
> > Can we start to claw some of this back? Perhaps with a config option
> > to hide the flags to put applications on notice?
> 
> I would love to. My knowledge of CRIU is very minimal, but my
> understanding is that this is the primary consumer of those flags. And
> checkpointing is so close to the specific kernel version that I assume
> that this abuse is somehow justified.

CRIU relies on vmflags to recreate exactly the same address space layout at
restore time.

> We can hide it behind CONFIG_CHECKPOINT_RESTORE but does it going to
> help? I presume that many distro kernels will have the config enabled.

They do :)
 
> > I recall that when I
> > introduced CONFIG_IO_STRICT_DEVMEM it caused enough regressions that
> > distros did not enable it, but now a few years out I'm finding that it
> > is enabled in more places.
> > 
> > In any event,
> > 
> > Acked-by: Dan Williams <dan.j.williams@intel.com>

Forgot that in my previous nit-picking e-mail:

Acked-by: Mike Rapoport <rppt@linux.ibm.com>
 
> Thanks!
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
  2018-11-20 10:35 ` [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
                     ` (2 preceding siblings ...)
  2018-11-21 17:54   ` Mike Rapoport
@ 2018-11-23 13:47   ` Vlastimil Babka
  3 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2018-11-23 13:47 UTC (permalink / raw)
  To: Michal Hocko, linux-api
  Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Michal Hocko,
	Jan Kara, Dan Williams, David Rientjes

On 11/20/18 11:35 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Even though vma flags exported via /proc/<pid>/smaps are explicitly
> documented to be not guaranteed for future compatibility the warning
> doesn't go far enough because it doesn't mention semantic changes to
> those flags. And they are important as well because these flags are
> a deep implementation internal to the MM code and the semantic might
> change at any time.
> 
> Let's consider two recent examples:
> http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> : commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> : removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> : mean time certain customer of ours started poking into /proc/<pid>/smaps
> : and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> : flags, the application just fails to start complaining that DAX support is
> : missing in the kernel.
> 
> http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> : Commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> : introduced a regression in that userspace cannot always determine the set
> : of vmas where thp is ineligible.
> : Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
> : to determine if a vma is eligible to be backed by hugepages.
> : Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
> : be disabled and emit "nh" as a flag for the corresponding vmas as part of
> : /proc/pid/smaps.  After the commit, thp is disabled by means of an mm
> : flag and "nh" is not emitted.
> : This causes smaps parsing libraries to assume a vma is eligible for thp
> : and ends up puzzling the user on why its memory is not backed by thp.
> 
> In both cases userspace was relying on a semantic of a specific VMA
> flag. The primary reason why that happened is a lack of a proper
> internface. While this has been worked on and it will be fixed properly,
> it seems that our wording could see some refinement and be more vocal
> about semantic aspect of these flags as well.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Agreed, although no amount of docs will override the
do-not-break-userspace rule I'm afraid :)

Acked-by: Vlastimil Babka <vbabka@suse.cz>

On top of typos reported by Mike:

> ---
>  Documentation/filesystems/proc.txt | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 12a5e6e693b6..b1fda309f067 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -496,7 +496,9 @@ flags associated with the particular virtual memory area in two letter encoded
>  
>  Note that there is no guarantee that every flag and associated mnemonic will
>  be present in all further kernel releases. Things get changed, the flags may
> -be vanished or the reverse -- new added.
> +be vanished or the reverse -- new added. Interpretatation of their meaning

                                            ^ interpretation

> +might change in future as well. So each consumnent of these flags have to
> +follow each specific kernel version for the exact semantic.
>  
>  This file is only present if the CONFIG_MMU kernel configuration option is
>  enabled.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma
  2018-11-20 10:35 ` [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma Michal Hocko
  2018-11-20 11:42   ` Michal Hocko
@ 2018-11-23 15:07   ` Vlastimil Babka
  2018-11-23 15:21     ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2018-11-23 15:07 UTC (permalink / raw)
  To: Michal Hocko, linux-api
  Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Michal Hocko

On 11/20/18 11:35 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Userspace falls short when trying to find out whether a specific memory
> range is eligible for THP. There are usecases that would like to know
> that
> http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
> : This is used to identify heap mappings that should be able to fault thp
> : but do not, and they normally point to a low-on-memory or fragmentation
> : issue.
> 
> The only way to deduce this now is to query for hg resp. nh flags and
> confronting the state with the global setting. Except that there is
> also PR_SET_THP_DISABLE that might change the picture. So the final
> logic is not trivial. Moreover the eligibility of the vma depends on
> the type of VMA as well. In the past we have supported only anononymous
> memory VMAs but things have changed and shmem based vmas are supported
> as well these days and the query logic gets even more complicated
> because the eligibility depends on the mount option and another global
> configuration knob.
> 
> Simplify the current state and report the THP eligibility in
> /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled
> for this purpose. The original implementation of this function assumes
> that the caller knows that the vma itself is supported for THP so make
> the core checks into __transparent_hugepage_enabled and use it for
> existing callers. __show_smap just use the new transparent_hugepage_enabled
> which also checks the vma support status (please note that this one has
> to be out of line due to include dependency issues).
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Not thrilled by this, but kernel is always better suited to report this,
than userspace piecing it together from multiple sources, relying on
possibly outdated knowledge of kernel implementation details...

Acked-by: Vlastimil Babka <vbabka@suse.cz>

A nitpick:

> ---
>  Documentation/filesystems/proc.txt |  3 +++
>  fs/proc/task_mmu.c                 |  2 ++
>  include/linux/huge_mm.h            | 13 ++++++++++++-
>  mm/huge_memory.c                   | 12 +++++++++++-
>  mm/memory.c                        |  4 ++--
>  5 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index b1fda309f067..06562bab509a 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -425,6 +425,7 @@ SwapPss:               0 kB
>  KernelPageSize:        4 kB
>  MMUPageSize:           4 kB
>  Locked:                0 kB
> +THPeligible:           0

I would use THP_Eligible. There are already fields with underscore in smaps.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma
  2018-11-23 15:07   ` Vlastimil Babka
@ 2018-11-23 15:21     ` Michal Hocko
  2018-11-23 15:24       ` Vlastimil Babka
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2018-11-23 15:21 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML

On Fri 23-11-18 16:07:06, Vlastimil Babka wrote:
> On 11/20/18 11:35 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Userspace falls short when trying to find out whether a specific memory
> > range is eligible for THP. There are usecases that would like to know
> > that
> > http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
> > : This is used to identify heap mappings that should be able to fault thp
> > : but do not, and they normally point to a low-on-memory or fragmentation
> > : issue.
> > 
> > The only way to deduce this now is to query for hg resp. nh flags and
> > confronting the state with the global setting. Except that there is
> > also PR_SET_THP_DISABLE that might change the picture. So the final
> > logic is not trivial. Moreover the eligibility of the vma depends on
> > the type of VMA as well. In the past we have supported only anononymous
> > memory VMAs but things have changed and shmem based vmas are supported
> > as well these days and the query logic gets even more complicated
> > because the eligibility depends on the mount option and another global
> > configuration knob.
> > 
> > Simplify the current state and report the THP eligibility in
> > /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled
> > for this purpose. The original implementation of this function assumes
> > that the caller knows that the vma itself is supported for THP so make
> > the core checks into __transparent_hugepage_enabled and use it for
> > existing callers. __show_smap just use the new transparent_hugepage_enabled
> > which also checks the vma support status (please note that this one has
> > to be out of line due to include dependency issues).
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Not thrilled by this,

Any specific concern?

> but kernel is always better suited to report this,
> than userspace piecing it together from multiple sources, relying on
> possibly outdated knowledge of kernel implementation details...

yep.

> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> A nitpick:
> 
> > ---
> >  Documentation/filesystems/proc.txt |  3 +++
> >  fs/proc/task_mmu.c                 |  2 ++
> >  include/linux/huge_mm.h            | 13 ++++++++++++-
> >  mm/huge_memory.c                   | 12 +++++++++++-
> >  mm/memory.c                        |  4 ++--
> >  5 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index b1fda309f067..06562bab509a 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -425,6 +425,7 @@ SwapPss:               0 kB
> >  KernelPageSize:        4 kB
> >  MMUPageSize:           4 kB
> >  Locked:                0 kB
> > +THPeligible:           0
> 
> I would use THP_Eligible. There are already fields with underscore in smaps.

I do not feel strongly. I will wait for more comments and see whether
there is some consensus.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma
  2018-11-23 15:21     ` Michal Hocko
@ 2018-11-23 15:24       ` Vlastimil Babka
  0 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2018-11-23 15:24 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML

On 11/23/18 4:21 PM, Michal Hocko wrote:
> On Fri 23-11-18 16:07:06, Vlastimil Babka wrote:
>> On 11/20/18 11:35 AM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> Userspace falls short when trying to find out whether a specific memory
>>> range is eligible for THP. There are usecases that would like to know
>>> that
>>> http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
>>> : This is used to identify heap mappings that should be able to fault thp
>>> : but do not, and they normally point to a low-on-memory or fragmentation
>>> : issue.
>>>
>>> The only way to deduce this now is to query for hg resp. nh flags and
>>> confronting the state with the global setting. Except that there is
>>> also PR_SET_THP_DISABLE that might change the picture. So the final
>>> logic is not trivial. Moreover the eligibility of the vma depends on
>>> the type of VMA as well. In the past we have supported only anononymous
>>> memory VMAs but things have changed and shmem based vmas are supported
>>> as well these days and the query logic gets even more complicated
>>> because the eligibility depends on the mount option and another global
>>> configuration knob.
>>>
>>> Simplify the current state and report the THP eligibility in
>>> /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled
>>> for this purpose. The original implementation of this function assumes
>>> that the caller knows that the vma itself is supported for THP so make
>>> the core checks into __transparent_hugepage_enabled and use it for
>>> existing callers. __show_smap just use the new transparent_hugepage_enabled
>>> which also checks the vma support status (please note that this one has
>>> to be out of line due to include dependency issues).
>>>
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>
>> Not thrilled by this,
> 
> Any specific concern?

The kitchen sink that smaps slowly becomes, with associated overhead
(i.e. one of reasons there's now smaps_rollup). Would be much nicer if
userspace had some way to say which fields it's interested in. But I
have no good ideas for that right now :/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-20 10:35 ` [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc Michal Hocko
  2018-11-20 11:42   ` Michal Hocko
@ 2018-11-23 15:49   ` Vlastimil Babka
  2018-11-27  0:33   ` William Kucharski
  2 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2018-11-23 15:49 UTC (permalink / raw)
  To: Michal Hocko, linux-api
  Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Michal Hocko

On 11/20/18 11:35 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> David Rientjes has reported that 1860033237d4 ("mm: make
> PR_SET_THP_DISABLE immediately active") has changed the way how
> we report THPable VMAs to the userspace. Their monitoring tool is
> triggering false alarms on PR_SET_THP_DISABLE tasks because it considers
> an insufficient THP usage as a memory fragmentation resp. memory
> pressure issue.
> 
> Before the said commit each newly created VMA inherited VM_NOHUGEPAGE
> flag and that got exposed to the userspace via /proc/<pid>/smaps file.
> This implementation had its downsides as explained in the commit message
> but it is true that the userspace doesn't have any means to query for
> the process wide THP enabled/disabled status.
> 
> PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense
> to export in the process wide context rather than per-vma. Introduce

Agreed.

> a new field to /proc/<pid>/status which export this status.  If
> PR_SET_THP_DISABLE is used then it reports false same as when the THP is
> not compiled in. It doesn't consider the global THP status because we
> already export that information via sysfs
> 
> Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-20 10:35 ` [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc Michal Hocko
  2018-11-20 11:42   ` Michal Hocko
  2018-11-23 15:49   ` Vlastimil Babka
@ 2018-11-27  0:33   ` William Kucharski
  2018-11-27 13:17     ` Michal Hocko
  2 siblings, 1 reply; 27+ messages in thread
From: William Kucharski @ 2018-11-27  0:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Michal Hocko



This determines whether the page can theoretically be THP-mapped , but is the intention to also check for proper alignment and/or preexisting PAGESIZE page cache mappings for the address range?

I'm having to deal with both these issues in the text page THP prototype I've been working on for some time now.

    Thanks,
         William Kucharski

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-27  0:33   ` William Kucharski
@ 2018-11-27 13:17     ` Michal Hocko
  2018-11-27 14:50       ` William Kucharski
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2018-11-27 13:17 UTC (permalink / raw)
  To: William Kucharski
  Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML

On Mon 26-11-18 17:33:32, William Kucharski wrote:
> 
> 
> This determines whether the page can theoretically be THP-mapped , but
> is the intention to also check for proper alignment and/or preexisting
> PAGESIZE page cache mappings for the address range?

This is only about the process wide flag to disable THP. I do not see
how this can be alighnement related. I suspect you wanted to ask in the
smaps patch?

> I'm having to deal with both these issues in the text page THP
> prototype I've been working on for some time now.

Could you be more specific about the issue and how the alignment comes
into the game? The only thing I can think of is to not report VMAs
smaller than the THP as eligible. Is this what you are looking for?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-27 13:17     ` Michal Hocko
@ 2018-11-27 14:50       ` William Kucharski
  2018-11-27 16:25         ` Michal Hocko
  2018-11-27 16:50         ` Vlastimil Babka
  0 siblings, 2 replies; 27+ messages in thread
From: William Kucharski @ 2018-11-27 14:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML



> On Nov 27, 2018, at 6:17 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> This is only about the process wide flag to disable THP. I do not see
> how this can be alighnement related. I suspect you wanted to ask in the
> smaps patch?

No, answered below.

> 
>> I'm having to deal with both these issues in the text page THP
>> prototype I've been working on for some time now.
> 
> Could you be more specific about the issue and how the alignment comes
> into the game? The only thing I can think of is to not report VMAs
> smaller than the THP as eligible. Is this what you are looking for?

Basically, if the faulting VA is one that cannot be mapped with a THP
due to alignment or size constraints, it may be "eligible" for THP
mapping but ultimately can't be.

I was just double checking that this was meant to be more of a check done
before code elsewhere performs additional checks and does the actual THP
mapping, not an all-encompassing go/no go check for THP mapping.

    Thanks,
         William Kucharski

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-27 14:50       ` William Kucharski
@ 2018-11-27 16:25         ` Michal Hocko
  2018-11-27 16:50         ` Vlastimil Babka
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2018-11-27 16:25 UTC (permalink / raw)
  To: William Kucharski
  Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML

On Tue 27-11-18 07:50:08, William Kucharski wrote:
> 
> 
> > On Nov 27, 2018, at 6:17 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > This is only about the process wide flag to disable THP. I do not see
> > how this can be alighnement related. I suspect you wanted to ask in the
> > smaps patch?
> 
> No, answered below.
> 
> > 
> >> I'm having to deal with both these issues in the text page THP
> >> prototype I've been working on for some time now.
> > 
> > Could you be more specific about the issue and how the alignment comes
> > into the game? The only thing I can think of is to not report VMAs
> > smaller than the THP as eligible. Is this what you are looking for?
> 
> Basically, if the faulting VA is one that cannot be mapped with a THP
> due to alignment or size constraints, it may be "eligible" for THP
> mapping but ultimately can't be.
> 
> I was just double checking that this was meant to be more of a check done
> before code elsewhere performs additional checks and does the actual THP
> mapping, not an all-encompassing go/no go check for THP mapping.

I am still not sure I follow you completely here. This just reports
per-task eligibility. The system wide eligibility is reported via sysfs
and the per vma eligibility is reported via /proc/<pid>/smaps.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-27 14:50       ` William Kucharski
  2018-11-27 16:25         ` Michal Hocko
@ 2018-11-27 16:50         ` Vlastimil Babka
  2018-11-27 17:06           ` William Kucharski
  1 sibling, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2018-11-27 16:50 UTC (permalink / raw)
  To: William Kucharski, Michal Hocko
  Cc: linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML

On 11/27/18 3:50 PM, William Kucharski wrote:
> 
> I was just double checking that this was meant to be more of a check done
> before code elsewhere performs additional checks and does the actual THP
> mapping, not an all-encompassing go/no go check for THP mapping.

Yes, the code doing the actual mapping is still checking also alignment etc.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
  2018-11-27 16:50         ` Vlastimil Babka
@ 2018-11-27 17:06           ` William Kucharski
  0 siblings, 0 replies; 27+ messages in thread
From: William Kucharski @ 2018-11-27 17:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, linux-api, Andrew Morton, Alexey Dobriyan, linux-mm, LKML



> On Nov 27, 2018, at 9:50 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> On 11/27/18 3:50 PM, William Kucharski wrote:
>> 
>> I was just double checking that this was meant to be more of a check done
>> before code elsewhere performs additional checks and does the actual THP
>> mapping, not an all-encompassing go/no go check for THP mapping.
> 
> Yes, the code doing the actual mapping is still checking also alignment etc.

Thanks, yes, that is what I was getting at.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] THP eligibility reporting via proc
  2018-11-20 10:35 [RFC PATCH 0/3] THP eligibility reporting via proc Michal Hocko
                   ` (2 preceding siblings ...)
  2018-11-20 10:35 ` [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc Michal Hocko
@ 2018-12-07 10:55 ` Michal Hocko
  3 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2018-12-07 10:55 UTC (permalink / raw)
  To: linux-api
  Cc: Andrew Morton, Alexey Dobriyan, linux-mm, LKML, Dan Williams,
	David Rientjes, Jan Kara

On Tue 20-11-18 11:35:12, Michal Hocko wrote:
> Hi,
> this series of three patches aims at making THP eligibility reporting
> much more robust and long term sustainable. The trigger for the change
> is a regression report [1] and the long follow up discussion. In short
> the specific application didn't have good API to query whether a particular
> mapping can be backed by THP so it has used VMA flags to workaround that.
> These flags represent a deep internal state of VMAs and as such they should
> be used by userspace with a great deal of caution.
> 
> A similar has happened for [2] when users complained that VM_MIXEDMAP is
> no longer set on DAX mappings. Again a lack of a proper API led to an
> abuse.
> 
> The first patch in the series tries to emphasise that that the semantic
> of flags might change and any application consuming those should be really
> careful.
> 
> The remaining two patches provide a more suitable interface to address [1]
> and provide a consistent API to query the THP status both for each VMA
> and process wide as well.

Are there any other comments on these? I haven't heard any pushback so
far so I will re-send with RFC dropped early next week.

> 
> [1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/alpine.DEB.2.21.1809241054050.224429@chino.kir.corp.google.com
> [2] http://lkml.kernel.org/r/20181002100531.GC4135@quack2.suse.cz
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, back to index

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-20 10:35 [RFC PATCH 0/3] THP eligibility reporting via proc Michal Hocko
2018-11-20 10:35 ` [RFC PATCH 1/3] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Michal Hocko
2018-11-20 10:51   ` Jan Kara
2018-11-20 11:41     ` Michal Hocko
2018-11-21  0:01     ` David Rientjes
2018-11-21  6:56       ` Michal Hocko
2018-11-20 18:32   ` Dan Williams
2018-11-21  7:05     ` Michal Hocko
2018-11-21 18:01       ` Mike Rapoport
2018-11-21 17:54   ` Mike Rapoport
2018-11-21 17:58     ` Michal Hocko
2018-11-23 13:47   ` Vlastimil Babka
2018-11-20 10:35 ` [RFC PATCH 2/3] mm, thp, proc: report THP eligibility for each vma Michal Hocko
2018-11-20 11:42   ` Michal Hocko
2018-11-23 15:07   ` Vlastimil Babka
2018-11-23 15:21     ` Michal Hocko
2018-11-23 15:24       ` Vlastimil Babka
2018-11-20 10:35 ` [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc Michal Hocko
2018-11-20 11:42   ` Michal Hocko
2018-11-23 15:49   ` Vlastimil Babka
2018-11-27  0:33   ` William Kucharski
2018-11-27 13:17     ` Michal Hocko
2018-11-27 14:50       ` William Kucharski
2018-11-27 16:25         ` Michal Hocko
2018-11-27 16:50         ` Vlastimil Babka
2018-11-27 17:06           ` William Kucharski
2018-12-07 10:55 ` [RFC PATCH 0/3] THP eligibility reporting via proc Michal Hocko

Linux-api Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-api/0 linux-api/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-api linux-api/ https://lore.kernel.org/linux-api \
		linux-api@vger.kernel.org
	public-inbox-index linux-api

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-api


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git