linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PACTH v2 0/3] Implement /proc/<pid>/totmaps
@ 2016-08-12 22:04 robert.foss
  2016-08-12 22:04 ` [PACTH v2 1/3] mm, proc: " robert.foss
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: robert.foss @ 2016-08-12 22:04 UTC (permalink / raw)
  To: corbet, akpm, vbabka, koct9i, mhocko, hughd, robert.foss,
	n-horiguchi, minchan, john.stultz, ross.zwisler, jmarchan,
	hannes, keescook, viro, gorcunov, plaguedbypenguins, rientjes,
	eric.engestrom, jdanis, calvinowens, adobriyan, jann, sonnyrao,
	kirill.shutemov, ldufour, linux-doc, linux-kernel, Ben Zhang,
	Bryan Freed, Filipe Brandenburger, Mateusz Guzik

From: Robert Foss <robert.foss@collabora.com>

This series implements /proc/PID/totmaps, a tool for retrieving summarized
information about the mappings of a process.

Changes since v1:
- Removed IS_ERR check from get_task_mm() function
- Changed comment format
- Moved proc_totmaps_operations declaration inside internal.h
- Switched to using do_maps_open() in totmaps_open() function,
  which provides privilege checking
- Error handling reworked for totmaps_open() function
- Switched to stack allocated struct mem_size_stats mss_sum in
  totmaps_proc_show() function
- Removed get_task_mm() in totmaps_proc_show() since priv->mm
  already is available
- Added support to proc_map_release() fork priv==NULL, to allow
  function to be used for all failure cases
- Added proc_totmaps_op and for it helper functions
- Added documention in separate patch
- Removed totmaps_release() since it was just a wrapper for
  proc_map_release()


Robert Foss (3):
  mm, proc: Implement /proc/<pid>/totmaps
  Documentation/filesystems: Fixed typo
  Documentation/filesystems: Added /proc/PID/totmaps documentation

 Documentation/filesystems/proc.txt |  23 ++++++-
 fs/proc/base.c                     |   1 +
 fs/proc/internal.h                 |   3 +
 fs/proc/task_mmu.c                 | 134 +++++++++++++++++++++++++++++++++++++
 4 files changed, 160 insertions(+), 1 deletion(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PACTH v2 1/3] mm, proc: Implement /proc/<pid>/totmaps
  2016-08-12 22:04 [PACTH v2 0/3] Implement /proc/<pid>/totmaps robert.foss
@ 2016-08-12 22:04 ` robert.foss
  2016-08-13 14:39   ` Jann Horn
  2016-08-12 22:04 ` [PACTH v2 2/3] Documentation/filesystems: Fixed typo robert.foss
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 49+ messages in thread
From: robert.foss @ 2016-08-12 22:04 UTC (permalink / raw)
  To: corbet, akpm, vbabka, koct9i, mhocko, hughd, robert.foss,
	n-horiguchi, minchan, john.stultz, ross.zwisler, jmarchan,
	hannes, keescook, viro, gorcunov, plaguedbypenguins, rientjes,
	eric.engestrom, jdanis, calvinowens, adobriyan, jann, sonnyrao,
	kirill.shutemov, ldufour, linux-doc, linux-kernel, Ben Zhang,
	Bryan Freed, Filipe Brandenburger, Mateusz Guzik

From: Robert Foss <robert.foss@collabora.com>

This is based on earlier work by Thiago Goncales. It implements a new
per process proc file which summarizes the contents of the smaps file
but doesn't display any addresses.  It gives more detailed information
than statm like the PSS (proprotional set size).  It differs from the
original implementation in that it doesn't use the full blown set of
seq operations, uses a different termination condition, and doesn't
displayed "Locked" as that was broken on the original implemenation.

This new proc file provides information faster than parsing the potentially
huge smaps file.

Tested-by: Robert Foss <robert.foss@collabora.com>
Signed-off-by: Robert Foss <robert.foss@collabora.com>

Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
---
 fs/proc/base.c     |   1 +
 fs/proc/internal.h |   3 ++
 fs/proc/task_mmu.c | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 138 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a11eb71..de3acdf 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2855,6 +2855,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
 	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
 	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+	REG("totmaps",    S_IRUGO, proc_totmaps_operations),
 #endif
 #ifdef CONFIG_SECURITY
 	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index aa27810..c55e1fe 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -281,6 +281,7 @@ struct proc_maps_private {
 	struct mm_struct *mm;
 #ifdef CONFIG_MMU
 	struct vm_area_struct *tail_vma;
+	struct mem_size_stats *mss;
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *task_mempolicy;
@@ -297,6 +298,8 @@ extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_tid_smaps_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_totmaps_operations;
+
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4648c7f..b7612e9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -246,6 +246,9 @@ static int proc_map_release(struct inode *inode, struct file *file)
 	struct seq_file *seq = file->private_data;
 	struct proc_maps_private *priv = seq->private;
 
+	if (!priv)
+		return 0;
+
 	if (priv->mm)
 		mmdrop(priv->mm);
 
@@ -802,6 +805,75 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 	return 0;
 }
 
+static void add_smaps_sum(struct mem_size_stats *mss,
+		struct mem_size_stats *mss_sum)
+{
+	mss_sum->resident += mss->resident;
+	mss_sum->pss += mss->pss;
+	mss_sum->shared_clean += mss->shared_clean;
+	mss_sum->shared_dirty += mss->shared_dirty;
+	mss_sum->private_clean += mss->private_clean;
+	mss_sum->private_dirty += mss->private_dirty;
+	mss_sum->referenced += mss->referenced;
+	mss_sum->anonymous += mss->anonymous;
+	mss_sum->anonymous_thp += mss->anonymous_thp;
+	mss_sum->swap += mss->swap;
+}
+
+static int totmaps_proc_show(struct seq_file *m, void *data)
+{
+	struct proc_maps_private *priv = m->private;
+	struct mm_struct *mm = priv->mm;
+	struct vm_area_struct *vma;
+	struct mem_size_stats mss_sum;
+
+	memset(&mss_sum, 0, sizeof(mss_sum));
+	down_read(&mm->mmap_sem);
+	hold_task_mempolicy(priv);
+
+	for (vma = mm->mmap; vma != priv->tail_vma; vma = vma->vm_next) {
+		struct mem_size_stats mss;
+		struct mm_walk smaps_walk = {
+			.pmd_entry = smaps_pte_range,
+			.mm = vma->vm_mm,
+			.private = &mss,
+		};
+
+		if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+			memset(&mss, 0, sizeof(mss));
+			walk_page_vma(vma, &smaps_walk);
+			add_smaps_sum(&mss, &mss_sum);
+		}
+	}
+
+	seq_printf(m,
+		   "Rss:            %8lu kB\n"
+		   "Pss:            %8lu kB\n"
+		   "Shared_Clean:   %8lu kB\n"
+		   "Shared_Dirty:   %8lu kB\n"
+		   "Private_Clean:  %8lu kB\n"
+		   "Private_Dirty:  %8lu kB\n"
+		   "Referenced:     %8lu kB\n"
+		   "Anonymous:      %8lu kB\n"
+		   "AnonHugePages:  %8lu kB\n"
+		   "Swap:           %8lu kB\n",
+		   mss_sum.resident >> 10,
+		   (unsigned long)(mss_sum.pss >> (10 + PSS_SHIFT)),
+		   mss_sum.shared_clean  >> 10,
+		   mss_sum.shared_dirty  >> 10,
+		   mss_sum.private_clean >> 10,
+		   mss_sum.private_dirty >> 10,
+		   mss_sum.referenced >> 10,
+		   mss_sum.anonymous >> 10,
+		   mss_sum.anonymous_thp >> 10,
+		   mss_sum.swap >> 10);
+
+	release_task_mempolicy(priv);
+	up_read(&mm->mmap_sem);
+
+	return 0;
+}
+
 static int show_pid_smap(struct seq_file *m, void *v)
 {
 	return show_smap(m, v, 1);
@@ -812,6 +884,28 @@ static int show_tid_smap(struct seq_file *m, void *v)
 	return show_smap(m, v, 0);
 }
 
+static void *m_totmaps_start(struct seq_file *p, loff_t *pos)
+{
+	return NULL + (*pos == 0);
+}
+
+static void *m_totmaps_next(struct seq_file *p, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static void m_totmaps_stop(struct seq_file *p, void *v)
+{
+}
+
+static const struct seq_operations proc_totmaps_op = {
+	.start	= m_totmaps_start,
+	.next	= m_totmaps_next,
+	.stop	= m_totmaps_stop,
+	.show	= totmaps_proc_show
+};
+
 static const struct seq_operations proc_pid_smaps_op = {
 	.start	= m_start,
 	.next	= m_next,
@@ -836,6 +930,39 @@ static int tid_smaps_open(struct inode *inode, struct file *file)
 	return do_maps_open(inode, file, &proc_tid_smaps_op);
 }
 
+static int totmaps_open(struct inode *inode, struct file *file)
+{
+	struct proc_maps_private *priv = NULL;
+	struct seq_file *seq;
+	int ret;
+
+	ret = do_maps_open(inode, file, &proc_totmaps_op);
+	if (ret)
+		goto error;
+
+	seq = file->private_data;
+	priv = seq->private;
+
+	/*
+	 * We need to grab references to the task_struct
+	 * at open time, because there's a potential information
+	 * leak where the totmaps file is opened and held open
+	 * while the underlying pid to task mapping changes
+	 * underneath it
+	 */
+	priv->task = get_proc_task(inode);
+	if (!priv->task) {
+		ret = -ESRCH;
+		goto error;
+	}
+
+	return 0;
+
+error:
+	proc_map_release(inode, file);
+	return ret;
+}
+
 const struct file_operations proc_pid_smaps_operations = {
 	.open		= pid_smaps_open,
 	.read		= seq_read,
@@ -850,6 +977,13 @@ const struct file_operations proc_tid_smaps_operations = {
 	.release	= proc_map_release,
 };
 
+const struct file_operations proc_totmaps_operations = {
+	.open		= totmaps_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_map_release,
+};
+
 enum clear_refs_types {
 	CLEAR_REFS_ALL = 1,
 	CLEAR_REFS_ANON,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PACTH v2 2/3] Documentation/filesystems: Fixed typo
  2016-08-12 22:04 [PACTH v2 0/3] Implement /proc/<pid>/totmaps robert.foss
  2016-08-12 22:04 ` [PACTH v2 1/3] mm, proc: " robert.foss
@ 2016-08-12 22:04 ` robert.foss
  2016-08-12 22:04 ` [PACTH v2 3/3] Documentation/filesystems: Added /proc/PID/totmaps documentation robert.foss
  2016-08-14  9:04 ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Michal Hocko
  3 siblings, 0 replies; 49+ messages in thread
From: robert.foss @ 2016-08-12 22:04 UTC (permalink / raw)
  To: corbet, akpm, vbabka, koct9i, mhocko, hughd, robert.foss,
	n-horiguchi, minchan, john.stultz, ross.zwisler, jmarchan,
	hannes, keescook, viro, gorcunov, plaguedbypenguins, rientjes,
	eric.engestrom, jdanis, calvinowens, adobriyan, jann, sonnyrao,
	kirill.shutemov, ldufour, linux-doc, linux-kernel, Ben Zhang,
	Bryan Freed, Filipe Brandenburger, Mateusz Guzik

From: Robert Foss <robert.foss@collabora.com>

Fixed a -> an typo.

Signed-off-by: Robert Foss <robert.foss@collabora.com>
---
 Documentation/filesystems/proc.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e8d0075..7d001be 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -145,7 +145,7 @@ Table 1-1: Process specific entries in /proc
 		symbol the task is blocked in - or "0" if not blocked.
  pagemap	Page table
  stack		Report full stack trace, enable via CONFIG_STACKTRACE
- smaps		a extension based on maps, showing the memory consumption of
+ smaps		an extension based on maps, showing the memory consumption of
 		each mapping and flags associated with it
  numa_maps	an extension based on maps, showing the memory locality and
 		binding policy as well as mem usage (in pages) of each mapping.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PACTH v2 3/3] Documentation/filesystems: Added /proc/PID/totmaps documentation
  2016-08-12 22:04 [PACTH v2 0/3] Implement /proc/<pid>/totmaps robert.foss
  2016-08-12 22:04 ` [PACTH v2 1/3] mm, proc: " robert.foss
  2016-08-12 22:04 ` [PACTH v2 2/3] Documentation/filesystems: Fixed typo robert.foss
@ 2016-08-12 22:04 ` robert.foss
  2016-08-14  9:04 ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Michal Hocko
  3 siblings, 0 replies; 49+ messages in thread
From: robert.foss @ 2016-08-12 22:04 UTC (permalink / raw)
  To: corbet, akpm, vbabka, koct9i, mhocko, hughd, robert.foss,
	n-horiguchi, minchan, john.stultz, ross.zwisler, jmarchan,
	hannes, keescook, viro, gorcunov, plaguedbypenguins, rientjes,
	eric.engestrom, jdanis, calvinowens, adobriyan, jann, sonnyrao,
	kirill.shutemov, ldufour, linux-doc, linux-kernel, Ben Zhang,
	Bryan Freed, Filipe Brandenburger, Mateusz Guzik

From: Robert Foss <robert.foss@collabora.com>

Added documentation covering /proc/PID/totmaps.

Signed-off-by: Robert Foss <robert.foss@collabora.com>
---
 Documentation/filesystems/proc.txt | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 7d001be..c06ff33 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -11,6 +11,7 @@ Version 1.3                                              Kernel version 2.2.12
 					      Kernel version 2.4.0-test11-pre4
 ------------------------------------------------------------------------------
 fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>       June 9 2009
+add totmaps            Robert Foss <robert.foss@collabora.com>  August 12 2016
 
 Table of Contents
 -----------------
@@ -147,6 +148,8 @@ Table 1-1: Process specific entries in /proc
  stack		Report full stack trace, enable via CONFIG_STACKTRACE
  smaps		an extension based on maps, showing the memory consumption of
 		each mapping and flags associated with it
+ totmaps        an extenssion based on maps, showing the total memory
+                consumption of all mappings
  numa_maps	an extension based on maps, showing the memory locality and
 		binding policy as well as mem usage (in pages) of each mapping.
 ..............................................................................
@@ -512,6 +515,24 @@ be vanished or the reverse -- new added.
 This file is only present if the CONFIG_MMU kernel configuration option is
 enabled.
 
+The /proc/PID/totmaps is an extension based on maps, showing the memory
+consumption totals for all of the process's mappings. It lists the sums of the
+same statistics as /proc/PID/smaps.
+
+The process' mappings will be summarized as a series of lines like the
+following:
+
+Rss:                4256 kB
+Pss:                1170 kB
+Shared_Clean:       2720 kB
+Shared_Dirty:       1136 kB
+Private_Clean:         0 kB
+Private_Dirty:       400 kB
+Referenced:         4256 kB
+Anonymous:          1536 kB
+AnonHugePages:         0 kB
+Swap:                  0 kB
+
 The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
 bits on both physical and virtual pages associated with a process, and the
 soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 1/3] mm, proc: Implement /proc/<pid>/totmaps
  2016-08-12 22:04 ` [PACTH v2 1/3] mm, proc: " robert.foss
@ 2016-08-13 14:39   ` Jann Horn
  2016-08-15 13:57     ` Robert Foss
  0 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2016-08-13 14:39 UTC (permalink / raw)
  To: robert.foss
  Cc: corbet, akpm, vbabka, koct9i, mhocko, hughd, n-horiguchi,
	minchan, john.stultz, ross.zwisler, jmarchan, hannes, keescook,
	viro, gorcunov, plaguedbypenguins, rientjes, eric.engestrom,
	jdanis, calvinowens, adobriyan, sonnyrao, kirill.shutemov,
	ldufour, linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

[-- Attachment #1: Type: text/plain, Size: 2674 bytes --]

On Fri, Aug 12, 2016 at 06:04:20PM -0400, robert.foss@collabora.com wrote:
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index aa27810..c55e1fe 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -281,6 +281,7 @@ struct proc_maps_private {
>  	struct mm_struct *mm;
>  #ifdef CONFIG_MMU
>  	struct vm_area_struct *tail_vma;
> +	struct mem_size_stats *mss;

This is unused now, right?


> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 4648c7f..b7612e9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -246,6 +246,9 @@ static int proc_map_release(struct inode *inode, struct file *file)
>  	struct seq_file *seq = file->private_data;
>  	struct proc_maps_private *priv = seq->private;
>  
> +	if (!priv)
> +		return 0;
> +

You might want to get rid of this, see below.


> +static int totmaps_open(struct inode *inode, struct file *file)
> +{
> +	struct proc_maps_private *priv = NULL;
> +	struct seq_file *seq;
> +	int ret;
> +
> +	ret = do_maps_open(inode, file, &proc_totmaps_op);
> +	if (ret)
> +		goto error;
[...]
> +error:
> +	proc_map_release(inode, file);
> +	return ret;

I don't think this is correct. Have a look at the other callers of
do_maps_open() - none of them do any cleanup steps on error, they
just return. I think the "goto error" here should be a return
instead.

Have a look at the error cases that can cause do_maps_open() to
fail: do_maps_open() just calls proc_maps_open(). If the
__seq_open_private() call fails because of memory pressure,
file->private_data is still NULL, and your newly added NULL check
in proc_map_release() causes proc_map_release() to be a no-op
there. But if proc_maps_open() fails later on, things get nasty:
If, for example, proc_mem_open() fails because of a ptrace
permission denial, __seq_open_file -> seq_open has already set
file->private_data to a struct seq_file *, and then
proc_maps_open(), prior to passing on the error code, calls
seq_release_private -> seq_release, which frees that
struct seq_file * without NULLing the private_data pointer.
As far as I can tell, proc_map_release() would then run into
a use-after-free scenario.


> +	priv->task = get_proc_task(inode);
> +	if (!priv->task) {
> +		ret = -ESRCH;
> +		goto error;
> +	}

You're not actually using ->task anywhere in the current version,
right? Can this be deleted?


> +const struct file_operations proc_totmaps_operations = {
[...]
> +	.release	= proc_map_release,

This won't release priv->task, causing a memory leak (exploitable
through a reference counter overflow of the task_struct usage
counter).

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-12 22:04 [PACTH v2 0/3] Implement /proc/<pid>/totmaps robert.foss
                   ` (2 preceding siblings ...)
  2016-08-12 22:04 ` [PACTH v2 3/3] Documentation/filesystems: Added /proc/PID/totmaps documentation robert.foss
@ 2016-08-14  9:04 ` Michal Hocko
  2016-08-15 13:00   ` Robert Foss
  3 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-14  9:04 UTC (permalink / raw)
  To: robert.foss
  Cc: corbet, akpm, vbabka, koct9i, hughd, n-horiguchi, minchan,
	john.stultz, ross.zwisler, jmarchan, hannes, keescook, viro,
	gorcunov, plaguedbypenguins, rientjes, eric.engestrom, jdanis,
	calvinowens, adobriyan, jann, sonnyrao, kirill.shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Fri 12-08-16 18:04:19, robert.foss@collabora.com wrote:
> From: Robert Foss <robert.foss@collabora.com>
> 
> This series implements /proc/PID/totmaps, a tool for retrieving summarized
> information about the mappings of a process.

The changelog is absolutely missing the usecase. Why do we need this?
Why existing interfaces are not sufficient?

> Changes since v1:
> - Removed IS_ERR check from get_task_mm() function
> - Changed comment format
> - Moved proc_totmaps_operations declaration inside internal.h
> - Switched to using do_maps_open() in totmaps_open() function,
>   which provides privilege checking
> - Error handling reworked for totmaps_open() function
> - Switched to stack allocated struct mem_size_stats mss_sum in
>   totmaps_proc_show() function
> - Removed get_task_mm() in totmaps_proc_show() since priv->mm
>   already is available
> - Added support to proc_map_release() fork priv==NULL, to allow
>   function to be used for all failure cases
> - Added proc_totmaps_op and for it helper functions
> - Added documention in separate patch
> - Removed totmaps_release() since it was just a wrapper for
>   proc_map_release()
> 
> 
> Robert Foss (3):
>   mm, proc: Implement /proc/<pid>/totmaps
>   Documentation/filesystems: Fixed typo
>   Documentation/filesystems: Added /proc/PID/totmaps documentation
> 
>  Documentation/filesystems/proc.txt |  23 ++++++-
>  fs/proc/base.c                     |   1 +
>  fs/proc/internal.h                 |   3 +
>  fs/proc/task_mmu.c                 | 134 +++++++++++++++++++++++++++++++++++++
>  4 files changed, 160 insertions(+), 1 deletion(-)
> 
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-14  9:04 ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Michal Hocko
@ 2016-08-15 13:00   ` Robert Foss
  2016-08-15 13:42     ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Robert Foss @ 2016-08-15 13:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: corbet, akpm, vbabka, koct9i, hughd, n-horiguchi, minchan,
	john.stultz, ross.zwisler, jmarchan, hannes, keescook, viro,
	gorcunov, plaguedbypenguins, rientjes, eric.engestrom, jdanis,
	calvinowens, adobriyan, jann, sonnyrao, kirill.shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik



On 2016-08-14 05:04 AM, Michal Hocko wrote:
> On Fri 12-08-16 18:04:19, robert.foss@collabora.com wrote:
>> From: Robert Foss <robert.foss@collabora.com>
>>
>> This series implements /proc/PID/totmaps, a tool for retrieving summarized
>> information about the mappings of a process.
>
> The changelog is absolutely missing the usecase. Why do we need this?
> Why existing interfaces are not sufficient?

You are absolutely right, more info information is in 1/3.
But the gist of it is that it provides a faster and more convenient way 
of accessing the information in /proc/PID/smaps.

>
>> Changes since v1:
>> - Removed IS_ERR check from get_task_mm() function
>> - Changed comment format
>> - Moved proc_totmaps_operations declaration inside internal.h
>> - Switched to using do_maps_open() in totmaps_open() function,
>>   which provides privilege checking
>> - Error handling reworked for totmaps_open() function
>> - Switched to stack allocated struct mem_size_stats mss_sum in
>>   totmaps_proc_show() function
>> - Removed get_task_mm() in totmaps_proc_show() since priv->mm
>>   already is available
>> - Added support to proc_map_release() fork priv==NULL, to allow
>>   function to be used for all failure cases
>> - Added proc_totmaps_op and for it helper functions
>> - Added documention in separate patch
>> - Removed totmaps_release() since it was just a wrapper for
>>   proc_map_release()
>>
>>
>> Robert Foss (3):
>>   mm, proc: Implement /proc/<pid>/totmaps
>>   Documentation/filesystems: Fixed typo
>>   Documentation/filesystems: Added /proc/PID/totmaps documentation
>>
>>  Documentation/filesystems/proc.txt |  23 ++++++-
>>  fs/proc/base.c                     |   1 +
>>  fs/proc/internal.h                 |   3 +
>>  fs/proc/task_mmu.c                 | 134 +++++++++++++++++++++++++++++++++++++
>>  4 files changed, 160 insertions(+), 1 deletion(-)
>>
>> --
>> 2.7.4
>>
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-15 13:00   ` Robert Foss
@ 2016-08-15 13:42     ` Michal Hocko
  2016-08-15 16:25       ` Robert Foss
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-15 13:42 UTC (permalink / raw)
  To: Robert Foss
  Cc: corbet, akpm, vbabka, koct9i, hughd, n-horiguchi, minchan,
	john.stultz, ross.zwisler, jmarchan, hannes, keescook, viro,
	gorcunov, plaguedbypenguins, rientjes, eric.engestrom, jdanis,
	calvinowens, adobriyan, jann, sonnyrao, kirill.shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon 15-08-16 09:00:04, Robert Foss wrote:
> 
> 
> On 2016-08-14 05:04 AM, Michal Hocko wrote:
> > On Fri 12-08-16 18:04:19, robert.foss@collabora.com wrote:
> > > From: Robert Foss <robert.foss@collabora.com>
> > > 
> > > This series implements /proc/PID/totmaps, a tool for retrieving summarized
> > > information about the mappings of a process.
> > 
> > The changelog is absolutely missing the usecase. Why do we need this?
> > Why existing interfaces are not sufficient?
> 
> You are absolutely right, more info information is in 1/3.

Patch 1 is silent about the use case as well. It is usually recommended
to describe the motivation for the change in the cover letter.

> But the gist of it is that it provides a faster and more convenient way of
> accessing the information in /proc/PID/smaps.

I am sorry to insist but this is far from a description I was hoping
for. Why do we need a more convenient API? Please note that this is a
userspace API which we will have to maintain for ever. We have made many
mistakes in the past where exporting some information made sense at the
time while it turned out being a mistake only later on. So let's make
sure we will not fall into the same trap again.

So please make sure you describe the use case, why the current API is
insufficient and why it cannot be tweaked to provide the information you
are looking for.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 1/3] mm, proc: Implement /proc/<pid>/totmaps
  2016-08-13 14:39   ` Jann Horn
@ 2016-08-15 13:57     ` Robert Foss
  2016-08-15 20:14       ` Robert Foss
  0 siblings, 1 reply; 49+ messages in thread
From: Robert Foss @ 2016-08-15 13:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: corbet, akpm, vbabka, koct9i, mhocko, hughd, n-horiguchi,
	minchan, john.stultz, ross.zwisler, jmarchan, hannes, keescook,
	viro, gorcunov, plaguedbypenguins, rientjes, eric.engestrom,
	jdanis, calvinowens, adobriyan, sonnyrao, kirill.shutemov,
	ldufour, linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik



On 2016-08-13 10:39 AM, Jann Horn wrote:
> On Fri, Aug 12, 2016 at 06:04:20PM -0400, robert.foss@collabora.com wrote:
>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>> index aa27810..c55e1fe 100644
>> --- a/fs/proc/internal.h
>> +++ b/fs/proc/internal.h
>> @@ -281,6 +281,7 @@ struct proc_maps_private {
>>  	struct mm_struct *mm;
>>  #ifdef CONFIG_MMU
>>  	struct vm_area_struct *tail_vma;
>> +	struct mem_size_stats *mss;
>
> This is unused now, right?

Fixing it in v3.

>
>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 4648c7f..b7612e9 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -246,6 +246,9 @@ static int proc_map_release(struct inode *inode, struct file *file)
>>  	struct seq_file *seq = file->private_data;
>>  	struct proc_maps_private *priv = seq->private;
>>
>> +	if (!priv)
>> +		return 0;
>> +
>
> You might want to get rid of this, see below.

Fixing it in v3.

>
>
>> +static int totmaps_open(struct inode *inode, struct file *file)
>> +{
>> +	struct proc_maps_private *priv = NULL;
>> +	struct seq_file *seq;
>> +	int ret;
>> +
>> +	ret = do_maps_open(inode, file, &proc_totmaps_op);
>> +	if (ret)
>> +		goto error;
> [...]
>> +error:
>> +	proc_map_release(inode, file);
>> +	return ret;
>
> I don't think this is correct. Have a look at the other callers of
> do_maps_open() - none of them do any cleanup steps on error, they
> just return. I think the "goto error" here should be a return
> instead.
>
> Have a look at the error cases that can cause do_maps_open() to
> fail: do_maps_open() just calls proc_maps_open(). If the
> __seq_open_private() call fails because of memory pressure,
> file->private_data is still NULL, and your newly added NULL check
> in proc_map_release() causes proc_map_release() to be a no-op
> there. But if proc_maps_open() fails later on, things get nasty:
> If, for example, proc_mem_open() fails because of a ptrace
> permission denial, __seq_open_file -> seq_open has already set
> file->private_data to a struct seq_file *, and then
> proc_maps_open(), prior to passing on the error code, calls
> seq_release_private -> seq_release, which frees that
> struct seq_file * without NULLing the private_data pointer.
> As far as I can tell, proc_map_release() would then run into
> a use-after-free scenario.
>
>
>> +	priv->task = get_proc_task(inode);
>> +	if (!priv->task) {
>> +		ret = -ESRCH;
>> +		goto error;
>> +	}
>
> You're not actually using ->task anywhere in the current version,
> right? Can this be deleted?
>
>
>> +const struct file_operations proc_totmaps_operations = {
> [...]
>> +	.release	= proc_map_release,
>
> This won't release priv->task, causing a memory leak (exploitable
> through a reference counter overflow of the task_struct usage
> counter).
>

Thanks for the thorough walkthrough, it is much appreciated.

priv->task does not appear to be used any more, and can be removed.
When "priv->task = get_proc_task(inode)" is removed, totmaps_open()
starts to look just like the other XXX_open functions.

I'll send out v3 as soon as testing has been done.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-15 13:42     ` Michal Hocko
@ 2016-08-15 16:25       ` Robert Foss
  2016-08-16  7:12         ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Robert Foss @ 2016-08-15 16:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: corbet, akpm, vbabka, koct9i, hughd, n-horiguchi, minchan,
	john.stultz, ross.zwisler, jmarchan, hannes, keescook, viro,
	gorcunov, plaguedbypenguins, rientjes, eric.engestrom, jdanis,
	calvinowens, adobriyan, jann, sonnyrao, kirill.shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik



On 2016-08-15 09:42 AM, Michal Hocko wrote:
> On Mon 15-08-16 09:00:04, Robert Foss wrote:
>>
>>
>> On 2016-08-14 05:04 AM, Michal Hocko wrote:
>>> On Fri 12-08-16 18:04:19, robert.foss@collabora.com wrote:
>>>> From: Robert Foss <robert.foss@collabora.com>
>>>>
>>>> This series implements /proc/PID/totmaps, a tool for retrieving summarized
>>>> information about the mappings of a process.
>>>
>>> The changelog is absolutely missing the usecase. Why do we need this?
>>> Why existing interfaces are not sufficient?
>>
>> You are absolutely right, more info information is in 1/3.
>
> Patch 1 is silent about the use case as well. It is usually recommended
> to describe the motivation for the change in the cover letter.

I'll change it for v3.

>
>> But the gist of it is that it provides a faster and more convenient way of
>> accessing the information in /proc/PID/smaps.
>
> I am sorry to insist but this is far from a description I was hoping
> for. Why do we need a more convenient API? Please note that this is a
> userspace API which we will have to maintain for ever. We have made many
> mistakes in the past where exporting some information made sense at the
> time while it turned out being a mistake only later on. So let's make
> sure we will not fall into the same trap again.
>
> So please make sure you describe the use case, why the current API is
> insufficient and why it cannot be tweaked to provide the information you
> are looking for.
>

I'll add a more elaborate description to the v3 cover letter.
In v1, there was a discussion which I think presented the practical 
applications rather well:

https://lkml.org/lkml/2016/8/9/628

or the qoute from Sonny Rao pasted below:

 > The use case is to speed up monitoring of
 > memory consumption in environments where RSS isn't precise.
 >
 > For example Chrome tends to many processes which have hundreds of VMAs
 > with a substantial amount of shared memory, and the error of using
 > RSS rather than PSS tends to be very large when looking at overall
 > memory consumption.  PSS isn't kept as a single number that's exported
 > like RSS, so to calculate PSS means having to parse a very large smaps
 > file.
 >
 > This process is slow and has to be repeated for many processes, and we
 > found that the just act of doing the parsing was taking up a
 > significant amount of CPU time, so this patch is an attempt to make
 > that process cheaper.

If a reformatted version of this still isn't adequate or desirable for 
the cover-letter, please give me another heads up.

Thanks!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 1/3] mm, proc: Implement /proc/<pid>/totmaps
  2016-08-15 13:57     ` Robert Foss
@ 2016-08-15 20:14       ` Robert Foss
  0 siblings, 0 replies; 49+ messages in thread
From: Robert Foss @ 2016-08-15 20:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: corbet, akpm, vbabka, koct9i, mhocko, hughd, n-horiguchi,
	minchan, john.stultz, ross.zwisler, jmarchan, hannes, keescook,
	viro, gorcunov, plaguedbypenguins, rientjes, eric.engestrom,
	jdanis, calvinowens, adobriyan, sonnyrao, kirill.shutemov,
	ldufour, linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik



On 2016-08-15 09:57 AM, Robert Foss wrote:
>
>
> On 2016-08-13 10:39 AM, Jann Horn wrote:
>> On Fri, Aug 12, 2016 at 06:04:20PM -0400, robert.foss@collabora.com
>> wrote:
>>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>>> index aa27810..c55e1fe 100644
>>> --- a/fs/proc/internal.h
>>> +++ b/fs/proc/internal.h
>>> @@ -281,6 +281,7 @@ struct proc_maps_private {
>>>      struct mm_struct *mm;
>>>  #ifdef CONFIG_MMU
>>>      struct vm_area_struct *tail_vma;
>>> +    struct mem_size_stats *mss;
>>
>> This is unused now, right?
>
> Fixing it in v3.
>
>>
>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index 4648c7f..b7612e9 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -246,6 +246,9 @@ static int proc_map_release(struct inode *inode,
>>> struct file *file)
>>>      struct seq_file *seq = file->private_data;
>>>      struct proc_maps_private *priv = seq->private;
>>>
>>> +    if (!priv)
>>> +        return 0;
>>> +
>>
>> You might want to get rid of this, see below.
>
> Fixing it in v3.
>
>>
>>
>>> +static int totmaps_open(struct inode *inode, struct file *file)
>>> +{
>>> +    struct proc_maps_private *priv = NULL;
>>> +    struct seq_file *seq;
>>> +    int ret;
>>> +
>>> +    ret = do_maps_open(inode, file, &proc_totmaps_op);
>>> +    if (ret)
>>> +        goto error;
>> [...]
>>> +error:
>>> +    proc_map_release(inode, file);
>>> +    return ret;
>>
>> I don't think this is correct. Have a look at the other callers of
>> do_maps_open() - none of them do any cleanup steps on error, they
>> just return. I think the "goto error" here should be a return
>> instead.
>>
>> Have a look at the error cases that can cause do_maps_open() to
>> fail: do_maps_open() just calls proc_maps_open(). If the
>> __seq_open_private() call fails because of memory pressure,
>> file->private_data is still NULL, and your newly added NULL check
>> in proc_map_release() causes proc_map_release() to be a no-op
>> there. But if proc_maps_open() fails later on, things get nasty:
>> If, for example, proc_mem_open() fails because of a ptrace
>> permission denial, __seq_open_file -> seq_open has already set
>> file->private_data to a struct seq_file *, and then
>> proc_maps_open(), prior to passing on the error code, calls
>> seq_release_private -> seq_release, which frees that
>> struct seq_file * without NULLing the private_data pointer.
>> As far as I can tell, proc_map_release() would then run into
>> a use-after-free scenario.
>>
>>
>>> +    priv->task = get_proc_task(inode);
>>> +    if (!priv->task) {
>>> +        ret = -ESRCH;
>>> +        goto error;
>>> +    }
>>
>> You're not actually using ->task anywhere in the current version,
>> right? Can this be deleted?

Actually, priv->task is used by hold_task_mempolicy() in 
totmaps_proc_show(), which as far as I understand it is needed due to 
the "vma = mm->mmap" looping we do.


>>
>>
>>> +const struct file_operations proc_totmaps_operations = {
>> [...]
>>> +    .release    = proc_map_release,
>>
>> This won't release priv->task, causing a memory leak (exploitable
>> through a reference counter overflow of the task_struct usage
>> counter).
>>
>
> Thanks for the thorough walkthrough, it is much appreciated.
>
> priv->task does not appear to be used any more, and can be removed.
> When "priv->task = get_proc_task(inode)" is removed, totmaps_open()
> starts to look just like the other XXX_open functions.
>
> I'll send out v3 as soon as testing has been done.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-15 16:25       ` Robert Foss
@ 2016-08-16  7:12         ` Michal Hocko
  2016-08-16 16:46           ` Robert Foss
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-16  7:12 UTC (permalink / raw)
  To: Robert Foss
  Cc: corbet, akpm, vbabka, koct9i, hughd, n-horiguchi, minchan,
	john.stultz, ross.zwisler, jmarchan, hannes, keescook, viro,
	gorcunov, plaguedbypenguins, rientjes, eric.engestrom, jdanis,
	calvinowens, adobriyan, jann, sonnyrao, kirill.shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon 15-08-16 12:25:10, Robert Foss wrote:
> 
> 
> On 2016-08-15 09:42 AM, Michal Hocko wrote:
[...]
> > The use case is to speed up monitoring of
> > memory consumption in environments where RSS isn't precise.
> >
> > For example Chrome tends to many processes which have hundreds of VMAs
> > with a substantial amount of shared memory, and the error of using
> > RSS rather than PSS tends to be very large when looking at overall
> > memory consumption.  PSS isn't kept as a single number that's exported
> > like RSS, so to calculate PSS means having to parse a very large smaps
> > file.
> >
> > This process is slow and has to be repeated for many processes, and we
> > found that the just act of doing the parsing was taking up a
> > significant amount of CPU time, so this patch is an attempt to make
> > that process cheaper.

Well, this is slow because it requires the pte walk otherwise you cannot
know how many ptes map the particular shared page. Your patch
(totmaps_proc_show) does the very same page table walk because in fact
it is unavoidable. So what exactly is the difference except for the
userspace parsing which is quite trivial e.g. my currently running Firefox
has
$ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l
984

quite some VMAs, yet parsing it spends basically all the time in the kernel...

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/4950/smaps 
rss:1112288 pss:1096435
        Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/4950/smaps"
        User time (seconds): 0.00
        System time (seconds): 0.02
        Percent of CPU this job got: 91%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02

So I am not really sure I see the performance benefit.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-16  7:12         ` Michal Hocko
@ 2016-08-16 16:46           ` Robert Foss
  2016-08-17  8:22             ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Robert Foss @ 2016-08-16 16:46 UTC (permalink / raw)
  To: Michal Hocko, sonnyrao
  Cc: corbet, akpm, vbabka, koct9i, hughd, n-horiguchi, minchan,
	john.stultz, ross.zwisler, jmarchan, hannes, keescook, viro,
	gorcunov, plaguedbypenguins, rientjes, eric.engestrom, jdanis,
	calvinowens, adobriyan, jann, kirill.shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik



On 2016-08-16 03:12 AM, Michal Hocko wrote:
> On Mon 15-08-16 12:25:10, Robert Foss wrote:
>>
>>
>> On 2016-08-15 09:42 AM, Michal Hocko wrote:
> [...]
>>> The use case is to speed up monitoring of
>>> memory consumption in environments where RSS isn't precise.
>>>
>>> For example Chrome tends to many processes which have hundreds of VMAs
>>> with a substantial amount of shared memory, and the error of using
>>> RSS rather than PSS tends to be very large when looking at overall
>>> memory consumption.  PSS isn't kept as a single number that's exported
>>> like RSS, so to calculate PSS means having to parse a very large smaps
>>> file.
>>>
>>> This process is slow and has to be repeated for many processes, and we
>>> found that the just act of doing the parsing was taking up a
>>> significant amount of CPU time, so this patch is an attempt to make
>>> that process cheaper.
>
> Well, this is slow because it requires the pte walk otherwise you cannot
> know how many ptes map the particular shared page. Your patch
> (totmaps_proc_show) does the very same page table walk because in fact
> it is unavoidable. So what exactly is the difference except for the
> userspace parsing which is quite trivial e.g. my currently running Firefox
> has
> $ awk '/^[0-9a-f]/{print}' /proc/4950/smaps | wc -l
> 984
>
> quite some VMAs, yet parsing it spends basically all the time in the kernel...
>
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/4950/smaps
> rss:1112288 pss:1096435
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/4950/smaps"
>         User time (seconds): 0.00
>         System time (seconds): 0.02
>         Percent of CPU this job got: 91%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
>
> So I am not really sure I see the performance benefit.
>

I did some performance measurements of my own, and it would seem like 
there is about a 2x performance gain to be had. To me that is 
substantial, and a larger gain than commonly seen.

There naturally also the benefit that this is a lot easier to interact 
with programmatically.

$ ps aux | grep firefox
robertfoss   5025 24.3 13.7 3562820 2219616 ?     Rl   Aug15 277:44 
/usr/lib/firefox/firefox https://allg.one/xpb
$ awk '/^[0-9a-f]/{print}' /proc/5025/smaps | wc -l
1503


$ /usr/bin/time -v -p zsh -c "(repeat 25 {cat /proc/5025/totmaps})"
[...]
	Command being timed: "zsh -c (repeat 25 {cat /proc/5025/totmaps})"
	User time (seconds): 0.00
	System time (seconds): 0.40
	Percent of CPU this job got: 90%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.45


$ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2} 
/^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\' 
/proc/5025/smaps }"
[...]
	Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2} 
/^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' 
/proc/5025/smaps }"
	User time (seconds): 0.37
	System time (seconds): 0.45
	Percent of CPU this job got: 92%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-16 16:46           ` Robert Foss
@ 2016-08-17  8:22             ` Michal Hocko
  2016-08-17  9:31               ` Jann Horn
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-17  8:22 UTC (permalink / raw)
  To: Robert Foss
  Cc: sonnyrao, corbet, akpm, vbabka, koct9i, hughd, n-horiguchi,
	minchan, john.stultz, ross.zwisler, jmarchan, hannes, keescook,
	viro, gorcunov, plaguedbypenguins, rientjes, eric.engestrom,
	jdanis, calvinowens, adobriyan, jann, kirill.shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Tue 16-08-16 12:46:51, Robert Foss wrote:
[...]
> $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> /proc/5025/smaps }"
> [...]
> 	Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> }"
> 	User time (seconds): 0.37
> 	System time (seconds): 0.45
> 	Percent of CPU this job got: 92%
> 	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89

This is really unexpected. Where is the user time spent? Anyway, rather
than measuring some random processes I've tried to measure something
resembling the worst case. So I've created a simple program to mmap as
much as possible:

#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
int main()
{
	while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
		;

	printf("pid:%d\n", getpid());
	pause();
	return 0;
}

so depending on /proc/sys/vm/max_map_count you will get the maximum
possible mmaps. I am using a default so 65k mappings. Then I have
retried your 25x file parsing:
$ cat s.sh
#!/bin/sh

pid=$1
for i in $(seq 25)
do
	awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps
done

But I am getting different results from you:
$ awk '/^[0-9a-f]/{print}' /proc/14808/smaps | wc -l
65532
[...]
        Command being timed: "sh s.sh 14808"
        User time (seconds): 0.00
        System time (seconds): 20.10
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.20

The results are stable when I try multiple times, in fact there
shouldn't be any reason for them not to be. Then I went on to increase
max_map_count to 250k and that behaves consistently:
$ awk '/^[0-9a-f]/{print}' /proc/16093/smaps | wc -l     
250002
[...]
        Command being timed: "sh s.sh 16093"
        User time (seconds): 0.00
        System time (seconds): 77.93
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:19.09

So with a reasonable user space the parsing is really not all that time
consuming wrt. smaps handling. That being said I am still very skeptical
about a dedicated proc file which accomplishes what userspace can done
in a trivial way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-17  8:22             ` Michal Hocko
@ 2016-08-17  9:31               ` Jann Horn
  2016-08-17 13:03                 ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Jann Horn @ 2016-08-17  9:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Robert Foss, sonnyrao, corbet, akpm, vbabka, koct9i, hughd,
	n-horiguchi, minchan, john.stultz, ross.zwisler, jmarchan,
	hannes, keescook, viro, gorcunov, plaguedbypenguins, rientjes,
	eric.engestrom, jdanis, calvinowens, adobriyan, kirill.shutemov,
	ldufour, linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

[-- Attachment #1: Type: text/plain, Size: 3675 bytes --]

On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
> On Tue 16-08-16 12:46:51, Robert Foss wrote:
> [...]
> > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> > /proc/5025/smaps }"
> > [...]
> > 	Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> > }"
> > 	User time (seconds): 0.37
> > 	System time (seconds): 0.45
> > 	Percent of CPU this job got: 92%
> > 	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
> 
> This is really unexpected. Where is the user time spent? Anyway, rather
> than measuring some random processes I've tried to measure something
> resembling the worst case. So I've created a simple program to mmap as
> much as possible:
> 
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <unistd.h>
> #include <stdio.h>
> int main()
> {
> 	while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
> 		;
> 
> 	printf("pid:%d\n", getpid());
> 	pause();
> 	return 0;
> }

Ah, nice, that's a reasonable test program. :)


> So with a reasonable user space the parsing is really not all that time
> consuming wrt. smaps handling. That being said I am still very skeptical
> about a dedicated proc file which accomplishes what userspace can done
> in a trivial way.

Now, since your numbers showed that all the time is spent in the kernel,
also create this test program to just read that file over and over again:

$ cat justreadloop.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <err.h>
#include <stdio.h>

char buf[1000000];

int main(int argc, char **argv) {
  printf("pid:%d\n", getpid());
  while (1) {
    int fd = open(argv[1], O_RDONLY);
    if (fd < 0) continue;
    if (read(fd, buf, sizeof(buf)) < 0)
      err(1, "read");
    close(fd);
  }
}
$ gcc -Wall -o justreadloop justreadloop.c
$ 

Now launch your test:

$ ./mapstuff 
pid:29397

point justreadloop at it:

$ ./justreadloop /proc/29397/smaps
pid:32567

... and then check the performance stats of justreadloop:

# perf top -p 32567

This is what I see:

Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
Overhead  Shared Object     Symbol
  30,43%  [kernel]          [k] format_decode
   9,12%  [kernel]          [k] number
   7,66%  [kernel]          [k] vsnprintf
   7,06%  [kernel]          [k] __lock_acquire
   3,23%  [kernel]          [k] lock_release
   2,85%  [kernel]          [k] debug_lockdep_rcu_enabled
   2,25%  [kernel]          [k] skip_atoi
   2,13%  [kernel]          [k] lock_acquire
   2,05%  [kernel]          [k] show_smap

That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
time spent on evaluating format strings. The new interface
wouldn't have to spend that much time on format strings because there
isn't so much text to format. (My kernel is built with a
bunch of debug options - the results might look very different on
distro kernels or so, so please try this yourself.)

I guess it could be argued that this is not just a problem with
smaps, but also a problem with format strings (or text-based interfaces
in general) just being slow in general.

(Here is a totally random and crazy thought: Can we put something into
the kernel build process that replaces printf calls that use simple
format strings with equivalent non-printf calls? Move the cost of
evaluating the format string to compile time?)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-17  9:31               ` Jann Horn
@ 2016-08-17 13:03                 ` Michal Hocko
  2016-08-17 16:48                   ` Robert Foss
  2016-08-17 18:57                   ` Sonny Rao
  0 siblings, 2 replies; 49+ messages in thread
From: Michal Hocko @ 2016-08-17 13:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Robert Foss, sonnyrao, corbet, akpm, vbabka, koct9i, hughd,
	n-horiguchi, minchan, john.stultz, ross.zwisler, jmarchan,
	hannes, keescook, viro, gorcunov, plaguedbypenguins, rientjes,
	eric.engestrom, jdanis, calvinowens, adobriyan, kirill.shutemov,
	ldufour, linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Wed 17-08-16 11:31:25, Jann Horn wrote:
> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
> > On Tue 16-08-16 12:46:51, Robert Foss wrote:
> > [...]
> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
> > > /proc/5025/smaps }"
> > > [...]
> > > 	Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
> > > }"
> > > 	User time (seconds): 0.37
> > > 	System time (seconds): 0.45
> > > 	Percent of CPU this job got: 92%
> > > 	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
> > 
> > This is really unexpected. Where is the user time spent? Anyway, rather
> > than measuring some random processes I've tried to measure something
> > resembling the worst case. So I've created a simple program to mmap as
> > much as possible:
> > 
> > #include <sys/mman.h>
> > #include <sys/types.h>
> > #include <unistd.h>
> > #include <stdio.h>
> > int main()
> > {
> > 	while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
> > 		;
> > 
> > 	printf("pid:%d\n", getpid());
> > 	pause();
> > 	return 0;
> > }
> 
> Ah, nice, that's a reasonable test program. :)
> 
> 
> > So with a reasonable user space the parsing is really not all that time
> > consuming wrt. smaps handling. That being said I am still very skeptical
> > about a dedicated proc file which accomplishes what userspace can done
> > in a trivial way.
> 
> Now, since your numbers showed that all the time is spent in the kernel,
> also create this test program to just read that file over and over again:
> 
> $ cat justreadloop.c
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sched.h>
> #include <unistd.h>
> #include <err.h>
> #include <stdio.h>
> 
> char buf[1000000];
> 
> int main(int argc, char **argv) {
>   printf("pid:%d\n", getpid());
>   while (1) {
>     int fd = open(argv[1], O_RDONLY);
>     if (fd < 0) continue;
>     if (read(fd, buf, sizeof(buf)) < 0)
>       err(1, "read");
>     close(fd);
>   }
> }
> $ gcc -Wall -o justreadloop justreadloop.c
> $ 
> 
> Now launch your test:
> 
> $ ./mapstuff 
> pid:29397
> 
> point justreadloop at it:
> 
> $ ./justreadloop /proc/29397/smaps
> pid:32567
> 
> ... and then check the performance stats of justreadloop:
> 
> # perf top -p 32567
> 
> This is what I see:
> 
> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
> Overhead  Shared Object     Symbol
>   30,43%  [kernel]          [k] format_decode
>    9,12%  [kernel]          [k] number
>    7,66%  [kernel]          [k] vsnprintf
>    7,06%  [kernel]          [k] __lock_acquire
>    3,23%  [kernel]          [k] lock_release
>    2,85%  [kernel]          [k] debug_lockdep_rcu_enabled
>    2,25%  [kernel]          [k] skip_atoi
>    2,13%  [kernel]          [k] lock_acquire
>    2,05%  [kernel]          [k] show_smap

This is a lot! I would expect the rmap walk to consume more but it even
doesn't show up in the top consumers.
 
> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> time spent on evaluating format strings. The new interface
> wouldn't have to spend that much time on format strings because there
> isn't so much text to format.

well, this is true of course but I would much rather try to reduce the
overhead of smaps file than add a new file. The following should help
already. I've measured ~7% systime cut down. I guess there is still some
room for improvements but I have to say I'm far from being convinced about
a new proc file just because we suck at dumping information to the
userspace. If this was something like /proc/<pid>/stat which is
essentially read all the time then it would be a different question but
is the rss, pss going to be all that often? If yes why? These are the
questions which should be answered before we even start considering the
implementation.
---
>From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 17 Aug 2016 14:00:13 +0200
Subject: [PATCH] proc, smaps: reduce printing overhead

seq_printf (used by show_smap) can be pretty expensive when dumping a
lot of numbers.  Say we would like to get Rss and Pss from a particular
process.  In order to measure a pathological case let's generate as many
mappings as possible:

$ cat max_mmap.c
int main()
{
	while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
		;

	printf("pid:%d\n", getpid());
	pause();
	return 0;
}

$ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps

would do a trick. The whole runtime is in the kernel space which is not
that that unexpected because smaps is not the cheapest one (we have to
do rmap walk etc.).

        Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
        User time (seconds): 0.01
        System time (seconds): 0.44
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47

But the perf says:
    22.55%  awk      [kernel.kallsyms]  [k] format_decode
    14.65%  awk      [kernel.kallsyms]  [k] vsnprintf
     6.40%  awk      [kernel.kallsyms]  [k] number
     2.53%  awk      [kernel.kallsyms]  [k] shmem_mapping
     2.53%  awk      [kernel.kallsyms]  [k] show_smap
     1.81%  awk      [kernel.kallsyms]  [k] lock_acquire

we are spending most of the time actually generating the output which is
quite lame. Let's replace seq_printf by seq_puts and seq_put_decimal_ull.
This will give us:
        Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3067/smaps"
        User time (seconds): 0.00
        System time (seconds): 0.41
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.42

which will give us ~7% improvement. Perf says:
    28.87%  awk      [kernel.kallsyms]  [k] seq_puts
     5.30%  awk      [kernel.kallsyms]  [k] vsnprintf
     4.54%  awk      [kernel.kallsyms]  [k] format_decode
     3.73%  awk      [kernel.kallsyms]  [k] show_smap
     2.56%  awk      [kernel.kallsyms]  [k] shmem_mapping
     1.92%  awk      [kernel.kallsyms]  [k] number
     1.80%  awk      [kernel.kallsyms]  [k] lock_acquire
     1.75%  awk      [kernel.kallsyms]  [k] print_name_value_kb

Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/task_mmu.c | 63 ++++++++++++++++++++++--------------------------------
 1 file changed, 25 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 187d84ef9de9..41c24c0811da 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -721,6 +721,13 @@ void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
 {
 }
 
+static void print_name_value_kb(struct seq_file *m, const char *name, unsigned long val)
+{
+	seq_puts(m, name);
+	seq_put_decimal_ull(m, 0, val);
+	seq_puts(m, " kB\n");
+}
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
 	struct vm_area_struct *vma = v;
@@ -765,45 +772,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 
 	show_map_vma(m, vma, is_pid);
 
-	seq_printf(m,
-		   "Size:           %8lu kB\n"
-		   "Rss:            %8lu kB\n"
-		   "Pss:            %8lu kB\n"
-		   "Shared_Clean:   %8lu kB\n"
-		   "Shared_Dirty:   %8lu kB\n"
-		   "Private_Clean:  %8lu kB\n"
-		   "Private_Dirty:  %8lu kB\n"
-		   "Referenced:     %8lu kB\n"
-		   "Anonymous:      %8lu kB\n"
-		   "AnonHugePages:  %8lu kB\n"
-		   "ShmemPmdMapped: %8lu kB\n"
-		   "Shared_Hugetlb: %8lu kB\n"
-		   "Private_Hugetlb: %7lu kB\n"
-		   "Swap:           %8lu kB\n"
-		   "SwapPss:        %8lu kB\n"
-		   "KernelPageSize: %8lu kB\n"
-		   "MMUPageSize:    %8lu kB\n"
-		   "Locked:         %8lu kB\n",
-		   (vma->vm_end - vma->vm_start) >> 10,
-		   mss.resident >> 10,
-		   (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
-		   mss.shared_clean  >> 10,
-		   mss.shared_dirty  >> 10,
-		   mss.private_clean >> 10,
-		   mss.private_dirty >> 10,
-		   mss.referenced >> 10,
-		   mss.anonymous >> 10,
-		   mss.anonymous_thp >> 10,
-		   mss.shmem_thp >> 10,
-		   mss.shared_hugetlb >> 10,
-		   mss.private_hugetlb >> 10,
-		   mss.swap >> 10,
-		   (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
-		   vma_kernel_pagesize(vma) >> 10,
-		   vma_mmu_pagesize(vma) >> 10,
-		   (vma->vm_flags & VM_LOCKED) ?
+	print_name_value_kb(m, "Size:           ", (vma->vm_end - vma->vm_start) >> 10);
+	print_name_value_kb(m, "Rss:            ", mss.resident >> 10);
+	print_name_value_kb(m, "Pss:            ", (unsigned long)(mss.pss >> (10 + PSS_SHIFT)));
+	print_name_value_kb(m, "Shared_Clean:   ", mss.shared_clean  >> 10);
+	print_name_value_kb(m, "Shared_Dirty:   ", mss.shared_dirty  >> 10);
+	print_name_value_kb(m, "Private_Clean:  ", mss.private_clean >> 10);
+	print_name_value_kb(m, "Private_Dirty:  ", mss.private_dirty >> 10);
+	print_name_value_kb(m, "Referenced:     ", mss.referenced >> 10);
+	print_name_value_kb(m, "Anonymous:      ", mss.anonymous >> 10);
+	print_name_value_kb(m, "AnonHugePages:  ", mss.anonymous_thp >> 10);
+	print_name_value_kb(m, "ShmemPmdMapped: ", mss.shmem_thp >> 10);
+	print_name_value_kb(m, "Shared_Hugetlb: ", mss.shared_hugetlb >> 10);
+	print_name_value_kb(m, "Private_Hugetlb: ", mss.private_hugetlb >> 10);
+	print_name_value_kb(m, "Swap:           ", mss.swap >> 10);
+	print_name_value_kb(m, "SwapPss:        ", (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)));
+	print_name_value_kb(m, "KernelPageSize: ", vma_kernel_pagesize(vma) >> 10);
+	print_name_value_kb(m, "MMUPageSize:    ", vma_mmu_pagesize(vma) >> 10);
+	print_name_value_kb(m, "Locked:         ", (vma->vm_flags & VM_LOCKED) ?
 			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
-
 	arch_show_smap(m, vma);
 	show_smap_vma_flags(m, vma);
 	m_cache_vma(m, vma);
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-17 13:03                 ` Michal Hocko
@ 2016-08-17 16:48                   ` Robert Foss
  2016-08-17 18:57                   ` Sonny Rao
  1 sibling, 0 replies; 49+ messages in thread
From: Robert Foss @ 2016-08-17 16:48 UTC (permalink / raw)
  To: Michal Hocko, Jann Horn, sonnyrao
  Cc: corbet, akpm, vbabka, koct9i, hughd, n-horiguchi, minchan,
	john.stultz, ross.zwisler, jmarchan, hannes, keescook, viro,
	gorcunov, plaguedbypenguins, rientjes, eric.engestrom, jdanis,
	calvinowens, adobriyan, kirill.shutemov, ldufour, linux-doc,
	linux-kernel, Ben Zhang, Bryan Freed, Filipe Brandenburger,
	Mateusz Guzik



On 2016-08-17 09:03 AM, Michal Hocko wrote:
> On Wed 17-08-16 11:31:25, Jann Horn wrote:
>> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
>>> On Tue 16-08-16 12:46:51, Robert Foss wrote:
>>> [...]
>>>> $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
>>>> /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
>>>> /proc/5025/smaps }"
>>>> [...]
>>>> 	Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
>>>> /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
>>>> }"
>>>> 	User time (seconds): 0.37
>>>> 	System time (seconds): 0.45
>>>> 	Percent of CPU this job got: 92%
>>>> 	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
>>>
>>> This is really unexpected. Where is the user time spent? Anyway, rather
>>> than measuring some random processes I've tried to measure something
>>> resembling the worst case. So I've created a simple program to mmap as
>>> much as possible:
>>>
>>> #include <sys/mman.h>
>>> #include <sys/types.h>
>>> #include <unistd.h>
>>> #include <stdio.h>
>>> int main()
>>> {
>>> 	while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>>> 		;
>>>
>>> 	printf("pid:%d\n", getpid());
>>> 	pause();
>>> 	return 0;
>>> }
>>
>> Ah, nice, that's a reasonable test program. :)
>>
>>
>>> So with a reasonable user space the parsing is really not all that time
>>> consuming wrt. smaps handling. That being said I am still very skeptical
>>> about a dedicated proc file which accomplishes what userspace can done
>>> in a trivial way.
>>
>> Now, since your numbers showed that all the time is spent in the kernel,
>> also create this test program to just read that file over and over again:
>>
>> $ cat justreadloop.c
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sched.h>
>> #include <unistd.h>
>> #include <err.h>
>> #include <stdio.h>
>>
>> char buf[1000000];
>>
>> int main(int argc, char **argv) {
>>   printf("pid:%d\n", getpid());
>>   while (1) {
>>     int fd = open(argv[1], O_RDONLY);
>>     if (fd < 0) continue;
>>     if (read(fd, buf, sizeof(buf)) < 0)
>>       err(1, "read");
>>     close(fd);
>>   }
>> }
>> $ gcc -Wall -o justreadloop justreadloop.c
>> $
>>
>> Now launch your test:
>>
>> $ ./mapstuff
>> pid:29397
>>
>> point justreadloop at it:
>>
>> $ ./justreadloop /proc/29397/smaps
>> pid:32567
>>
>> ... and then check the performance stats of justreadloop:
>>
>> # perf top -p 32567
>>
>> This is what I see:
>>
>> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
>> Overhead  Shared Object     Symbol
>>   30,43%  [kernel]          [k] format_decode
>>    9,12%  [kernel]          [k] number
>>    7,66%  [kernel]          [k] vsnprintf
>>    7,06%  [kernel]          [k] __lock_acquire
>>    3,23%  [kernel]          [k] lock_release
>>    2,85%  [kernel]          [k] debug_lockdep_rcu_enabled
>>    2,25%  [kernel]          [k] skip_atoi
>>    2,13%  [kernel]          [k] lock_acquire
>>    2,05%  [kernel]          [k] show_smap
>
> This is a lot! I would expect the rmap walk to consume more but it even
> doesn't show up in the top consumers.
>
>> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> time spent on evaluating format strings. The new interface
>> wouldn't have to spend that much time on format strings because there
>> isn't so much text to format.
>
> well, this is true of course but I would much rather try to reduce the
> overhead of smaps file than add a new file. The following should help
> already. I've measured ~7% systime cut down. I guess there is still some
> room for improvements but I have to say I'm far from being convinced about
> a new proc file just because we suck at dumping information to the
> userspace. If this was something like /proc/<pid>/stat which is
> essentially read all the time then it would be a different question but
> is the rss, pss going to be all that often? If yes why? These are the
> questions which should be answered before we even start considering the
> implementation.

@Sonny Rao: Maybe you can comment on how often, for how many processes 
this information is needed and for which reasons this information is useful.

> ---
> From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 17 Aug 2016 14:00:13 +0200
> Subject: [PATCH] proc, smaps: reduce printing overhead
>
> seq_printf (used by show_smap) can be pretty expensive when dumping a
> lot of numbers.  Say we would like to get Rss and Pss from a particular
> process.  In order to measure a pathological case let's generate as many
> mappings as possible:
>
> $ cat max_mmap.c
> int main()
> {
> 	while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
> 		;
>
> 	printf("pid:%d\n", getpid());
> 	pause();
> 	return 0;
> }
>
> $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps
>
> would do a trick. The whole runtime is in the kernel space which is not
> that that unexpected because smaps is not the cheapest one (we have to
> do rmap walk etc.).
>
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
>         User time (seconds): 0.01
>         System time (seconds): 0.44
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47
>
> But the perf says:
>     22.55%  awk      [kernel.kallsyms]  [k] format_decode
>     14.65%  awk      [kernel.kallsyms]  [k] vsnprintf
>      6.40%  awk      [kernel.kallsyms]  [k] number
>      2.53%  awk      [kernel.kallsyms]  [k] shmem_mapping
>      2.53%  awk      [kernel.kallsyms]  [k] show_smap
>      1.81%  awk      [kernel.kallsyms]  [k] lock_acquire
>
> we are spending most of the time actually generating the output which is
> quite lame. Let's replace seq_printf by seq_puts and seq_put_decimal_ull.
> This will give us:
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3067/smaps"
>         User time (seconds): 0.00
>         System time (seconds): 0.41
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.42
>
> which will give us ~7% improvement. Perf says:
>     28.87%  awk      [kernel.kallsyms]  [k] seq_puts
>      5.30%  awk      [kernel.kallsyms]  [k] vsnprintf
>      4.54%  awk      [kernel.kallsyms]  [k] format_decode
>      3.73%  awk      [kernel.kallsyms]  [k] show_smap
>      2.56%  awk      [kernel.kallsyms]  [k] shmem_mapping
>      1.92%  awk      [kernel.kallsyms]  [k] number
>      1.80%  awk      [kernel.kallsyms]  [k] lock_acquire
>      1.75%  awk      [kernel.kallsyms]  [k] print_name_value_kb
>
> Reported-by: Jann Horn <jann@thejh.net>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/proc/task_mmu.c | 63 ++++++++++++++++++++++--------------------------------
>  1 file changed, 25 insertions(+), 38 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 187d84ef9de9..41c24c0811da 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -721,6 +721,13 @@ void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
>  {
>  }
>
> +static void print_name_value_kb(struct seq_file *m, const char *name, unsigned long val)
> +{
> +	seq_puts(m, name);
> +	seq_put_decimal_ull(m, 0, val);
> +	seq_puts(m, " kB\n");
> +}
> +
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>  	struct vm_area_struct *vma = v;
> @@ -765,45 +772,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>
>  	show_map_vma(m, vma, is_pid);
>
> -	seq_printf(m,
> -		   "Size:           %8lu kB\n"
> -		   "Rss:            %8lu kB\n"
> -		   "Pss:            %8lu kB\n"
> -		   "Shared_Clean:   %8lu kB\n"
> -		   "Shared_Dirty:   %8lu kB\n"
> -		   "Private_Clean:  %8lu kB\n"
> -		   "Private_Dirty:  %8lu kB\n"
> -		   "Referenced:     %8lu kB\n"
> -		   "Anonymous:      %8lu kB\n"
> -		   "AnonHugePages:  %8lu kB\n"
> -		   "ShmemPmdMapped: %8lu kB\n"
> -		   "Shared_Hugetlb: %8lu kB\n"
> -		   "Private_Hugetlb: %7lu kB\n"
> -		   "Swap:           %8lu kB\n"
> -		   "SwapPss:        %8lu kB\n"
> -		   "KernelPageSize: %8lu kB\n"
> -		   "MMUPageSize:    %8lu kB\n"
> -		   "Locked:         %8lu kB\n",
> -		   (vma->vm_end - vma->vm_start) >> 10,
> -		   mss.resident >> 10,
> -		   (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
> -		   mss.shared_clean  >> 10,
> -		   mss.shared_dirty  >> 10,
> -		   mss.private_clean >> 10,
> -		   mss.private_dirty >> 10,
> -		   mss.referenced >> 10,
> -		   mss.anonymous >> 10,
> -		   mss.anonymous_thp >> 10,
> -		   mss.shmem_thp >> 10,
> -		   mss.shared_hugetlb >> 10,
> -		   mss.private_hugetlb >> 10,
> -		   mss.swap >> 10,
> -		   (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
> -		   vma_kernel_pagesize(vma) >> 10,
> -		   vma_mmu_pagesize(vma) >> 10,
> -		   (vma->vm_flags & VM_LOCKED) ?
> +	print_name_value_kb(m, "Size:           ", (vma->vm_end - vma->vm_start) >> 10);
> +	print_name_value_kb(m, "Rss:            ", mss.resident >> 10);
> +	print_name_value_kb(m, "Pss:            ", (unsigned long)(mss.pss >> (10 + PSS_SHIFT)));
> +	print_name_value_kb(m, "Shared_Clean:   ", mss.shared_clean  >> 10);
> +	print_name_value_kb(m, "Shared_Dirty:   ", mss.shared_dirty  >> 10);
> +	print_name_value_kb(m, "Private_Clean:  ", mss.private_clean >> 10);
> +	print_name_value_kb(m, "Private_Dirty:  ", mss.private_dirty >> 10);
> +	print_name_value_kb(m, "Referenced:     ", mss.referenced >> 10);
> +	print_name_value_kb(m, "Anonymous:      ", mss.anonymous >> 10);
> +	print_name_value_kb(m, "AnonHugePages:  ", mss.anonymous_thp >> 10);
> +	print_name_value_kb(m, "ShmemPmdMapped: ", mss.shmem_thp >> 10);
> +	print_name_value_kb(m, "Shared_Hugetlb: ", mss.shared_hugetlb >> 10);
> +	print_name_value_kb(m, "Private_Hugetlb: ", mss.private_hugetlb >> 10);
> +	print_name_value_kb(m, "Swap:           ", mss.swap >> 10);
> +	print_name_value_kb(m, "SwapPss:        ", (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)));
> +	print_name_value_kb(m, "KernelPageSize: ", vma_kernel_pagesize(vma) >> 10);
> +	print_name_value_kb(m, "MMUPageSize:    ", vma_mmu_pagesize(vma) >> 10);
> +	print_name_value_kb(m, "Locked:         ", (vma->vm_flags & VM_LOCKED) ?
>  			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
> -
>  	arch_show_smap(m, vma);
>  	show_smap_vma_flags(m, vma);
>  	m_cache_vma(m, vma);
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-17 13:03                 ` Michal Hocko
  2016-08-17 16:48                   ` Robert Foss
@ 2016-08-17 18:57                   ` Sonny Rao
  2016-08-18  7:44                     ` Michal Hocko
  1 sibling, 1 reply; 49+ messages in thread
From: Sonny Rao @ 2016-08-17 18:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Robert Foss, corbet, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Hugh Dickins, Naoya Horiguchi,
	Minchan Kim, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 17-08-16 11:31:25, Jann Horn wrote:
>> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
>> > On Tue 16-08-16 12:46:51, Robert Foss wrote:
>> > [...]
>> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
>> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
>> > > /proc/5025/smaps }"
>> > > [...]
>> > >   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
>> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
>> > > }"
>> > >   User time (seconds): 0.37
>> > >   System time (seconds): 0.45
>> > >   Percent of CPU this job got: 92%
>> > >   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
>> >
>> > This is really unexpected. Where is the user time spent? Anyway, rather
>> > than measuring some random processes I've tried to measure something
>> > resembling the worst case. So I've created a simple program to mmap as
>> > much as possible:
>> >
>> > #include <sys/mman.h>
>> > #include <sys/types.h>
>> > #include <unistd.h>
>> > #include <stdio.h>
>> > int main()
>> > {
>> >     while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>> >             ;
>> >
>> >     printf("pid:%d\n", getpid());
>> >     pause();
>> >     return 0;
>> > }
>>
>> Ah, nice, that's a reasonable test program. :)
>>
>>
>> > So with a reasonable user space the parsing is really not all that time
>> > consuming wrt. smaps handling. That being said I am still very skeptical
>> > about a dedicated proc file which accomplishes what userspace can done
>> > in a trivial way.
>>
>> Now, since your numbers showed that all the time is spent in the kernel,
>> also create this test program to just read that file over and over again:
>>
>> $ cat justreadloop.c
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sched.h>
>> #include <unistd.h>
>> #include <err.h>
>> #include <stdio.h>
>>
>> char buf[1000000];
>>
>> int main(int argc, char **argv) {
>>   printf("pid:%d\n", getpid());
>>   while (1) {
>>     int fd = open(argv[1], O_RDONLY);
>>     if (fd < 0) continue;
>>     if (read(fd, buf, sizeof(buf)) < 0)
>>       err(1, "read");
>>     close(fd);
>>   }
>> }
>> $ gcc -Wall -o justreadloop justreadloop.c
>> $
>>
>> Now launch your test:
>>
>> $ ./mapstuff
>> pid:29397
>>
>> point justreadloop at it:
>>
>> $ ./justreadloop /proc/29397/smaps
>> pid:32567
>>
>> ... and then check the performance stats of justreadloop:
>>
>> # perf top -p 32567
>>
>> This is what I see:
>>
>> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
>> Overhead  Shared Object     Symbol
>>   30,43%  [kernel]          [k] format_decode
>>    9,12%  [kernel]          [k] number
>>    7,66%  [kernel]          [k] vsnprintf
>>    7,06%  [kernel]          [k] __lock_acquire
>>    3,23%  [kernel]          [k] lock_release
>>    2,85%  [kernel]          [k] debug_lockdep_rcu_enabled
>>    2,25%  [kernel]          [k] skip_atoi
>>    2,13%  [kernel]          [k] lock_acquire
>>    2,05%  [kernel]          [k] show_smap
>
> This is a lot! I would expect the rmap walk to consume more but it even
> doesn't show up in the top consumers.
>
>> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> time spent on evaluating format strings. The new interface
>> wouldn't have to spend that much time on format strings because there
>> isn't so much text to format.
>
> well, this is true of course but I would much rather try to reduce the
> overhead of smaps file than add a new file. The following should help
> already. I've measured ~7% systime cut down. I guess there is still some
> room for improvements but I have to say I'm far from being convinced about
> a new proc file just because we suck at dumping information to the
> userspace.
> If this was something like /proc/<pid>/stat which is
> essentially read all the time then it would be a different question but
> is the rss, pss going to be all that often? If yes why?

If the question is why do we need to read RSS, PSS, Private_*, Swap
and the other fields so often?

I have two use cases so far involving monitoring per-process memory
usage, and we usually need to read stats for about 25 processes.

Here's a timing example on an fairly recent ARM system 4 core RK3288
running at 1.8Ghz

localhost ~ # time cat /proc/25946/smaps > /dev/null

real    0m0.036s
user    0m0.020s
sys     0m0.020s

localhost ~ # time cat /proc/25946/totmaps > /dev/null

real    0m0.027s
user    0m0.010s
sys     0m0.010s
localhost ~ #

I'll ignore the user time for now, and we see about 20 ms of system
time with smaps and 10 ms with totmaps, with 20 similar processes it
would be 400 milliseconds of cpu time for the kernel to get this
information from smaps vs 200 milliseconds with totmaps.  Even totmaps
is still pretty slow, but much better than smaps.

Use cases:
1) Basic task monitoring -- like "top" that shows memory consumption
including PSS, Private, Swap
    1 second update means about 40% of one CPU is spent in the kernel
gathering the data with smaps

2) User space OOM handling -- we'd rather do a more graceful shutdown
than let the kernel's OOM killer activate and need to gather this
information
    and we'd like to be able to get this information to make the
decision much faster than 400ms

> These are the
> questions which should be answered before we even start considering the
> implementation.
> ---
> From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 17 Aug 2016 14:00:13 +0200
> Subject: [PATCH] proc, smaps: reduce printing overhead
>
> seq_printf (used by show_smap) can be pretty expensive when dumping a
> lot of numbers.  Say we would like to get Rss and Pss from a particular
> process.  In order to measure a pathological case let's generate as many
> mappings as possible:
>
> $ cat max_mmap.c
> int main()
> {
>         while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>                 ;
>
>         printf("pid:%d\n", getpid());
>         pause();
>         return 0;
> }
>
> $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps
>
> would do a trick. The whole runtime is in the kernel space which is not
> that that unexpected because smaps is not the cheapest one (we have to
> do rmap walk etc.).
>
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
>         User time (seconds): 0.01
>         System time (seconds): 0.44
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47
>
> But the perf says:
>     22.55%  awk      [kernel.kallsyms]  [k] format_decode
>     14.65%  awk      [kernel.kallsyms]  [k] vsnprintf
>      6.40%  awk      [kernel.kallsyms]  [k] number
>      2.53%  awk      [kernel.kallsyms]  [k] shmem_mapping
>      2.53%  awk      [kernel.kallsyms]  [k] show_smap
>      1.81%  awk      [kernel.kallsyms]  [k] lock_acquire
>
> we are spending most of the time actually generating the output which is
> quite lame. Let's replace seq_printf by seq_puts and seq_put_decimal_ull.
> This will give us:
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3067/smaps"
>         User time (seconds): 0.00
>         System time (seconds): 0.41
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.42
>
> which will give us ~7% improvement. Perf says:
>     28.87%  awk      [kernel.kallsyms]  [k] seq_puts
>      5.30%  awk      [kernel.kallsyms]  [k] vsnprintf
>      4.54%  awk      [kernel.kallsyms]  [k] format_decode
>      3.73%  awk      [kernel.kallsyms]  [k] show_smap
>      2.56%  awk      [kernel.kallsyms]  [k] shmem_mapping
>      1.92%  awk      [kernel.kallsyms]  [k] number
>      1.80%  awk      [kernel.kallsyms]  [k] lock_acquire
>      1.75%  awk      [kernel.kallsyms]  [k] print_name_value_kb
>
> Reported-by: Jann Horn <jann@thejh.net>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/proc/task_mmu.c | 63 ++++++++++++++++++++++--------------------------------
>  1 file changed, 25 insertions(+), 38 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 187d84ef9de9..41c24c0811da 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -721,6 +721,13 @@ void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
>  {
>  }
>
> +static void print_name_value_kb(struct seq_file *m, const char *name, unsigned long val)
> +{
> +       seq_puts(m, name);
> +       seq_put_decimal_ull(m, 0, val);
> +       seq_puts(m, " kB\n");
> +}
> +
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>         struct vm_area_struct *vma = v;
> @@ -765,45 +772,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>
>         show_map_vma(m, vma, is_pid);
>
> -       seq_printf(m,
> -                  "Size:           %8lu kB\n"
> -                  "Rss:            %8lu kB\n"
> -                  "Pss:            %8lu kB\n"
> -                  "Shared_Clean:   %8lu kB\n"
> -                  "Shared_Dirty:   %8lu kB\n"
> -                  "Private_Clean:  %8lu kB\n"
> -                  "Private_Dirty:  %8lu kB\n"
> -                  "Referenced:     %8lu kB\n"
> -                  "Anonymous:      %8lu kB\n"
> -                  "AnonHugePages:  %8lu kB\n"
> -                  "ShmemPmdMapped: %8lu kB\n"
> -                  "Shared_Hugetlb: %8lu kB\n"
> -                  "Private_Hugetlb: %7lu kB\n"
> -                  "Swap:           %8lu kB\n"
> -                  "SwapPss:        %8lu kB\n"
> -                  "KernelPageSize: %8lu kB\n"
> -                  "MMUPageSize:    %8lu kB\n"
> -                  "Locked:         %8lu kB\n",
> -                  (vma->vm_end - vma->vm_start) >> 10,
> -                  mss.resident >> 10,
> -                  (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
> -                  mss.shared_clean  >> 10,
> -                  mss.shared_dirty  >> 10,
> -                  mss.private_clean >> 10,
> -                  mss.private_dirty >> 10,
> -                  mss.referenced >> 10,
> -                  mss.anonymous >> 10,
> -                  mss.anonymous_thp >> 10,
> -                  mss.shmem_thp >> 10,
> -                  mss.shared_hugetlb >> 10,
> -                  mss.private_hugetlb >> 10,
> -                  mss.swap >> 10,
> -                  (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
> -                  vma_kernel_pagesize(vma) >> 10,
> -                  vma_mmu_pagesize(vma) >> 10,
> -                  (vma->vm_flags & VM_LOCKED) ?
> +       print_name_value_kb(m, "Size:           ", (vma->vm_end - vma->vm_start) >> 10);
> +       print_name_value_kb(m, "Rss:            ", mss.resident >> 10);
> +       print_name_value_kb(m, "Pss:            ", (unsigned long)(mss.pss >> (10 + PSS_SHIFT)));
> +       print_name_value_kb(m, "Shared_Clean:   ", mss.shared_clean  >> 10);
> +       print_name_value_kb(m, "Shared_Dirty:   ", mss.shared_dirty  >> 10);
> +       print_name_value_kb(m, "Private_Clean:  ", mss.private_clean >> 10);
> +       print_name_value_kb(m, "Private_Dirty:  ", mss.private_dirty >> 10);
> +       print_name_value_kb(m, "Referenced:     ", mss.referenced >> 10);
> +       print_name_value_kb(m, "Anonymous:      ", mss.anonymous >> 10);
> +       print_name_value_kb(m, "AnonHugePages:  ", mss.anonymous_thp >> 10);
> +       print_name_value_kb(m, "ShmemPmdMapped: ", mss.shmem_thp >> 10);
> +       print_name_value_kb(m, "Shared_Hugetlb: ", mss.shared_hugetlb >> 10);
> +       print_name_value_kb(m, "Private_Hugetlb: ", mss.private_hugetlb >> 10);
> +       print_name_value_kb(m, "Swap:           ", mss.swap >> 10);
> +       print_name_value_kb(m, "SwapPss:        ", (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)));
> +       print_name_value_kb(m, "KernelPageSize: ", vma_kernel_pagesize(vma) >> 10);
> +       print_name_value_kb(m, "MMUPageSize:    ", vma_mmu_pagesize(vma) >> 10);
> +       print_name_value_kb(m, "Locked:         ", (vma->vm_flags & VM_LOCKED) ?
>                         (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
> -
>         arch_show_smap(m, vma);
>         show_smap_vma_flags(m, vma);
>         m_cache_vma(m, vma);
> --
> 2.8.1
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-17 18:57                   ` Sonny Rao
@ 2016-08-18  7:44                     ` Michal Hocko
  2016-08-18 17:47                       ` Sonny Rao
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-18  7:44 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Jann Horn, Robert Foss, corbet, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Hugh Dickins, Naoya Horiguchi,
	Minchan Kim, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 17-08-16 11:31:25, Jann Horn wrote:
[...]
> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
> >> time spent on evaluating format strings. The new interface
> >> wouldn't have to spend that much time on format strings because there
> >> isn't so much text to format.
> >
> > well, this is true of course but I would much rather try to reduce the
> > overhead of smaps file than add a new file. The following should help
> > already. I've measured ~7% systime cut down. I guess there is still some
> > room for improvements but I have to say I'm far from being convinced about
> > a new proc file just because we suck at dumping information to the
> > userspace.
> > If this was something like /proc/<pid>/stat which is
> > essentially read all the time then it would be a different question but
> > is the rss, pss going to be all that often? If yes why?
> 
> If the question is why do we need to read RSS, PSS, Private_*, Swap
> and the other fields so often?
> 
> I have two use cases so far involving monitoring per-process memory
> usage, and we usually need to read stats for about 25 processes.
> 
> Here's a timing example on an fairly recent ARM system 4 core RK3288
> running at 1.8Ghz
> 
> localhost ~ # time cat /proc/25946/smaps > /dev/null
> 
> real    0m0.036s
> user    0m0.020s
> sys     0m0.020s
> 
> localhost ~ # time cat /proc/25946/totmaps > /dev/null
> 
> real    0m0.027s
> user    0m0.010s
> sys     0m0.010s
> localhost ~ #
> 
> I'll ignore the user time for now, and we see about 20 ms of system
> time with smaps and 10 ms with totmaps, with 20 similar processes it
> would be 400 milliseconds of cpu time for the kernel to get this
> information from smaps vs 200 milliseconds with totmaps.  Even totmaps
> is still pretty slow, but much better than smaps.
> 
> Use cases:
> 1) Basic task monitoring -- like "top" that shows memory consumption
> including PSS, Private, Swap
>     1 second update means about 40% of one CPU is spent in the kernel
> gathering the data with smaps

I would argue that even 20% is way too much for such a monitoring. What
is the value to do it so often tha 20 vs 40ms really matters?

> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> than let the kernel's OOM killer activate and need to gather this
> information and we'd like to be able to get this information to make
> the decision much faster than 400ms

Global OOM handling in userspace is really dubious if you ask me. I
understand you want something better than SIGKILL and in fact this is
already possible with memory cgroup controller (btw. memcg will give
you a cheap access to rss, amount of shared, swapped out memory as
well). Anyway if you are getting close to the OOM your system will most
probably be really busy and chances are that also reading your new file
will take much more time. I am also not quite sure how is pss useful for
oom decisions.

Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
specific usecase but so far I haven't heard any sound argument for it to
be generally usable. It is true that smaps is unnecessarily costly but
at least I can see some room for improvements. A simple patch I've
posted cut the formatting overhead by 7%. Maybe we can do more.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-18  7:44                     ` Michal Hocko
@ 2016-08-18 17:47                       ` Sonny Rao
  2016-08-18 18:01                         ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Sonny Rao @ 2016-08-18 17:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Robert Foss, corbet, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Hugh Dickins, Naoya Horiguchi,
	Minchan Kim, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Wed 17-08-16 11:31:25, Jann Horn wrote:
> [...]
>> >> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> >> time spent on evaluating format strings. The new interface
>> >> wouldn't have to spend that much time on format strings because there
>> >> isn't so much text to format.
>> >
>> > well, this is true of course but I would much rather try to reduce the
>> > overhead of smaps file than add a new file. The following should help
>> > already. I've measured ~7% systime cut down. I guess there is still some
>> > room for improvements but I have to say I'm far from being convinced about
>> > a new proc file just because we suck at dumping information to the
>> > userspace.
>> > If this was something like /proc/<pid>/stat which is
>> > essentially read all the time then it would be a different question but
>> > is the rss, pss going to be all that often? If yes why?
>>
>> If the question is why do we need to read RSS, PSS, Private_*, Swap
>> and the other fields so often?
>>
>> I have two use cases so far involving monitoring per-process memory
>> usage, and we usually need to read stats for about 25 processes.
>>
>> Here's a timing example on an fairly recent ARM system 4 core RK3288
>> running at 1.8Ghz
>>
>> localhost ~ # time cat /proc/25946/smaps > /dev/null
>>
>> real    0m0.036s
>> user    0m0.020s
>> sys     0m0.020s
>>
>> localhost ~ # time cat /proc/25946/totmaps > /dev/null
>>
>> real    0m0.027s
>> user    0m0.010s
>> sys     0m0.010s
>> localhost ~ #
>>
>> I'll ignore the user time for now, and we see about 20 ms of system
>> time with smaps and 10 ms with totmaps, with 20 similar processes it
>> would be 400 milliseconds of cpu time for the kernel to get this
>> information from smaps vs 200 milliseconds with totmaps.  Even totmaps
>> is still pretty slow, but much better than smaps.
>>
>> Use cases:
>> 1) Basic task monitoring -- like "top" that shows memory consumption
>> including PSS, Private, Swap
>>     1 second update means about 40% of one CPU is spent in the kernel
>> gathering the data with smaps
>
> I would argue that even 20% is way too much for such a monitoring. What
> is the value to do it so often tha 20 vs 40ms really matters?

Yeah it is too much (I believe I said that) but it's significantly better.

>> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> than let the kernel's OOM killer activate and need to gather this
>> information and we'd like to be able to get this information to make
>> the decision much faster than 400ms
>
> Global OOM handling in userspace is really dubious if you ask me. I
> understand you want something better than SIGKILL and in fact this is
> already possible with memory cgroup controller (btw. memcg will give
> you a cheap access to rss, amount of shared, swapped out memory as
> well). Anyway if you are getting close to the OOM your system will most
> probably be really busy and chances are that also reading your new file
> will take much more time. I am also not quite sure how is pss useful for
> oom decisions.

I mentioned it before, but based on experience RSS just isn't good
enough -- there's too much sharing going on in our use case to make
the correct decision based on RSS.  If RSS were good enough, simply
put, this patch wouldn't exist.  So even with memcg I think we'd have
the same problem?

>
> Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
> specific usecase but so far I haven't heard any sound argument for it to
> be generally usable. It is true that smaps is unnecessarily costly but
> at least I can see some room for improvements. A simple patch I've
> posted cut the formatting overhead by 7%. Maybe we can do more.

It seems like a general problem that if you want these values the
existing kernel interface can be very expensive, so it would be
generally usable by any application which wants a per process PSS,
private data, dirty data or swap value.   I mentioned two use cases,
but I guess I don't understand the comment about why it's not usable
by other use cases.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-18 17:47                       ` Sonny Rao
@ 2016-08-18 18:01                         ` Michal Hocko
  2016-08-18 21:05                           ` Robert Foss
                                             ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Michal Hocko @ 2016-08-18 18:01 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Jann Horn, Robert Foss, corbet, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Hugh Dickins, Naoya Horiguchi,
	Minchan Kim, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
[...]
> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> than let the kernel's OOM killer activate and need to gather this
> >> information and we'd like to be able to get this information to make
> >> the decision much faster than 400ms
> >
> > Global OOM handling in userspace is really dubious if you ask me. I
> > understand you want something better than SIGKILL and in fact this is
> > already possible with memory cgroup controller (btw. memcg will give
> > you a cheap access to rss, amount of shared, swapped out memory as
> > well). Anyway if you are getting close to the OOM your system will most
> > probably be really busy and chances are that also reading your new file
> > will take much more time. I am also not quite sure how is pss useful for
> > oom decisions.
> 
> I mentioned it before, but based on experience RSS just isn't good
> enough -- there's too much sharing going on in our use case to make
> the correct decision based on RSS.  If RSS were good enough, simply
> put, this patch wouldn't exist.

But that doesn't answer my question, I am afraid. So how exactly do you
use pss for oom decisions?

> So even with memcg I think we'd have the same problem?

memcg will give you instant anon, shared counters for all processes in
the memcg.

> > Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
> > specific usecase but so far I haven't heard any sound argument for it to
> > be generally usable. It is true that smaps is unnecessarily costly but
> > at least I can see some room for improvements. A simple patch I've
> > posted cut the formatting overhead by 7%. Maybe we can do more.
> 
> It seems like a general problem that if you want these values the
> existing kernel interface can be very expensive, so it would be
> generally usable by any application which wants a per process PSS,
> private data, dirty data or swap value.

yes this is really unfortunate. And if at all possible we should address
that. Precise values require the expensive rmap walk. We can introduce
some caching to help that. But so far it seems the biggest overhead is
to simply format the output and that should be addressed before any new
proc file is added.

> I mentioned two use cases, but I guess I don't understand the comment
> about why it's not usable by other use cases.

I might be wrong here but a use of pss is quite limited and I do not
remember anybody asking for large optimizations in that area. I still do
not understand your use cases properly so I am quite skeptical about a
general usefulness of a new file.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-18 18:01                         ` Michal Hocko
@ 2016-08-18 21:05                           ` Robert Foss
  2016-08-19  6:27                             ` Sonny Rao
  2016-08-19  2:26                           ` Minchan Kim
  2016-08-19  6:43                           ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Sonny Rao
  2 siblings, 1 reply; 49+ messages in thread
From: Robert Foss @ 2016-08-18 21:05 UTC (permalink / raw)
  To: Michal Hocko, Sonny Rao
  Cc: Jann Horn, corbet, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Hugh Dickins, Naoya Horiguchi,
	Minchan Kim, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik



On 2016-08-18 02:01 PM, Michal Hocko wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>> On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
>>>> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>>>> than let the kernel's OOM killer activate and need to gather this
>>>> information and we'd like to be able to get this information to make
>>>> the decision much faster than 400ms
>>>
>>> Global OOM handling in userspace is really dubious if you ask me. I
>>> understand you want something better than SIGKILL and in fact this is
>>> already possible with memory cgroup controller (btw. memcg will give
>>> you a cheap access to rss, amount of shared, swapped out memory as
>>> well). Anyway if you are getting close to the OOM your system will most
>>> probably be really busy and chances are that also reading your new file
>>> will take much more time. I am also not quite sure how is pss useful for
>>> oom decisions.
>>
>> I mentioned it before, but based on experience RSS just isn't good
>> enough -- there's too much sharing going on in our use case to make
>> the correct decision based on RSS.  If RSS were good enough, simply
>> put, this patch wouldn't exist.
>
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?
>
>> So even with memcg I think we'd have the same problem?
>
> memcg will give you instant anon, shared counters for all processes in
> the memcg.

Is it technically feasible to add instant pss support to memcg?

@Sonny Rao: Would using cgroups be acceptable for chromiumos?

>
>>> Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
>>> specific usecase but so far I haven't heard any sound argument for it to
>>> be generally usable. It is true that smaps is unnecessarily costly but
>>> at least I can see some room for improvements. A simple patch I've
>>> posted cut the formatting overhead by 7%. Maybe we can do more.
>>
>> It seems like a general problem that if you want these values the
>> existing kernel interface can be very expensive, so it would be
>> generally usable by any application which wants a per process PSS,
>> private data, dirty data or swap value.
>
> yes this is really unfortunate. And if at all possible we should address
> that. Precise values require the expensive rmap walk. We can introduce
> some caching to help that. But so far it seems the biggest overhead is
> to simply format the output and that should be addressed before any new
> proc file is added.
>
>> I mentioned two use cases, but I guess I don't understand the comment
>> about why it's not usable by other use cases.
>
> I might be wrong here but a use of pss is quite limited and I do not
> remember anybody asking for large optimizations in that area. I still do
> not understand your use cases properly so I am quite skeptical about a
> general usefulness of a new file.
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-18 18:01                         ` Michal Hocko
  2016-08-18 21:05                           ` Robert Foss
@ 2016-08-19  2:26                           ` Minchan Kim
  2016-08-19  6:47                             ` Sonny Rao
  2016-08-19  8:05                             ` Michal Hocko
  2016-08-19  6:43                           ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Sonny Rao
  2 siblings, 2 replies; 49+ messages in thread
From: Minchan Kim @ 2016-08-19  2:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sonny Rao, Jann Horn, Robert Foss, corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

Hi Michal,

On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > >> than let the kernel's OOM killer activate and need to gather this
> > >> information and we'd like to be able to get this information to make
> > >> the decision much faster than 400ms
> > >
> > > Global OOM handling in userspace is really dubious if you ask me. I
> > > understand you want something better than SIGKILL and in fact this is
> > > already possible with memory cgroup controller (btw. memcg will give
> > > you a cheap access to rss, amount of shared, swapped out memory as
> > > well). Anyway if you are getting close to the OOM your system will most
> > > probably be really busy and chances are that also reading your new file
> > > will take much more time. I am also not quite sure how is pss useful for
> > > oom decisions.
> > 
> > I mentioned it before, but based on experience RSS just isn't good
> > enough -- there's too much sharing going on in our use case to make
> > the correct decision based on RSS.  If RSS were good enough, simply
> > put, this patch wouldn't exist.
> 
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?

My case is not for OOM decision but I agree it would be great if we can get
*fast* smap summary information.

PSS is really great tool to figure out how processes consume memory
more exactly rather than RSS. We have been used it for monitoring
of memory for per-process. Although it is not used for OOM decision,
it would be great if it is speed up because we don't want to spend
many CPU time for just monitoring.

For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
either. Even, we don't need address range for just monitoring when we don't
investigate in detail.

Although they are not severe overhead, why does it emit the useless
information? Even bloat day by day. :( With that, userspace tools should
spend more time to parse which is pointless.

Having said that, I'm not fan of creating new stat knob for that, either.
How about appending summary information in the end of smap?
So, monitoring users can just open the file and lseek to the (end - 1) and
read the summary only.

Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-18 21:05                           ` Robert Foss
@ 2016-08-19  6:27                             ` Sonny Rao
  0 siblings, 0 replies; 49+ messages in thread
From: Sonny Rao @ 2016-08-19  6:27 UTC (permalink / raw)
  To: Robert Foss
  Cc: Michal Hocko, Jann Horn, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Thu, Aug 18, 2016 at 2:05 PM, Robert Foss <robert.foss@collabora.com> wrote:
>
>
> On 2016-08-18 02:01 PM, Michal Hocko wrote:
>>
>> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>>>
>>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>>
>>>> On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>>
>> [...]
>>>>>
>>>>> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>>>>> than let the kernel's OOM killer activate and need to gather this
>>>>> information and we'd like to be able to get this information to make
>>>>> the decision much faster than 400ms
>>>>
>>>>
>>>> Global OOM handling in userspace is really dubious if you ask me. I
>>>> understand you want something better than SIGKILL and in fact this is
>>>> already possible with memory cgroup controller (btw. memcg will give
>>>> you a cheap access to rss, amount of shared, swapped out memory as
>>>> well). Anyway if you are getting close to the OOM your system will most
>>>> probably be really busy and chances are that also reading your new file
>>>> will take much more time. I am also not quite sure how is pss useful for
>>>> oom decisions.
>>>
>>>
>>> I mentioned it before, but based on experience RSS just isn't good
>>> enough -- there's too much sharing going on in our use case to make
>>> the correct decision based on RSS.  If RSS were good enough, simply
>>> put, this patch wouldn't exist.
>>
>>
>> But that doesn't answer my question, I am afraid. So how exactly do you
>> use pss for oom decisions?
>>
>>> So even with memcg I think we'd have the same problem?
>>
>>
>> memcg will give you instant anon, shared counters for all processes in
>> the memcg.
>
>
> Is it technically feasible to add instant pss support to memcg?
>
> @Sonny Rao: Would using cgroups be acceptable for chromiumos?

It's possible, though I think we'd end up putting each renderer in
it's own cgroup to get the PSS stat, so it seems a bit like overkill.
I think memcg also has some overhead that we'd need to quantify but I
could be mistaken about this.

>
>
>>
>>>> Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
>>>> specific usecase but so far I haven't heard any sound argument for it to
>>>> be generally usable. It is true that smaps is unnecessarily costly but
>>>> at least I can see some room for improvements. A simple patch I've
>>>> posted cut the formatting overhead by 7%. Maybe we can do more.
>>>
>>>
>>> It seems like a general problem that if you want these values the
>>> existing kernel interface can be very expensive, so it would be
>>> generally usable by any application which wants a per process PSS,
>>> private data, dirty data or swap value.
>>
>>
>> yes this is really unfortunate. And if at all possible we should address
>> that. Precise values require the expensive rmap walk. We can introduce
>> some caching to help that. But so far it seems the biggest overhead is
>> to simply format the output and that should be addressed before any new
>> proc file is added.
>>
>>> I mentioned two use cases, but I guess I don't understand the comment
>>> about why it's not usable by other use cases.
>>
>>
>> I might be wrong here but a use of pss is quite limited and I do not
>> remember anybody asking for large optimizations in that area. I still do
>> not understand your use cases properly so I am quite skeptical about a
>> general usefulness of a new file.
>>
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-18 18:01                         ` Michal Hocko
  2016-08-18 21:05                           ` Robert Foss
  2016-08-19  2:26                           ` Minchan Kim
@ 2016-08-19  6:43                           ` Sonny Rao
  2016-08-19  7:59                             ` Michal Hocko
  2 siblings, 1 reply; 49+ messages in thread
From: Sonny Rao @ 2016-08-19  6:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Robert Foss, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> [...]
>> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> than let the kernel's OOM killer activate and need to gather this
>> >> information and we'd like to be able to get this information to make
>> >> the decision much faster than 400ms
>> >
>> > Global OOM handling in userspace is really dubious if you ask me. I
>> > understand you want something better than SIGKILL and in fact this is
>> > already possible with memory cgroup controller (btw. memcg will give
>> > you a cheap access to rss, amount of shared, swapped out memory as
>> > well). Anyway if you are getting close to the OOM your system will most
>> > probably be really busy and chances are that also reading your new file
>> > will take much more time. I am also not quite sure how is pss useful for
>> > oom decisions.
>>
>> I mentioned it before, but based on experience RSS just isn't good
>> enough -- there's too much sharing going on in our use case to make
>> the correct decision based on RSS.  If RSS were good enough, simply
>> put, this patch wouldn't exist.
>
> But that doesn't answer my question, I am afraid. So how exactly do you
> use pss for oom decisions?

We use PSS to calculate the memory used by a process among all the
processes in the system, in the case of Chrome this tells us how much
each renderer process (which is roughly tied to a particular "tab" in
Chrome) is using and how much it has swapped out, so we know what the
worst offenders are -- I'm not sure what's unclear about that?

Chrome tends to use a lot of shared memory so we found PSS to be
better than RSS, and I can give you examples of the  RSS and PSS on
real systems to illustrate the magnitude of the difference between
those two numbers if that would be useful.

>
>> So even with memcg I think we'd have the same problem?
>
> memcg will give you instant anon, shared counters for all processes in
> the memcg.
>

We want to be able to get per-process granularity quickly.  I'm not
sure if memcg provides that exactly?

>> > Don't take me wrong, /proc/<pid>/totmaps might be suitable for your
>> > specific usecase but so far I haven't heard any sound argument for it to
>> > be generally usable. It is true that smaps is unnecessarily costly but
>> > at least I can see some room for improvements. A simple patch I've
>> > posted cut the formatting overhead by 7%. Maybe we can do more.
>>
>> It seems like a general problem that if you want these values the
>> existing kernel interface can be very expensive, so it would be
>> generally usable by any application which wants a per process PSS,
>> private data, dirty data or swap value.
>
> yes this is really unfortunate. And if at all possible we should address
> that. Precise values require the expensive rmap walk. We can introduce
> some caching to help that. But so far it seems the biggest overhead is
> to simply format the output and that should be addressed before any new
> proc file is added.
>
>> I mentioned two use cases, but I guess I don't understand the comment
>> about why it's not usable by other use cases.
>
> I might be wrong here but a use of pss is quite limited and I do not
> remember anybody asking for large optimizations in that area. I still do
> not understand your use cases properly so I am quite skeptical about a
> general usefulness of a new file.

How do you know that usage of PSS is quite limited?  I can only say
that we've been using it on Chromium OS for at least four years and
have found it very valuable, and I think I've explained the use cases
in this thread. If you have more specific questions then I can try to
clarify.

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-19  2:26                           ` Minchan Kim
@ 2016-08-19  6:47                             ` Sonny Rao
  2016-08-19  8:05                             ` Michal Hocko
  1 sibling, 0 replies; 49+ messages in thread
From: Sonny Rao @ 2016-08-19  6:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Jann Horn, Robert Foss, Jonathan Corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Thu, Aug 18, 2016 at 7:26 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Michal,
>
> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
>> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> [...]
>> > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> > >> than let the kernel's OOM killer activate and need to gather this
>> > >> information and we'd like to be able to get this information to make
>> > >> the decision much faster than 400ms
>> > >
>> > > Global OOM handling in userspace is really dubious if you ask me. I
>> > > understand you want something better than SIGKILL and in fact this is
>> > > already possible with memory cgroup controller (btw. memcg will give
>> > > you a cheap access to rss, amount of shared, swapped out memory as
>> > > well). Anyway if you are getting close to the OOM your system will most
>> > > probably be really busy and chances are that also reading your new file
>> > > will take much more time. I am also not quite sure how is pss useful for
>> > > oom decisions.
>> >
>> > I mentioned it before, but based on experience RSS just isn't good
>> > enough -- there's too much sharing going on in our use case to make
>> > the correct decision based on RSS.  If RSS were good enough, simply
>> > put, this patch wouldn't exist.
>>
>> But that doesn't answer my question, I am afraid. So how exactly do you
>> use pss for oom decisions?
>
> My case is not for OOM decision but I agree it would be great if we can get
> *fast* smap summary information.
>
> PSS is really great tool to figure out how processes consume memory
> more exactly rather than RSS. We have been used it for monitoring
> of memory for per-process. Although it is not used for OOM decision,
> it would be great if it is speed up because we don't want to spend
> many CPU time for just monitoring.
>
> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> either. Even, we don't need address range for just monitoring when we don't
> investigate in detail.
>
> Although they are not severe overhead, why does it emit the useless
> information? Even bloat day by day. :( With that, userspace tools should
> spend more time to parse which is pointless.
>
> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.
>

That would work fine for us as long as it's fast -- i.e. we don't
still have to do all the expensive per-VMA format conversion in the
kernel.

> Thanks.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-19  6:43                           ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Sonny Rao
@ 2016-08-19  7:59                             ` Michal Hocko
  2016-08-19 17:57                               ` Sonny Rao
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-19  7:59 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Jann Horn, Robert Foss, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Thu 18-08-16 23:43:39, Sonny Rao wrote:
> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> >> than let the kernel's OOM killer activate and need to gather this
> >> >> information and we'd like to be able to get this information to make
> >> >> the decision much faster than 400ms
> >> >
> >> > Global OOM handling in userspace is really dubious if you ask me. I
> >> > understand you want something better than SIGKILL and in fact this is
> >> > already possible with memory cgroup controller (btw. memcg will give
> >> > you a cheap access to rss, amount of shared, swapped out memory as
> >> > well). Anyway if you are getting close to the OOM your system will most
> >> > probably be really busy and chances are that also reading your new file
> >> > will take much more time. I am also not quite sure how is pss useful for
> >> > oom decisions.
> >>
> >> I mentioned it before, but based on experience RSS just isn't good
> >> enough -- there's too much sharing going on in our use case to make
> >> the correct decision based on RSS.  If RSS were good enough, simply
> >> put, this patch wouldn't exist.
> >
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
> 
> We use PSS to calculate the memory used by a process among all the
> processes in the system, in the case of Chrome this tells us how much
> each renderer process (which is roughly tied to a particular "tab" in
> Chrome) is using and how much it has swapped out, so we know what the
> worst offenders are -- I'm not sure what's unclear about that?

So let me ask more specifically. How can you make any decision based on
the pss when you do not know _what_ is the shared resource. In other
words if you select a task to terminate based on the pss then you have to
kill others who share the same resource otherwise you do not release
that shared resource. Not to mention that such a shared resource might
be on tmpfs/shmem and it won't get released even after all processes
which map it are gone.

I am sorry for being dense but it is still not clear to me how the
single pss number can be used for oom or, in general, any serious
decisions. The counter might be useful of course for debugging purposes
or to have a general overview but then arguing about 40 vs 20ms sounds a
bit strange to me.

> Chrome tends to use a lot of shared memory so we found PSS to be
> better than RSS, and I can give you examples of the  RSS and PSS on
> real systems to illustrate the magnitude of the difference between
> those two numbers if that would be useful.
> 
> >
> >> So even with memcg I think we'd have the same problem?
> >
> > memcg will give you instant anon, shared counters for all processes in
> > the memcg.
> >
> 
> We want to be able to get per-process granularity quickly.  I'm not
> sure if memcg provides that exactly?

I will give you that information if you do process-per-memcg but that
doesn't sound ideal. I thought those 20-something processes you were
talking about are treated together but it seems I misunderstood.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-19  2:26                           ` Minchan Kim
  2016-08-19  6:47                             ` Sonny Rao
@ 2016-08-19  8:05                             ` Michal Hocko
  2016-08-19 18:20                               ` Sonny Rao
  2016-08-22  0:07                               ` Minchan Kim
  1 sibling, 2 replies; 49+ messages in thread
From: Michal Hocko @ 2016-08-19  8:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sonny Rao, Jann Horn, Robert Foss, corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Fri 19-08-16 11:26:34, Minchan Kim wrote:
> Hi Michal,
> 
> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > [...]
> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > > >> than let the kernel's OOM killer activate and need to gather this
> > > >> information and we'd like to be able to get this information to make
> > > >> the decision much faster than 400ms
> > > >
> > > > Global OOM handling in userspace is really dubious if you ask me. I
> > > > understand you want something better than SIGKILL and in fact this is
> > > > already possible with memory cgroup controller (btw. memcg will give
> > > > you a cheap access to rss, amount of shared, swapped out memory as
> > > > well). Anyway if you are getting close to the OOM your system will most
> > > > probably be really busy and chances are that also reading your new file
> > > > will take much more time. I am also not quite sure how is pss useful for
> > > > oom decisions.
> > > 
> > > I mentioned it before, but based on experience RSS just isn't good
> > > enough -- there's too much sharing going on in our use case to make
> > > the correct decision based on RSS.  If RSS were good enough, simply
> > > put, this patch wouldn't exist.
> > 
> > But that doesn't answer my question, I am afraid. So how exactly do you
> > use pss for oom decisions?
> 
> My case is not for OOM decision but I agree it would be great if we can get
> *fast* smap summary information.
> 
> PSS is really great tool to figure out how processes consume memory
> more exactly rather than RSS. We have been used it for monitoring
> of memory for per-process. Although it is not used for OOM decision,
> it would be great if it is speed up because we don't want to spend
> many CPU time for just monitoring.
> 
> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> either. Even, we don't need address range for just monitoring when we don't
> investigate in detail.
> 
> Although they are not severe overhead, why does it emit the useless
> information? Even bloat day by day. :( With that, userspace tools should
> spend more time to parse which is pointless.

So far it doesn't really seem that the parsing is the biggest problem.
The major cycles killer is the output formatting and that doesn't sound
like a problem we are not able to address. And I would even argue that
we want to address it in a generic way as much as possible.

> Having said that, I'm not fan of creating new stat knob for that, either.
> How about appending summary information in the end of smap?
> So, monitoring users can just open the file and lseek to the (end - 1) and
> read the summary only.

That might confuse existing parsers. Besides that we already have
/proc/<pid>/statm which gives cumulative numbers already. I am not sure
how often it is used and whether the pte walk is too expensive for
existing users but that should be explored and evaluated before a new
file is created.

The /proc became a dump of everything people found interesting just
because we were to easy to allow those additions. Do not repeat those
mistakes, please!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-19  7:59                             ` Michal Hocko
@ 2016-08-19 17:57                               ` Sonny Rao
  2016-08-22  7:54                                 ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Sonny Rao @ 2016-08-19 17:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Robert Foss, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> > [...]
>> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> >> than let the kernel's OOM killer activate and need to gather this
>> >> >> information and we'd like to be able to get this information to make
>> >> >> the decision much faster than 400ms
>> >> >
>> >> > Global OOM handling in userspace is really dubious if you ask me. I
>> >> > understand you want something better than SIGKILL and in fact this is
>> >> > already possible with memory cgroup controller (btw. memcg will give
>> >> > you a cheap access to rss, amount of shared, swapped out memory as
>> >> > well). Anyway if you are getting close to the OOM your system will most
>> >> > probably be really busy and chances are that also reading your new file
>> >> > will take much more time. I am also not quite sure how is pss useful for
>> >> > oom decisions.
>> >>
>> >> I mentioned it before, but based on experience RSS just isn't good
>> >> enough -- there's too much sharing going on in our use case to make
>> >> the correct decision based on RSS.  If RSS were good enough, simply
>> >> put, this patch wouldn't exist.
>> >
>> > But that doesn't answer my question, I am afraid. So how exactly do you
>> > use pss for oom decisions?
>>
>> We use PSS to calculate the memory used by a process among all the
>> processes in the system, in the case of Chrome this tells us how much
>> each renderer process (which is roughly tied to a particular "tab" in
>> Chrome) is using and how much it has swapped out, so we know what the
>> worst offenders are -- I'm not sure what's unclear about that?
>
> So let me ask more specifically. How can you make any decision based on
> the pss when you do not know _what_ is the shared resource. In other
> words if you select a task to terminate based on the pss then you have to
> kill others who share the same resource otherwise you do not release
> that shared resource. Not to mention that such a shared resource might
> be on tmpfs/shmem and it won't get released even after all processes
> which map it are gone.

Ok I see why you're confused now, sorry.

In our case that we do know what is being shared in general because
the sharing is mostly between those processes that we're looking at
and not other random processes or tmpfs, so PSS gives us useful data
in the context of these processes which are sharing the data
especially for monitoring between the set of these renderer processes.

We also use the private clean and private dirty and swap fields to
make a few metrics for the processes and charge each process for it's
private, shared, and swap data. Private clean and dirty are used for
estimating a lower bound on how much memory would be freed.  Swap and
PSS also give us some indication of additional memory which might get
freed up.

>
> I am sorry for being dense but it is still not clear to me how the
> single pss number can be used for oom or, in general, any serious
> decisions. The counter might be useful of course for debugging purposes
> or to have a general overview but then arguing about 40 vs 20ms sounds a
> bit strange to me.

Yeah so it's more than just the single PSS number, it's PSS,
Private_Clean, Private_dirty, Swap are all interesting numbers to make
these decisions.

>
>> Chrome tends to use a lot of shared memory so we found PSS to be
>> better than RSS, and I can give you examples of the  RSS and PSS on
>> real systems to illustrate the magnitude of the difference between
>> those two numbers if that would be useful.
>>
>> >
>> >> So even with memcg I think we'd have the same problem?
>> >
>> > memcg will give you instant anon, shared counters for all processes in
>> > the memcg.
>> >
>>
>> We want to be able to get per-process granularity quickly.  I'm not
>> sure if memcg provides that exactly?
>
> I will give you that information if you do process-per-memcg but that
> doesn't sound ideal. I thought those 20-something processes you were
> talking about are treated together but it seems I misunderstood.
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-19  8:05                             ` Michal Hocko
@ 2016-08-19 18:20                               ` Sonny Rao
  2016-08-22  0:07                               ` Minchan Kim
  1 sibling, 0 replies; 49+ messages in thread
From: Sonny Rao @ 2016-08-19 18:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Jann Horn, Robert Foss, Jonathan Corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Fri, Aug 19, 2016 at 1:05 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 19-08-16 11:26:34, Minchan Kim wrote:
>> Hi Michal,
>>
>> On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
>> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> > [...]
>> > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> > > >> than let the kernel's OOM killer activate and need to gather this
>> > > >> information and we'd like to be able to get this information to make
>> > > >> the decision much faster than 400ms
>> > > >
>> > > > Global OOM handling in userspace is really dubious if you ask me. I
>> > > > understand you want something better than SIGKILL and in fact this is
>> > > > already possible with memory cgroup controller (btw. memcg will give
>> > > > you a cheap access to rss, amount of shared, swapped out memory as
>> > > > well). Anyway if you are getting close to the OOM your system will most
>> > > > probably be really busy and chances are that also reading your new file
>> > > > will take much more time. I am also not quite sure how is pss useful for
>> > > > oom decisions.
>> > >
>> > > I mentioned it before, but based on experience RSS just isn't good
>> > > enough -- there's too much sharing going on in our use case to make
>> > > the correct decision based on RSS.  If RSS were good enough, simply
>> > > put, this patch wouldn't exist.
>> >
>> > But that doesn't answer my question, I am afraid. So how exactly do you
>> > use pss for oom decisions?
>>
>> My case is not for OOM decision but I agree it would be great if we can get
>> *fast* smap summary information.
>>
>> PSS is really great tool to figure out how processes consume memory
>> more exactly rather than RSS. We have been used it for monitoring
>> of memory for per-process. Although it is not used for OOM decision,
>> it would be great if it is speed up because we don't want to spend
>> many CPU time for just monitoring.
>>
>> For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
>> Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
>> hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
>> either. Even, we don't need address range for just monitoring when we don't
>> investigate in detail.
>>
>> Although they are not severe overhead, why does it emit the useless
>> information? Even bloat day by day. :( With that, userspace tools should
>> spend more time to parse which is pointless.
>
> So far it doesn't really seem that the parsing is the biggest problem.
> The major cycles killer is the output formatting and that doesn't sound
> like a problem we are not able to address. And I would even argue that
> we want to address it in a generic way as much as possible.
>
>> Having said that, I'm not fan of creating new stat knob for that, either.
>> How about appending summary information in the end of smap?
>> So, monitoring users can just open the file and lseek to the (end - 1) and
>> read the summary only.
>
> That might confuse existing parsers. Besides that we already have
> /proc/<pid>/statm which gives cumulative numbers already. I am not sure
> how often it is used and whether the pte walk is too expensive for
> existing users but that should be explored and evaluated before a new
> file is created.
>
> The /proc became a dump of everything people found interesting just
> because we were to easy to allow those additions. Do not repeat those
> mistakes, please!

Another thing I noticed was that we lock down smaps on Chromium OS.  I
think this is to avoid exposing more information than necessary via
proc.  The totmaps file gives us just the information we need and
nothing else.   I certainly don't think we need a proc file for this
use case -- do you think a new system call is better or something
else?

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-19  8:05                             ` Michal Hocko
  2016-08-19 18:20                               ` Sonny Rao
@ 2016-08-22  0:07                               ` Minchan Kim
  2016-08-22  7:40                                 ` Michal Hocko
  1 sibling, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2016-08-22  0:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sonny Rao, Jann Horn, Robert Foss, corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Fri, Aug 19, 2016 at 10:05:32AM +0200, Michal Hocko wrote:
> On Fri 19-08-16 11:26:34, Minchan Kim wrote:
> > Hi Michal,
> > 
> > On Thu, Aug 18, 2016 at 08:01:04PM +0200, Michal Hocko wrote:
> > > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> > > > On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> > > [...]
> > > > >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> > > > >> than let the kernel's OOM killer activate and need to gather this
> > > > >> information and we'd like to be able to get this information to make
> > > > >> the decision much faster than 400ms
> > > > >
> > > > > Global OOM handling in userspace is really dubious if you ask me. I
> > > > > understand you want something better than SIGKILL and in fact this is
> > > > > already possible with memory cgroup controller (btw. memcg will give
> > > > > you a cheap access to rss, amount of shared, swapped out memory as
> > > > > well). Anyway if you are getting close to the OOM your system will most
> > > > > probably be really busy and chances are that also reading your new file
> > > > > will take much more time. I am also not quite sure how is pss useful for
> > > > > oom decisions.
> > > > 
> > > > I mentioned it before, but based on experience RSS just isn't good
> > > > enough -- there's too much sharing going on in our use case to make
> > > > the correct decision based on RSS.  If RSS were good enough, simply
> > > > put, this patch wouldn't exist.
> > > 
> > > But that doesn't answer my question, I am afraid. So how exactly do you
> > > use pss for oom decisions?
> > 
> > My case is not for OOM decision but I agree it would be great if we can get
> > *fast* smap summary information.
> > 
> > PSS is really great tool to figure out how processes consume memory
> > more exactly rather than RSS. We have been used it for monitoring
> > of memory for per-process. Although it is not used for OOM decision,
> > it would be great if it is speed up because we don't want to spend
> > many CPU time for just monitoring.
> > 
> > For our usecase, we don't need AnonHugePages, ShmemPmdMapped, Shared_Hugetlb,
> > Private_Hugetlb, KernelPageSize, MMUPageSize because we never enable THP and
> > hugetlb. Additionally, Locked can be known via vma flags so we don't need it,
> > either. Even, we don't need address range for just monitoring when we don't
> > investigate in detail.
> > 
> > Although they are not severe overhead, why does it emit the useless
> > information? Even bloat day by day. :( With that, userspace tools should
> > spend more time to parse which is pointless.
> 
> So far it doesn't really seem that the parsing is the biggest problem.
> The major cycles killer is the output formatting and that doesn't sound

I cannot understand how kernel space is more expensive.
Hmm. I tested your test program on my machine.


#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
        cat /proc/$pid/smaps > /dev/null
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s_v.sh
pid:21925
real    0m3.365s
user    0m0.031s
sys     0m3.046s


vs.

#!/bin/sh
./smap_test &
pid=$!

for i in $(seq 25)
do
        awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
         /proc/$pid/smaps
done
kill $pid

root@bbox:/home/barrios/test/smap# time ./s.sh 
pid:21973

real    0m17.812s
user    0m12.612s
sys     0m5.187s

perf report says

    39.56%  awk        gawk               [.] dfaexec                             
     7.61%  awk        [kernel.kallsyms]  [k] format_decode                       
     6.37%  awk        gawk               [.] avoid_dfa                           
     5.85%  awk        gawk               [.] interpret                           
     5.69%  awk        [kernel.kallsyms]  [k] __memcpy                            
     4.37%  awk        [kernel.kallsyms]  [k] vsnprintf                           
     2.69%  awk        [kernel.kallsyms]  [k] number.isra.13                      
     2.10%  awk        gawk               [.] research                            
     1.91%  awk        gawk               [.] 0x00000000000351d0                  
     1.49%  awk        gawk               [.] free_wstr                           
     1.27%  awk        gawk               [.] unref                               
     1.19%  awk        gawk               [.] reset_record                        
     0.95%  awk        gawk               [.] set_record                          
     0.95%  awk        gawk               [.] get_field                           
     0.94%  awk        [kernel.kallsyms]  [k] show_smap                           

Parsing is much expensive than kernel.
Could you retest your test program?

> like a problem we are not able to address. And I would even argue that
> we want to address it in a generic way as much as possible.

Sure. What solution do you think as generic way?

> 
> > Having said that, I'm not fan of creating new stat knob for that, either.
> > How about appending summary information in the end of smap?
> > So, monitoring users can just open the file and lseek to the (end - 1) and
> > read the summary only.
> 
> That might confuse existing parsers. Besides that we already have
> /proc/<pid>/statm which gives cumulative numbers already. I am not sure
> how often it is used and whether the pte walk is too expensive for
> existing users but that should be explored and evaluated before a new
> file is created.
> 
> The /proc became a dump of everything people found interesting just
> because we were to easy to allow those additions. Do not repeat those
> mistakes, please!
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22  0:07                               ` Minchan Kim
@ 2016-08-22  7:40                                 ` Michal Hocko
  2016-08-22 14:12                                   ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-22  7:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sonny Rao, Jann Horn, Robert Foss, corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon 22-08-16 09:07:45, Minchan Kim wrote:
[...]
> #!/bin/sh
> ./smap_test &
> pid=$!
> 
> for i in $(seq 25)
> do
>         awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
>          /proc/$pid/smaps
> done
> kill $pid
> 
> root@bbox:/home/barrios/test/smap# time ./s.sh 
> pid:21973
> 
> real    0m17.812s
> user    0m12.612s
> sys     0m5.187s

retested on the bare metal (x86_64 - 2CPUs)
        Command being timed: "sh s.sh"
        User time (seconds): 0.00
        System time (seconds): 18.08
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29

multiple runs are quite consistent in those numbers. I am running with
$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

> > like a problem we are not able to address. And I would even argue that
> > we want to address it in a generic way as much as possible.
> 
> Sure. What solution do you think as generic way?

either optimize seq_printf or replace it with something faster.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-19 17:57                               ` Sonny Rao
@ 2016-08-22  7:54                                 ` Michal Hocko
  2016-08-22 22:44                                   ` Sonny Rao
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-22  7:54 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Jann Horn, Robert Foss, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Fri 19-08-16 10:57:48, Sonny Rao wrote:
> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Thu 18-08-16 23:43:39, Sonny Rao wrote:
> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
> >> > [...]
> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
> >> >> >> than let the kernel's OOM killer activate and need to gather this
> >> >> >> information and we'd like to be able to get this information to make
> >> >> >> the decision much faster than 400ms
> >> >> >
> >> >> > Global OOM handling in userspace is really dubious if you ask me. I
> >> >> > understand you want something better than SIGKILL and in fact this is
> >> >> > already possible with memory cgroup controller (btw. memcg will give
> >> >> > you a cheap access to rss, amount of shared, swapped out memory as
> >> >> > well). Anyway if you are getting close to the OOM your system will most
> >> >> > probably be really busy and chances are that also reading your new file
> >> >> > will take much more time. I am also not quite sure how is pss useful for
> >> >> > oom decisions.
> >> >>
> >> >> I mentioned it before, but based on experience RSS just isn't good
> >> >> enough -- there's too much sharing going on in our use case to make
> >> >> the correct decision based on RSS.  If RSS were good enough, simply
> >> >> put, this patch wouldn't exist.
> >> >
> >> > But that doesn't answer my question, I am afraid. So how exactly do you
> >> > use pss for oom decisions?
> >>
> >> We use PSS to calculate the memory used by a process among all the
> >> processes in the system, in the case of Chrome this tells us how much
> >> each renderer process (which is roughly tied to a particular "tab" in
> >> Chrome) is using and how much it has swapped out, so we know what the
> >> worst offenders are -- I'm not sure what's unclear about that?
> >
> > So let me ask more specifically. How can you make any decision based on
> > the pss when you do not know _what_ is the shared resource. In other
> > words if you select a task to terminate based on the pss then you have to
> > kill others who share the same resource otherwise you do not release
> > that shared resource. Not to mention that such a shared resource might
> > be on tmpfs/shmem and it won't get released even after all processes
> > which map it are gone.
> 
> Ok I see why you're confused now, sorry.
> 
> In our case that we do know what is being shared in general because
> the sharing is mostly between those processes that we're looking at
> and not other random processes or tmpfs, so PSS gives us useful data
> in the context of these processes which are sharing the data
> especially for monitoring between the set of these renderer processes.

OK, I see and agree that pss might be useful when you _know_ what is
shared. But this sounds quite specific to a particular workload. How
many users are in a similar situation? In other words, if we present
a single number without the context, how much useful it will be in
general? Is it possible that presenting such a number could be even
misleading for somebody who doesn't have an idea which resources are
shared? These are all questions which should be answered before we
actually add this number (be it a new/existing proc file or a syscall).
I still believe that the number without wider context is just not all
that useful.

> We also use the private clean and private dirty and swap fields to
> make a few metrics for the processes and charge each process for it's
> private, shared, and swap data. Private clean and dirty are used for
> estimating a lower bound on how much memory would be freed.

I can imagine that this kind of information might be useful and
presented in /proc/<pid>/statm. The question is whether some of the
existing consumers would see the performance impact due to he page table
walk. Anyway even these counters might get quite tricky because even
shareable resources are considered private if the process is the only
one to map them (so again this might be a file on tmpfs...).

> Swap and
> PSS also give us some indication of additional memory which might get
> freed up.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22  7:40                                 ` Michal Hocko
@ 2016-08-22 14:12                                   ` Minchan Kim
  2016-08-22 14:37                                     ` Robert Foss
  2016-08-22 16:45                                     ` Michal Hocko
  0 siblings, 2 replies; 49+ messages in thread
From: Minchan Kim @ 2016-08-22 14:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Sonny Rao, Jann Horn, Robert Foss, corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]

On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:
> On Mon 22-08-16 09:07:45, Minchan Kim wrote:
> [...]
> > #!/bin/sh
> > ./smap_test &
> > pid=$!
> > 
> > for i in $(seq 25)
> > do
> >         awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
> >          /proc/$pid/smaps
> > done
> > kill $pid
> > 
> > root@bbox:/home/barrios/test/smap# time ./s.sh 
> > pid:21973
> > 
> > real    0m17.812s
> > user    0m12.612s
> > sys     0m5.187s
> 
> retested on the bare metal (x86_64 - 2CPUs)
>         Command being timed: "sh s.sh"
>         User time (seconds): 0.00
>         System time (seconds): 18.08
>         Percent of CPU this job got: 98%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29
> 
> multiple runs are quite consistent in those numbers. I am running with
> $ awk --version
> GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
> 
> > > like a problem we are not able to address. And I would even argue that
> > > we want to address it in a generic way as much as possible.
> > 
> > Sure. What solution do you think as generic way?
> 
> either optimize seq_printf or replace it with something faster.

If it's real culprit, I agree. However, I tested your test program on
my 2 x86 machines and my friend's machine.

Ubuntu, Fedora, Arch

They have awk 4.0.1 and 4.1.3.

Result are same. Userspace speand more times I mentioned.

[root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
rss:263484 pss:262188

real    0m0.770s
user    0m0.574s
sys     0m0.197s

I will attach my test progrma source.
I hope you guys test and repost the result because it's the key for direction
of patchset.

Thanks.

[-- Attachment #2: smap_test.c --]
[-- Type: text/plain, Size: 316 bytes --]

#include <sys/mman.h>

int main()
{
	unsigned long nr_vma = 0;

        while (1) {
		if (mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED|MAP_POPULATE, -1, 0) == MAP_FAILED)
			break;
		nr_vma++;
        };

        printf("pid:%d nr_vma:%lu\n", getpid(), nr_vma);
        pause();
        return 0;
}


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22 14:12                                   ` Minchan Kim
@ 2016-08-22 14:37                                     ` Robert Foss
  2016-08-22 16:45                                     ` Michal Hocko
  1 sibling, 0 replies; 49+ messages in thread
From: Robert Foss @ 2016-08-22 14:37 UTC (permalink / raw)
  To: Minchan Kim, Michal Hocko
  Cc: Sonny Rao, Jann Horn, corbet, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Hugh Dickins, Naoya Horiguchi,
	John Stultz, ross.zwisler, jmarchan, Johannes Weiner, Kees Cook,
	Al Viro, Cyrill Gorcunov, Robin Humble, David Rientjes,
	eric.engestrom, Janis Danisevskis, calvinowens, Alexey Dobriyan,
	Kirill A. Shutemov, ldufour, linux-doc, linux-kernel, Ben Zhang,
	Bryan Freed, Filipe Brandenburger, Mateusz Guzik



On 2016-08-22 10:12 AM, Minchan Kim wrote:
> On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:
>> On Mon 22-08-16 09:07:45, Minchan Kim wrote:
>> [...]
>>> #!/bin/sh
>>> ./smap_test &
>>> pid=$!
>>>
>>> for i in $(seq 25)
>>> do
>>>         awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
>>>          /proc/$pid/smaps
>>> done
>>> kill $pid
>>>
>>> root@bbox:/home/barrios/test/smap# time ./s.sh
>>> pid:21973
>>>
>>> real    0m17.812s
>>> user    0m12.612s
>>> sys     0m5.187s
>>
>> retested on the bare metal (x86_64 - 2CPUs)
>>         Command being timed: "sh s.sh"
>>         User time (seconds): 0.00
>>         System time (seconds): 18.08
>>         Percent of CPU this job got: 98%
>>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29
>>
>> multiple runs are quite consistent in those numbers. I am running with
>> $ awk --version
>> GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
>>

$ ./smap_test &
pid:19658 nr_vma:65514

$ time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d 
pss:%d\n", rss, pss}' /proc/19658/smaps
rss:263452 pss:262151

real	0m0.625s
user	0m0.404s
sys	0m0.216s

$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

>>>> like a problem we are not able to address. And I would even argue that
>>>> we want to address it in a generic way as much as possible.
>>>
>>> Sure. What solution do you think as generic way?
>>
>> either optimize seq_printf or replace it with something faster.
>
> If it's real culprit, I agree. However, I tested your test program on
> my 2 x86 machines and my friend's machine.
>
> Ubuntu, Fedora, Arch
>
> They have awk 4.0.1 and 4.1.3.
>
> Result are same. Userspace speand more times I mentioned.
>
> [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
> rss:263484 pss:262188
>
> real    0m0.770s
> user    0m0.574s
> sys     0m0.197s
>
> I will attach my test progrma source.
> I hope you guys test and repost the result because it's the key for direction
> of patchset.
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22 14:12                                   ` Minchan Kim
  2016-08-22 14:37                                     ` Robert Foss
@ 2016-08-22 16:45                                     ` Michal Hocko
  2016-08-22 17:29                                       ` Michal Hocko
  1 sibling, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-22 16:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sonny Rao, Jann Horn, Robert Foss, corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon 22-08-16 23:12:41, Minchan Kim wrote:
> On Mon, Aug 22, 2016 at 09:40:52AM +0200, Michal Hocko wrote:
> > On Mon 22-08-16 09:07:45, Minchan Kim wrote:
> > [...]
> > > #!/bin/sh
> > > ./smap_test &
> > > pid=$!
> > > 
> > > for i in $(seq 25)
> > > do
> > >         awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {}' \
> > >          /proc/$pid/smaps
> > > done
> > > kill $pid
> > > 
> > > root@bbox:/home/barrios/test/smap# time ./s.sh 
> > > pid:21973
> > > 
> > > real    0m17.812s
> > > user    0m12.612s
> > > sys     0m5.187s
> > 
> > retested on the bare metal (x86_64 - 2CPUs)
> >         Command being timed: "sh s.sh"
> >         User time (seconds): 0.00
> >         System time (seconds): 18.08
> >         Percent of CPU this job got: 98%
> >         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:18.29
> > 
> > multiple runs are quite consistent in those numbers. I am running with
> > $ awk --version
> > GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
> > 
> > > > like a problem we are not able to address. And I would even argue that
> > > > we want to address it in a generic way as much as possible.
> > > 
> > > Sure. What solution do you think as generic way?
> > 
> > either optimize seq_printf or replace it with something faster.
> 
> If it's real culprit, I agree. However, I tested your test program on
> my 2 x86 machines and my friend's machine.
> 
> Ubuntu, Fedora, Arch
> 
> They have awk 4.0.1 and 4.1.3.
> 
> Result are same. Userspace speand more times I mentioned.
> 
> [root@blaptop smap_test]# time awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3552/smaps
> rss:263484 pss:262188
> 
> real    0m0.770s
> user    0m0.574s
> sys     0m0.197s
> 
> I will attach my test progrma source.
> I hope you guys test and repost the result because it's the key for direction
> of patchset.

Hmm, this is really interesting. I have checked a different machine and
it shows different results. Same code, slightly different version of awk
(4.1.0) and the results are different
        Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/48925/smaps"
        User time (seconds): 0.43
        System time (seconds): 0.27

I have no idea why those numbers are so different on my laptop
yet. It surely looks suspicious. I will try to debug this further
tomorrow. Anyway, the performance is just one side of the problem. I
have tried to express my concerns about a single exported pss value in
other email. Please try to step back and think about how useful is this
information without the knowing which resource we are talking about.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22 16:45                                     ` Michal Hocko
@ 2016-08-22 17:29                                       ` Michal Hocko
  2016-08-22 17:47                                         ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-22 17:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sonny Rao, Jann Horn, Robert Foss, corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon 22-08-16 18:45:54, Michal Hocko wrote:
[...]
> I have no idea why those numbers are so different on my laptop
> yet. It surely looks suspicious. I will try to debug this further
> tomorrow.

Hmm, so I've tried to use my version of awk on other machine and vice
versa and it didn't make any difference. So this is independent on the
awk version it seems. So I've tried to strace /usr/bin/time and
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128

so the kernel indeed reports 0 user time for some reason. Note I
was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
modifications). The other machine which reports non-0 utime is 3.12
SLES kernel. Maybe I am hitting some accounting bug. At first I was
suspecting CONFIG_NO_HZ_FULL because that is the main difference between
my and the other machine but then I've noticed that the tests I was
doing in kvm have this disabled too.. so it must be something else.

Weird...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22 17:29                                       ` Michal Hocko
@ 2016-08-22 17:47                                         ` Michal Hocko
  2016-08-23  8:26                                           ` Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-22 17:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sonny Rao, Jann Horn, Robert Foss, corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> [...]
> > I have no idea why those numbers are so different on my laptop
> > yet. It surely looks suspicious. I will try to debug this further
> > tomorrow.
> 
> Hmm, so I've tried to use my version of awk on other machine and vice
> versa and it didn't make any difference. So this is independent on the
> awk version it seems. So I've tried to strace /usr/bin/time and
> wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128
> 
> so the kernel indeed reports 0 user time for some reason. Note I
> was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> modifications). The other machine which reports non-0 utime is 3.12
> SLES kernel. Maybe I am hitting some accounting bug. At first I was
> suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> my and the other machine but then I've noticed that the tests I was
> doing in kvm have this disabled too.. so it must be something else.

4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
in both kernels.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22  7:54                                 ` Michal Hocko
@ 2016-08-22 22:44                                   ` Sonny Rao
  2016-08-24 10:14                                     ` Marcin Jabrzyk
  2016-08-29 14:37                                     ` Michal Hocko
  0 siblings, 2 replies; 49+ messages in thread
From: Sonny Rao @ 2016-08-22 22:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Robert Foss, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 19-08-16 10:57:48, Sonny Rao wrote:
>> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> >> > [...]
>> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> >> >> than let the kernel's OOM killer activate and need to gather this
>> >> >> >> information and we'd like to be able to get this information to make
>> >> >> >> the decision much faster than 400ms
>> >> >> >
>> >> >> > Global OOM handling in userspace is really dubious if you ask me. I
>> >> >> > understand you want something better than SIGKILL and in fact this is
>> >> >> > already possible with memory cgroup controller (btw. memcg will give
>> >> >> > you a cheap access to rss, amount of shared, swapped out memory as
>> >> >> > well). Anyway if you are getting close to the OOM your system will most
>> >> >> > probably be really busy and chances are that also reading your new file
>> >> >> > will take much more time. I am also not quite sure how is pss useful for
>> >> >> > oom decisions.
>> >> >>
>> >> >> I mentioned it before, but based on experience RSS just isn't good
>> >> >> enough -- there's too much sharing going on in our use case to make
>> >> >> the correct decision based on RSS.  If RSS were good enough, simply
>> >> >> put, this patch wouldn't exist.
>> >> >
>> >> > But that doesn't answer my question, I am afraid. So how exactly do you
>> >> > use pss for oom decisions?
>> >>
>> >> We use PSS to calculate the memory used by a process among all the
>> >> processes in the system, in the case of Chrome this tells us how much
>> >> each renderer process (which is roughly tied to a particular "tab" in
>> >> Chrome) is using and how much it has swapped out, so we know what the
>> >> worst offenders are -- I'm not sure what's unclear about that?
>> >
>> > So let me ask more specifically. How can you make any decision based on
>> > the pss when you do not know _what_ is the shared resource. In other
>> > words if you select a task to terminate based on the pss then you have to
>> > kill others who share the same resource otherwise you do not release
>> > that shared resource. Not to mention that such a shared resource might
>> > be on tmpfs/shmem and it won't get released even after all processes
>> > which map it are gone.
>>
>> Ok I see why you're confused now, sorry.
>>
>> In our case that we do know what is being shared in general because
>> the sharing is mostly between those processes that we're looking at
>> and not other random processes or tmpfs, so PSS gives us useful data
>> in the context of these processes which are sharing the data
>> especially for monitoring between the set of these renderer processes.
>
> OK, I see and agree that pss might be useful when you _know_ what is
> shared. But this sounds quite specific to a particular workload. How
> many users are in a similar situation? In other words, if we present
> a single number without the context, how much useful it will be in
> general? Is it possible that presenting such a number could be even
> misleading for somebody who doesn't have an idea which resources are
> shared? These are all questions which should be answered before we
> actually add this number (be it a new/existing proc file or a syscall).
> I still believe that the number without wider context is just not all
> that useful.


I see the specific point about  PSS -- because you need to know what
is being shared or otherwise use it in a whole system context, but I
still think the whole system context is a valid and generally useful
thing.  But what about the private_clean and private_dirty?  Surely
those are more generally useful for calculating a lower bound on
process memory usage without additional knowledge?

At the end of the day all of these metrics are approximations, and it
comes down to how far off the various approximations are and what
trade offs we are willing to make.
RSS is the cheapest but the most coarse.

PSS (with the correct context) and Private data plus swap are much
better but also more expensive due to the PT walk.
As far as I know, to get anything but RSS we have to go through smaps
or use memcg.  Swap seems to be available in /proc/<pid>/status.

I looked at the "shared" value in /proc/<pid>/statm but it doesn't
seem to correlate well with the shared value in smaps -- not sure why?

It might be useful to show the magnitude of difference of using RSS vs
PSS/Private in the case of the Chrome renderer processes.  On the
system I was looking at there were about 40 of these processes, but I
picked a few to give an idea:

localhost ~ # cat /proc/21550/totmaps
Rss:               98972 kB
Pss:               54717 kB
Shared_Clean:      19020 kB
Shared_Dirty:      26352 kB
Private_Clean:         0 kB
Private_Dirty:     53600 kB
Referenced:        92184 kB
Anonymous:         46524 kB
AnonHugePages:     24576 kB
Swap:              13148 kB


RSS is 80% higher than PSS and 84% higher than private data

localhost ~ # cat /proc/21470/totmaps
Rss:              118420 kB
Pss:               70938 kB
Shared_Clean:      22212 kB
Shared_Dirty:      26520 kB
Private_Clean:         0 kB
Private_Dirty:     69688 kB
Referenced:       111500 kB
Anonymous:         79928 kB
AnonHugePages:     24576 kB
Swap:              12964 kB

RSS is 66% higher than RSS and 69% higher than private data

localhost ~ # cat /proc/21435/totmaps
Rss:               97156 kB
Pss:               50044 kB
Shared_Clean:      21920 kB
Shared_Dirty:      26400 kB
Private_Clean:         0 kB
Private_Dirty:     48836 kB
Referenced:        90012 kB
Anonymous:         75228 kB
AnonHugePages:     24576 kB
Swap:              13064 kB

RSS is 94% higher than PSS and 98% higher than private data.

It looks like there's a set of about 40MB of shared pages which cause
the difference in this case.
Swap was roughly even on these but I don't think it's always going to be true.


>
>> We also use the private clean and private dirty and swap fields to
>> make a few metrics for the processes and charge each process for it's
>> private, shared, and swap data. Private clean and dirty are used for
>> estimating a lower bound on how much memory would be freed.
>
> I can imagine that this kind of information might be useful and
> presented in /proc/<pid>/statm. The question is whether some of the
> existing consumers would see the performance impact due to he page table
> walk. Anyway even these counters might get quite tricky because even
> shareable resources are considered private if the process is the only
> one to map them (so again this might be a file on tmpfs...).
>
>> Swap and
>> PSS also give us some indication of additional memory which might get
>> freed up.
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22 17:47                                         ` Michal Hocko
@ 2016-08-23  8:26                                           ` Michal Hocko
  2016-08-23 14:33                                             ` utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps) Michal Hocko
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-23  8:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sonny Rao, Jann Horn, Robert Foss, corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, John Stultz, ross.zwisler, jmarchan,
	Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > [...]
> > > I have no idea why those numbers are so different on my laptop
> > > yet. It surely looks suspicious. I will try to debug this further
> > > tomorrow.
> > 
> > Hmm, so I've tried to use my version of awk on other machine and vice
> > versa and it didn't make any difference. So this is independent on the
> > awk version it seems. So I've tried to strace /usr/bin/time and
> > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128
> > 
> > so the kernel indeed reports 0 user time for some reason. Note I
> > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > modifications). The other machine which reports non-0 utime is 3.12
> > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > my and the other machine but then I've noticed that the tests I was
> > doing in kvm have this disabled too.. so it must be something else.
> 
> 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> in both kernels.

and one more thing. It is not like utime accounting would be completely
broken and always report 0. Other commands report non-0 values even on
4.6 kernels. I will try to bisect this down later today.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)
  2016-08-23  8:26                                           ` Michal Hocko
@ 2016-08-23 14:33                                             ` Michal Hocko
  2016-08-23 21:46                                               ` Rik van Riel
  0 siblings, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-23 14:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, Sonny Rao, Jann Horn, Robert Foss, corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

[-- Attachment #1: Type: text/plain, Size: 4939 bytes --]

On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > [...]
> > > > I have no idea why those numbers are so different on my laptop
> > > > yet. It surely looks suspicious. I will try to debug this further
> > > > tomorrow.
> > > 
> > > Hmm, so I've tried to use my version of awk on other machine and vice
> > > versa and it didn't make any difference. So this is independent on the
> > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128
> > > 
> > > so the kernel indeed reports 0 user time for some reason. Note I
> > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no local
> > > modifications). The other machine which reports non-0 utime is 3.12
> > > SLES kernel. Maybe I am hitting some accounting bug. At first I was
> > > suspecting CONFIG_NO_HZ_FULL because that is the main difference between
> > > my and the other machine but then I've noticed that the tests I was
> > > doing in kvm have this disabled too.. so it must be something else.
> > 
> > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is the same
> > in both kernels.
> 
> and one more thing. It is not like utime accounting would be completely
> broken and always report 0. Other commands report non-0 values even on
> 4.6 kernels. I will try to bisect this down later today.

OK, so it seems I found it. I was quite lucky because account_user_time
is not all that popular function and there were basically no changes
besides Riks ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN
to jiffy granularity") and that seems to cause the regression. Reverting
the commit on top of the current mmotm seems to fix the issue for me.

And just to give Rik more context. While debugging overhead of the
/proc/<pid>/smaps I am getting a misleading output from /usr/bin/time -v
(source for ./max_mmap is [1])

root@test1:~# uname -r
4.5.0-rc6-bisect1-00025-gff9a9b4c4334
root@test1:~# ./max_map 
pid:2990 maps:65515
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
rss:263368 pss:262203
        Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
        User time (seconds): 0.00
        System time (seconds): 0.45
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.46
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1796
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 83
        Voluntary context switches: 6
        Involuntary context switches: 6
        Swaps: 0
        File system inputs: 248
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

See the User time being 0 (as you can see above in the quoted text it
is not a rounding error in userspace or something similar because wait4
really returns 0). Now with the revert
root@test1:~# uname -r
4.5.0-rc6-revert-00026-g7fc86f968bf5
root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
rss:263316 pss:262199
        Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
        User time (seconds): 0.18
        System time (seconds): 0.29
        Percent of CPU this job got: 97%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.50
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1760
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 79
        Voluntary context switches: 5
        Involuntary context switches: 7
        Swaps: 0
        File system inputs: 248
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

So it looks like the whole user time is accounted as the system time.
My config is attached and yes I do have CONFIG_VIRT_CPU_ACCOUNTING_GEN
enabled. Could you have a look please?

[1] http://lkml.kernel.org/r/20160817082200.GA10547@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23016 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)
  2016-08-23 14:33                                             ` utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps) Michal Hocko
@ 2016-08-23 21:46                                               ` Rik van Riel
  2016-08-24 16:56                                                 ` Michal Hocko
  2016-09-30  9:49                                                 ` Michal Hocko
  0 siblings, 2 replies; 49+ messages in thread
From: Rik van Riel @ 2016-08-23 21:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Sonny Rao, Jann Horn, Robert Foss, corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
> On Tue 23-08-16 10:26:03, Michal Hocko wrote:
> > On Mon 22-08-16 19:47:09, Michal Hocko wrote:
> > > On Mon 22-08-16 19:29:36, Michal Hocko wrote:
> > > > On Mon 22-08-16 18:45:54, Michal Hocko wrote:
> > > > [...]
> > > > > I have no idea why those numbers are so different on my
> > > > > laptop
> > > > > yet. It surely looks suspicious. I will try to debug this
> > > > > further
> > > > > tomorrow.
> > > > 
> > > > Hmm, so I've tried to use my version of awk on other machine
> > > > and vice
> > > > versa and it didn't make any difference. So this is independent
> > > > on the
> > > > awk version it seems. So I've tried to strace /usr/bin/time and
> > > > wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0,
> > > > {ru_utime={0, 0}, ru_stime={0, 688438}, ...}) = 9128
> > > > 
> > > > so the kernel indeed reports 0 user time for some reason. Note
> > > > I
> > > > was testing with 4.7 and right now with 4.8.0-rc3 kernel (no
> > > > local
> > > > modifications). The other machine which reports non-0 utime is
> > > > 3.12
> > > > SLES kernel. Maybe I am hitting some accounting bug. At first I
> > > > was
> > > > suspecting CONFIG_NO_HZ_FULL because that is the main
> > > > difference between
> > > > my and the other machine but then I've noticed that the tests I
> > > > was
> > > > doing in kvm have this disabled too.. so it must be something
> > > > else.
> > > 
> > > 4.5 reports non-0 while 4.6 zero utime. NO_HZ configuration is
> > > the same
> > > in both kernels.
> > 
> > and one more thing. It is not like utime accounting would be
> > completely
> > broken and always report 0. Other commands report non-0 values even
> > on
> > 4.6 kernels. I will try to bisect this down later today.
> 
> OK, so it seems I found it. I was quite lucky because
> account_user_time
> is not all that popular function and there were basically no changes
> besides Riks ff9a9b4c4334 ("sched, time: Switch
> VIRT_CPU_ACCOUNTING_GEN
> to jiffy granularity") and that seems to cause the regression.
> Reverting
> the commit on top of the current mmotm seems to fix the issue for me.
> 
> And just to give Rik more context. While debugging overhead of the
> /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> -v
> (source for ./max_mmap is [1])
> 
> root@test1:~# uname -r
> 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> root@test1:~# ./max_map 
> pid:2990 maps:65515
> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> rss:263368 pss:262203
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
>         User time (seconds): 0.00
>         System time (seconds): 0.45
>         Percent of CPU this job got: 98%
> 

> root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> rss:263316 pss:262199
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
>         User time (seconds): 0.18
>         System time (seconds): 0.29
>         Percent of CPU this job got: 97%

The patch in question makes user and system
time accounting essentially tick-based. If
jiffies changes while the task is in user
mode, time gets accounted as user time, if
jiffies changes while the task is in system
mode, time gets accounted as system time.

If you get "unlucky", with a job like the
above, it is possible all time gets accounted
to system time.

This would be true both with the system running
with a periodic timer tick (before and after my
patch is applied), and in nohz_idle mode (after
my patch).

However, it does seem quite unlikely that you
get zero user time, since you have 125 timer
ticks in half a second. Furthermore, you do not
even have NO_HZ_FULL enabled...

Does the workload consistently get zero user
time?

If so, we need to dig further to see under
what precise circumstances that happens.

On my laptop, with kernel 4.6.3-300.fc24.x86_64
I get this:

$ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
"rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
rss:263368 pss:262145
	Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
{printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
	User time (seconds): 0.64
	System time (seconds): 0.19
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83

The main difference between your and my
NO_HZ config seems to be that NO_HZ_FULL
is set here. However, it is not enabled
at run time, so both of our systems
should only really get NO_HZ_IDLE
effectively.

Running tasks should get sampled with the
regular timer tick, while they are running.

In other words, vtime accounting should be
disabled in both of our tests, for everything
except the idle task.

Do I need to do anything special to reproduce
your bug, besides running the max mmap program
and the awk script?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22 22:44                                   ` Sonny Rao
@ 2016-08-24 10:14                                     ` Marcin Jabrzyk
  2016-08-30  8:20                                       ` Michal Hocko
  2016-08-29 14:37                                     ` Michal Hocko
  1 sibling, 1 reply; 49+ messages in thread
From: Marcin Jabrzyk @ 2016-08-24 10:14 UTC (permalink / raw)
  To: Sonny Rao, Michal Hocko
  Cc: Jann Horn, Robert Foss, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik



On 23/08/16 00:44, Sonny Rao wrote:
> On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> On Fri 19-08-16 10:57:48, Sonny Rao wrote:
>>> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>> On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>>>>> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>>>> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>>>>>>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>>>>>> On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>>>>>> [...]
>>>>>>>>> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>>>>>>>>> than let the kernel's OOM killer activate and need to gather this
>>>>>>>>> information and we'd like to be able to get this information to make
>>>>>>>>> the decision much faster than 400ms
>>>>>>>>
>>>>>>>> Global OOM handling in userspace is really dubious if you ask me. I
>>>>>>>> understand you want something better than SIGKILL and in fact this is
>>>>>>>> already possible with memory cgroup controller (btw. memcg will give
>>>>>>>> you a cheap access to rss, amount of shared, swapped out memory as
>>>>>>>> well). Anyway if you are getting close to the OOM your system will most
>>>>>>>> probably be really busy and chances are that also reading your new file
>>>>>>>> will take much more time. I am also not quite sure how is pss useful for
>>>>>>>> oom decisions.
>>>>>>>
>>>>>>> I mentioned it before, but based on experience RSS just isn't good
>>>>>>> enough -- there's too much sharing going on in our use case to make
>>>>>>> the correct decision based on RSS.  If RSS were good enough, simply
>>>>>>> put, this patch wouldn't exist.
>>>>>>
>>>>>> But that doesn't answer my question, I am afraid. So how exactly do you
>>>>>> use pss for oom decisions?
>>>>>
>>>>> We use PSS to calculate the memory used by a process among all the
>>>>> processes in the system, in the case of Chrome this tells us how much
>>>>> each renderer process (which is roughly tied to a particular "tab" in
>>>>> Chrome) is using and how much it has swapped out, so we know what the
>>>>> worst offenders are -- I'm not sure what's unclear about that?
>>>>
>>>> So let me ask more specifically. How can you make any decision based on
>>>> the pss when you do not know _what_ is the shared resource. In other
>>>> words if you select a task to terminate based on the pss then you have to
>>>> kill others who share the same resource otherwise you do not release
>>>> that shared resource. Not to mention that such a shared resource might
>>>> be on tmpfs/shmem and it won't get released even after all processes
>>>> which map it are gone.
>>>
>>> Ok I see why you're confused now, sorry.
>>>
>>> In our case that we do know what is being shared in general because
>>> the sharing is mostly between those processes that we're looking at
>>> and not other random processes or tmpfs, so PSS gives us useful data
>>> in the context of these processes which are sharing the data
>>> especially for monitoring between the set of these renderer processes.
>>
>> OK, I see and agree that pss might be useful when you _know_ what is
>> shared. But this sounds quite specific to a particular workload. How
>> many users are in a similar situation? In other words, if we present
>> a single number without the context, how much useful it will be in
>> general? Is it possible that presenting such a number could be even
>> misleading for somebody who doesn't have an idea which resources are
>> shared? These are all questions which should be answered before we
>> actually add this number (be it a new/existing proc file or a syscall).
>> I still believe that the number without wider context is just not all
>> that useful.
>
>
> I see the specific point about  PSS -- because you need to know what
> is being shared or otherwise use it in a whole system context, but I
> still think the whole system context is a valid and generally useful
> thing.  But what about the private_clean and private_dirty?  Surely
> those are more generally useful for calculating a lower bound on
> process memory usage without additional knowledge?
>
> At the end of the day all of these metrics are approximations, and it
> comes down to how far off the various approximations are and what
> trade offs we are willing to make.
> RSS is the cheapest but the most coarse.
>
> PSS (with the correct context) and Private data plus swap are much
> better but also more expensive due to the PT walk.
> As far as I know, to get anything but RSS we have to go through smaps
> or use memcg.  Swap seems to be available in /proc/<pid>/status.
>
> I looked at the "shared" value in /proc/<pid>/statm but it doesn't
> seem to correlate well with the shared value in smaps -- not sure why?
>
> It might be useful to show the magnitude of difference of using RSS vs
> PSS/Private in the case of the Chrome renderer processes.  On the
> system I was looking at there were about 40 of these processes, but I
> picked a few to give an idea:
>
> localhost ~ # cat /proc/21550/totmaps
> Rss:               98972 kB
> Pss:               54717 kB
> Shared_Clean:      19020 kB
> Shared_Dirty:      26352 kB
> Private_Clean:         0 kB
> Private_Dirty:     53600 kB
> Referenced:        92184 kB
> Anonymous:         46524 kB
> AnonHugePages:     24576 kB
> Swap:              13148 kB
>
>
> RSS is 80% higher than PSS and 84% higher than private data
>
> localhost ~ # cat /proc/21470/totmaps
> Rss:              118420 kB
> Pss:               70938 kB
> Shared_Clean:      22212 kB
> Shared_Dirty:      26520 kB
> Private_Clean:         0 kB
> Private_Dirty:     69688 kB
> Referenced:       111500 kB
> Anonymous:         79928 kB
> AnonHugePages:     24576 kB
> Swap:              12964 kB
>
> RSS is 66% higher than RSS and 69% higher than private data
>
> localhost ~ # cat /proc/21435/totmaps
> Rss:               97156 kB
> Pss:               50044 kB
> Shared_Clean:      21920 kB
> Shared_Dirty:      26400 kB
> Private_Clean:         0 kB
> Private_Dirty:     48836 kB
> Referenced:        90012 kB
> Anonymous:         75228 kB
> AnonHugePages:     24576 kB
> Swap:              13064 kB
>
> RSS is 94% higher than PSS and 98% higher than private data.
>
> It looks like there's a set of about 40MB of shared pages which cause
> the difference in this case.
> Swap was roughly even on these but I don't think it's always going to be true.
>
>

Sorry to hijack the thread, but I've found it recently
and I guess it's the best place to present our point.
We are working at our custom OS based on Linux and we also suffered much
by /proc/<pid>/smaps file. As in Chrome we tried to improve our internal
application memory management polices (Low Memory Killer) using data
provided by smaps but we failed due to very long time needed for reading
and parsing properly the file.

We've also observed that RSS measurement is often highly over PSS which
seems to be more real memory usage for process. Using smaps we would
be able to calculate USS usage and know exact minimum value of memory
that would be freed after terminating some process. Those are very
important sources of information as they give as the possibility to
provide best possible app life-cycle.

We have also tried to use smaps in some application for OS developers
as source of detailed information of memory usage of the system.
For checking possible ways of improvement we tried totmaps from earlier
version. On sample case for our app the CPU usage as presented by 'top'
decreases from ~60% to ~4.5% only by changing source from smpas to tomaps.

So we are also very interested in using interface such as totmaps as it
gives detailed and complete memory usage information for user-space and
in our case much of information provided by smaps is for us not useful
at all.

We are also using or tried using other interfaces like status, statm,
cgroups.memory etc. but still totmaps/smaps are still the best interface
to get all of the informations per process based in single place.

>>
>>> We also use the private clean and private dirty and swap fields to
>>> make a few metrics for the processes and charge each process for it's
>>> private, shared, and swap data. Private clean and dirty are used for
>>> estimating a lower bound on how much memory would be freed.
>>
>> I can imagine that this kind of information might be useful and
>> presented in /proc/<pid>/statm. The question is whether some of the
>> existing consumers would see the performance impact due to he page table
>> walk. Anyway even these counters might get quite tricky because even
>> shareable resources are considered private if the process is the only
>> one to map them (so again this might be a file on tmpfs...).
>>
>>> Swap and
>>> PSS also give us some indication of additional memory which might get
>>> freed up.
>> --
>> Michal Hocko
>> SUSE Labs
>
>

-- 
Marcin Jabrzyk
Samsung R&D Institute Poland
Samsung Electronics

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)
  2016-08-23 21:46                                               ` Rik van Riel
@ 2016-08-24 16:56                                                 ` Michal Hocko
  2016-09-30  9:49                                                 ` Michal Hocko
  1 sibling, 0 replies; 49+ messages in thread
From: Michal Hocko @ 2016-08-24 16:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, Sonny Rao, Jann Horn, Robert Foss, corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

[-- Attachment #1: Type: text/plain, Size: 789 bytes --]

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
[...]
> Does the workload consistently get zero user
> time?

root@test1:~# for i in $(seq 20); do /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3028/smaps; done 2>&1 | grep "User time" | sort | uniq -c
     20         User time (seconds): 0.00

sounds like pretty much consistent bad luck to me. I've attached the
single awk strace run, maybe this will help you understand the timing...

[...]

> Do I need to do anything special to reproduce
> your bug, besides running the max mmap program
> and the awk script?

That's how I noticed this. Nothing special really need. I am running
that in a kvm virt. machine but the bare metal gives me the same
results.

-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: awk.strace.gz --]
[-- Type: application/gzip, Size: 328871 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-22 22:44                                   ` Sonny Rao
  2016-08-24 10:14                                     ` Marcin Jabrzyk
@ 2016-08-29 14:37                                     ` Michal Hocko
  2016-08-30  8:15                                       ` Michal Hocko
  1 sibling, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-08-29 14:37 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Jann Horn, Robert Foss, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

[Sorry for a late reply, I was busy with other stuff]

On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
[...]
> But what about the private_clean and private_dirty?  Surely
> those are more generally useful for calculating a lower bound on
> process memory usage without additional knowledge?

I guess private_clean can be used as a reasonable estimate.
private_dirty less so because it may refer to e.g. tmpfs which is not
mapped by other process and so no memory would be freed after unmap
without removing the file.

> At the end of the day all of these metrics are approximations, and it
> comes down to how far off the various approximations are and what
> trade offs we are willing to make.
> RSS is the cheapest but the most coarse.

I agree on this part definitely. I also understand that what we provide
currently is quite confusing and not really helpful. But I am afraid
that the accounting is far from trivial to make right for all the
possible cases.

> PSS (with the correct context) and Private data plus swap are much
> better but also more expensive due to the PT walk.

Maybe we can be more clever and do some form of caching. I haven't
thought that through to see how hard that could be. I mean we could
cache some data per mm_struct and invalidate them only after the current
value would get too much out of sync.

> As far as I know, to get anything but RSS we have to go through smaps
> or use memcg.  Swap seems to be available in /proc/<pid>/status.
> 
> I looked at the "shared" value in /proc/<pid>/statm but it doesn't
> seem to correlate well with the shared value in smaps -- not sure why?

task_statm() does only approximate to get_mm_counter(mm, MM_FILEPAGES) +
get_mm_counter(mm, MM_SHMEMPAGES) so all the pages accounted to the mm.
If they are not shared by anybody else they would be considered private
by smaps.

> It might be useful to show the magnitude of difference of using RSS vs
> PSS/Private in the case of the Chrome renderer processes.  On the
> system I was looking at there were about 40 of these processes, but I
> picked a few to give an idea:
> 
> localhost ~ # cat /proc/21550/totmaps
> Rss:               98972 kB
> Pss:               54717 kB
> Shared_Clean:      19020 kB
> Shared_Dirty:      26352 kB
> Private_Clean:         0 kB
> Private_Dirty:     53600 kB
> Referenced:        92184 kB
> Anonymous:         46524 kB
> AnonHugePages:     24576 kB
> Swap:              13148 kB
> 
> 
> RSS is 80% higher than PSS and 84% higher than private data
> 
> localhost ~ # cat /proc/21470/totmaps
> Rss:              118420 kB
> Pss:               70938 kB
> Shared_Clean:      22212 kB
> Shared_Dirty:      26520 kB
> Private_Clean:         0 kB
> Private_Dirty:     69688 kB
> Referenced:       111500 kB
> Anonymous:         79928 kB
> AnonHugePages:     24576 kB
> Swap:              12964 kB
> 
> RSS is 66% higher than RSS and 69% higher than private data
> 
> localhost ~ # cat /proc/21435/totmaps
> Rss:               97156 kB
> Pss:               50044 kB
> Shared_Clean:      21920 kB
> Shared_Dirty:      26400 kB
> Private_Clean:         0 kB
> Private_Dirty:     48836 kB
> Referenced:        90012 kB
> Anonymous:         75228 kB
> AnonHugePages:     24576 kB
> Swap:              13064 kB
> 
> RSS is 94% higher than PSS and 98% higher than private data.
> 
> It looks like there's a set of about 40MB of shared pages which cause
> the difference in this case.
> Swap was roughly even on these but I don't think it's always going to be true.

OK, I see that those processes differ in the way how they are using
memory but I am not really sure what kind of conclusion you can draw
from that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-29 14:37                                     ` Michal Hocko
@ 2016-08-30  8:15                                       ` Michal Hocko
  0 siblings, 0 replies; 49+ messages in thread
From: Michal Hocko @ 2016-08-30  8:15 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Jann Horn, Robert Foss, Jonathan Corbet, Andrew Morton,
	Vlastimil Babka, Konstantin Khlebnikov, Hugh Dickins,
	Naoya Horiguchi, Minchan Kim, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik

On Mon 29-08-16 16:37:04, Michal Hocko wrote:
> [Sorry for a late reply, I was busy with other stuff]
> 
> On Mon 22-08-16 15:44:53, Sonny Rao wrote:
> > On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > But what about the private_clean and private_dirty?  Surely
> > those are more generally useful for calculating a lower bound on
> > process memory usage without additional knowledge?
> 
> I guess private_clean can be used as a reasonable estimate.

I was thinking about this more and I think I am wrong here. Even truly
MAP_PRIVATE|MAP_ANON will be in private_dirty. So private_clean will
become not all that interesting and similarly misleading as its _dirty
variant (mmaped file after [m]sync should become _clean) and that
doesn't mean the memory will get freed after the process which maps it
terminates. Take shmem as an example again.

> private_dirty less so because it may refer to e.g. tmpfs which is not
> mapped by other process and so no memory would be freed after unmap
> without removing the file.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
  2016-08-24 10:14                                     ` Marcin Jabrzyk
@ 2016-08-30  8:20                                       ` Michal Hocko
  0 siblings, 0 replies; 49+ messages in thread
From: Michal Hocko @ 2016-08-30  8:20 UTC (permalink / raw)
  To: Marcin Jabrzyk
  Cc: Sonny Rao, Jann Horn, Robert Foss, Jonathan Corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, Minchan Kim, John Stultz,
	ross.zwisler, jmarchan, Johannes Weiner, Kees Cook, Al Viro,
	Cyrill Gorcunov, Robin Humble, David Rientjes, eric.engestrom,
	Janis Danisevskis, calvinowens, Alexey Dobriyan,
	Kirill A. Shutemov, ldufour, linux-doc, linux-kernel, Ben Zhang,
	Filipe Brandenburger, Mateusz Guzik

On Wed 24-08-16 12:14:06, Marcin Jabrzyk wrote:
[...]
> Sorry to hijack the thread, but I've found it recently
> and I guess it's the best place to present our point.
> We are working at our custom OS based on Linux and we also suffered much
> by /proc/<pid>/smaps file. As in Chrome we tried to improve our internal
> application memory management polices (Low Memory Killer) using data
> provided by smaps but we failed due to very long time needed for reading
> and parsing properly the file.

I was already questioning Pss and also Private_* for any memory killer
purpose earlier in the thread because cumulative numbers for all
mappings can be really meaningless. Especially when you do not know
about which resource is shared and by whom. Maybe you can describe how
you are using those cumulative numbers for your decisions and prove me
wrong but I simply haven't heard any sound arguments so far. Everything
was just "we know what we are doing in our environment so we know those
resouces and therefore those numbers make sense to us". But with all due
respect this is not a reason to add a user visible API into the kernel.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)
  2016-08-23 21:46                                               ` Rik van Riel
  2016-08-24 16:56                                                 ` Michal Hocko
@ 2016-09-30  9:49                                                 ` Michal Hocko
  2016-09-30 12:35                                                   ` Rik van Riel
  1 sibling, 1 reply; 49+ messages in thread
From: Michal Hocko @ 2016-09-30  9:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Minchan Kim, Sonny Rao, Jann Horn, Robert Foss, corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik, Mel Gorman, Mike Galbraith

[CC Mike and Mel as they have seen some accounting oddities
 when doing performance testing. They can share details but
 essentially the system time just gets too high]

For your reference the email thread started
http://lkml.kernel.org/r/20160823143330.GL23577@dhcp22.suse.cz

I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
        User time (seconds): 0.45
        System time (seconds): 0.82
        Percent of CPU this job got: 111%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
        User time (seconds): 0.47
        System time (seconds): 1.55
        Percent of CPU this job got: 107%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.

$ perf report | grep kernel.vmlinux
     2.01%  as        [kernel.vmlinux]         [k] page_fault
     0.59%  cc1       [kernel.vmlinux]         [k] page_fault
     0.15%  git       [kernel.vmlinux]         [k] page_fault
     0.12%  bash      [kernel.vmlinux]         [k] page_fault
     0.11%  sh        [kernel.vmlinux]         [k] page_fault
     0.08%  gcc       [kernel.vmlinux]         [k] page_fault
     0.06%  make      [kernel.vmlinux]         [k] page_fault
     0.04%  rm        [kernel.vmlinux]         [k] page_fault
     0.03%  ld        [kernel.vmlinux]         [k] page_fault
     0.02%  bash      [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.01%  git       [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.01%  cat       [kernel.vmlinux]         [k] page_fault
     0.01%  collect2  [kernel.vmlinux]         [k] page_fault
     0.00%  sh        [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.00%  rm        [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.00%  grep      [kernel.vmlinux]         [k] page_fault

doesn't show anything unexpected.

Original Rik's reply follows:

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> > 
> > And just to give Rik more context. While debugging overhead of the
> > /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> > 
> > root@test1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@test1:~# ./max_map 
> > pid:2990 maps:65515
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> >         User time (seconds): 0.00
> >         System time (seconds): 0.45
> >         Percent of CPU this job got: 98%
> > 
> 
> > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> >         User time (seconds): 0.18
> >         System time (seconds): 0.29
> >         Percent of CPU this job got: 97%
> 
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
> 
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
> 
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
> 
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
> 
> Does the workload consistently get zero user
> time?
> 
> If so, we need to dig further to see under
> what precise circumstances that happens.
> 
> On my laptop, with kernel 4.6.3-300.fc24.x86_64
> I get this:
> 
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
> "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
> rss:263368 pss:262145
> 	Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
> 	User time (seconds): 0.64
> 	System time (seconds): 0.19
> 	Percent of CPU this job got: 99%
> 	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83
> 
> The main difference between your and my
> NO_HZ config seems to be that NO_HZ_FULL
> is set here. However, it is not enabled
> at run time, so both of our systems
> should only really get NO_HZ_IDLE
> effectively.
> 
> Running tasks should get sampled with the
> regular timer tick, while they are running.
> 
> In other words, vtime accounting should be
> disabled in both of our tests, for everything
> except the idle task.
> 
> Do I need to do anything special to reproduce
> your bug, besides running the max mmap program
> and the awk script?
> 
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)
  2016-09-30  9:49                                                 ` Michal Hocko
@ 2016-09-30 12:35                                                   ` Rik van Riel
  0 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2016-09-30 12:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Sonny Rao, Jann Horn, Robert Foss, corbet,
	Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Hugh Dickins, Naoya Horiguchi, John Stultz, ross.zwisler,
	jmarchan, Johannes Weiner, Kees Cook, Al Viro, Cyrill Gorcunov,
	Robin Humble, David Rientjes, eric.engestrom, Janis Danisevskis,
	calvinowens, Alexey Dobriyan, Kirill A. Shutemov, ldufour,
	linux-doc, linux-kernel, Ben Zhang, Bryan Freed,
	Filipe Brandenburger, Mateusz Guzik, Mel Gorman, Mike Galbraith

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

On Fri, 2016-09-30 at 11:49 +0200, Michal Hocko wrote:
> [CC Mike and Mel as they have seen some accounting oddities
>  when doing performance testing. They can share details but
>  essentially the system time just gets too high]
> 
> For your reference the email thread started
> http://lkml.kernel.org/r/20160823143330.GL23577@dhcp22.suse.cz
> 
> I suspect this is mainly for short lived processes - like kernel
> compile
> $ /usr/bin/time -v make mm/mmap.o
> [...]
>         User time (seconds): 0.45
>         System time (seconds): 0.82
>         Percent of CPU this job got: 111%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
> $ rm mm/mmap.o
> $ /usr/bin/time -v make mm/mmap.o
> [...]
>         User time (seconds): 0.47
>         System time (seconds): 1.55
>         Percent of CPU this job got: 107%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

I was not able to get the "expected" results from
your last reproducer, but this one does happen on
my system, too.

The bad news is, I still have no clue what is causing
it...

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2016-09-30 12:36 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-12 22:04 [PACTH v2 0/3] Implement /proc/<pid>/totmaps robert.foss
2016-08-12 22:04 ` [PACTH v2 1/3] mm, proc: " robert.foss
2016-08-13 14:39   ` Jann Horn
2016-08-15 13:57     ` Robert Foss
2016-08-15 20:14       ` Robert Foss
2016-08-12 22:04 ` [PACTH v2 2/3] Documentation/filesystems: Fixed typo robert.foss
2016-08-12 22:04 ` [PACTH v2 3/3] Documentation/filesystems: Added /proc/PID/totmaps documentation robert.foss
2016-08-14  9:04 ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Michal Hocko
2016-08-15 13:00   ` Robert Foss
2016-08-15 13:42     ` Michal Hocko
2016-08-15 16:25       ` Robert Foss
2016-08-16  7:12         ` Michal Hocko
2016-08-16 16:46           ` Robert Foss
2016-08-17  8:22             ` Michal Hocko
2016-08-17  9:31               ` Jann Horn
2016-08-17 13:03                 ` Michal Hocko
2016-08-17 16:48                   ` Robert Foss
2016-08-17 18:57                   ` Sonny Rao
2016-08-18  7:44                     ` Michal Hocko
2016-08-18 17:47                       ` Sonny Rao
2016-08-18 18:01                         ` Michal Hocko
2016-08-18 21:05                           ` Robert Foss
2016-08-19  6:27                             ` Sonny Rao
2016-08-19  2:26                           ` Minchan Kim
2016-08-19  6:47                             ` Sonny Rao
2016-08-19  8:05                             ` Michal Hocko
2016-08-19 18:20                               ` Sonny Rao
2016-08-22  0:07                               ` Minchan Kim
2016-08-22  7:40                                 ` Michal Hocko
2016-08-22 14:12                                   ` Minchan Kim
2016-08-22 14:37                                     ` Robert Foss
2016-08-22 16:45                                     ` Michal Hocko
2016-08-22 17:29                                       ` Michal Hocko
2016-08-22 17:47                                         ` Michal Hocko
2016-08-23  8:26                                           ` Michal Hocko
2016-08-23 14:33                                             ` utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps) Michal Hocko
2016-08-23 21:46                                               ` Rik van Riel
2016-08-24 16:56                                                 ` Michal Hocko
2016-09-30  9:49                                                 ` Michal Hocko
2016-09-30 12:35                                                   ` Rik van Riel
2016-08-19  6:43                           ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Sonny Rao
2016-08-19  7:59                             ` Michal Hocko
2016-08-19 17:57                               ` Sonny Rao
2016-08-22  7:54                                 ` Michal Hocko
2016-08-22 22:44                                   ` Sonny Rao
2016-08-24 10:14                                     ` Marcin Jabrzyk
2016-08-30  8:20                                       ` Michal Hocko
2016-08-29 14:37                                     ` Michal Hocko
2016-08-30  8:15                                       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).