All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6] enhance shmem process and swap accounting
@ 2015-11-18  9:29 ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

Changes since v4:
o Rebase on next-20151118
o Hugh pointed out a problem with private mappings of tmpfs files where
  smaps would show a sum of shmem object's swapped out pages and swapped
  out COWed pages. Fixed this by falling back to the find_get_page() approach.
  Patches are now layered by employing find_get_page() first, and then
  optimizing the non-private mappings on top (with some measurements).
o Expanded commit messages.

Changes since v3:
o Rebase on next-20151002
o Apply (feedb)acks from Michal Hocko and Konstantin Khlebnikov (Thanks!)
  - drop CONFIG_SHMEM ifdefs, as it was the 2nd suggestion already
  - add comments about not taking i_mutex in patch 2
o Rename VmAnon/VmFile/VmShm to RssAnon/RssFile... to make it hopefully more
  obvious that it's a breakdown of VmRSS. Naming things sucks.

Changes since v2:
o Rebase on next-20150805.
o This means that /proc/pid/maps has the proportional swap share (SwapPss:)
  field as per https://lkml.org/lkml/2015/6/15/274
  It's not clear what to do with shmem here so it's 0 for now.
  - swapped out shmem doesn't have swap entries, so we would have to look at who
    else has the shmem object (partially) mapped
  - to be more precise we should also check if his range actually includes 
    the offset in question, which could get rather involved
  - or is there some easy way I don't see?
o Konstantin suggested for patch 3/4 that I drop the CONFIG_SHMEM #ifdefs
  I didn't see the point in going against tinyfication when the work is
  already done, but I can do that if more people think it's better and it
  would block the series.

Changes since v1:
o In Patch 2, rely on SHMEM_I(inode)->swapped if possible, and fallback to
  radix tree iterator on partially mapped shmem objects, i.e. decouple shmem
  swap usage determination from the page walk, for performance reasons.
  Thanks to Jerome and Konstantin for the tips.
  The downside is that mm/shmem.c had to be touched.

This series is based on Jerome Marchand's [1] so let me quote the first
paragraph from there:

There are several shortcomings with the accounting of shared memory
(sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
values in /proc/<pid>/status and statm don't allow to distinguish
between shmem memory and a shared mapping to a regular file, even
though theirs implication on memory usage are quite different: at
reclaim, file mapping can be dropped or write back on disk while shmem
needs a place in swap. As for shmem pages that are swapped-out or in
swap cache, they aren't accounted at all.

The original motivation for myself is that a customer found (IMHO rightfully)
confusing that e.g. top output for process swap usage is unreliable with
respect to swapped out shmem pages, which are not accounted for.

The fundamental difference between private anonymous and shmem pages is that
the latter has PTE's converted to pte_none, and not swapents. As such, they are
not accounted to the number of swapents visible e.g. in /proc/pid/status VmSwap
row. It might be theoretically possible to use swapents when swapping out shmem
(without extra cost, as one has to change all mappers anyway), and on swap in
only convert the swapent for the faulting process, leaving swapents in other
processes until they also fault (so again no extra cost). But I don't know how
many assumptions this would break, and it would be too disruptive change for a
relatively small benefit.

Instead, my approach is to document the limitation of VmSwap, and provide means
to determine the swap usage for shmem areas for those who are interested and
willing to pay the price, using /proc/pid/smaps. Because outside of ipcs, I
don't think it's possible to currently to determine the usage at all.  The
previous patchset [1] did introduce new shmem-specific fields into smaps
output, and functions to determine the values. I take a simpler approach,
noting that smaps output already has a "Swap: X kB" line, where currently X ==
0 always for shmem areas. I think we can just consider this a bug and provide
the proper value by consulting the radix tree, as e.g. mincore_page() does. In the
patch changelog I explain why this is also not perfect (and cannot be without
swapents), but still arguably much better than showing a 0.

The last two patches are adapted from Jerome's patchset and provide a VmRSS
breakdown to RssAnon, RssFile and RssShm in /proc/pid/status. Hugh noted that
this is a welcome addition, and I agree that it might help e.g. debugging
process memory usage at albeit non-zero, but still rather low cost of extra
per-mm counter and some page flag checks.

[1] http://lwn.net/Articles/611966/

Jerome Marchand (2):
  mm, shmem: add internal shmem resident memory accounting
  mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status

Vlastimil Babka (4):
  mm, documentation: clarify /proc/pid/status VmSwap limitations for
    shmem
  mm, proc: account for shmem swap in /proc/pid/smaps
  mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
  mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem
    mappings

 Documentation/filesystems/proc.txt | 21 ++++++++--
 arch/s390/mm/pgtable.c             |  5 +--
 fs/proc/task_mmu.c                 | 70 ++++++++++++++++++++++++++++++--
 include/linux/mm.h                 | 18 ++++++++-
 include/linux/mm_types.h           |  7 ++--
 include/linux/shmem_fs.h           |  4 ++
 kernel/events/uprobes.c            |  2 +-
 mm/memory.c                        | 30 +++++---------
 mm/oom_kill.c                      |  5 ++-
 mm/rmap.c                          | 12 ++----
 mm/shmem.c                         | 81 ++++++++++++++++++++++++++++++++++++++
 11 files changed, 208 insertions(+), 47 deletions(-)

-- 
2.6.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 0/6] enhance shmem process and swap accounting
@ 2015-11-18  9:29 ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

Changes since v4:
o Rebase on next-20151118
o Hugh pointed out a problem with private mappings of tmpfs files where
  smaps would show a sum of shmem object's swapped out pages and swapped
  out COWed pages. Fixed this by falling back to the find_get_page() approach.
  Patches are now layered by employing find_get_page() first, and then
  optimizing the non-private mappings on top (with some measurements).
o Expanded commit messages.

Changes since v3:
o Rebase on next-20151002
o Apply (feedb)acks from Michal Hocko and Konstantin Khlebnikov (Thanks!)
  - drop CONFIG_SHMEM ifdefs, as it was the 2nd suggestion already
  - add comments about not taking i_mutex in patch 2
o Rename VmAnon/VmFile/VmShm to RssAnon/RssFile... to make it hopefully more
  obvious that it's a breakdown of VmRSS. Naming things sucks.

Changes since v2:
o Rebase on next-20150805.
o This means that /proc/pid/maps has the proportional swap share (SwapPss:)
  field as per https://lkml.org/lkml/2015/6/15/274
  It's not clear what to do with shmem here so it's 0 for now.
  - swapped out shmem doesn't have swap entries, so we would have to look at who
    else has the shmem object (partially) mapped
  - to be more precise we should also check if his range actually includes 
    the offset in question, which could get rather involved
  - or is there some easy way I don't see?
o Konstantin suggested for patch 3/4 that I drop the CONFIG_SHMEM #ifdefs
  I didn't see the point in going against tinyfication when the work is
  already done, but I can do that if more people think it's better and it
  would block the series.

Changes since v1:
o In Patch 2, rely on SHMEM_I(inode)->swapped if possible, and fallback to
  radix tree iterator on partially mapped shmem objects, i.e. decouple shmem
  swap usage determination from the page walk, for performance reasons.
  Thanks to Jerome and Konstantin for the tips.
  The downside is that mm/shmem.c had to be touched.

This series is based on Jerome Marchand's [1] so let me quote the first
paragraph from there:

There are several shortcomings with the accounting of shared memory
(sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
values in /proc/<pid>/status and statm don't allow to distinguish
between shmem memory and a shared mapping to a regular file, even
though theirs implication on memory usage are quite different: at
reclaim, file mapping can be dropped or write back on disk while shmem
needs a place in swap. As for shmem pages that are swapped-out or in
swap cache, they aren't accounted at all.

The original motivation for myself is that a customer found (IMHO rightfully)
confusing that e.g. top output for process swap usage is unreliable with
respect to swapped out shmem pages, which are not accounted for.

The fundamental difference between private anonymous and shmem pages is that
the latter has PTE's converted to pte_none, and not swapents. As such, they are
not accounted to the number of swapents visible e.g. in /proc/pid/status VmSwap
row. It might be theoretically possible to use swapents when swapping out shmem
(without extra cost, as one has to change all mappers anyway), and on swap in
only convert the swapent for the faulting process, leaving swapents in other
processes until they also fault (so again no extra cost). But I don't know how
many assumptions this would break, and it would be too disruptive change for a
relatively small benefit.

Instead, my approach is to document the limitation of VmSwap, and provide means
to determine the swap usage for shmem areas for those who are interested and
willing to pay the price, using /proc/pid/smaps. Because outside of ipcs, I
don't think it's possible to currently to determine the usage at all.  The
previous patchset [1] did introduce new shmem-specific fields into smaps
output, and functions to determine the values. I take a simpler approach,
noting that smaps output already has a "Swap: X kB" line, where currently X ==
0 always for shmem areas. I think we can just consider this a bug and provide
the proper value by consulting the radix tree, as e.g. mincore_page() does. In the
patch changelog I explain why this is also not perfect (and cannot be without
swapents), but still arguably much better than showing a 0.

The last two patches are adapted from Jerome's patchset and provide a VmRSS
breakdown to RssAnon, RssFile and RssShm in /proc/pid/status. Hugh noted that
this is a welcome addition, and I agree that it might help e.g. debugging
process memory usage at albeit non-zero, but still rather low cost of extra
per-mm counter and some page flag checks.

[1] http://lwn.net/Articles/611966/

Jerome Marchand (2):
  mm, shmem: add internal shmem resident memory accounting
  mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status

Vlastimil Babka (4):
  mm, documentation: clarify /proc/pid/status VmSwap limitations for
    shmem
  mm, proc: account for shmem swap in /proc/pid/smaps
  mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
  mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem
    mappings

 Documentation/filesystems/proc.txt | 21 ++++++++--
 arch/s390/mm/pgtable.c             |  5 +--
 fs/proc/task_mmu.c                 | 70 ++++++++++++++++++++++++++++++--
 include/linux/mm.h                 | 18 ++++++++-
 include/linux/mm_types.h           |  7 ++--
 include/linux/shmem_fs.h           |  4 ++
 kernel/events/uprobes.c            |  2 +-
 mm/memory.c                        | 30 +++++---------
 mm/oom_kill.c                      |  5 ++-
 mm/rmap.c                          | 12 ++----
 mm/shmem.c                         | 81 ++++++++++++++++++++++++++++++++++++++
 11 files changed, 208 insertions(+), 47 deletions(-)

-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 1/6] mm, documentation: clarify /proc/pid/status VmSwap limitations for shmem
@ 2015-11-18  9:29   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

The documentation for /proc/pid/status does not mention that the value of
VmSwap counts only swapped out anonymous private pages, and not swapped out
pages of the underlying shmem objects (for shmem mappings). This is not
obvious, so document this limitation.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
---
 Documentation/filesystems/proc.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 402ab99..9f13b6e 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -238,7 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmLib                       size of shared library code
  VmPTE                       size of page table entries
  VmPMD                       size of second level page tables
- VmSwap                      size of swap usage (the number of referred swapents)
+ VmSwap                      amount of swap used by anonymous private data
+                             (shmem swap usage is not included)
  HugetlbPages                size of hugetlb memory portions
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 1/6] mm, documentation: clarify /proc/pid/status VmSwap limitations for shmem
@ 2015-11-18  9:29   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jerome Marchand,
	Hugh Dickins, Michal Hocko, Peter Zijlstra, Oleg Nesterov,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, Vlastimil Babka,
	Konstantin Khlebnikov, Michal Hocko

The documentation for /proc/pid/status does not mention that the value of
VmSwap counts only swapped out anonymous private pages, and not swapped out
pages of the underlying shmem objects (for shmem mappings). This is not
obvious, so document this limitation.

Signed-off-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>
Acked-by: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Acked-by: Jerome Marchand <jmarchan-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/filesystems/proc.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 402ab99..9f13b6e 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -238,7 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmLib                       size of shared library code
  VmPTE                       size of page table entries
  VmPMD                       size of second level page tables
- VmSwap                      size of swap usage (the number of referred swapents)
+ VmSwap                      amount of swap used by anonymous private data
+                             (shmem swap usage is not included)
  HugetlbPages                size of hugetlb memory portions
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
-- 
2.6.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 1/6] mm, documentation: clarify /proc/pid/status VmSwap limitations for shmem
@ 2015-11-18  9:29   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

The documentation for /proc/pid/status does not mention that the value of
VmSwap counts only swapped out anonymous private pages, and not swapped out
pages of the underlying shmem objects (for shmem mappings). This is not
obvious, so document this limitation.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
---
 Documentation/filesystems/proc.txt | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 402ab99..9f13b6e 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -238,7 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmLib                       size of shared library code
  VmPTE                       size of page table entries
  VmPMD                       size of second level page tables
- VmSwap                      size of swap usage (the number of referred swapents)
+ VmSwap                      amount of swap used by anonymous private data
+                             (shmem swap usage is not included)
  HugetlbPages                size of hugetlb memory portions
  Threads                     number of threads
  SigQ                        number of signals queued/max. number for queue
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 2/6] mm, proc: account for shmem swap in /proc/pid/smaps
  2015-11-18  9:29 ` Vlastimil Babka
@ 2015-11-18  9:29   ` Vlastimil Babka
  -1 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

Currently, /proc/pid/smaps will always show "Swap: 0 kB" for shmem-backed
mappings, even if the mapped portion does contain pages that were swapped out.
This is because unlike private anonymous mappings, shmem does not change pte
to swap entry, but pte_none when swapping the page out. In the smaps page
walk, such page thus looks like it was never faulted in.

This patch changes smaps_pte_entry() to determine the swap status for such
pte_none entries for shmem mappings, similarly to how mincore_page() does it.
Swapped out shmem pages are thus accounted for. For private mappings of tmpfs
files that COWed some of the pages, swaped out status of the original shmem
pages is naturally ignored. If some of the private copies was also swapped
out, they are accounted via their page table swap entries, so the resulting
reported swap usage is then a sum of both swapped out private copies, and
swapped out shmem pages that were not COWed. No double accounting can thus
happen.

The accounting is arguably still not as precise as for private anonymous
mappings, since now we will count also pages that the process in question never
accessed, but another process populated them and then let them become swapped
out. I believe it is still less confusing and subtle than not showing any swap
usage by shmem mappings at all. Swapped out counter might of interest of users
who would like to prevent from future swapins during performance critical
operation and pre-fault them at their convenience. Especially for larger
swapped out regions the cost of swapin is much higher than a fresh page
allocation.  So a differentiation between pte_none vs. swapped out is important
for those usecases.

One downside of this patch is that it makes /proc/pid/smaps more expensive for
shmem mappings, as we consult the radix tree for each pte_none entry, so the
overal complexity is O(n*log(n)). I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and time
how long does it take to cat /proc/pid/smaps of this process 100 times.

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

Mapping of a /dev/shm/file:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

The difference rather substantional, so the next patch will reduce the cost
for shared or read-only mappings.

In a less controlled experiment, I've gathered pids of processes on my desktop
that have either '/dev/shm/*' or 'SYSV*' in smaps. This included the Chrome
browser and some KDE processes. Again, I've run cat /proc/pid/smaps on each
100 times.

Before this patch:

real    0m9.050s
user    0m0.518s
sys     0m8.066s

After this patch:

real    0m9.221s
user    0m0.541s
sys     0m8.187s

This suggests low impact on average systems.

Note that this patch doesn't attempt to adjust the SwapPss field for shmem
mappings, which would need extra work to determine who else could have the
pages mapped. Thus the value stays zero except for COWed swapped out pages in
a shmem mapping, which are accounted as usual.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/filesystems/proc.txt |  5 +++-
 fs/proc/task_mmu.c                 | 51 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 9f13b6e..fdeb5b3 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -460,7 +460,10 @@ and a page is modified, the file page is replaced by a private anonymous copy.
 hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
 reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
 "Swap" shows how much would-be-anonymous memory is also used, but out on swap.
-"SwapPss" shows proportional swap share of this mapping.
+For shmem mappings, "Swap" includes also the size of the mapped (and not
+replaced by copy-on-write) part of the underlying shmem object out on swap.
+"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
+does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
 
 "VmFlags" field deserves a separate description. This member represents the kernel
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9e0938b..7e0c4c2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -451,6 +451,7 @@ struct mem_size_stats {
 	unsigned long private_hugetlb;
 	u64 pss;
 	u64 swap_pss;
+	bool check_shmem_swap;
 };
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
@@ -500,6 +501,45 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	}
 }
 
+#ifdef CONFIG_SHMEM
+static unsigned long smaps_shmem_swap(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	struct page *page;
+
+	page = find_get_entry(vma->vm_file->f_mapping,
+					linear_page_index(vma, addr));
+	if (!page)
+		return 0;
+
+	if (radix_tree_exceptional_entry(page))
+		return PAGE_SIZE;
+
+	page_cache_release(page);
+	return 0;
+
+}
+
+static int smaps_pte_hole(unsigned long addr, unsigned long end,
+		struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+
+	while (addr < end) {
+		mss->swap += smaps_shmem_swap(walk->vma, addr);
+		addr += PAGE_SIZE;
+	}
+
+	return 0;
+}
+#else
+static unsigned long smaps_shmem_swap(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	return 0;
+}
+#endif
+
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		struct mm_walk *walk)
 {
@@ -527,6 +567,9 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			}
 		} else if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
+							&& pte_none(*pte))) {
+		mss->swap += smaps_shmem_swap(vma, addr);
 	}
 
 	if (!page)
@@ -686,6 +729,14 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 	};
 
 	memset(&mss, 0, sizeof mss);
+
+#ifdef CONFIG_SHMEM
+	if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) {
+		mss.check_shmem_swap = true;
+		smaps_walk.pte_hole = smaps_pte_hole;
+	}
+#endif
+
 	/* mmap_sem is held in m_start */
 	walk_page_vma(vma, &smaps_walk);
 
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 2/6] mm, proc: account for shmem swap in /proc/pid/smaps
@ 2015-11-18  9:29   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

Currently, /proc/pid/smaps will always show "Swap: 0 kB" for shmem-backed
mappings, even if the mapped portion does contain pages that were swapped out.
This is because unlike private anonymous mappings, shmem does not change pte
to swap entry, but pte_none when swapping the page out. In the smaps page
walk, such page thus looks like it was never faulted in.

This patch changes smaps_pte_entry() to determine the swap status for such
pte_none entries for shmem mappings, similarly to how mincore_page() does it.
Swapped out shmem pages are thus accounted for. For private mappings of tmpfs
files that COWed some of the pages, swaped out status of the original shmem
pages is naturally ignored. If some of the private copies was also swapped
out, they are accounted via their page table swap entries, so the resulting
reported swap usage is then a sum of both swapped out private copies, and
swapped out shmem pages that were not COWed. No double accounting can thus
happen.

The accounting is arguably still not as precise as for private anonymous
mappings, since now we will count also pages that the process in question never
accessed, but another process populated them and then let them become swapped
out. I believe it is still less confusing and subtle than not showing any swap
usage by shmem mappings at all. Swapped out counter might of interest of users
who would like to prevent from future swapins during performance critical
operation and pre-fault them at their convenience. Especially for larger
swapped out regions the cost of swapin is much higher than a fresh page
allocation.  So a differentiation between pte_none vs. swapped out is important
for those usecases.

One downside of this patch is that it makes /proc/pid/smaps more expensive for
shmem mappings, as we consult the radix tree for each pte_none entry, so the
overal complexity is O(n*log(n)). I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and time
how long does it take to cat /proc/pid/smaps of this process 100 times.

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

Mapping of a /dev/shm/file:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

The difference rather substantional, so the next patch will reduce the cost
for shared or read-only mappings.

In a less controlled experiment, I've gathered pids of processes on my desktop
that have either '/dev/shm/*' or 'SYSV*' in smaps. This included the Chrome
browser and some KDE processes. Again, I've run cat /proc/pid/smaps on each
100 times.

Before this patch:

real    0m9.050s
user    0m0.518s
sys     0m8.066s

After this patch:

real    0m9.221s
user    0m0.541s
sys     0m8.187s

This suggests low impact on average systems.

Note that this patch doesn't attempt to adjust the SwapPss field for shmem
mappings, which would need extra work to determine who else could have the
pages mapped. Thus the value stays zero except for COWed swapped out pages in
a shmem mapping, which are accounted as usual.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/filesystems/proc.txt |  5 +++-
 fs/proc/task_mmu.c                 | 51 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 9f13b6e..fdeb5b3 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -460,7 +460,10 @@ and a page is modified, the file page is replaced by a private anonymous copy.
 hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
 reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
 "Swap" shows how much would-be-anonymous memory is also used, but out on swap.
-"SwapPss" shows proportional swap share of this mapping.
+For shmem mappings, "Swap" includes also the size of the mapped (and not
+replaced by copy-on-write) part of the underlying shmem object out on swap.
+"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
+does not take into account swapped out page of underlying shmem objects.
 "Locked" indicates whether the mapping is locked in memory or not.
 
 "VmFlags" field deserves a separate description. This member represents the kernel
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9e0938b..7e0c4c2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -451,6 +451,7 @@ struct mem_size_stats {
 	unsigned long private_hugetlb;
 	u64 pss;
 	u64 swap_pss;
+	bool check_shmem_swap;
 };
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
@@ -500,6 +501,45 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	}
 }
 
+#ifdef CONFIG_SHMEM
+static unsigned long smaps_shmem_swap(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	struct page *page;
+
+	page = find_get_entry(vma->vm_file->f_mapping,
+					linear_page_index(vma, addr));
+	if (!page)
+		return 0;
+
+	if (radix_tree_exceptional_entry(page))
+		return PAGE_SIZE;
+
+	page_cache_release(page);
+	return 0;
+
+}
+
+static int smaps_pte_hole(unsigned long addr, unsigned long end,
+		struct mm_walk *walk)
+{
+	struct mem_size_stats *mss = walk->private;
+
+	while (addr < end) {
+		mss->swap += smaps_shmem_swap(walk->vma, addr);
+		addr += PAGE_SIZE;
+	}
+
+	return 0;
+}
+#else
+static unsigned long smaps_shmem_swap(struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	return 0;
+}
+#endif
+
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		struct mm_walk *walk)
 {
@@ -527,6 +567,9 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			}
 		} else if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
+							&& pte_none(*pte))) {
+		mss->swap += smaps_shmem_swap(vma, addr);
 	}
 
 	if (!page)
@@ -686,6 +729,14 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 	};
 
 	memset(&mss, 0, sizeof mss);
+
+#ifdef CONFIG_SHMEM
+	if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) {
+		mss.check_shmem_swap = true;
+		smaps_walk.pte_hole = smaps_pte_hole;
+	}
+#endif
+
 	/* mmap_sem is held in m_start */
 	walk_page_vma(vma, &smaps_walk);
 
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 3/6] mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
  2015-11-18  9:29 ` Vlastimil Babka
@ 2015-11-18  9:29   ` Vlastimil Babka
  -1 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we consult
the radix tree for each pte_none entry, so the overal complexity is
O(n*log(n)).

We can reduce this significantly for mappings that cannot contain COWed pages,
because then we can either use the statistics tha shmem object itself tracks
(if the mapping contains the whole object, or the swap usage of the whole
object is zero), or use the radix tree iterator, which is much more effective
than repeated find_get_entry() calls.

This patch therefore introduces a function shmem_swap_usage(vma) and makes
/proc/pid/smaps use it when possible. Only for writable private mappings of
shmem objects (i.e. tmpfs files) with the shmem object itself (partially)
swapped outwe have to resort to the find_get_entry() approach. Hopefully
such mappings are relatively uncommon.

To demonstrate the diference, I have measured this on a process that creates
a 2GB mapping and dirties single pages with a stride of 2MB, and time how long
does it take to cat /proc/pid/smaps of this process 100 times.

Private writable mapping of a /dev/shm/file (the most complex case):

real    0m3.831s
user    0m0.180s
sys     0m3.212s

Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).

real    0m1.351s
user    0m0.096s
sys     0m0.768s

Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

real    0m0.935s
user    0m0.128s
sys     0m0.344s

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

The cost is now much closer to the private anonymous mapping case, unless the
shmem mapping is private and writable.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/proc/task_mmu.c       | 22 +++++++++++++--
 include/linux/shmem_fs.h |  2 ++
 mm/shmem.c               | 70 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7e0c4c2..491e675 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -14,6 +14,7 @@
 #include <linux/swapops.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/elf.h>
 #include <asm/uaccess.h>
@@ -732,8 +733,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 
 #ifdef CONFIG_SHMEM
 	if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) {
-		mss.check_shmem_swap = true;
-		smaps_walk.pte_hole = smaps_pte_hole;
+		/*
+		 * For shared or readonly shmem mappings we know that all
+		 * swapped out pages belong to the shmem object, and we can
+		 * obtain the swap value much more efficiently. For private
+		 * writable mappings, we might have COW pages that are
+		 * not affected by the parent swapped out pages of the shmem
+		 * object, so we have to distinguish them during the page walk.
+		 * Unless we know that the shmem object (or the part mapped by
+		 * our VMA) has no swapped out pages at all.
+		 */
+		unsigned long shmem_swapped = shmem_swap_usage(vma);
+
+		if (!shmem_swapped || (vma->vm_flags & VM_SHARED) ||
+					!(vma->vm_flags & VM_WRITE)) {
+			mss.swap = shmem_swapped;
+		} else {
+			mss.check_shmem_swap = true;
+			smaps_walk.pte_hole = smaps_pte_hole;
+		}
 	}
 #endif
 
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 50777b5..bd58be5 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -60,6 +60,8 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 
+extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index 529a7d5..bc0f676 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -360,6 +360,76 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
+ * Determine (in bytes) how many of the shmem object's pages mapped by the
+ * given vma is swapped out.
+ *
+ * This is safe to call without i_mutex or mapping->tree_lock thanks to RCU,
+ * as long as the inode doesn't go away and racy results are not a problem.
+ */
+unsigned long shmem_swap_usage(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct address_space *mapping = inode->i_mapping;
+	unsigned long swapped;
+	pgoff_t start, end;
+	struct radix_tree_iter iter;
+	void **slot;
+	struct page *page;
+
+	/* Be careful as we don't hold info->lock */
+	swapped = READ_ONCE(info->swapped);
+
+	/*
+	 * The easier cases are when the shmem object has nothing in swap, or
+	 * the vma maps it whole. Then we can simply use the stats that we
+	 * already track.
+	 */
+	if (!swapped)
+		return 0;
+
+	if (!vma->vm_pgoff && vma->vm_end - vma->vm_start >= inode->i_size)
+		return swapped << PAGE_SHIFT;
+
+	swapped = 0;
+
+	/* Here comes the more involved part */
+	start = linear_page_index(vma, vma->vm_start);
+	end = linear_page_index(vma, vma->vm_end);
+
+	rcu_read_lock();
+
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		if (iter.index >= end)
+			break;
+
+		page = radix_tree_deref_slot(slot);
+
+		/*
+		 * This should only be possible to happen at index 0, so we
+		 * don't need to reset the counter, nor do we risk infinite
+		 * restarts.
+		 */
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		if (radix_tree_exceptional_entry(page))
+			swapped++;
+
+		if (need_resched()) {
+			cond_resched_rcu();
+			start = iter.index + 1;
+			goto restart;
+		}
+	}
+
+	rcu_read_unlock();
+
+	return swapped << PAGE_SHIFT;
+}
+
+/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 3/6] mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
@ 2015-11-18  9:29   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we consult
the radix tree for each pte_none entry, so the overal complexity is
O(n*log(n)).

We can reduce this significantly for mappings that cannot contain COWed pages,
because then we can either use the statistics tha shmem object itself tracks
(if the mapping contains the whole object, or the swap usage of the whole
object is zero), or use the radix tree iterator, which is much more effective
than repeated find_get_entry() calls.

This patch therefore introduces a function shmem_swap_usage(vma) and makes
/proc/pid/smaps use it when possible. Only for writable private mappings of
shmem objects (i.e. tmpfs files) with the shmem object itself (partially)
swapped outwe have to resort to the find_get_entry() approach. Hopefully
such mappings are relatively uncommon.

To demonstrate the diference, I have measured this on a process that creates
a 2GB mapping and dirties single pages with a stride of 2MB, and time how long
does it take to cat /proc/pid/smaps of this process 100 times.

Private writable mapping of a /dev/shm/file (the most complex case):

real    0m3.831s
user    0m0.180s
sys     0m3.212s

Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).

real    0m1.351s
user    0m0.096s
sys     0m0.768s

Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

real    0m0.935s
user    0m0.128s
sys     0m0.344s

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

The cost is now much closer to the private anonymous mapping case, unless the
shmem mapping is private and writable.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/proc/task_mmu.c       | 22 +++++++++++++--
 include/linux/shmem_fs.h |  2 ++
 mm/shmem.c               | 70 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7e0c4c2..491e675 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -14,6 +14,7 @@
 #include <linux/swapops.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/elf.h>
 #include <asm/uaccess.h>
@@ -732,8 +733,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 
 #ifdef CONFIG_SHMEM
 	if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) {
-		mss.check_shmem_swap = true;
-		smaps_walk.pte_hole = smaps_pte_hole;
+		/*
+		 * For shared or readonly shmem mappings we know that all
+		 * swapped out pages belong to the shmem object, and we can
+		 * obtain the swap value much more efficiently. For private
+		 * writable mappings, we might have COW pages that are
+		 * not affected by the parent swapped out pages of the shmem
+		 * object, so we have to distinguish them during the page walk.
+		 * Unless we know that the shmem object (or the part mapped by
+		 * our VMA) has no swapped out pages at all.
+		 */
+		unsigned long shmem_swapped = shmem_swap_usage(vma);
+
+		if (!shmem_swapped || (vma->vm_flags & VM_SHARED) ||
+					!(vma->vm_flags & VM_WRITE)) {
+			mss.swap = shmem_swapped;
+		} else {
+			mss.check_shmem_swap = true;
+			smaps_walk.pte_hole = smaps_pte_hole;
+		}
 	}
 #endif
 
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 50777b5..bd58be5 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -60,6 +60,8 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 
+extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index 529a7d5..bc0f676 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -360,6 +360,76 @@ static int shmem_free_swap(struct address_space *mapping,
 }
 
 /*
+ * Determine (in bytes) how many of the shmem object's pages mapped by the
+ * given vma is swapped out.
+ *
+ * This is safe to call without i_mutex or mapping->tree_lock thanks to RCU,
+ * as long as the inode doesn't go away and racy results are not a problem.
+ */
+unsigned long shmem_swap_usage(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct address_space *mapping = inode->i_mapping;
+	unsigned long swapped;
+	pgoff_t start, end;
+	struct radix_tree_iter iter;
+	void **slot;
+	struct page *page;
+
+	/* Be careful as we don't hold info->lock */
+	swapped = READ_ONCE(info->swapped);
+
+	/*
+	 * The easier cases are when the shmem object has nothing in swap, or
+	 * the vma maps it whole. Then we can simply use the stats that we
+	 * already track.
+	 */
+	if (!swapped)
+		return 0;
+
+	if (!vma->vm_pgoff && vma->vm_end - vma->vm_start >= inode->i_size)
+		return swapped << PAGE_SHIFT;
+
+	swapped = 0;
+
+	/* Here comes the more involved part */
+	start = linear_page_index(vma, vma->vm_start);
+	end = linear_page_index(vma, vma->vm_end);
+
+	rcu_read_lock();
+
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		if (iter.index >= end)
+			break;
+
+		page = radix_tree_deref_slot(slot);
+
+		/*
+		 * This should only be possible to happen at index 0, so we
+		 * don't need to reset the counter, nor do we risk infinite
+		 * restarts.
+		 */
+		if (radix_tree_deref_retry(page))
+			goto restart;
+
+		if (radix_tree_exceptional_entry(page))
+			swapped++;
+
+		if (need_resched()) {
+			cond_resched_rcu();
+			start = iter.index + 1;
+			goto restart;
+		}
+	}
+
+	rcu_read_unlock();
+
+	return swapped << PAGE_SHIFT;
+}
+
+/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 4/6] mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings
  2015-11-18  9:29 ` Vlastimil Babka
@ 2015-11-18  9:29   ` Vlastimil Babka
  -1 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

Following the previous patch, further reduction of /proc/pid/smaps cost is
possible for private writable shmem mappings with unpopulated areas where
the page walk invokes the .pte_hole function. We can use radix tree iterator
for each such area instead of calling find_get_entry() in a loop. This is
possible at the extra maintenance cost of introducing another shmem function
shmem_partial_swap_usage().

To demonstrate the diference, I have measured this on a process that creates a
private writable 2GB mapping of a partially swapped out /dev/shm/file (which
cannot employ the optimizations from the prvious patch) and doesn't populate it
at all. I time how long does it take to cat /proc/pid/smaps of this process 100
times.

Before this patch:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

After this patch:

real    0m1.176s
user    0m0.180s
sys     0m0.684s

The time is similar to case where radix tree iterator is employed on the whole
mapping.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/proc/task_mmu.c       | 42 ++++++++++---------------------
 include/linux/shmem_fs.h |  2 ++
 mm/shmem.c               | 65 ++++++++++++++++++++++++++++--------------------
 3 files changed, 53 insertions(+), 56 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 491e675..a0338ec 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -503,42 +503,16 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 }
 
 #ifdef CONFIG_SHMEM
-static unsigned long smaps_shmem_swap(struct vm_area_struct *vma,
-		unsigned long addr)
-{
-	struct page *page;
-
-	page = find_get_entry(vma->vm_file->f_mapping,
-					linear_page_index(vma, addr));
-	if (!page)
-		return 0;
-
-	if (radix_tree_exceptional_entry(page))
-		return PAGE_SIZE;
-
-	page_cache_release(page);
-	return 0;
-
-}
-
 static int smaps_pte_hole(unsigned long addr, unsigned long end,
 		struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 
-	while (addr < end) {
-		mss->swap += smaps_shmem_swap(walk->vma, addr);
-		addr += PAGE_SIZE;
-	}
+	mss->swap += shmem_partial_swap_usage(
+			walk->vma->vm_file->f_mapping, addr, end);
 
 	return 0;
 }
-#else
-static unsigned long smaps_shmem_swap(struct vm_area_struct *vma,
-		unsigned long addr)
-{
-	return 0;
-}
 #endif
 
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
@@ -570,7 +544,17 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			page = migration_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
-		mss->swap += smaps_shmem_swap(vma, addr);
+		page = find_get_entry(vma->vm_file->f_mapping,
+						linear_page_index(vma, addr));
+		if (!page)
+			return;
+
+		if (radix_tree_exceptional_entry(page))
+			mss->swap += PAGE_SIZE;
+		else
+			page_cache_release(page);
+
+		return;
 	}
 
 	if (!page)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index bd58be5..a43f41c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -61,6 +61,8 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
+extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
+						pgoff_t start, pgoff_t end);
 
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
diff --git a/mm/shmem.c b/mm/shmem.c
index bc0f676..8689a58 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -361,41 +361,18 @@ static int shmem_free_swap(struct address_space *mapping,
 
 /*
  * Determine (in bytes) how many of the shmem object's pages mapped by the
- * given vma is swapped out.
+ * given offsets are swapped out.
  *
  * This is safe to call without i_mutex or mapping->tree_lock thanks to RCU,
  * as long as the inode doesn't go away and racy results are not a problem.
  */
-unsigned long shmem_swap_usage(struct vm_area_struct *vma)
+unsigned long shmem_partial_swap_usage(struct address_space *mapping,
+						pgoff_t start, pgoff_t end)
 {
-	struct inode *inode = file_inode(vma->vm_file);
-	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct address_space *mapping = inode->i_mapping;
-	unsigned long swapped;
-	pgoff_t start, end;
 	struct radix_tree_iter iter;
 	void **slot;
 	struct page *page;
-
-	/* Be careful as we don't hold info->lock */
-	swapped = READ_ONCE(info->swapped);
-
-	/*
-	 * The easier cases are when the shmem object has nothing in swap, or
-	 * the vma maps it whole. Then we can simply use the stats that we
-	 * already track.
-	 */
-	if (!swapped)
-		return 0;
-
-	if (!vma->vm_pgoff && vma->vm_end - vma->vm_start >= inode->i_size)
-		return swapped << PAGE_SHIFT;
-
-	swapped = 0;
-
-	/* Here comes the more involved part */
-	start = linear_page_index(vma, vma->vm_start);
-	end = linear_page_index(vma, vma->vm_end);
+	unsigned long swapped = 0;
 
 	rcu_read_lock();
 
@@ -430,6 +407,40 @@ unsigned long shmem_swap_usage(struct vm_area_struct *vma)
 }
 
 /*
+ * Determine (in bytes) how many of the shmem object's pages mapped by the
+ * given vma is swapped out.
+ *
+ * This is safe to call without i_mutex or mapping->tree_lock thanks to RCU,
+ * as long as the inode doesn't go away and racy results are not a problem.
+ */
+unsigned long shmem_swap_usage(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct address_space *mapping = inode->i_mapping;
+	unsigned long swapped;
+
+	/* Be careful as we don't hold info->lock */
+	swapped = READ_ONCE(info->swapped);
+
+	/*
+	 * The easier cases are when the shmem object has nothing in swap, or
+	 * the vma maps it whole. Then we can simply use the stats that we
+	 * already track.
+	 */
+	if (!swapped)
+		return 0;
+
+	if (!vma->vm_pgoff && vma->vm_end - vma->vm_start >= inode->i_size)
+		return swapped << PAGE_SHIFT;
+
+	/* Here comes the more involved part */
+	return shmem_partial_swap_usage(mapping,
+			linear_page_index(vma, vma->vm_start),
+			linear_page_index(vma, vma->vm_end));
+}
+
+/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 4/6] mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings
@ 2015-11-18  9:29   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

Following the previous patch, further reduction of /proc/pid/smaps cost is
possible for private writable shmem mappings with unpopulated areas where
the page walk invokes the .pte_hole function. We can use radix tree iterator
for each such area instead of calling find_get_entry() in a loop. This is
possible at the extra maintenance cost of introducing another shmem function
shmem_partial_swap_usage().

To demonstrate the diference, I have measured this on a process that creates a
private writable 2GB mapping of a partially swapped out /dev/shm/file (which
cannot employ the optimizations from the prvious patch) and doesn't populate it
at all. I time how long does it take to cat /proc/pid/smaps of this process 100
times.

Before this patch:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

After this patch:

real    0m1.176s
user    0m0.180s
sys     0m0.684s

The time is similar to case where radix tree iterator is employed on the whole
mapping.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/proc/task_mmu.c       | 42 ++++++++++---------------------
 include/linux/shmem_fs.h |  2 ++
 mm/shmem.c               | 65 ++++++++++++++++++++++++++++--------------------
 3 files changed, 53 insertions(+), 56 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 491e675..a0338ec 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -503,42 +503,16 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 }
 
 #ifdef CONFIG_SHMEM
-static unsigned long smaps_shmem_swap(struct vm_area_struct *vma,
-		unsigned long addr)
-{
-	struct page *page;
-
-	page = find_get_entry(vma->vm_file->f_mapping,
-					linear_page_index(vma, addr));
-	if (!page)
-		return 0;
-
-	if (radix_tree_exceptional_entry(page))
-		return PAGE_SIZE;
-
-	page_cache_release(page);
-	return 0;
-
-}
-
 static int smaps_pte_hole(unsigned long addr, unsigned long end,
 		struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 
-	while (addr < end) {
-		mss->swap += smaps_shmem_swap(walk->vma, addr);
-		addr += PAGE_SIZE;
-	}
+	mss->swap += shmem_partial_swap_usage(
+			walk->vma->vm_file->f_mapping, addr, end);
 
 	return 0;
 }
-#else
-static unsigned long smaps_shmem_swap(struct vm_area_struct *vma,
-		unsigned long addr)
-{
-	return 0;
-}
 #endif
 
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
@@ -570,7 +544,17 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			page = migration_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
-		mss->swap += smaps_shmem_swap(vma, addr);
+		page = find_get_entry(vma->vm_file->f_mapping,
+						linear_page_index(vma, addr));
+		if (!page)
+			return;
+
+		if (radix_tree_exceptional_entry(page))
+			mss->swap += PAGE_SIZE;
+		else
+			page_cache_release(page);
+
+		return;
 	}
 
 	if (!page)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index bd58be5..a43f41c 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -61,6 +61,8 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
+extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
+						pgoff_t start, pgoff_t end);
 
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
diff --git a/mm/shmem.c b/mm/shmem.c
index bc0f676..8689a58 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -361,41 +361,18 @@ static int shmem_free_swap(struct address_space *mapping,
 
 /*
  * Determine (in bytes) how many of the shmem object's pages mapped by the
- * given vma is swapped out.
+ * given offsets are swapped out.
  *
  * This is safe to call without i_mutex or mapping->tree_lock thanks to RCU,
  * as long as the inode doesn't go away and racy results are not a problem.
  */
-unsigned long shmem_swap_usage(struct vm_area_struct *vma)
+unsigned long shmem_partial_swap_usage(struct address_space *mapping,
+						pgoff_t start, pgoff_t end)
 {
-	struct inode *inode = file_inode(vma->vm_file);
-	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct address_space *mapping = inode->i_mapping;
-	unsigned long swapped;
-	pgoff_t start, end;
 	struct radix_tree_iter iter;
 	void **slot;
 	struct page *page;
-
-	/* Be careful as we don't hold info->lock */
-	swapped = READ_ONCE(info->swapped);
-
-	/*
-	 * The easier cases are when the shmem object has nothing in swap, or
-	 * the vma maps it whole. Then we can simply use the stats that we
-	 * already track.
-	 */
-	if (!swapped)
-		return 0;
-
-	if (!vma->vm_pgoff && vma->vm_end - vma->vm_start >= inode->i_size)
-		return swapped << PAGE_SHIFT;
-
-	swapped = 0;
-
-	/* Here comes the more involved part */
-	start = linear_page_index(vma, vma->vm_start);
-	end = linear_page_index(vma, vma->vm_end);
+	unsigned long swapped = 0;
 
 	rcu_read_lock();
 
@@ -430,6 +407,40 @@ unsigned long shmem_swap_usage(struct vm_area_struct *vma)
 }
 
 /*
+ * Determine (in bytes) how many of the shmem object's pages mapped by the
+ * given vma is swapped out.
+ *
+ * This is safe to call without i_mutex or mapping->tree_lock thanks to RCU,
+ * as long as the inode doesn't go away and racy results are not a problem.
+ */
+unsigned long shmem_swap_usage(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct address_space *mapping = inode->i_mapping;
+	unsigned long swapped;
+
+	/* Be careful as we don't hold info->lock */
+	swapped = READ_ONCE(info->swapped);
+
+	/*
+	 * The easier cases are when the shmem object has nothing in swap, or
+	 * the vma maps it whole. Then we can simply use the stats that we
+	 * already track.
+	 */
+	if (!swapped)
+		return 0;
+
+	if (!vma->vm_pgoff && vma->vm_end - vma->vm_start >= inode->i_size)
+		return swapped << PAGE_SHIFT;
+
+	/* Here comes the more involved part */
+	return shmem_partial_swap_usage(mapping,
+			linear_page_index(vma, vma->vm_start),
+			linear_page_index(vma, vma->vm_end));
+}
+
+/*
  * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 5/6] mm, shmem: add internal shmem resident memory accounting
  2015-11-18  9:29 ` Vlastimil Babka
@ 2015-11-18  9:29   ` Vlastimil Babka
  -1 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

From: Jerome Marchand <jmarchan@redhat.com>

Currently looking at /proc/<pid>/status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages are
mapped to /dev/zero), even though their implication in actual memory use is
quite different.

The internal accounting currently counts shmem pages together with regular
files. As a preparation to extend the userspace interfaces, this patch adds
MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from
MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't
change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at
places where MM_FILEPAGES was used before. The only user-visible change after
this patch is the OOM killer message that separates the reported "shmem-rss"
from "file-rss".

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/mm/pgtable.c   |  5 +----
 fs/proc/task_mmu.c       |  3 ++-
 include/linux/mm.h       | 18 +++++++++++++++++-
 include/linux/mm_types.h |  7 ++++---
 kernel/events/uprobes.c  |  2 +-
 mm/memory.c              | 30 ++++++++++--------------------
 mm/oom_kill.c            |  5 +++--
 mm/rmap.c                | 12 +++---------
 8 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 34f3790..43b9a48 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -603,10 +603,7 @@ static void gmap_zap_swap_entry(swp_entry_t entry, struct mm_struct *mm)
 	else if (is_migration_entry(entry)) {
 		struct page *page = migration_entry_to_page(entry);
 
-		if (PageAnon(page))
-			dec_mm_counter(mm, MM_ANONPAGES);
-		else
-			dec_mm_counter(mm, MM_FILEPAGES);
+		dec_mm_counter(mm, mm_counter(page));
 	}
 	free_swap_and_cache(entry);
 }
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a0338ec..19d190b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -83,7 +83,8 @@ unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
 			 unsigned long *data, unsigned long *resident)
 {
-	*shared = get_mm_counter(mm, MM_FILEPAGES);
+	*shared = get_mm_counter(mm, MM_FILEPAGES) +
+			get_mm_counter(mm, MM_SHMEMPAGES);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fc9a3b8..25cdec3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1342,10 +1342,26 @@ static inline void dec_mm_counter(struct mm_struct *mm, int member)
 	atomic_long_dec(&mm->rss_stat.count[member]);
 }
 
+/* Optimized variant when page is already known not to be PageAnon */
+static inline int mm_counter_file(struct page *page)
+{
+	if (PageSwapBacked(page))
+		return MM_SHMEMPAGES;
+	return MM_FILEPAGES;
+}
+
+static inline int mm_counter(struct page *page)
+{
+	if (PageAnon(page))
+		return MM_ANONPAGES;
+	return mm_counter_file(page);
+}
+
 static inline unsigned long get_mm_rss(struct mm_struct *mm)
 {
 	return get_mm_counter(mm, MM_FILEPAGES) +
-		get_mm_counter(mm, MM_ANONPAGES);
+		get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
 }
 
 static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e37e91..22beb22 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -361,9 +361,10 @@ struct core_state {
 };
 
 enum {
-	MM_FILEPAGES,
-	MM_ANONPAGES,
-	MM_SWAPENTS,
+	MM_FILEPAGES,	/* Resident file mapping pages */
+	MM_ANONPAGES,	/* Resident anonymous pages */
+	MM_SWAPENTS,	/* Anonymous swap entries */
+	MM_SHMEMPAGES,	/* Resident shared memory pages */
 	NR_MM_COUNTERS
 };
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5137399..373a3ab 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -181,7 +181,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
-		dec_mm_counter(mm, MM_FILEPAGES);
+		dec_mm_counter(mm, mm_counter_file(page));
 		inc_mm_counter(mm, MM_ANONPAGES);
 	}
 
diff --git a/mm/memory.c b/mm/memory.c
index 7f3b9f2..f5b8e8c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -826,10 +826,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		} else if (is_migration_entry(entry)) {
 			page = migration_entry_to_page(entry);
 
-			if (PageAnon(page))
-				rss[MM_ANONPAGES]++;
-			else
-				rss[MM_FILEPAGES]++;
+			rss[mm_counter(page)]++;
 
 			if (is_write_migration_entry(entry) &&
 					is_cow_mapping(vm_flags)) {
@@ -868,10 +865,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
-		if (PageAnon(page))
-			rss[MM_ANONPAGES]++;
-		else
-			rss[MM_FILEPAGES]++;
+		rss[mm_counter(page)]++;
 	}
 
 out_set_pte:
@@ -1107,9 +1101,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
-			if (PageAnon(page))
-				rss[MM_ANONPAGES]--;
-			else {
+
+			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
 					force_flush = 1;
 					set_page_dirty(page);
@@ -1117,8 +1110,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
-				rss[MM_FILEPAGES]--;
 			}
+			rss[mm_counter(page)]--;
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
@@ -1140,11 +1133,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			struct page *page;
 
 			page = migration_entry_to_page(entry);
-
-			if (PageAnon(page))
-				rss[MM_ANONPAGES]--;
-			else
-				rss[MM_FILEPAGES]--;
+			rss[mm_counter(page)]--;
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
@@ -1454,7 +1443,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter_fast(mm, MM_FILEPAGES);
+	inc_mm_counter_fast(mm, mm_counter_file(page));
 	page_add_file_rmap(page);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
 
@@ -2091,7 +2080,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
-				dec_mm_counter_fast(mm, MM_FILEPAGES);
+				dec_mm_counter_fast(mm,
+						mm_counter_file(old_page));
 				inc_mm_counter_fast(mm, MM_ANONPAGES);
 			}
 		} else {
@@ -2815,7 +2805,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, address, false);
 	} else {
-		inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
+		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page);
 	}
 	set_pte_at(vma->vm_mm, address, pte, entry);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d13a339..5314b20 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -585,10 +585,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
-	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
-		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
+		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
+		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
 	task_unlock(victim);
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index e045fea..2d87940 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1450,10 +1450,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		if (PageHuge(page)) {
 			hugetlb_count_sub(1 << compound_order(page), mm);
 		} else {
-			if (PageAnon(page))
-				dec_mm_counter(mm, MM_ANONPAGES);
-			else
-				dec_mm_counter(mm, MM_FILEPAGES);
+			dec_mm_counter(mm, mm_counter(page));
 		}
 		set_pte_at(mm, address, pte,
 			   swp_entry_to_pte(make_hwpoison_entry(page)));
@@ -1463,10 +1460,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * interest anymore. Simply discard the pte, vmscan
 		 * will take care of the rest.
 		 */
-		if (PageAnon(page))
-			dec_mm_counter(mm, MM_ANONPAGES);
-		else
-			dec_mm_counter(mm, MM_FILEPAGES);
+		dec_mm_counter(mm, mm_counter(page));
 	} else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) {
 		swp_entry_t entry;
 		pte_t swp_pte;
@@ -1506,7 +1500,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = pte_swp_mksoft_dirty(swp_pte);
 		set_pte_at(mm, address, pte, swp_pte);
 	} else
-		dec_mm_counter(mm, MM_FILEPAGES);
+		dec_mm_counter(mm, mm_counter_file(page));
 
 	page_remove_rmap(page, PageHuge(page));
 	page_cache_release(page);
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 5/6] mm, shmem: add internal shmem resident memory accounting
@ 2015-11-18  9:29   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

From: Jerome Marchand <jmarchan@redhat.com>

Currently looking at /proc/<pid>/status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages are
mapped to /dev/zero), even though their implication in actual memory use is
quite different.

The internal accounting currently counts shmem pages together with regular
files. As a preparation to extend the userspace interfaces, this patch adds
MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from
MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't
change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at
places where MM_FILEPAGES was used before. The only user-visible change after
this patch is the OOM killer message that separates the reported "shmem-rss"
from "file-rss".

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
---
 arch/s390/mm/pgtable.c   |  5 +----
 fs/proc/task_mmu.c       |  3 ++-
 include/linux/mm.h       | 18 +++++++++++++++++-
 include/linux/mm_types.h |  7 ++++---
 kernel/events/uprobes.c  |  2 +-
 mm/memory.c              | 30 ++++++++++--------------------
 mm/oom_kill.c            |  5 +++--
 mm/rmap.c                | 12 +++---------
 8 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 34f3790..43b9a48 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -603,10 +603,7 @@ static void gmap_zap_swap_entry(swp_entry_t entry, struct mm_struct *mm)
 	else if (is_migration_entry(entry)) {
 		struct page *page = migration_entry_to_page(entry);
 
-		if (PageAnon(page))
-			dec_mm_counter(mm, MM_ANONPAGES);
-		else
-			dec_mm_counter(mm, MM_FILEPAGES);
+		dec_mm_counter(mm, mm_counter(page));
 	}
 	free_swap_and_cache(entry);
 }
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a0338ec..19d190b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -83,7 +83,8 @@ unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
 			 unsigned long *data, unsigned long *resident)
 {
-	*shared = get_mm_counter(mm, MM_FILEPAGES);
+	*shared = get_mm_counter(mm, MM_FILEPAGES) +
+			get_mm_counter(mm, MM_SHMEMPAGES);
 	*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
 								>> PAGE_SHIFT;
 	*data = mm->total_vm - mm->shared_vm;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fc9a3b8..25cdec3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1342,10 +1342,26 @@ static inline void dec_mm_counter(struct mm_struct *mm, int member)
 	atomic_long_dec(&mm->rss_stat.count[member]);
 }
 
+/* Optimized variant when page is already known not to be PageAnon */
+static inline int mm_counter_file(struct page *page)
+{
+	if (PageSwapBacked(page))
+		return MM_SHMEMPAGES;
+	return MM_FILEPAGES;
+}
+
+static inline int mm_counter(struct page *page)
+{
+	if (PageAnon(page))
+		return MM_ANONPAGES;
+	return mm_counter_file(page);
+}
+
 static inline unsigned long get_mm_rss(struct mm_struct *mm)
 {
 	return get_mm_counter(mm, MM_FILEPAGES) +
-		get_mm_counter(mm, MM_ANONPAGES);
+		get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
 }
 
 static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e37e91..22beb22 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -361,9 +361,10 @@ struct core_state {
 };
 
 enum {
-	MM_FILEPAGES,
-	MM_ANONPAGES,
-	MM_SWAPENTS,
+	MM_FILEPAGES,	/* Resident file mapping pages */
+	MM_ANONPAGES,	/* Resident anonymous pages */
+	MM_SWAPENTS,	/* Anonymous swap entries */
+	MM_SHMEMPAGES,	/* Resident shared memory pages */
 	NR_MM_COUNTERS
 };
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5137399..373a3ab 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -181,7 +181,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
-		dec_mm_counter(mm, MM_FILEPAGES);
+		dec_mm_counter(mm, mm_counter_file(page));
 		inc_mm_counter(mm, MM_ANONPAGES);
 	}
 
diff --git a/mm/memory.c b/mm/memory.c
index 7f3b9f2..f5b8e8c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -826,10 +826,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		} else if (is_migration_entry(entry)) {
 			page = migration_entry_to_page(entry);
 
-			if (PageAnon(page))
-				rss[MM_ANONPAGES]++;
-			else
-				rss[MM_FILEPAGES]++;
+			rss[mm_counter(page)]++;
 
 			if (is_write_migration_entry(entry) &&
 					is_cow_mapping(vm_flags)) {
@@ -868,10 +865,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
-		if (PageAnon(page))
-			rss[MM_ANONPAGES]++;
-		else
-			rss[MM_FILEPAGES]++;
+		rss[mm_counter(page)]++;
 	}
 
 out_set_pte:
@@ -1107,9 +1101,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
-			if (PageAnon(page))
-				rss[MM_ANONPAGES]--;
-			else {
+
+			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
 					force_flush = 1;
 					set_page_dirty(page);
@@ -1117,8 +1110,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				if (pte_young(ptent) &&
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
-				rss[MM_FILEPAGES]--;
 			}
+			rss[mm_counter(page)]--;
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
@@ -1140,11 +1133,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			struct page *page;
 
 			page = migration_entry_to_page(entry);
-
-			if (PageAnon(page))
-				rss[MM_ANONPAGES]--;
-			else
-				rss[MM_FILEPAGES]--;
+			rss[mm_counter(page)]--;
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
@@ -1454,7 +1443,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter_fast(mm, MM_FILEPAGES);
+	inc_mm_counter_fast(mm, mm_counter_file(page));
 	page_add_file_rmap(page);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
 
@@ -2091,7 +2080,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
-				dec_mm_counter_fast(mm, MM_FILEPAGES);
+				dec_mm_counter_fast(mm,
+						mm_counter_file(old_page));
 				inc_mm_counter_fast(mm, MM_ANONPAGES);
 			}
 		} else {
@@ -2815,7 +2805,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, address, false);
 	} else {
-		inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
+		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page);
 	}
 	set_pte_at(vma->vm_mm, address, pte, entry);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d13a339..5314b20 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -585,10 +585,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
-	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
-		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
+		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
+		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
 	task_unlock(victim);
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index e045fea..2d87940 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1450,10 +1450,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		if (PageHuge(page)) {
 			hugetlb_count_sub(1 << compound_order(page), mm);
 		} else {
-			if (PageAnon(page))
-				dec_mm_counter(mm, MM_ANONPAGES);
-			else
-				dec_mm_counter(mm, MM_FILEPAGES);
+			dec_mm_counter(mm, mm_counter(page));
 		}
 		set_pte_at(mm, address, pte,
 			   swp_entry_to_pte(make_hwpoison_entry(page)));
@@ -1463,10 +1460,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * interest anymore. Simply discard the pte, vmscan
 		 * will take care of the rest.
 		 */
-		if (PageAnon(page))
-			dec_mm_counter(mm, MM_ANONPAGES);
-		else
-			dec_mm_counter(mm, MM_FILEPAGES);
+		dec_mm_counter(mm, mm_counter(page));
 	} else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) {
 		swp_entry_t entry;
 		pte_t swp_pte;
@@ -1506,7 +1500,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			swp_pte = pte_swp_mksoft_dirty(swp_pte);
 		set_pte_at(mm, address, pte, swp_pte);
 	} else
-		dec_mm_counter(mm, MM_FILEPAGES);
+		dec_mm_counter(mm, mm_counter_file(page));
 
 	page_remove_rmap(page, PageHuge(page));
 	page_cache_release(page);
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 6/6] mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status
  2015-11-18  9:29 ` Vlastimil Babka
@ 2015-11-18  9:29   ` Vlastimil Babka
  -1 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

From: Jerome Marchand <jmarchan@redhat.com>

There are several shortcomings with the accounting of shared memory (SysV shm,
shared anonymous mapping, mapping of a tmpfs file). The values in
/proc/<pid>/status and <...>/statm don't allow to distinguish between shmem
memory and a shared mapping to a regular file, even though theirs implication
on memory usage are quite different: during reclaim, file mapping can be
dropped or written back on disk, while shmem needs a place in swap.

Also, to distinguish the memory occupied by anonymous and file mappings, one
has to read the /proc/pid/statm file, which has a field for the file mappings
(again, including shmem) and total memory occupied by these mappings (i.e.
equivalent to VmRSS in the <...>/status file. Getting the value for anonymous
mappings only is thus not exactly user-friendly (the statm file is intended
to be rather efficiently machine-readable).

To address both of these shortcomings, this patch adds a breakdown of VmRSS in
/proc/<pid>/status via new fields RssAnon, RssFile and RssShmem, making use of
the previous preparatory patch. These fields tell the user the memory occupied
by private anonymous pages, mapped regular files and shmem, respectively.
Other existing fields in /status and /statm files are left without change.
The /statm file can be extended in the future, if there's a need for that.

Example (part of) /proc/pid/status output including the new Rss* fields:

VmPeak:  2001008 kB
VmSize:  2001004 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      5108 kB
VmRSS:      5108 kB
RssAnon:              92 kB
RssFile:            1324 kB
RssShmem:           3692 kB
VmData:      192 kB
VmStk:       136 kB
VmExe:         4 kB
VmLib:      1784 kB
VmPTE:      3928 kB
VmPMD:        20 kB
VmSwap:        0 kB
HugetlbPages:          0 kB

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
---
 Documentation/filesystems/proc.txt | 13 +++++++++++--
 fs/proc/task_mmu.c                 | 14 ++++++++++++--
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index fdeb5b3..ffcd495 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -169,6 +169,9 @@ For example, to get the status information of a process, all you have to do is
   VmLck:         0 kB
   VmHWM:       476 kB
   VmRSS:       476 kB
+  RssAnon:             352 kB
+  RssFile:             120 kB
+  RssShmem:              4 kB
   VmData:      156 kB
   VmStk:        88 kB
   VmExe:        68 kB
@@ -231,7 +234,12 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmSize                      total program size
  VmLck                       locked memory size
  VmHWM                       peak resident set size ("high water mark")
- VmRSS                       size of memory portions
+ VmRSS                       size of memory portions. It contains the three
+                             following parts (VmRSS = RssAnon + RssFile + RssShmem)
+ RssAnon                     size of resident anonymous memory
+ RssFile                     size of resident file mappings
+ RssShmem                    size of resident shmem memory (includes SysV shm,
+                             mapping of tmpfs and shared anonymous mappings)
  VmData                      size of data, stack, and text segments
  VmStk                       size of data, stack, and text segments
  VmExe                       size of text segment
@@ -266,7 +274,8 @@ Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
  Field    Content
  size     total program size (pages)		(same as VmSize in status)
  resident size of memory portions (pages)	(same as VmRSS in status)
- shared   number of pages that are shared	(i.e. backed by a file)
+ shared   number of pages that are shared	(i.e. backed by a file, same
+						as RssFile+RssShmem in status)
  trs      number of pages that are 'code'	(not including libs; broken,
 							includes data segment)
  lrs      number of pages of library		(always 0 on 2.6)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 19d190b..67aaaad 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -23,9 +23,13 @@
 
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
-	unsigned long data, text, lib, swap, ptes, pmds;
+	unsigned long data, text, lib, swap, ptes, pmds, anon, file, shmem;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
+	anon = get_mm_counter(mm, MM_ANONPAGES);
+	file = get_mm_counter(mm, MM_FILEPAGES);
+	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
 	 * hiwater_rss only when about to *lower* total_vm or rss.  Any
@@ -36,7 +40,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	hiwater_vm = total_vm = mm->total_vm;
 	if (hiwater_vm < mm->hiwater_vm)
 		hiwater_vm = mm->hiwater_vm;
-	hiwater_rss = total_rss = get_mm_rss(mm);
+	hiwater_rss = total_rss = anon + file + shmem;
 	if (hiwater_rss < mm->hiwater_rss)
 		hiwater_rss = mm->hiwater_rss;
 
@@ -53,6 +57,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		"VmPin:\t%8lu kB\n"
 		"VmHWM:\t%8lu kB\n"
 		"VmRSS:\t%8lu kB\n"
+		"RssAnon:\t%8lu kB\n"
+		"RssFile:\t%8lu kB\n"
+		"RssShmem:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
@@ -66,6 +73,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		mm->pinned_vm << (PAGE_SHIFT-10),
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
+		anon << (PAGE_SHIFT-10),
+		file << (PAGE_SHIFT-10),
+		shmem << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 6/6] mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status
@ 2015-11-18  9:29   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2015-11-18  9:29 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Jerome Marchand, Hugh Dickins, Michal Hocko,
	Peter Zijlstra, Oleg Nesterov, linux-api, linux-doc,
	Vlastimil Babka, Konstantin Khlebnikov, Michal Hocko

From: Jerome Marchand <jmarchan@redhat.com>

There are several shortcomings with the accounting of shared memory (SysV shm,
shared anonymous mapping, mapping of a tmpfs file). The values in
/proc/<pid>/status and <...>/statm don't allow to distinguish between shmem
memory and a shared mapping to a regular file, even though theirs implication
on memory usage are quite different: during reclaim, file mapping can be
dropped or written back on disk, while shmem needs a place in swap.

Also, to distinguish the memory occupied by anonymous and file mappings, one
has to read the /proc/pid/statm file, which has a field for the file mappings
(again, including shmem) and total memory occupied by these mappings (i.e.
equivalent to VmRSS in the <...>/status file. Getting the value for anonymous
mappings only is thus not exactly user-friendly (the statm file is intended
to be rather efficiently machine-readable).

To address both of these shortcomings, this patch adds a breakdown of VmRSS in
/proc/<pid>/status via new fields RssAnon, RssFile and RssShmem, making use of
the previous preparatory patch. These fields tell the user the memory occupied
by private anonymous pages, mapped regular files and shmem, respectively.
Other existing fields in /status and /statm files are left without change.
The /statm file can be extended in the future, if there's a need for that.

Example (part of) /proc/pid/status output including the new Rss* fields:

VmPeak:  2001008 kB
VmSize:  2001004 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      5108 kB
VmRSS:      5108 kB
RssAnon:              92 kB
RssFile:            1324 kB
RssShmem:           3692 kB
VmData:      192 kB
VmStk:       136 kB
VmExe:         4 kB
VmLib:      1784 kB
VmPTE:      3928 kB
VmPMD:        20 kB
VmSwap:        0 kB
HugetlbPages:          0 kB

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
---
 Documentation/filesystems/proc.txt | 13 +++++++++++--
 fs/proc/task_mmu.c                 | 14 ++++++++++++--
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index fdeb5b3..ffcd495 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -169,6 +169,9 @@ For example, to get the status information of a process, all you have to do is
   VmLck:         0 kB
   VmHWM:       476 kB
   VmRSS:       476 kB
+  RssAnon:             352 kB
+  RssFile:             120 kB
+  RssShmem:              4 kB
   VmData:      156 kB
   VmStk:        88 kB
   VmExe:        68 kB
@@ -231,7 +234,12 @@ Table 1-2: Contents of the status files (as of 4.1)
  VmSize                      total program size
  VmLck                       locked memory size
  VmHWM                       peak resident set size ("high water mark")
- VmRSS                       size of memory portions
+ VmRSS                       size of memory portions. It contains the three
+                             following parts (VmRSS = RssAnon + RssFile + RssShmem)
+ RssAnon                     size of resident anonymous memory
+ RssFile                     size of resident file mappings
+ RssShmem                    size of resident shmem memory (includes SysV shm,
+                             mapping of tmpfs and shared anonymous mappings)
  VmData                      size of data, stack, and text segments
  VmStk                       size of data, stack, and text segments
  VmExe                       size of text segment
@@ -266,7 +274,8 @@ Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
  Field    Content
  size     total program size (pages)		(same as VmSize in status)
  resident size of memory portions (pages)	(same as VmRSS in status)
- shared   number of pages that are shared	(i.e. backed by a file)
+ shared   number of pages that are shared	(i.e. backed by a file, same
+						as RssFile+RssShmem in status)
  trs      number of pages that are 'code'	(not including libs; broken,
 							includes data segment)
  lrs      number of pages of library		(always 0 on 2.6)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 19d190b..67aaaad 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -23,9 +23,13 @@
 
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
-	unsigned long data, text, lib, swap, ptes, pmds;
+	unsigned long data, text, lib, swap, ptes, pmds, anon, file, shmem;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
+	anon = get_mm_counter(mm, MM_ANONPAGES);
+	file = get_mm_counter(mm, MM_FILEPAGES);
+	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
+
 	/*
 	 * Note: to minimize their overhead, mm maintains hiwater_vm and
 	 * hiwater_rss only when about to *lower* total_vm or rss.  Any
@@ -36,7 +40,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 	hiwater_vm = total_vm = mm->total_vm;
 	if (hiwater_vm < mm->hiwater_vm)
 		hiwater_vm = mm->hiwater_vm;
-	hiwater_rss = total_rss = get_mm_rss(mm);
+	hiwater_rss = total_rss = anon + file + shmem;
 	if (hiwater_rss < mm->hiwater_rss)
 		hiwater_rss = mm->hiwater_rss;
 
@@ -53,6 +57,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		"VmPin:\t%8lu kB\n"
 		"VmHWM:\t%8lu kB\n"
 		"VmRSS:\t%8lu kB\n"
+		"RssAnon:\t%8lu kB\n"
+		"RssFile:\t%8lu kB\n"
+		"RssShmem:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
@@ -66,6 +73,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		mm->pinned_vm << (PAGE_SHIFT-10),
 		hiwater_rss << (PAGE_SHIFT-10),
 		total_rss << (PAGE_SHIFT-10),
+		anon << (PAGE_SHIFT-10),
+		file << (PAGE_SHIFT-10),
+		shmem << (PAGE_SHIFT-10),
 		data << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/6] mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
  2015-11-18  9:29   ` Vlastimil Babka
@ 2015-11-19 10:04     ` Michal Hocko
  -1 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2015-11-19 10:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Jerome Marchand,
	Hugh Dickins, Peter Zijlstra, Oleg Nesterov, linux-api,
	linux-doc, Konstantin Khlebnikov

On Wed 18-11-15 10:29:33, Vlastimil Babka wrote:
> The previous patch has improved swap accounting for shmem mapping, which
> however made /proc/pid/smaps more expensive for shmem mappings, as we consult
> the radix tree for each pte_none entry, so the overal complexity is
> O(n*log(n)).
> 
> We can reduce this significantly for mappings that cannot contain COWed pages,
> because then we can either use the statistics tha shmem object itself tracks
> (if the mapping contains the whole object, or the swap usage of the whole
> object is zero), or use the radix tree iterator, which is much more effective
> than repeated find_get_entry() calls.
> 
> This patch therefore introduces a function shmem_swap_usage(vma) and makes
> /proc/pid/smaps use it when possible. Only for writable private mappings of
> shmem objects (i.e. tmpfs files) with the shmem object itself (partially)
> swapped outwe have to resort to the find_get_entry() approach. Hopefully
> such mappings are relatively uncommon.
> 
> To demonstrate the diference, I have measured this on a process that creates
> a 2GB mapping and dirties single pages with a stride of 2MB, and time how long
> does it take to cat /proc/pid/smaps of this process 100 times.
> 
> Private writable mapping of a /dev/shm/file (the most complex case):
> 
> real    0m3.831s
> user    0m0.180s
> sys     0m3.212s
> 
> Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
> (which needs to employ the radix tree iterator).
> 
> real    0m1.351s
> user    0m0.096s
> sys     0m0.768s
> 
> Same, but with /dev/shm/file not swapped (so no radix tree walk needed)
> 
> real    0m0.935s
> user    0m0.128s
> sys     0m0.344s
> 
> Private anonymous mapping:
> 
> real    0m0.949s
> user    0m0.116s
> sys     0m0.348s
> 
> The cost is now much closer to the private anonymous mapping case, unless the
> shmem mapping is private and writable.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Looks good to me
Acked-by: Michal Hocko <mhocko@suse.com>
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/6] mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
@ 2015-11-19 10:04     ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2015-11-19 10:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Jerome Marchand,
	Hugh Dickins, Peter Zijlstra, Oleg Nesterov, linux-api,
	linux-doc, Konstantin Khlebnikov

On Wed 18-11-15 10:29:33, Vlastimil Babka wrote:
> The previous patch has improved swap accounting for shmem mapping, which
> however made /proc/pid/smaps more expensive for shmem mappings, as we consult
> the radix tree for each pte_none entry, so the overal complexity is
> O(n*log(n)).
> 
> We can reduce this significantly for mappings that cannot contain COWed pages,
> because then we can either use the statistics tha shmem object itself tracks
> (if the mapping contains the whole object, or the swap usage of the whole
> object is zero), or use the radix tree iterator, which is much more effective
> than repeated find_get_entry() calls.
> 
> This patch therefore introduces a function shmem_swap_usage(vma) and makes
> /proc/pid/smaps use it when possible. Only for writable private mappings of
> shmem objects (i.e. tmpfs files) with the shmem object itself (partially)
> swapped outwe have to resort to the find_get_entry() approach. Hopefully
> such mappings are relatively uncommon.
> 
> To demonstrate the diference, I have measured this on a process that creates
> a 2GB mapping and dirties single pages with a stride of 2MB, and time how long
> does it take to cat /proc/pid/smaps of this process 100 times.
> 
> Private writable mapping of a /dev/shm/file (the most complex case):
> 
> real    0m3.831s
> user    0m0.180s
> sys     0m3.212s
> 
> Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
> (which needs to employ the radix tree iterator).
> 
> real    0m1.351s
> user    0m0.096s
> sys     0m0.768s
> 
> Same, but with /dev/shm/file not swapped (so no radix tree walk needed)
> 
> real    0m0.935s
> user    0m0.128s
> sys     0m0.344s
> 
> Private anonymous mapping:
> 
> real    0m0.949s
> user    0m0.116s
> sys     0m0.348s
> 
> The cost is now much closer to the private anonymous mapping case, unless the
> shmem mapping is private and writable.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Looks good to me
Acked-by: Michal Hocko <mhocko@suse.com>
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/6] mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings
  2015-11-18  9:29   ` Vlastimil Babka
@ 2015-11-19 10:13     ` Michal Hocko
  -1 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2015-11-19 10:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Jerome Marchand,
	Hugh Dickins, Peter Zijlstra, Oleg Nesterov, linux-api,
	linux-doc, Konstantin Khlebnikov

On Wed 18-11-15 10:29:34, Vlastimil Babka wrote:
> Following the previous patch, further reduction of /proc/pid/smaps cost is
> possible for private writable shmem mappings with unpopulated areas where
> the page walk invokes the .pte_hole function. We can use radix tree iterator
> for each such area instead of calling find_get_entry() in a loop. This is
> possible at the extra maintenance cost of introducing another shmem function
> shmem_partial_swap_usage().
> 
> To demonstrate the diference, I have measured this on a process that creates a
> private writable 2GB mapping of a partially swapped out /dev/shm/file (which
> cannot employ the optimizations from the prvious patch) and doesn't populate it
> at all. I time how long does it take to cat /proc/pid/smaps of this process 100
> times.
> 
> Before this patch:
> 
> real    0m3.831s
> user    0m0.180s
> sys     0m3.212s
> 
> After this patch:
> 
> real    0m1.176s
> user    0m0.180s
> sys     0m0.684s
> 
> The time is similar to case where radix tree iterator is employed on the whole
> mapping.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Looks good as well.
Acked-by: Michal Hocko <mhocko@suse.com>
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/6] mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings
@ 2015-11-19 10:13     ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2015-11-19 10:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Jerome Marchand,
	Hugh Dickins, Peter Zijlstra, Oleg Nesterov, linux-api,
	linux-doc, Konstantin Khlebnikov

On Wed 18-11-15 10:29:34, Vlastimil Babka wrote:
> Following the previous patch, further reduction of /proc/pid/smaps cost is
> possible for private writable shmem mappings with unpopulated areas where
> the page walk invokes the .pte_hole function. We can use radix tree iterator
> for each such area instead of calling find_get_entry() in a loop. This is
> possible at the extra maintenance cost of introducing another shmem function
> shmem_partial_swap_usage().
> 
> To demonstrate the diference, I have measured this on a process that creates a
> private writable 2GB mapping of a partially swapped out /dev/shm/file (which
> cannot employ the optimizations from the prvious patch) and doesn't populate it
> at all. I time how long does it take to cat /proc/pid/smaps of this process 100
> times.
> 
> Before this patch:
> 
> real    0m3.831s
> user    0m0.180s
> sys     0m3.212s
> 
> After this patch:
> 
> real    0m1.176s
> user    0m0.180s
> sys     0m0.684s
> 
> The time is similar to case where radix tree iterator is employed on the whole
> mapping.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Looks good as well.
Acked-by: Michal Hocko <mhocko@suse.com>
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-11-19 10:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-18  9:29 [PATCH v5 0/6] enhance shmem process and swap accounting Vlastimil Babka
2015-11-18  9:29 ` Vlastimil Babka
2015-11-18  9:29 ` [PATCH v5 1/6] mm, documentation: clarify /proc/pid/status VmSwap limitations for shmem Vlastimil Babka
2015-11-18  9:29   ` Vlastimil Babka
2015-11-18  9:29   ` Vlastimil Babka
2015-11-18  9:29 ` [PATCH v5 2/6] mm, proc: account for shmem swap in /proc/pid/smaps Vlastimil Babka
2015-11-18  9:29   ` Vlastimil Babka
2015-11-18  9:29 ` [PATCH v5 3/6] mm, proc: reduce cost of /proc/pid/smaps for shmem mappings Vlastimil Babka
2015-11-18  9:29   ` Vlastimil Babka
2015-11-19 10:04   ` Michal Hocko
2015-11-19 10:04     ` Michal Hocko
2015-11-18  9:29 ` [PATCH v5 4/6] mm, proc: reduce cost of /proc/pid/smaps for unpopulated " Vlastimil Babka
2015-11-18  9:29   ` Vlastimil Babka
2015-11-19 10:13   ` Michal Hocko
2015-11-19 10:13     ` Michal Hocko
2015-11-18  9:29 ` [PATCH v5 5/6] mm, shmem: add internal shmem resident memory accounting Vlastimil Babka
2015-11-18  9:29   ` Vlastimil Babka
2015-11-18  9:29 ` [PATCH v5 6/6] mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status Vlastimil Babka
2015-11-18  9:29   ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.