Linux-man Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING
@ 2020-10-12 11:49 Jann Horn
  2020-10-12 14:52 ` Jann Horn
  2020-10-12 15:07 ` Michal Hocko
  0 siblings, 2 replies; 5+ messages in thread
From: Jann Horn @ 2020-10-12 11:49 UTC (permalink / raw)
  To: mtk.manpages; +Cc: linux-man, linux-mm, Mark Mossberg

Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
the per-mm counters. With a 4K page size, that means that you can end up
with the counters off by up to 252KiB per thread.

Example:

$ cat rsstest.c
#include <stdlib.h>
#include <err.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/eventfd.h>
#include <sys/prctl.h>
void dump(int pid) {
  char cmd[1000];
  sprintf(cmd,
    "grep '^VmRSS' /proc/%d/status;"
    "grep '^Rss:' /proc/%d/smaps_rollup;"
    "echo",
    pid, pid
  );
  system(cmd);
}
int main(void) {
  eventfd_t dummy;
  int child_wait = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  int child_resume = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  if (child_wait == -1 || child_resume == -1) err(1, "eventfd");
  pid_t child = fork();
  if (child == -1) err(1, "fork");
  if (child == 0) {
    if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "PDEATHSIG");
    if (getppid() == 1) exit(0);
    char *mapping = mmap(NULL, 80 * 0x1000, PROT_READ|PROT_WRITE,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    for (int i=0; i<40; i++) mapping[0x1000 * i] = 1;
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    for (int i=40; i<80; i++) mapping[0x1000 * i] = 1;
    eventfd_write(child_wait, 1);
    eventfd_read(child_resume, &dummy);
    exit(0);
  }

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  eventfd_read(child_wait, &dummy);
  dump(child);
  eventfd_write(child_resume, 1);

  exit(0);
}
$ gcc -o rsstest rsstest.c && ./rsstest
VmRSS:	      68 kB
Rss:                 616 kB

VmRSS:	      68 kB
Rss:                 776 kB

VmRSS:	     812 kB
Rss:                 936 kB

$


Let's document that those counters aren't entirely accurate.

Reported-by: Mark Mossberg <mark.mossberg@gmail.com>
Signed-off-by: Jann Horn <jannh@google.com>
---
 man5/proc.5 | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/man5/proc.5 b/man5/proc.5
index ed309380b53b..13208811efb0 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -2265,6 +2265,9 @@ This is just the pages which
 count toward text, data, or stack space.
 This does not include pages
 which have not been demand-loaded in, or which are swapped out.
+This value is inaccurate; see
+.I /proc/[pid]/statm
+below.
 .TP
 (25) \fIrsslim\fP \ %lu
 Current soft limit in bytes on the rss of the process;
@@ -2409,9 +2412,9 @@ The columns are:
 size       (1) total program size
            (same as VmSize in \fI/proc/[pid]/status\fP)
 resident   (2) resident set size
-           (same as VmRSS in \fI/proc/[pid]/status\fP)
+           (inaccurate; same as VmRSS in \fI/proc/[pid]/status\fP)
 shared     (3) number of resident shared pages (i.e., backed by a file)
-           (same as RssFile+RssShmem in \fI/proc/[pid]/status\fP)
+           (inaccurate; same as RssFile+RssShmem in \fI/proc/[pid]/status\fP)
 text       (4) text (code)
 .\" (not including libs; broken, includes data segment)
 lib        (5) library (unused since Linux 2.6; always 0)
@@ -2420,6 +2423,16 @@ data       (6) data + stack
 dt         (7) dirty pages (unused since Linux 2.6; always 0)
 .EE
 .in
+.IP
+.\" See SPLIT_RSS_COUNTING in the kernel.
+.\" Inaccuracy is bounded by TASK_RSS_EVENTS_THRESH.
+Some of these values are somewhat inaccurate (up to 63 pages per thread) because
+of a kernel-internal scalability optimization.
+If accurate values are required, use
+.I /proc/[pid]/smaps
+or
+.I /proc/[pid]/smaps_rollup
+instead, which are much slower but provide accurate, detailed information.
 .TP
 .I /proc/[pid]/status
 Provides much of the information in
@@ -2596,6 +2609,9 @@ directly access physical memory.
 .IP *
 .IR VmHWM :
 Peak resident set size ("high water mark").
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR VmRSS :
 Resident set size.
@@ -2604,16 +2620,25 @@ Note that the value here is the sum of
 .IR RssFile ,
 and
 .IR RssShmem .
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR RssAnon :
 Size of resident anonymous memory.
 .\" commit bf9683d6990589390b5178dafe8fd06808869293
 (since Linux 4.5).
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR RssFile :
 Size of resident file mappings.
 .\" commit bf9683d6990589390b5178dafe8fd06808869293
 (since Linux 4.5).
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR RssShmem :
 Size of resident shared memory (includes System V shared memory,
@@ -2622,6 +2647,9 @@ mappings from
 and shared anonymous mappings).
 .\" commit bf9683d6990589390b5178dafe8fd06808869293
 (since Linux 4.5).
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR VmData ", " VmStk ", " VmExe :
 Size of data, stack, and text segments.
@@ -2640,6 +2668,9 @@ Size of second-level page tables (added in Linux 4.0; removed in Linux 4.15).
 .\" commit b084d4353ff99d824d3bc5a5c2c22c70b1fba722
 Swapped-out virtual memory size by anonymous private pages;
 shmem swap usage is not included (since Linux 2.6.34).
+This value is inaccurate; see
+.I /proc/[pid]/statm
+above.
 .IP *
 .IR HugetlbPages :
 Size of hugetlb memory portions

base-commit: 92e4056a29156598d057045ad25f59d44fcd1bb5
-- 
2.28.0.1011.ga647a8990f-goog


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING
  2020-10-12 11:49 [PATCH] proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING Jann Horn
@ 2020-10-12 14:52 ` Jann Horn
  2020-10-12 15:07 ` Michal Hocko
  1 sibling, 0 replies; 5+ messages in thread
From: Jann Horn @ 2020-10-12 14:52 UTC (permalink / raw)
  To: Michael Kerrisk-manpages; +Cc: linux-man, Linux-MM, Mark Mossberg

On Mon, Oct 12, 2020 at 1:49 PM Jann Horn <jannh@google.com> wrote:
> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> the per-mm counters. With a 4K page size, that means that you can end up
> with the counters off by up to 252KiB per thread.

Actually, as Mark Mossberg pointed out to me off-thread, the counters
can actually be off by many times more... can be reproduced with e.g.
the following:

#include <stdlib.h>
#include <err.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/eventfd.h>
#include <sys/prctl.h>
void dump(int pid) {
  char cmd[1000];
  sprintf(cmd,
    "grep '^VmRSS' /proc/%d/status;"
    "grep '^Rss:' /proc/%d/smaps_rollup;"
    "echo",
    pid, pid
  );
  system(cmd);
}
int main(void) {
  eventfd_t dummy;
  int child_wait = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  int child_resume = eventfd(0, EFD_SEMAPHORE|EFD_CLOEXEC);
  if (child_wait == -1 || child_resume == -1) err(1, "eventfd");
  pid_t child = fork();
  if (child == -1) err(1, "fork");
  if (child == 0) {
    if (prctl(PR_SET_PDEATHSIG, SIGKILL)) err(1, "PDEATHSIG");
    if (getppid() == 1) exit(0);
    char *mapping = mmap(NULL, 80 * 0x1000, PROT_READ|PROT_WRITE,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    for (int i=0; 1; i++) {
      eventfd_write(child_wait, 1);
      eventfd_read(child_resume, &dummy);
      if (i == 80) break;
      mapping[0x1000 * i] = 1;
    }
    exit(0);
  }

  for (int i=0; i<81; i++) {
    eventfd_read(child_wait, &dummy);
    dump(child);
    eventfd_write(child_resume, 1);
  }

  exit(0);
}


I'm not entirely sure why though.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING
  2020-10-12 11:49 [PATCH] proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING Jann Horn
  2020-10-12 14:52 ` Jann Horn
@ 2020-10-12 15:07 ` Michal Hocko
  2020-10-12 15:20   ` Jann Horn
  1 sibling, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2020-10-12 15:07 UTC (permalink / raw)
  To: Jann Horn; +Cc: mtk.manpages, linux-man, linux-mm, Mark Mossberg

On Mon 12-10-20 13:49:40, Jann Horn wrote:
> Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> the per-mm counters. With a 4K page size, that means that you can end up
> with the counters off by up to 252KiB per thread.

Do we actually have any strong case to keep this exception to the
accounting? 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING
  2020-10-12 15:07 ` Michal Hocko
@ 2020-10-12 15:20   ` Jann Horn
  2020-10-12 15:33     ` Michal Hocko
  0 siblings, 1 reply; 5+ messages in thread
From: Jann Horn @ 2020-10-12 15:20 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Michael Kerrisk-manpages, linux-man, Linux-MM, Mark Mossberg

On Mon, Oct 12, 2020 at 5:07 PM Michal Hocko <mhocko@suse.com> wrote:
> On Mon 12-10-20 13:49:40, Jann Horn wrote:
> > Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> > v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> > the per-mm counters. With a 4K page size, that means that you can end up
> > with the counters off by up to 252KiB per thread.
>
> Do we actually have any strong case to keep this exception to the
> accounting?

I have no clue. The concept of "concurrently modified cache lines are
bad" seemed vaguely reasonable to me... but I have no idea how much
impact this actually has on massively multithreaded processes.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING
  2020-10-12 15:20   ` Jann Horn
@ 2020-10-12 15:33     ` Michal Hocko
  0 siblings, 0 replies; 5+ messages in thread
From: Michal Hocko @ 2020-10-12 15:33 UTC (permalink / raw)
  To: Jann Horn; +Cc: Michael Kerrisk-manpages, linux-man, Linux-MM, Mark Mossberg

On Mon 12-10-20 17:20:08, Jann Horn wrote:
> On Mon, Oct 12, 2020 at 5:07 PM Michal Hocko <mhocko@suse.com> wrote:
> > On Mon 12-10-20 13:49:40, Jann Horn wrote:
> > > Since 34e55232e59f7b19050267a05ff1226e5cd122a5 (introduced back in
> > > v2.6.34), Linux uses per-thread RSS counters to reduce cache contention on
> > > the per-mm counters. With a 4K page size, that means that you can end up
> > > with the counters off by up to 252KiB per thread.
> >
> > Do we actually have any strong case to keep this exception to the
> > accounting?
> 
> I have no clue. The concept of "concurrently modified cache lines are
> bad" seemed vaguely reasonable to me... but I have no idea how much
> impact this actually has on massively multithreaded processes.

I do remember some discussion when imprecision turned out to be a real
problem (Android?).

Anyway, I have to say that 34e55232e59f ("mm: avoid false sharing of
mm_counter") sounds quite dubious to me and it begs for re-evaluation.

Btw. thanks for trying to document this weird behavior. This is
certainly useful but I am suspecting that dropping it might be even
better.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, back to index

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-12 11:49 [PATCH] proc.5: Document inaccurate RSS due to SPLIT_RSS_COUNTING Jann Horn
2020-10-12 14:52 ` Jann Horn
2020-10-12 15:07 ` Michal Hocko
2020-10-12 15:20   ` Jann Horn
2020-10-12 15:33     ` Michal Hocko

Linux-man Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-man/0 linux-man/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-man linux-man/ https://lore.kernel.org/linux-man \
		linux-man@vger.kernel.org
	public-inbox-index linux-man

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-man


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git