linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sonny Rao <sonnyrao@chromium.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Jann Horn <jann@thejh.net>,
	Robert Foss <robert.foss@collabora.com>,
	corbet@lwn.net, Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Konstantin Khlebnikov <koct9i@gmail.com>,
	Hugh Dickins <hughd@google.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Minchan Kim <minchan@kernel.org>,
	John Stultz <john.stultz@linaro.org>,
	ross.zwisler@linux.intel.com, jmarchan@redhat.com,
	Johannes Weiner <hannes@cmpxchg.org>,
	Kees Cook <keescook@chromium.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Robin Humble <plaguedbypenguins@gmail.com>,
	David Rientjes <rientjes@google.com>,
	eric.engestrom@imgtec.com, Janis Danisevskis <jdanis@google.com>,
	calvinowens@fb.com, Alexey Dobriyan <adobriyan@gmail.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	ldufour@linux.vnet.ibm.com, linux-doc@vger.kernel.org,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ben Zhang <benzh@chromium.org>, Bryan Freed <bfreed@chromium.org>,
	Filipe Brandenburger <filbranden@chromium.org>,
	Mateusz Guzik <mguzik@redhat.com>
Subject: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
Date: Wed, 17 Aug 2016 11:57:56 -0700	[thread overview]
Message-ID: <CAPz6YkW3Ph4mi++qY4cJiQ1PwhnxLr5=E4oCHjf5nYJHMhRcew@mail.gmail.com> (raw)
In-Reply-To: <20160817130320.GC20703@dhcp22.suse.cz>

On Wed, Aug 17, 2016 at 6:03 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 17-08-16 11:31:25, Jann Horn wrote:
>> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
>> > On Tue 16-08-16 12:46:51, Robert Foss wrote:
>> > [...]
>> > > $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
>> > > /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
>> > > /proc/5025/smaps }"
>> > > [...]
>> > >   Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
>> > > /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
>> > > }"
>> > >   User time (seconds): 0.37
>> > >   System time (seconds): 0.45
>> > >   Percent of CPU this job got: 92%
>> > >   Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
>> >
>> > This is really unexpected. Where is the user time spent? Anyway, rather
>> > than measuring some random processes I've tried to measure something
>> > resembling the worst case. So I've created a simple program to mmap as
>> > much as possible:
>> >
>> > #include <sys/mman.h>
>> > #include <sys/types.h>
>> > #include <unistd.h>
>> > #include <stdio.h>
>> > int main()
>> > {
>> >     while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>> >             ;
>> >
>> >     printf("pid:%d\n", getpid());
>> >     pause();
>> >     return 0;
>> > }
>>
>> Ah, nice, that's a reasonable test program. :)
>>
>>
>> > So with a reasonable user space the parsing is really not all that time
>> > consuming wrt. smaps handling. That being said I am still very skeptical
>> > about a dedicated proc file which accomplishes what userspace can done
>> > in a trivial way.
>>
>> Now, since your numbers showed that all the time is spent in the kernel,
>> also create this test program to just read that file over and over again:
>>
>> $ cat justreadloop.c
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sched.h>
>> #include <unistd.h>
>> #include <err.h>
>> #include <stdio.h>
>>
>> char buf[1000000];
>>
>> int main(int argc, char **argv) {
>>   printf("pid:%d\n", getpid());
>>   while (1) {
>>     int fd = open(argv[1], O_RDONLY);
>>     if (fd < 0) continue;
>>     if (read(fd, buf, sizeof(buf)) < 0)
>>       err(1, "read");
>>     close(fd);
>>   }
>> }
>> $ gcc -Wall -o justreadloop justreadloop.c
>> $
>>
>> Now launch your test:
>>
>> $ ./mapstuff
>> pid:29397
>>
>> point justreadloop at it:
>>
>> $ ./justreadloop /proc/29397/smaps
>> pid:32567
>>
>> ... and then check the performance stats of justreadloop:
>>
>> # perf top -p 32567
>>
>> This is what I see:
>>
>> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
>> Overhead  Shared Object     Symbol
>>   30,43%  [kernel]          [k] format_decode
>>    9,12%  [kernel]          [k] number
>>    7,66%  [kernel]          [k] vsnprintf
>>    7,06%  [kernel]          [k] __lock_acquire
>>    3,23%  [kernel]          [k] lock_release
>>    2,85%  [kernel]          [k] debug_lockdep_rcu_enabled
>>    2,25%  [kernel]          [k] skip_atoi
>>    2,13%  [kernel]          [k] lock_acquire
>>    2,05%  [kernel]          [k] show_smap
>
> This is a lot! I would expect the rmap walk to consume more but it even
> doesn't show up in the top consumers.
>
>> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> time spent on evaluating format strings. The new interface
>> wouldn't have to spend that much time on format strings because there
>> isn't so much text to format.
>
> well, this is true of course but I would much rather try to reduce the
> overhead of smaps file than add a new file. The following should help
> already. I've measured ~7% systime cut down. I guess there is still some
> room for improvements but I have to say I'm far from being convinced about
> a new proc file just because we suck at dumping information to the
> userspace.
> If this was something like /proc/<pid>/stat which is
> essentially read all the time then it would be a different question but
> is the rss, pss going to be all that often? If yes why?

If the question is why do we need to read RSS, PSS, Private_*, Swap
and the other fields so often?

I have two use cases so far involving monitoring per-process memory
usage, and we usually need to read stats for about 25 processes.

Here's a timing example on an fairly recent ARM system 4 core RK3288
running at 1.8Ghz

localhost ~ # time cat /proc/25946/smaps > /dev/null

real    0m0.036s
user    0m0.020s
sys     0m0.020s

localhost ~ # time cat /proc/25946/totmaps > /dev/null

real    0m0.027s
user    0m0.010s
sys     0m0.010s
localhost ~ #

I'll ignore the user time for now, and we see about 20 ms of system
time with smaps and 10 ms with totmaps, with 20 similar processes it
would be 400 milliseconds of cpu time for the kernel to get this
information from smaps vs 200 milliseconds with totmaps.  Even totmaps
is still pretty slow, but much better than smaps.

Use cases:
1) Basic task monitoring -- like "top" that shows memory consumption
including PSS, Private, Swap
    1 second update means about 40% of one CPU is spent in the kernel
gathering the data with smaps

2) User space OOM handling -- we'd rather do a more graceful shutdown
than let the kernel's OOM killer activate and need to gather this
information
    and we'd like to be able to get this information to make the
decision much faster than 400ms

> These are the
> questions which should be answered before we even start considering the
> implementation.
> ---
> From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 17 Aug 2016 14:00:13 +0200
> Subject: [PATCH] proc, smaps: reduce printing overhead
>
> seq_printf (used by show_smap) can be pretty expensive when dumping a
> lot of numbers.  Say we would like to get Rss and Pss from a particular
> process.  In order to measure a pathological case let's generate as many
> mappings as possible:
>
> $ cat max_mmap.c
> int main()
> {
>         while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>                 ;
>
>         printf("pid:%d\n", getpid());
>         pause();
>         return 0;
> }
>
> $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps
>
> would do a trick. The whole runtime is in the kernel space which is not
> that that unexpected because smaps is not the cheapest one (we have to
> do rmap walk etc.).
>
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
>         User time (seconds): 0.01
>         System time (seconds): 0.44
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47
>
> But the perf says:
>     22.55%  awk      [kernel.kallsyms]  [k] format_decode
>     14.65%  awk      [kernel.kallsyms]  [k] vsnprintf
>      6.40%  awk      [kernel.kallsyms]  [k] number
>      2.53%  awk      [kernel.kallsyms]  [k] shmem_mapping
>      2.53%  awk      [kernel.kallsyms]  [k] show_smap
>      1.81%  awk      [kernel.kallsyms]  [k] lock_acquire
>
> we are spending most of the time actually generating the output which is
> quite lame. Let's replace seq_printf by seq_puts and seq_put_decimal_ull.
> This will give us:
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3067/smaps"
>         User time (seconds): 0.00
>         System time (seconds): 0.41
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.42
>
> which will give us ~7% improvement. Perf says:
>     28.87%  awk      [kernel.kallsyms]  [k] seq_puts
>      5.30%  awk      [kernel.kallsyms]  [k] vsnprintf
>      4.54%  awk      [kernel.kallsyms]  [k] format_decode
>      3.73%  awk      [kernel.kallsyms]  [k] show_smap
>      2.56%  awk      [kernel.kallsyms]  [k] shmem_mapping
>      1.92%  awk      [kernel.kallsyms]  [k] number
>      1.80%  awk      [kernel.kallsyms]  [k] lock_acquire
>      1.75%  awk      [kernel.kallsyms]  [k] print_name_value_kb
>
> Reported-by: Jann Horn <jann@thejh.net>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/proc/task_mmu.c | 63 ++++++++++++++++++++++--------------------------------
>  1 file changed, 25 insertions(+), 38 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 187d84ef9de9..41c24c0811da 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -721,6 +721,13 @@ void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
>  {
>  }
>
> +static void print_name_value_kb(struct seq_file *m, const char *name, unsigned long val)
> +{
> +       seq_puts(m, name);
> +       seq_put_decimal_ull(m, 0, val);
> +       seq_puts(m, " kB\n");
> +}
> +
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>         struct vm_area_struct *vma = v;
> @@ -765,45 +772,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>
>         show_map_vma(m, vma, is_pid);
>
> -       seq_printf(m,
> -                  "Size:           %8lu kB\n"
> -                  "Rss:            %8lu kB\n"
> -                  "Pss:            %8lu kB\n"
> -                  "Shared_Clean:   %8lu kB\n"
> -                  "Shared_Dirty:   %8lu kB\n"
> -                  "Private_Clean:  %8lu kB\n"
> -                  "Private_Dirty:  %8lu kB\n"
> -                  "Referenced:     %8lu kB\n"
> -                  "Anonymous:      %8lu kB\n"
> -                  "AnonHugePages:  %8lu kB\n"
> -                  "ShmemPmdMapped: %8lu kB\n"
> -                  "Shared_Hugetlb: %8lu kB\n"
> -                  "Private_Hugetlb: %7lu kB\n"
> -                  "Swap:           %8lu kB\n"
> -                  "SwapPss:        %8lu kB\n"
> -                  "KernelPageSize: %8lu kB\n"
> -                  "MMUPageSize:    %8lu kB\n"
> -                  "Locked:         %8lu kB\n",
> -                  (vma->vm_end - vma->vm_start) >> 10,
> -                  mss.resident >> 10,
> -                  (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
> -                  mss.shared_clean  >> 10,
> -                  mss.shared_dirty  >> 10,
> -                  mss.private_clean >> 10,
> -                  mss.private_dirty >> 10,
> -                  mss.referenced >> 10,
> -                  mss.anonymous >> 10,
> -                  mss.anonymous_thp >> 10,
> -                  mss.shmem_thp >> 10,
> -                  mss.shared_hugetlb >> 10,
> -                  mss.private_hugetlb >> 10,
> -                  mss.swap >> 10,
> -                  (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
> -                  vma_kernel_pagesize(vma) >> 10,
> -                  vma_mmu_pagesize(vma) >> 10,
> -                  (vma->vm_flags & VM_LOCKED) ?
> +       print_name_value_kb(m, "Size:           ", (vma->vm_end - vma->vm_start) >> 10);
> +       print_name_value_kb(m, "Rss:            ", mss.resident >> 10);
> +       print_name_value_kb(m, "Pss:            ", (unsigned long)(mss.pss >> (10 + PSS_SHIFT)));
> +       print_name_value_kb(m, "Shared_Clean:   ", mss.shared_clean  >> 10);
> +       print_name_value_kb(m, "Shared_Dirty:   ", mss.shared_dirty  >> 10);
> +       print_name_value_kb(m, "Private_Clean:  ", mss.private_clean >> 10);
> +       print_name_value_kb(m, "Private_Dirty:  ", mss.private_dirty >> 10);
> +       print_name_value_kb(m, "Referenced:     ", mss.referenced >> 10);
> +       print_name_value_kb(m, "Anonymous:      ", mss.anonymous >> 10);
> +       print_name_value_kb(m, "AnonHugePages:  ", mss.anonymous_thp >> 10);
> +       print_name_value_kb(m, "ShmemPmdMapped: ", mss.shmem_thp >> 10);
> +       print_name_value_kb(m, "Shared_Hugetlb: ", mss.shared_hugetlb >> 10);
> +       print_name_value_kb(m, "Private_Hugetlb: ", mss.private_hugetlb >> 10);
> +       print_name_value_kb(m, "Swap:           ", mss.swap >> 10);
> +       print_name_value_kb(m, "SwapPss:        ", (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)));
> +       print_name_value_kb(m, "KernelPageSize: ", vma_kernel_pagesize(vma) >> 10);
> +       print_name_value_kb(m, "MMUPageSize:    ", vma_mmu_pagesize(vma) >> 10);
> +       print_name_value_kb(m, "Locked:         ", (vma->vm_flags & VM_LOCKED) ?
>                         (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
> -
>         arch_show_smap(m, vma);
>         show_smap_vma_flags(m, vma);
>         m_cache_vma(m, vma);
> --
> 2.8.1
>
> --
> Michal Hocko
> SUSE Labs

  parent reply	other threads:[~2016-08-17 19:04 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-12 22:04 [PACTH v2 0/3] Implement /proc/<pid>/totmaps robert.foss
2016-08-12 22:04 ` [PACTH v2 1/3] mm, proc: " robert.foss
2016-08-13 14:39   ` Jann Horn
2016-08-15 13:57     ` Robert Foss
2016-08-15 20:14       ` Robert Foss
2016-08-12 22:04 ` [PACTH v2 2/3] Documentation/filesystems: Fixed typo robert.foss
2016-08-12 22:04 ` [PACTH v2 3/3] Documentation/filesystems: Added /proc/PID/totmaps documentation robert.foss
2016-08-14  9:04 ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Michal Hocko
2016-08-15 13:00   ` Robert Foss
2016-08-15 13:42     ` Michal Hocko
2016-08-15 16:25       ` Robert Foss
2016-08-16  7:12         ` Michal Hocko
2016-08-16 16:46           ` Robert Foss
2016-08-17  8:22             ` Michal Hocko
2016-08-17  9:31               ` Jann Horn
2016-08-17 13:03                 ` Michal Hocko
2016-08-17 16:48                   ` Robert Foss
2016-08-17 18:57                   ` Sonny Rao [this message]
2016-08-18  7:44                     ` Michal Hocko
2016-08-18 17:47                       ` Sonny Rao
2016-08-18 18:01                         ` Michal Hocko
2016-08-18 21:05                           ` Robert Foss
2016-08-19  6:27                             ` Sonny Rao
2016-08-19  2:26                           ` Minchan Kim
2016-08-19  6:47                             ` Sonny Rao
2016-08-19  8:05                             ` Michal Hocko
2016-08-19 18:20                               ` Sonny Rao
2016-08-22  0:07                               ` Minchan Kim
2016-08-22  7:40                                 ` Michal Hocko
2016-08-22 14:12                                   ` Minchan Kim
2016-08-22 14:37                                     ` Robert Foss
2016-08-22 16:45                                     ` Michal Hocko
2016-08-22 17:29                                       ` Michal Hocko
2016-08-22 17:47                                         ` Michal Hocko
2016-08-23  8:26                                           ` Michal Hocko
2016-08-23 14:33                                             ` utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps) Michal Hocko
2016-08-23 21:46                                               ` Rik van Riel
2016-08-24 16:56                                                 ` Michal Hocko
2016-09-30  9:49                                                 ` Michal Hocko
2016-09-30 12:35                                                   ` Rik van Riel
2016-08-19  6:43                           ` [PACTH v2 0/3] Implement /proc/<pid>/totmaps Sonny Rao
2016-08-19  7:59                             ` Michal Hocko
2016-08-19 17:57                               ` Sonny Rao
2016-08-22  7:54                                 ` Michal Hocko
2016-08-22 22:44                                   ` Sonny Rao
2016-08-24 10:14                                     ` Marcin Jabrzyk
2016-08-30  8:20                                       ` Michal Hocko
2016-08-29 14:37                                     ` Michal Hocko
2016-08-30  8:15                                       ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPz6YkW3Ph4mi++qY4cJiQ1PwhnxLr5=E4oCHjf5nYJHMhRcew@mail.gmail.com' \
    --to=sonnyrao@chromium.org \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=benzh@chromium.org \
    --cc=bfreed@chromium.org \
    --cc=calvinowens@fb.com \
    --cc=corbet@lwn.net \
    --cc=eric.engestrom@imgtec.com \
    --cc=filbranden@chromium.org \
    --cc=gorcunov@openvz.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jann@thejh.net \
    --cc=jdanis@google.com \
    --cc=jmarchan@redhat.com \
    --cc=john.stultz@linaro.org \
    --cc=keescook@chromium.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=koct9i@gmail.com \
    --cc=ldufour@linux.vnet.ibm.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mguzik@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=plaguedbypenguins@gmail.com \
    --cc=rientjes@google.com \
    --cc=robert.foss@collabora.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).