All of lore.kernel.org
 help / color / mirror / Atom feed
* How to sample memory usage cheaply?
@ 2017-03-30 20:04 Benjamin King
  2017-03-31 12:06 ` Milian Wolff
  2017-04-03 19:09 ` Benjamin King
  0 siblings, 2 replies; 6+ messages in thread
From: Benjamin King @ 2017-03-30 20:04 UTC (permalink / raw)
  To: linux-perf-users

Hi,

I'd like to get a big picture of where a memory hogging process uses physical
memory. I'm interested in call graphs, but in terms of Brendans treatise
(http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html), I'd love to
analyze page faults a bit before continuing with the more expensive tracing
of malloc and friends.

My problem is that we mmap some readonly files more than once to save memory,
but each individual mapping creates page faults.

This makes sense, but how can I measure physical memory properly, then?
Parsing the Pss-Rows in /proc/<pid>/smaps does work, but seems a bit clumsy.
Is there a better way (e.g. with callstacks) to measure physical memory
growth for a process?

cheers,
  Benjamin

PS:
 Here is some measurement with a leaky toy program. It uses a 1GB zero-filled
 file and drops file system caches prior to the measurement to encourage
 major page faults for the first mapping only. Does not work at all:
-----
$ gcc -O0 mmap_faults.c
$ fallocate -z -l $((1<<30)) 1gb_of_garbage.dat 
$ sudo sysctl -w vm.drop_caches=3
vm.drop_caches = 3
$ perf stat -eminor-faults,major-faults ./a.out

 Performance counter stats for './a.out':

            327,726      minor-faults                                                
                  1      major-faults
$ cat mmap_faults.c
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

#define numMaps 20
#define length 1u<<30
#define path "1gb_of_garbage.dat"

int main()
{
  int sum = 0;
  for ( int j = 0; j < numMaps; ++j ) 
  { 
    const char *result =
      (const char*)mmap( NULL, length, PROT_READ, MAP_PRIVATE,
          open( path, O_RDONLY ), 0 );

    for ( int i = 0; i < length; i += 4096 ) 
      sum += result[ i ];
  } 
  return sum;
}
-----
Shouldn't I see ~5 Million page faults (20GB/4K)? 
Shouldn't I see more major page faults?
Same thing when garbage-file is filled from /dev/urandom
Even weirder when MAP_POPULATE'ing the file

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to sample memory usage cheaply?
  2017-03-30 20:04 How to sample memory usage cheaply? Benjamin King
@ 2017-03-31 12:06 ` Milian Wolff
  2017-04-01  7:41   ` Benjamin King
  2017-04-03 19:09 ` Benjamin King
  1 sibling, 1 reply; 6+ messages in thread
From: Milian Wolff @ 2017-03-31 12:06 UTC (permalink / raw)
  To: Benjamin King; +Cc: linux-perf-users

[-- Attachment #1: Type: text/plain, Size: 3688 bytes --]

On Donnerstag, 30. März 2017 22:04:04 CEST Benjamin King wrote:
> Hi,
> 
> I'd like to get a big picture of where a memory hogging process uses
> physical memory. I'm interested in call graphs, but in terms of Brendans
> treatise (http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html),
> I'd love to analyze page faults a bit before continuing with the more
> expensive tracing of malloc and friends.

I suggest you try out my heaptrack tool. While "expensive" it does work quite 
well even for applications that allocate a lot of heap memory. It's super easy 
to use, and if it works you don't need to venture into low-level profiling 
like you are describing below.

https://www.kdab.com/heaptrack-v1-0-0-release/

If you are actually using low-level stuff directly, i.e. have a custom memory 
pool implementation, heaptrack also offers an API similar to Valgrind such 
that you can annotate your custom heap allocations.

> My problem is that we mmap some readonly files more than once to save
> memory, but each individual mapping creates page faults.
> 
> This makes sense, but how can I measure physical memory properly, then?
> Parsing the Pss-Rows in /proc/<pid>/smaps does work, but seems a bit clumsy.
> Is there a better way (e.g. with callstacks) to measure physical memory
> growth for a process?

From what I understand, wouldn't you get this by tracing sbrk + mmap with call 
stacks? Do note that file-backed mmaps can be shared across processes, so 
you'll have to take that into account as done for Pss. But in my experience, 
when you want to improve a single application's memory consumption, it usually 
boils down to non-shared heap memory anyways. I.e. sbrk and anon mmaps.

But if you look at the callstacks for these syscalls, they usually point at 
the obvious places (i.e. mempools), but you won't see what is _actually_ using 
the memory of these pools. Heaptrack or massif is much better in that regard.

Hope that helps, happy profiling!

> PS:
>  Here is some measurement with a leaky toy program. It uses a 1GB
> zero-filled file and drops file system caches prior to the measurement to
> encourage major page faults for the first mapping only. Does not work at
> all: -----
> $ gcc -O0 mmap_faults.c
> $ fallocate -z -l $((1<<30)) 1gb_of_garbage.dat
> $ sudo sysctl -w vm.drop_caches=3
> vm.drop_caches = 3
> $ perf stat -eminor-faults,major-faults ./a.out
> 
>  Performance counter stats for './a.out':
> 
>             327,726      minor-faults
>                   1      major-faults
> $ cat mmap_faults.c
> #include <fcntl.h>
> #include <sys/mman.h>
> #include <unistd.h>
> 
> #define numMaps 20
> #define length 1u<<30
> #define path "1gb_of_garbage.dat"
> 
> int main()
> {
>   int sum = 0;
>   for ( int j = 0; j < numMaps; ++j )
>   {
>     const char *result =
>       (const char*)mmap( NULL, length, PROT_READ, MAP_PRIVATE,
>           open( path, O_RDONLY ), 0 );
> 
>     for ( int i = 0; i < length; i += 4096 )
>       sum += result[ i ];
>   }
>   return sum;
> }
> -----
> Shouldn't I see ~5 Million page faults (20GB/4K)?
> Shouldn't I see more major page faults?
> Same thing when garbage-file is filled from /dev/urandom
> Even weirder when MAP_POPULATE'ing the file
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users"
> in the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Milian Wolff | milian.wolff@kdab.com | Software Engineer
KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
Tel: +49-30-521325470
KDAB - The Qt Experts

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5903 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to sample memory usage cheaply?
  2017-03-31 12:06 ` Milian Wolff
@ 2017-04-01  7:41   ` Benjamin King
  2017-04-01 13:54     ` Vince Weaver
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin King @ 2017-04-01  7:41 UTC (permalink / raw)
  To: Milian Wolff; +Cc: linux-perf-users

Hi!

On Fri, Mar 31, 2017 at 02:06:28PM +0200, Milian Wolff wrote:
>> I'd love to analyze page faults a bit before continuing with the more
>> expensive tracing of malloc and friends.
>
>I suggest you try out my heaptrack tool.

I did! Also the heap profiler from google perftools and a homegrown approach
using uprobes on malloc, free, mmap and munmap. My bcc-fu is too weak to do
me any good right now. The workload that I am working with takes a few hours
to generate, however, so more overhead means a lot of latency to observe the
effect of changes that I do.

I guess that the issue I am facing currently is a bit more fundamental, since I
don't have a good way to trace/sample instances where my process is using a
page of physical memory *for the first time*. Bonus points to detect when the
use count on behalf of my process drops to zero.

I'd like to tell these events apart from instances where I am using the same
physical page from more than one virtual page. It feels like there should be
a way to do so, but I don't know how.

Also, the 'perf stat' output that I sent does not make much sense to me right
now. For example, when I add MAP_POPULATE to the flags for mmap, I only see
~40 minor page faults, which I do not understand at all.

>> This makes sense, but how can I measure physical memory properly, then?
>> Parsing the Pss-Rows in /proc/<pid>/smaps does work, but seems a bit clumsy.
>> Is there a better way (e.g. with callstacks) to measure physical memory
>> growth for a process?
>
From what I understand, wouldn't you get this by tracing sbrk + mmap with call
>stacks?

No, not quite, since different threads in my process mmap the same file. This
counts them twice in terms of virtual memory but I'm interested in the load
on physical memory.


> Do note that file-backed mmaps can be shared across processes, so
>you'll have to take that into account as done for Pss.  But in my experience,
>when you want to improve a single application's memory consumption, it usually
>boils down to non-shared heap memory anyways. I.e. sbrk and anon mmaps.

True, and that's what I'm after, eventually, but the multiple mappings skew
the picture a bit right now.

To make progress, I need to figure out how to properly measure physical
memory usage more.

>But if you look at the callstacks for these syscalls, they usually point at
>the obvious places (i.e. mempools), but you won't see what is _actually_ using
>the memory of these pools. Heaptrack or massif is much better in that regard.

>
>Hope that helps, happy profiling!

Thanks,
  Benjamin


>
>> PS:
>>  Here is some measurement with a leaky toy program. It uses a 1GB
>> zero-filled file and drops file system caches prior to the measurement to
>> encourage major page faults for the first mapping only. Does not work at
>> all: -----
>> $ gcc -O0 mmap_faults.c
>> $ fallocate -z -l $((1<<30)) 1gb_of_garbage.dat
>> $ sudo sysctl -w vm.drop_caches=3
>> vm.drop_caches = 3
>> $ perf stat -eminor-faults,major-faults ./a.out
>>
>>  Performance counter stats for './a.out':
>>
>>             327,726      minor-faults
>>                   1      major-faults
>> $ cat mmap_faults.c
>> #include <fcntl.h>
>> #include <sys/mman.h>
>> #include <unistd.h>
>>
>> #define numMaps 20
>> #define length 1u<<30
>> #define path "1gb_of_garbage.dat"
>>
>> int main()
>> {
>>   int sum = 0;
>>   for ( int j = 0; j < numMaps; ++j )
>>   {
>>     const char *result =
>>       (const char*)mmap( NULL, length, PROT_READ, MAP_PRIVATE,
>>           open( path, O_RDONLY ), 0 );
>>
>>     for ( int i = 0; i < length; i += 4096 )
>>       sum += result[ i ];
>>   }
>>   return sum;
>> }
>> -----
>> Shouldn't I see ~5 Million page faults (20GB/4K)?
>> Shouldn't I see more major page faults?
>> Same thing when garbage-file is filled from /dev/urandom
>> Even weirder when MAP_POPULATE'ing the file
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-perf-users"
>> in the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>-- 
>Milian Wolff | milian.wolff@kdab.com | Software Engineer
>KDAB (Deutschland) GmbH&Co KG, a KDAB Group company
>Tel: +49-30-521325470
>KDAB - The Qt Experts

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to sample memory usage cheaply?
  2017-04-01  7:41   ` Benjamin King
@ 2017-04-01 13:54     ` Vince Weaver
  2017-04-01 16:27       ` Benjamin King
  0 siblings, 1 reply; 6+ messages in thread
From: Vince Weaver @ 2017-04-01 13:54 UTC (permalink / raw)
  To: Benjamin King; +Cc: Milian Wolff, linux-perf-users


On Sat, 1 Apr 2017, Benjamin King wrote:
> Also, the 'perf stat' output that I sent does not make much sense to me right
> now. For example, when I add MAP_POPULATE to the flags for mmap, I only see
> ~40 minor page faults, which I do not understand at all.

have you tried accessing your file in random order?  I think the kernel is 
likely doing some sort of readahead/prefetching.  I think you can change 
that behavior with the madvise() syscall

Vince

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to sample memory usage cheaply?
  2017-04-01 13:54     ` Vince Weaver
@ 2017-04-01 16:27       ` Benjamin King
  0 siblings, 0 replies; 6+ messages in thread
From: Benjamin King @ 2017-04-01 16:27 UTC (permalink / raw)
  To: Vince Weaver; +Cc: Milian Wolff, linux-perf-users

Hi Vince,

>> Also, the 'perf stat' output that I sent does not make much sense to me right
>> now. For example, when I add MAP_POPULATE to the flags for mmap, I only see
>> ~40 minor page faults, which I do not understand at all.
>
>have you tried accessing your file in random order?  I think the kernel is
>likely doing some sort of readahead/prefetching.  I think you can change
>that behavior with the madvise() syscall

No, I did not try that. Just reading in every 4096th byte. I did assume that
any way of populating the page table of my process would show up as a page
fault. I guess, I have to read a bit more on that...

Cheers,
  Benjamin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to sample memory usage cheaply?
  2017-03-30 20:04 How to sample memory usage cheaply? Benjamin King
  2017-03-31 12:06 ` Milian Wolff
@ 2017-04-03 19:09 ` Benjamin King
  1 sibling, 0 replies; 6+ messages in thread
From: Benjamin King @ 2017-04-03 19:09 UTC (permalink / raw)
  To: linux-perf-users

On Thu, Mar 30, 2017 at 10:04:04PM +0200, Benjamin King wrote:

Hi,

I learned a bit more about observing memory with perf. If this is not the
right place to discuss this any more, just tell me to shut up :-)

wrapping this up a bit:
>I'd like to get a big picture of where a memory hogging process uses physical
>memory. I'm interested in call graphs, [...] I'd love to
>analyze page faults

I have learned that first use of a physical page is called "page allocation",
which can be traced via the event kmem:mm_page_alloc. This is the pyhsical
analogue of and different from "page faults" that happen in virtual memory.

Maping a file with MAP_POPULATE after dropping filesystem caches (sysctl -w
vm.drop_caches=3) will show the right number in kmem:mm_page_alloc, namely
the size/4K. 4K is the page size on my system.

If I do the same again without dropping caches in between, mm_page_alloc does
not show the same number, but rather the number of pages it takes to hold the
page table entries. This is nice and fairly complete, but I still hope to
find a way to observe when a page from the filesystem cache is referenced for
the first time from my process. This would allow me to do without the cache
dropping.

Page faults from virtual memory are more opaque to me. They only seem to be
counted when the system did not prepare the process via prefetching. For
example, MAP_POPULATE'd mappings will not count towards page faults, neither
minor nor major ones.

To control some of the prefetching, there is a debugfs knob called
/sys/kernel/debug/fault_around_bytes, but reducing this to the minimum on my
machine does not produce a page fault number that I could easily explain, at
least not in the MAP_POPULATE case. It might work better when actually
reading data from the mapped file.

Anticipating page faults and preventing them proactively is a nice service
from the OS, but I would be delighted if there was a way to trace this as
well, similar to how mm_page_alloc will count each and every pyhsical
allocation.  This would make page faults more useful as a poor man's memory
tracker.

Cheers,
  Benjamin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-04-03 19:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-30 20:04 How to sample memory usage cheaply? Benjamin King
2017-03-31 12:06 ` Milian Wolff
2017-04-01  7:41   ` Benjamin King
2017-04-01 13:54     ` Vince Weaver
2017-04-01 16:27       ` Benjamin King
2017-04-03 19:09 ` Benjamin King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.