From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755232AbbCLO7o (ORCPT ); Thu, 12 Mar 2015 10:59:44 -0400 Received: from mail-pa0-f45.google.com ([209.85.220.45]:40423 "EHLO mail-pa0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755369AbbCLO7X (ORCPT ); Thu, 12 Mar 2015 10:59:23 -0400 Date: Thu, 12 Mar 2015 23:58:37 +0900 From: Namhyung Kim To: Ingo Molnar Cc: Arnaldo Carvalho de Melo , Peter Zijlstra , Jiri Olsa , LKML , David Ahern , Minchan Kim , Joonsoo Kim Subject: Re: [RFC/PATCHSET 0/6] perf kmem: Implement page allocation analysis (v1) Message-ID: <20150312145837.GA1398@danjae> References: <1426145571-3065-1-git-send-email-namhyung@kernel.org> <20150312104119.GA5978@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20150312104119.GA5978@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Ingo, On Thu, Mar 12, 2015 at 11:41:19AM +0100, Ingo Molnar wrote: > * Namhyung Kim wrote: > > > Hello, > > > > Currently perf kmem command only analyzes SLAB memory allocation. And > > I'd like to introduce page allocation analysis also. Users can use > > --slab and/or --page option to select it. If none of these options > > are used, it does slab allocation analysis for backward compatibility. > > > > The patch 1-3 are bugfix and cleanups. Patch 4 implements basic > > support for page allocation analysis, patch 5 deals with the callsite > > and finally patch 6 implements sorting. > > > > In this patchset, I used two kmem events: kmem:mm_page_alloc and > > kmem_page_free for analysis as they can track every memory > > allocation/free path AFAIK. However, unlike slab tracepoint events, > > those page allocation events don't provide callsite info directly. So > > I recorded callchains and extracted callsites like below: > > Really cool features! Thanks! > > I have a couple of output typography observations: > > > Normal page allocation callchains look like this: > > > > 360a7e __alloc_pages_nodemask > > 3a711c alloc_pages_current > > 357bc7 __page_cache_alloc <-- callsite > > 357cf6 pagecache_get_page > > 48b0a prepare_pages > > 494d3 __btrfs_buffered_write > > 49cdf btrfs_file_write_iter > > 3ceb6e new_sync_write > > 3cf447 vfs_write > > 3cff99 sys_write > > 7556e9 system_call > > f880 __write_nocancel > > 33eb9 cmd_record > > 4b38e cmd_kmem > > 7aa23 run_builtin > > 27a9a main > > 20800 __libc_start_main > > > > But first two are internal page allocation functions so it should be > > skipped. To determine such allocation functions, I used following regex: > > > > ^_?_?(alloc|get_free|get_zeroed)_pages? > > > > This gave me a following list of functions (you can see this with -v): > > > > alloc func: __get_free_pages > > alloc func: get_zeroed_page > > alloc func: alloc_pages_exact > > alloc func: __alloc_pages_direct_compact > > alloc func: __alloc_pages_nodemask > > alloc func: alloc_page_interleave > > alloc func: alloc_pages_current > > alloc func: alloc_pages_vma > > alloc func: alloc_page_buffers > > alloc func: alloc_pages_exact_nid > > > > After skipping those function, it got '__page_cache_alloc'. > > > > Other information such as allocation order, migration type and gfp > > flags are provided by tracepoint events. > > > > Basically the output will be sorted by total allocation bytes, but you > > can change it by using -s/--sort option. The following sort keys are > > added to support page analysis: page, order, mtype, gfp. Existing > > 'callsite', 'bytes' and 'hit' sort keys also can be used. > > > > An example follows: > > > > # perf kmem record --slab --page sleep 1 > > [ perf record: Woken up 0 times to write data ] > > [ perf record: Captured and wrote 49.277 MB perf.data (191027 samples) ] > > > > # perf kmem stat --page --caller -l 10 -s order,hit > > > > -------------------------------------------------------------------------------------------- > > Total_alloc/Per | Hit | Order | Migrate type | GFP flag | Callsite > > s/Per/Size > s/Hit/Hits > s/Migrate type/Migration type > s/GFP flag/GFP flags > > ? OK, will change. (They'll spend a bit more column spaces though.) > > > -------------------------------------------------------------------------------------------- > > 65536/16384 | 4 | 2 | RECLAIMABLE | 00285250 | new_slab > > 51347456/4096 | 12536 | 0 | MOVABLE | 0102005a | __page_cache_alloc > > 53248/4096 | 13 | 0 | UNMOVABLE | 002084d0 | pte_alloc_one > > 40960/4096 | 10 | 0 | MOVABLE | 000280da | handle_mm_fault > > 28672/4096 | 7 | 0 | UNMOVABLE | 000000d0 | __pollwait > > 20480/4096 | 5 | 0 | MOVABLE | 000200da | do_wp_page > > 20480/4096 | 5 | 0 | MOVABLE | 000200da | do_cow_fault > > 16384/4096 | 4 | 0 | UNMOVABLE | 00000200 | __tlb_remove_page > > 16384/4096 | 4 | 0 | UNMOVABLE | 000084d0 | __pmd_alloc > > 8192/4096 | 2 | 0 | UNMOVABLE | 000084d0 | __pud_alloc > > ... | ... | ... | ... | ... | ... > > -------------------------------------------------------------------------------------------- > > > > SUMMARY (page allocator) > > ======================== > > Total alloc requested: 12593 > > Total alloc failure : 0 > > Total bytes allocated: 51630080 > > Total free requested: 115 > > Total free unmatched: 67 > > Total bytes freed : 471040 > > I'd suggest the following changes to the format: > > - Collapse stats into 3 groups: 'allocated+freed', 'allocated only', > 'freed only', depending on how much of their lifetime we've > managed to trace. These groups are really distinct and it makes > little sense to mix up their stats. Good idea. Actually I'm thinking about a new option that shows only lively allocated memory (excluding freed page) in the table. FYI current number is total allocated memory (including freed page). > > - Add commas to the numbers, to make it easier to read and compare > larger numbers. OK > > - Right-align the numbers, to make them easy to compare when they > are placed under each other. OK > > - Merge the 'count' and 'bytes' stats into a single line, so that > it's more compact, easier to navigate, but also only comparable > type numbers are placed under each other. OK > > I.e. something like this (mockup) output: > > SUMMARY (page allocator) > ======================== > > Pages allocated+freed: 12,593 [ 51,630,080 bytes ] > > Pages allocated-only: 2,342 [ 1,235,010 bytes ] > Pages freed-only: 67 [ 135,311 bytes ] > > Page allocation failures : 0 Looks a lot better! One thing I need to tell you is that the numbers are not pages but requests. > > > > Order UNMOVABLE RECLAIMABLE MOVABLE RESERVED CMA/ISOLATE > > ----- ------------ ------------ ------------ ------------ ------------ > > 0 32 0 12557 0 0 > > 1 0 0 0 0 0 > > 2 0 4 0 0 0 > > 3 0 0 0 0 0 > > 4 0 0 0 0 0 > > 5 0 0 0 0 0 > > 6 0 0 0 0 0 > > 7 0 0 0 0 0 > > 8 0 0 0 0 0 > > 9 0 0 0 0 0 > > 10 0 0 0 0 0 > > Here I'd suggest the following refinements: > > - Use '.' instead of '0', to make actual nonzero values stand out > visually, while still keeping a tabular format OK > > - Merge the 'Reserved', 'CMA/Isolate' columns into a single 'Special' > colum: this will be zero in 99.9% of the cases, as those pages > mostly deal with driver interfaces, mostly used during init/deinit. I'm not sure about the CMA pages.. > > - Capitalize less. OK > > - Use comma-separated numbers for better readability. OK > > So something like this: > > > Order Unmovable Reclaimable Movable Special > ----- ------------ ------------ ------------ ------------ > 0 32 . 12,557 . > 1 . . . . > 2 . 4 . . > 3 . . . . > 4 . . . . > 5 . . . . > 6 . . . . > 7 . . . . > 8 . . . . > 9 . . . . > 10 . . . . > > > Look for example how easily noticeable the '4' value is now, while it > was pretty easy to miss in the original table. Indeed! > > > I have some idea how to improve it. But I'd also like to hear other > > idea, suggestion, feedback and so on. > > So there's one thing that would be useful: to track pages allocated on > one node, but freed on another. Those kinds of allocation/free > patterns are especially expensive and might make sense to visualize. I think it can be done easily as slab analysis already contains the info. Thanks for your useful feedbacks! Namhyung