From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753665AbbCLKl1 (ORCPT ); Thu, 12 Mar 2015 06:41:27 -0400 Received: from mail-we0-f182.google.com ([74.125.82.182]:42102 "EHLO mail-we0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751850AbbCLKlZ (ORCPT ); Thu, 12 Mar 2015 06:41:25 -0400 Date: Thu, 12 Mar 2015 11:41:19 +0100 From: Ingo Molnar To: Namhyung Kim Cc: Arnaldo Carvalho de Melo , Peter Zijlstra , Jiri Olsa , LKML , David Ahern , Minchan Kim , Joonsoo Kim Subject: Re: [RFC/PATCHSET 0/6] perf kmem: Implement page allocation analysis (v1) Message-ID: <20150312104119.GA5978@gmail.com> References: <1426145571-3065-1-git-send-email-namhyung@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1426145571-3065-1-git-send-email-namhyung@kernel.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Namhyung Kim wrote: > Hello, > > Currently perf kmem command only analyzes SLAB memory allocation. And > I'd like to introduce page allocation analysis also. Users can use > --slab and/or --page option to select it. If none of these options > are used, it does slab allocation analysis for backward compatibility. > > The patch 1-3 are bugfix and cleanups. Patch 4 implements basic > support for page allocation analysis, patch 5 deals with the callsite > and finally patch 6 implements sorting. > > In this patchset, I used two kmem events: kmem:mm_page_alloc and > kmem_page_free for analysis as they can track every memory > allocation/free path AFAIK. However, unlike slab tracepoint events, > those page allocation events don't provide callsite info directly. So > I recorded callchains and extracted callsites like below: Really cool features! I have a couple of output typography observations: > Normal page allocation callchains look like this: > > 360a7e __alloc_pages_nodemask > 3a711c alloc_pages_current > 357bc7 __page_cache_alloc <-- callsite > 357cf6 pagecache_get_page > 48b0a prepare_pages > 494d3 __btrfs_buffered_write > 49cdf btrfs_file_write_iter > 3ceb6e new_sync_write > 3cf447 vfs_write > 3cff99 sys_write > 7556e9 system_call > f880 __write_nocancel > 33eb9 cmd_record > 4b38e cmd_kmem > 7aa23 run_builtin > 27a9a main > 20800 __libc_start_main > > But first two are internal page allocation functions so it should be > skipped. To determine such allocation functions, I used following regex: > > ^_?_?(alloc|get_free|get_zeroed)_pages? > > This gave me a following list of functions (you can see this with -v): > > alloc func: __get_free_pages > alloc func: get_zeroed_page > alloc func: alloc_pages_exact > alloc func: __alloc_pages_direct_compact > alloc func: __alloc_pages_nodemask > alloc func: alloc_page_interleave > alloc func: alloc_pages_current > alloc func: alloc_pages_vma > alloc func: alloc_page_buffers > alloc func: alloc_pages_exact_nid > > After skipping those function, it got '__page_cache_alloc'. > > Other information such as allocation order, migration type and gfp > flags are provided by tracepoint events. > > Basically the output will be sorted by total allocation bytes, but you > can change it by using -s/--sort option. The following sort keys are > added to support page analysis: page, order, mtype, gfp. Existing > 'callsite', 'bytes' and 'hit' sort keys also can be used. > > An example follows: > > # perf kmem record --slab --page sleep 1 > [ perf record: Woken up 0 times to write data ] > [ perf record: Captured and wrote 49.277 MB perf.data (191027 samples) ] > > # perf kmem stat --page --caller -l 10 -s order,hit > > -------------------------------------------------------------------------------------------- > Total_alloc/Per | Hit | Order | Migrate type | GFP flag | Callsite s/Per/Size s/Hit/Hits s/Migrate type/Migration type s/GFP flag/GFP flags ? > -------------------------------------------------------------------------------------------- > 65536/16384 | 4 | 2 | RECLAIMABLE | 00285250 | new_slab > 51347456/4096 | 12536 | 0 | MOVABLE | 0102005a | __page_cache_alloc > 53248/4096 | 13 | 0 | UNMOVABLE | 002084d0 | pte_alloc_one > 40960/4096 | 10 | 0 | MOVABLE | 000280da | handle_mm_fault > 28672/4096 | 7 | 0 | UNMOVABLE | 000000d0 | __pollwait > 20480/4096 | 5 | 0 | MOVABLE | 000200da | do_wp_page > 20480/4096 | 5 | 0 | MOVABLE | 000200da | do_cow_fault > 16384/4096 | 4 | 0 | UNMOVABLE | 00000200 | __tlb_remove_page > 16384/4096 | 4 | 0 | UNMOVABLE | 000084d0 | __pmd_alloc > 8192/4096 | 2 | 0 | UNMOVABLE | 000084d0 | __pud_alloc > ... | ... | ... | ... | ... | ... > -------------------------------------------------------------------------------------------- > > SUMMARY (page allocator) > ======================== > Total alloc requested: 12593 > Total alloc failure : 0 > Total bytes allocated: 51630080 > Total free requested: 115 > Total free unmatched: 67 > Total bytes freed : 471040 I'd suggest the following changes to the format: - Collapse stats into 3 groups: 'allocated+freed', 'allocated only', 'freed only', depending on how much of their lifetime we've managed to trace. These groups are really distinct and it makes little sense to mix up their stats. - Add commas to the numbers, to make it easier to read and compare larger numbers. - Right-align the numbers, to make them easy to compare when they are placed under each other. - Merge the 'count' and 'bytes' stats into a single line, so that it's more compact, easier to navigate, but also only comparable type numbers are placed under each other. I.e. something like this (mockup) output: SUMMARY (page allocator) ======================== Pages allocated+freed: 12,593 [ 51,630,080 bytes ] Pages allocated-only: 2,342 [ 1,235,010 bytes ] Pages freed-only: 67 [ 135,311 bytes ] Page allocation failures : 0 > Order UNMOVABLE RECLAIMABLE MOVABLE RESERVED CMA/ISOLATE > ----- ------------ ------------ ------------ ------------ ------------ > 0 32 0 12557 0 0 > 1 0 0 0 0 0 > 2 0 4 0 0 0 > 3 0 0 0 0 0 > 4 0 0 0 0 0 > 5 0 0 0 0 0 > 6 0 0 0 0 0 > 7 0 0 0 0 0 > 8 0 0 0 0 0 > 9 0 0 0 0 0 > 10 0 0 0 0 0 Here I'd suggest the following refinements: - Use '.' instead of '0', to make actual nonzero values stand out visually, while still keeping a tabular format - Merge the 'Reserved', 'CMA/Isolate' columns into a single 'Special' colum: this will be zero in 99.9% of the cases, as those pages mostly deal with driver interfaces, mostly used during init/deinit. - Capitalize less. - Use comma-separated numbers for better readability. So something like this: Order Unmovable Reclaimable Movable Special ----- ------------ ------------ ------------ ------------ 0 32 . 12,557 . 1 . . . . 2 . 4 . . 3 . . . . 4 . . . . 5 . . . . 6 . . . . 7 . . . . 8 . . . . 9 . . . . 10 . . . . Look for example how easily noticeable the '4' value is now, while it was pretty easy to miss in the original table. > I have some idea how to improve it. But I'd also like to hear other > idea, suggestion, feedback and so on. So there's one thing that would be useful: to track pages allocated on one node, but freed on another. Those kinds of allocation/free patterns are especially expensive and might make sense to visualize. Thanks, Ingo