From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751428AbbJBFXK (ORCPT ); Fri, 2 Oct 2015 01:23:10 -0400 Received: from mail-pa0-f48.google.com ([209.85.220.48]:36003 "EHLO mail-pa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751325AbbJBFXH (ORCPT ); Fri, 2 Oct 2015 01:23:07 -0400 From: Namhyung Kim To: Arnaldo Carvalho de Melo Cc: Ingo Molnar , Peter Zijlstra , Jiri Olsa , LKML , Frederic Weisbecker , Stephane Eranian , David Ahern , Andi Kleen Subject: [RFC/PATCH 00/38] perf tools: Speed-up perf report by using multi thread (v5) Date: Fri, 2 Oct 2015 14:18:41 +0900 Message-Id: <1443763159-29098-1-git-send-email-namhyung@kernel.org> X-Mailer: git-send-email 2.6.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, This patchset converts perf report to use multiple threads in order to speed up the processing on large data files. I can see a minimum ~30% of speedup with this change. The code is still experimental and contains many rough edges. But I'd like to share and give some feedbacks. * changes in v5) - move struct machines from perf_session to perf_tool (Arnaldo) - add --num-thread option to perf report (David) - separate track_mmap code for review (Jiri) - fix some minor bugs (Jiri) * changes in v4) - rename to *_find(new)_by_time() - also sort struct map by time - use 'perf_has_index' global variable - fix an off-by-one bug in index generation - rebased onto the recent atomic thread refcounting - remove missing threads tree in favor of thread rwlock * changes in v3) - handle header (metadata) same as sample data (at index 0) - maintain libunwind address space in map_groups instead of thread - use *_time API only for indexed data file - resolve callchain with the *_time API - use dso__data_get/put_fd() to protect access to fd - synthesize COMM event for command line workload * changes in v2) - rework with single indexed data file rather than multiple files in a directory The perf report processes (sample) events like below: 1. preprocess sample to get matching thread/dso/symbol info 2. insert it to hists rbtree (with callchain tree) based on the info 3. optionally collapse hist entries that match given sort key(s) 4. resort hist entries (by overhead) for output 5. display the hist entries The stage 1 is a preprocessing and mostly act like a read-only operation in that it doesn't change a machine state during the sample processing. Meta events like fork, comm and mmap can change the machine/thread state but symbols can be loaded during the processing (stage 2). The stage 2 consumes most of the time especially with callchains and --children option is enabled. And this work can be easily patitioned as each sample is independent to others. But the resulting hists must be combined/collapsed to a single global hists before going to further steps. The stage 3 is optional and only needed by certain sort keys - but with stage 2 paralellized, it needs to be done always. The stage 4 and 5 works on whole hists so must be done serially. So my approach is like this: Partially do stage 1 first - but only for meta events that changes machine state. To do this I add a dummy tracking event to perf record and make it collect such meta events only. They are saved as normal data and processed before sample events at perf report time. This also requires to handle multiple sample data concurrently and to find a corresponding machine state when processing samples. On a large profiling session, many tasks were created and exited so pid might be recycled (even more than once!). To deal with it, I managed to have thread, map_groups, map and comm in time sorted. The only remaining thing is symbol loading as it's done lazily when sample requires it. With that being done, the stage 2 can be done by multiple threads. I also save each sample data (per-cpu or per-thread) in separate files during record and then merge them into a single data file with an index table. On perf report time, each region of sample data will be processed by each thread. And symbol loading is protected by a mutex lock. For DWARF post-unwinding, dso cache data also needs to be protected by a lock and this caused a huge contention. I made it to search the rbtree speculatively first and then, if it didn't find one, search it again under the dso lock. Please take a look at it if it's acceptable. The patch 1-9 are to support indexing for data file. With --index option, perf record will create a intermediate directory and then save meta events and sample data to separate files. And finally it'll build an index table and concatenate the data files (and also remove the intermediate directory). The patch 10-24 are to manage machine and thread state using timestamp so that it can be searched when processing samples. The patch 25-35 are to implement parallel report. And on patch 36 I implemented 'perf data index' command to build an index table for a given data file. The last two patches are to improve speed when using multi-thread. This patchset didn't change perf record to use multi-thread. But I think it can be easily done later if needed. Note that output has a slight difference to original version when compared using indexed data file. But they're mostly unresolved symbols for callchains. Here is the result: This is just elapsed time measured by 'perf stat -r 5'. Note that overall performance in v5 slows down than v4 by ~30% possibly due to more agressive use of atomic thread refcounting and rwlocks. The data file was recorded during kernel build with fp callchain and size is 2.1GB. The machine has 6 core with hyper-threading enabled and I got a similar result on my laptop too. perf report --children --no-children + --call-graph none ------------- ------------- ------------------- current 497.568804423 131.312637209 58.972991789 with index 443.527574644 110.671594167 40.755612227 + --multi-thread 304.064133574 63.175497780 18.815425955 This result is with 7.7GB data file using libunwind for callchain. perf report --children --no-children + --call-graph none ------------- ------------- ------------------- current 20.600862660 13.723705468 6.865225263 with index 17.926679276 11.492099330 5.254594583 + --multi-thread 10.549591227 6.891150645 3.261140410 This result is with same file but using libdw for callchain unwind. perf report --children --no-children + --call-graph none ------------- ------------- ------------------- current 259.553111664 229.953713435 7.083999430 with index 239.071895690 222.996473823 4.967384924 + --multi-thread 119.465870204 110.833195062 3.144185106 On my archlinux system, callchain unwind using libdw is much slower than libunwind. I'm using elfutils version 0.161. Also I don't know why --children takes less time than --no-children. Anyway we can see the --multi-thread performance is much better for each case. This patch is based on acme/perf/core - commit ("dbc67409fa91 perf list: Do event name substring search as last resort when no events found"). You can get it from 'perf/threaded-v5' branch on my tree at: git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git Please take a look and play with it. Any comments are welcome! :) Thanks, Namhyung Namhyung Kim (38): perf tools: Use a software dummy event to track task/mmap events perf tools: Save mmap_param.len instead of mask perf tools: Move auxtrace_mmap field to struct perf_evlist perf tools: pass perf_mmap desc directly perf tools: Create separate mmap for dummy tracking event perf tools: Extend perf_evlist__mmap_ex() to use track mmap perf tools: Add HEADER_DATA_INDEX feature perf tools: Handle indexed data file properly perf record: Add --index option for building index table perf report: Skip dummy tracking event perf tools: Introduce thread__comm(_str)_by_time() helpers perf tools: Add a test case for thread comm handling perf tools: Use thread__comm_by_time() when adding hist entries perf tools: Convert dead thread list into rbtree perf tools: Introduce machine__find*_thread_by_time() perf tools: Add a test case for timed thread handling perf tools: Maintain map groups list in a leader thread perf tools: Introduce thread__find_addr_location_by_time() and friends perf callchain: Use thread__find_addr_location_by_time() and friends perf tools: Add a test case for timed map groups handling perf tools: Save timestamp of a map creation perf tools: Introduce map_groups__{insert,find}_by_time() perf tools: Use map_groups__find_addr_by_time() perf tools: Add testcase for managing maps with time perf callchain: Maintain libunwind's address space in map_groups perf session: Pass struct events stats to event processing functions perf hists: Pass hists struct to hist_entry_iter struct perf tools: Move BUILD_ID_SIZE definition to perf.h perf tools: Introduce machines__new/delete() perf session: Separate struct machines from session perf report: Parallelize perf report using multi-thread perf tools: Fix progress ui to support multi thread perf report: Add --multi-thread option and config item perf session: Handle index files generally perf report: Add --num-thread option to control number of thread perf data: Implement 'index' subcommand perf tools: Reduce lock contention in perf_event__preprocess_sample() perf tools: Skip dso front cache for indexed data file tools/perf/Documentation/perf-data.txt | 25 +- tools/perf/Documentation/perf-record.txt | 4 + tools/perf/Documentation/perf-report.txt | 10 + tools/perf/builtin-annotate.c | 12 +- tools/perf/builtin-buildid-cache.c | 14 +- tools/perf/builtin-buildid-list.c | 16 +- tools/perf/builtin-data.c | 358 +++++++++++++++++++++++++++- tools/perf/builtin-diff.c | 16 +- tools/perf/builtin-evlist.c | 18 +- tools/perf/builtin-inject.c | 6 + tools/perf/builtin-kmem.c | 14 +- tools/perf/builtin-kvm.c | 14 +- tools/perf/builtin-lock.c | 7 +- tools/perf/builtin-mem.c | 14 +- tools/perf/builtin-record.c | 196 +++++++++++++-- tools/perf/builtin-report.c | 102 +++++++- tools/perf/builtin-sched.c | 8 +- tools/perf/builtin-script.c | 34 ++- tools/perf/builtin-timechart.c | 12 +- tools/perf/builtin-top.c | 26 +- tools/perf/builtin-trace.c | 10 +- tools/perf/perf.c | 1 + tools/perf/perf.h | 6 + tools/perf/tests/Build | 4 + tools/perf/tests/builtin-test.c | 16 ++ tools/perf/tests/dwarf-unwind.c | 13 +- tools/perf/tests/hists_common.c | 3 +- tools/perf/tests/hists_cumulate.c | 1 + tools/perf/tests/hists_filter.c | 1 + tools/perf/tests/hists_link.c | 6 +- tools/perf/tests/hists_output.c | 1 + tools/perf/tests/tests.h | 4 + tools/perf/tests/thread-comm.c | 47 ++++ tools/perf/tests/thread-lookup-time.c | 179 ++++++++++++++ tools/perf/tests/thread-map-time.c | 90 +++++++ tools/perf/tests/thread-mg-share.c | 7 +- tools/perf/tests/thread-mg-time.c | 93 ++++++++ tools/perf/tests/topology.c | 19 +- tools/perf/ui/browsers/hists.c | 30 ++- tools/perf/ui/gtk/hists.c | 3 + tools/perf/util/build-id.c | 16 +- tools/perf/util/build-id.h | 3 - tools/perf/util/data-convert-bt.c | 8 +- tools/perf/util/dso.c | 2 +- tools/perf/util/dso.h | 1 + tools/perf/util/event.c | 144 +++++++++-- tools/perf/util/event.h | 1 - tools/perf/util/evlist.c | 209 ++++++++++++---- tools/perf/util/evlist.h | 15 +- tools/perf/util/evsel.h | 15 ++ tools/perf/util/header.c | 68 +++++- tools/perf/util/header.h | 3 + tools/perf/util/hist.c | 117 ++++++--- tools/perf/util/hist.h | 7 +- tools/perf/util/intel-bts.c | 2 +- tools/perf/util/intel-pt.c | 2 +- tools/perf/util/machine.c | 320 +++++++++++++++++++++---- tools/perf/util/machine.h | 17 +- tools/perf/util/map.c | 84 ++++++- tools/perf/util/map.h | 37 ++- tools/perf/util/probe-event.c | 2 +- tools/perf/util/session.c | 394 +++++++++++++++++++++++++++---- tools/perf/util/session.h | 9 +- tools/perf/util/symbol-elf.c | 2 +- tools/perf/util/symbol.c | 7 +- tools/perf/util/thread.c | 205 +++++++++++++++- tools/perf/util/thread.h | 28 ++- tools/perf/util/tool.h | 16 ++ tools/perf/util/unwind-libdw.c | 12 +- tools/perf/util/unwind-libunwind.c | 55 +++-- tools/perf/util/unwind.h | 15 +- 71 files changed, 2868 insertions(+), 388 deletions(-) create mode 100644 tools/perf/tests/thread-comm.c create mode 100644 tools/perf/tests/thread-lookup-time.c create mode 100644 tools/perf/tests/thread-map-time.c create mode 100644 tools/perf/tests/thread-mg-time.c -- 2.6.0