From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752072AbdFOWKU (ORCPT ); Thu, 15 Jun 2017 18:10:20 -0400 Received: from mga09.intel.com ([134.134.136.24]:50214 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751546AbdFOWKS (ORCPT ); Thu, 15 Jun 2017 18:10:18 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.39,345,1493708400"; d="scan'208";a="1183085590" Subject: Re: [PATCH v3 1/n] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi To: Mark Rutland Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , David Carrillo-Cisneros , Stephane Eranian , linux-kernel@vger.kernel.org References: <09226446-39b9-9bd2-d60f-b9bb947987c5@linux.intel.com> <20170615195618.GA8807@leverpostej> From: Alexey Budankov Organization: Intel Corp. Message-ID: <07a76338-4c71-569a-d36e-7d6bcd10bd74@linux.intel.com> Date: Fri, 16 Jun 2017 01:10:10 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <20170615195618.GA8807@leverpostej> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 15.06.2017 22:56, Mark Rutland wrote: > Hi, > > On Thu, Jun 15, 2017 at 08:41:42PM +0300, Alexey Budankov wrote: >> This series of patches continues v2 and addresses captured comments. > > As a general note, please keep any change log stuff out of the commit > message, and mvoe that just before the diff. It generally doesn't maek > sense for that to go in the commit message. Accepted. > >> Specifically this patch replaces pinned_groups and flexible_groups >> lists of perf_event_context by red-black cpu indexed trees avoiding >> data structures duplication and introducing possibility to iterate >> event groups for a specific CPU only. > > If you use --per-thread, I take it the overhead is significantly > lowered? Please ask more. > > As another general thing, please aim to write the commit message so that > it'll make sense in N years' time, in the git history. Describe the > whole problem, and the intended solution. > > I had to go diving into mail archives to understand the problem you're > trying to solve, and I shouldn't have to. > > For example, this commit message could be something like: > > perf/core: use rb trees for pinned/flexible groups > > By default, the userspace perf tool opens per-cpu task-bound events > when sampling, so for N logical events requested by the user, the tool > will open N * NR_CPUS events. > > In the kernel, we mux events with a hrtimer, periodically rotating the > event list and trying to schedule each in turn. We skip events whose > cpu filter doesn't match. So when we get unlucky, we can walk N * > (NR_CPUS - 1) events pointlessly for each hrtimer invocation. > > This has been observed to result in significant overhead when running > the STREAM benchmark on 272 core Xeon Phi systems. > > One way to avoid this is to place our events into an rb tree sorted by > CPU filter, so that our hrtimer can skip to the current CPU's > sub-tree, and ignore everything else. > > As a step towards that, this patch replaces our event lists with rb > trees. > > ... which hopefully makes sense now, and will in 2, 5, 10 years' time, > Accepted. Thanks. >> >> Signed-off-by: Alexey Budankov > > Have you thrown Vince's perf fuzzer at this? > > If you haven't, please do. It can be found in the fuzzer directory of: > > https://github.com/deater/perf_event_tests > Accepted. >> --- >> include/linux/perf_event.h | 34 +++- >> kernel/events/core.c | 386 >> +++++++++++++++++++++++++++++++++------------ >> 2 files changed, 315 insertions(+), 105 deletions(-) > > > >> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h >> index 24a6358..2c1dcf1 100644 >> --- a/include/linux/perf_event.h >> +++ b/include/linux/perf_event.h >> @@ -574,6 +574,27 @@ struct perf_event { >> struct list_head sibling_list; >> >> /* >> + * Node on the pinned or flexible tree located at the event context; >> + * the node may be empty in case its event is not directly attached >> + * to the tree but to group_list list of the event directly >> + * attached to the tree; >> + */ >> + struct rb_node group_node; >> + /* >> + * List keeps groups allocated for the same cpu; >> + * the list may be empty in case its event is not directly >> + * attached to the tree but to group_list list of the event directly >> + * attached to the tree; >> + */ >> + struct list_head group_list; >> + /* >> + * Entry into the group_list list above; >> + * the entry may be attached to the self group_list list above >> + * in case the event is directly attached to the pinned or >> + * flexible tree; >> + */ >> + struct list_head group_list_entry; > > We already have group_entry for threading the RB tree. i don't see why > we need two new lists heads. Ok. For a group leader event its group_entry may be used instead of new group_list_entry. > >> + /* >> * We need storage to track the entries in perf_pmu_migrate_context; we >> * cannot use the event_entry because of RCU and we want to keep the >> * group in tact which avoids us using the other two entries. >> @@ -741,8 +762,17 @@ struct perf_event_context { >> struct mutex mutex; >> >> struct list_head active_ctx_list; >> - struct list_head pinned_groups; >> - struct list_head flexible_groups; >> + /* >> + * RB tree for pinned groups; keeps event's group_node >> + * nodes of attached pinned groups; >> + */ >> + struct rb_root pinned_tree; >> + /* >> + * RB tree for flexible groups; keeps event's group_node >> + * nodes of attached flexible groups; >> + */ >> + struct rb_root flexible_tree; > > We didn't need these commesnts before, and I don't think we need them > now. > > Any reason not to stick with the existing names? > > The existing {pinned,flexible}_groups names are fine regardless of > whether a list or tree is used, and makes it clear that the elements are > the group leaders, not all events. Yes, makes sense. Just changing the type of data structures. > >> + >> struct list_head event_list; >> int nr_events; >> int nr_active; >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index bc63f8d..a3531f9 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -1458,13 +1458,142 @@ static enum event_type_t >> get_event_type(struct perf_event *event) >> return event_type; >> } >> >> -static struct list_head * >> -ctx_group_list(struct perf_event *event, struct perf_event_context *ctx) >> +static struct rb_root * >> +ctx_group_cpu_tree(struct perf_event *event, struct >> perf_event_context *ctx) >> { >> if (event->attr.pinned) >> - return &ctx->pinned_groups; >> + return &ctx->pinned_tree; >> else >> - return &ctx->flexible_groups; >> + return &ctx->flexible_tree; >> +} >> + >> +/* >> + * Insert a group into a tree using event->cpu as a key. If event->cpu node >> + * is already attached to the tree then the event is added to the attached >> + * group's group_list list. >> + */ >> +static void >> +perf_cpu_tree_insert(struct rb_root *tree, struct perf_event *event) >> +{ >> + struct rb_node **node; >> + struct rb_node *parent; >> + >> + WARN_ON_ONCE(!tree || !event); >> + >> + node = &tree->rb_node; >> + parent = *node; > > The first iteration of the loop handles this, so it can go. If tree is empty parent will be uninitialized what is harmful. > >> + >> + while (*node) { >> + struct perf_event *node_event = container_of(*node, >> + struct perf_event, group_node); >> + >> + parent = *node; >> + >> + if (event->cpu < node_event->cpu) { >> + node = &parent->rb_left; >> + } else if (event->cpu > node_event->cpu) { >> + node = &parent->rb_right; >> + } else { >> + list_add_tail(&event->group_list_entry, >> + &node_event->group_list); >> + return; > > Why aren't we putting all group leaders into the tree? Because we index a tree by cpu only and if two groups share cpu we link them into the group_list of the group directly attached to the tree via group_node. > > I don't think this is what Peter meant when he suggested a threaded rb > tree. Yes, that's right. This implements cpu indexed rb trees with linked lists attached to every tree's node. The lists contain event groups allocated for the same cpu only. > > My understanding was that every group leader should be in the rb tree, > and every group leader should be in the same list that threads the whole > tree. Both the rb tree and list have to be maintained with the same > order. > > Thus, you can walk the rb tree to find the first event in the sub-list > that you care about, then walk that sub-list. > > Note that this means we need to shuffle the list and rb tree to rotate > events in each sub-tree. Now we also rotate a list but only containing groups for the appropriate cpu and we don't touch the tree. > >> + } >> + } >> + >> + list_add_tail(&event->group_list_entry, &event->group_list); >> + >> + rb_link_node(&event->group_node, parent, node); >> + rb_insert_color(&event->group_node, tree); >> +} >> + >> +/* >> + * Delete a group from a tree. If the group is directly attached to >> the tree >> + * it also detaches all groups on the group's group_list list. >> + */ >> +static void >> +perf_cpu_tree_delete(struct rb_root *tree, struct perf_event *event) >> +{ >> + WARN_ON_ONCE(!tree || !event); >> + >> + if (RB_EMPTY_NODE(&event->group_node)) { >> + list_del_init(&event->group_list_entry); >> + } else { >> + struct perf_event *list_event, *list_next; >> + >> + rb_erase(&event->group_node, tree); >> + list_for_each_entry_safe(list_event, list_next, >> + &event->group_list, group_list_entry) >> + list_del_init(&list_event->group_list_entry); > > At this point, all the other group leaders withthe same filter are gone > from the rb tree, since we didn't insert any of them into the rb tree in > place of the leader we deleted. > >> + } >> +} >> + >> +/* >> + * Find group_list list by a cpu key and call provided callback for every >> + * group on the list. Iteration stops if the callback returns non zero. >> + */ >> + >> +typedef int (*perf_cpu_tree_callback_t)(struct perf_event *, void *); >> + >> +static int >> +perf_cpu_tree_iterate_cpu(struct rb_root *tree, int cpu, >> + perf_cpu_tree_callback_t callback, void *data) >> +{ >> + int ret = 0; >> + struct rb_node *node; >> + struct perf_event *event; >> + >> + WARN_ON_ONCE(!tree); >> + >> + node = tree->rb_node; >> + >> + while (node) { >> + struct perf_event *node_event = container_of(node, >> + struct perf_event, group_node); >> + >> + if (cpu < node_event->cpu) { >> + node = node->rb_left; >> + } else if (cpu > node_event->cpu) { >> + node = node->rb_right; >> + } else { >> + list_for_each_entry(event, &node_event->group_list, >> + group_list_entry) { >> + ret = callback(event, data); >> + if (ret) >> + return ret; >> + } >> + } >> + } >> + >> + return 0; >> +} >> + >> +/* >> + * Iterate a tree and call provided callback for every group in the tree. >> + * Iteration stops if the callback returns non zero. >> + */ >> +static int >> +perf_cpu_tree_iterate(struct rb_root *tree, >> + perf_cpu_tree_callback_t callback, void *data) >> +{ >> + int ret = 0; >> + struct rb_node *node; >> + struct perf_event *event; >> + >> + WARN_ON_ONCE(!tree); >> + >> + for (node = rb_first(tree); node; node = rb_next(node)) { >> + struct perf_event *node_event = container_of(node, >> + struct perf_event, group_node); >> + >> + list_for_each_entry(event, &node_event->group_list, >> + group_list_entry) { >> + ret = callback(event, data); >> + if (ret) >> + return ret; >> + } >> + } >> + >> + return 0; >> } > > If you need to iterate over every event, you can use the list that > threads the whole tree. > >> >> /* >> @@ -1485,12 +1614,12 @@ list_add_event(struct perf_event *event, >> struct perf_event_context *ctx) >> * perf_group_detach can, at all times, locate all siblings. >> */ >> if (event->group_leader == event) { >> - struct list_head *list; >> + struct rb_root *tree; >> >> event->group_caps = event->event_caps; >> >> - list = ctx_group_list(event, ctx); >> - list_add_tail(&event->group_entry, list); >> + tree = ctx_group_cpu_tree(event, ctx); >> + perf_cpu_tree_insert(tree, event); > > Can we wrap this in a helper so that finds the sub-tree and performs > the insert? Ok. add_to_groups(). > >> } >> >> list_update_cgroup_event(event, ctx, true); >> @@ -1680,8 +1809,12 @@ list_del_event(struct perf_event *event, >> struct perf_event_context *ctx) >> >> list_del_rcu(&event->event_entry); >> >> - if (event->group_leader == event) >> - list_del_init(&event->group_entry); >> + if (event->group_leader == event) { >> + struct rb_root *tree; >> + >> + tree = ctx_group_cpu_tree(event, ctx); >> + perf_cpu_tree_delete(tree, event); > > ... likewise? Ok. del_from_groups(). > > Thanks, > Mark. > Thanks, Alexey