From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753597AbdHWIyQ (ORCPT ); Wed, 23 Aug 2017 04:54:16 -0400 Received: from mga09.intel.com ([134.134.136.24]:47036 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753513AbdHWIyN (ORCPT ); Wed, 23 Aug 2017 04:54:13 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,415,1498546800"; d="scan'208";a="141057257" Subject: Re: [PATCH v7 0/2] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi To: Peter Zijlstra Cc: Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , Mark Rutland , Stephane Eranian , David Carrillo-Cisneros , linux-kernel References: <20170822202106.GP32112@worktop.programming.kicks-ass.net> From: Alexey Budankov Organization: Intel Corp. Message-ID: <8d14b630-cbd7-7569-2566-644cd9e3a564@linux.intel.com> Date: Wed, 23 Aug 2017 11:54:08 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <20170822202106.GP32112@worktop.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 22.08.2017 23:21, Peter Zijlstra wrote: > On Fri, Aug 18, 2017 at 08:17:15AM +0300, Alexey Budankov wrote: >> Hi, > > Please don't post new versions in reply to old versions, that gets them > lost in thread sorted views. Accepted. Actually I followed these recommendations: https://git-scm.com/docs/git-send-email and they say it needs to be like this: [PATCH 0/2] Here is what I did... [PATCH 1/2] Clean up and tests [PATCH 2/2] Implementation [PATCH v2 0/3] Here is a reroll [PATCH v2 1/3] Clean up [PATCH v2 2/3] New tests [PATCH v2 3/3] Implementation So I made v7 a descendant of v6. Do you prefer new version be at the same level as the previous versions like this? [PATCH 0/2] Here is what I did... [PATCH 1/2] Clean up and tests [PATCH 2/2] Implementation [PATCH v2 0/3] Here is a reroll [PATCH v2 1/3] Clean up [PATCH v2 2/3] New tests [PATCH v2 3/3] Implementation > >> This patch set v7 moves event groups into rb trees and implements >> skipping to the current CPU's list on hrtimer interrupt. > > Does this depend on your timekeeping rework posted in that v6 thread? This v7 includes timekeeping rework so it is complete patch set addressing your earlier concerns. The bunch of changes became smaller so we are going right way. I tested v7 thru several nights on Xeon Phi under fuzzer like this: for ((i=0;i<1000;i=i+1)) do ./perf_fuzzer; done and there were no crashes or hangs in perf code the machine is still alive. > If so, I would have expected to see that as part of these patches, if > not, I'm confused, because part of the problem was that we currently > need to update times for events we don't want to schedule etc.. Yes, you are right. We need to update times for that events and we still do, but on-demand - read() syscall. Doing so we may skip slow iterating of the whole bunch of events and get performance boost. We may skip updating the times every timer interrupt and do it only on read() call and on thread context switch out when update_event_times() is actually called. > >> Events allocated for the same CPU are still kept in a linked list >> of the event directly attached to the tree because it is unclear >> how to implement fast iteration thru events allocated for >> the same CPU when they are all attached to a tree employing >> additional 64bit index as a secondary treee key. > > Finding the CPU subtree and rb_next() wasn't good? I implemented the approach you had suggested (as I understood it), tested it and got results that are drastically different from what I am getting for the tree of lists. Specifically I did: 1. keeping all groups in the same single tree by employing a 64-bit index additionally to CPU key; 2. implementing special _less() function and rotation by re-inserting group with incremented index; 3. replacing API with a callback in the signature by a macro perf_event_groups_for_each(); Employing all that shrunk the total patch size, however I am still struggling with the correctness issues. Now I figured that not all indexed events are always located under the root with the same cpu, and it depends on the order of insertion e.g. with insertion order 01,02,03,14,15,16 we get this: 02 / \ 01 14 / \ 03 15 \ 16 and it is unclear how to iterate cpu==0 part of tree in this case. Iterating cpu specific subtree like this: #define for_each_group_event(event, group, cpu, pmu, field) \ for (event = rb_entry_safe(group_first(group, cpu, pmu), \ typeof(*event), field); \ event && event->cpu == cpu && event->pmu == pmu; \ event = rb_entry_safe(rb_next(&event->field), \ typeof(*event), field)) misses event==03 for the case above and I guess this is where I loose samples in my testing. > > >