linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Carrillo-Cisneros <davidcc@google.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>, Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andi Kleen <ak@linux.intel.com>, Kan Liang <kan.liang@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@suse.de>,
	Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Vikas Shivappa <vikas.shivappa@linux.intel.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Vince Weaver <vince@deater.net>, Paul Turner <pjt@google.com>,
	Stephane Eranian <eranian@google.com>
Subject: Re: [RFC 2/6] perf/core: add a rb-tree index to inactive_groups
Date: Tue, 10 Jan 2017 12:20:00 -0800	[thread overview]
Message-ID: <CALcN6mhMMLz1j1e5p=xukYp8LbRXM2zsUZMr0zpjcU2G-WMB6g@mail.gmail.com> (raw)
In-Reply-To: <20170110141408.GC19704@leverpostej>

On Tue, Jan 10, 2017 at 6:14 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Tue, Jan 10, 2017 at 02:24:58AM -0800, David Carrillo-Cisneros wrote:
>> Add a rb-tree that indexes inactive events by {CPU/cgroup,flexible,stamp}.
>>
>> The original idea by Peter Z. was to sort task events in an rb-tree using
>> {pmu,cpu,timestamp} as key.
>>
>> Having the PMU as part of the key gets complicated for contexts that
>> share pmus (i.e. software context) because all events in a context should
>> rotate together irrespective of their pmu. It's also unclear to me that
>> there is any case where a seach by pmu is useful.
>
> Using the PMU as the key is essential for correct operation where
> heterogeneous HW PMUs share a context.
>
> For example, on a big.LITTLE system, big and little CPU PMUs share the
> same context, but their events are mutually incompatible. On big CPUs we
> only want to consider the sub-tree of big events, and on little CPUs we
> only want to consider little events. Hence, we need to be abel to search
> by PMU.

I see it now. So, if PMU were added to the rb-tree keys. How can the
generic code know what's the PMU of the current CPU?

>
> For SW PMUs, pmu::add() should never fail, and regardless of the order
> of the list we should be able to pmu::add() all events. Given that, why
> does the manner in which rotation occurs matter for SW PMUs?
>
>> Another complicatino is that using ctx->time (or timestamp) implies that
>> groups added during the same context switch may not have unique key.
>> This increases the complexity of that finds all events in the rb-tree
>> that are within a time interval.
>
> Could you elaborate on this? I don't understand what the problem is
> here. If we need uniqueness where {pmu,cpu,runtime} are equal, can't we
> extend the comparison to {pmu,cpu,runtime,event pointer}? That way
> everything we need is already implicit in the event, and we don't need
> perf_event::rbtree_key nor do we need
> perf_event_context::nr_inactive_added.

Yes, we could extend the comparison. But I am trying to keep the key a
u64 to speed up things.

I found it easier to simply create a counter and use it as an equivalent to
(timestamp, unique id). Both ways induce the same order of events.

>
> Using the runtime as part of the key was intended to provide fairer
> scheduling, even when scheduling intervals are not uniform, etc. Not
> using the runtime means we lose that fairness.

A counter that increases every time an event is added to the rbtree
induces the same order as a (timestamp, <some unique id>).

>
>> Lastly, it is useful to query pinned and flexible events separately since
>> they are scheduled in at different times.
>
> Using this as part of the sort makes sense to me.
>
>> For the reasons above, I created a rb-tree per context with key
>> {CPU,flexible,stamp} for task contexts and {cgroup,flexible,stamp} for
>> CPU contexts.
>>
>> The "flexible" boolean allows to query pinned or flexible events
>> separately.
>> The stamp is given by a non-decreasing counter: ctx->nr_events_added.
>> It increases every time a new inactive event is inserted. That choice of
>> stamp guarantees unique keys for all events and that events of the same
>> type (same {CPU/cgroup,flexible}) have the same order in the rb-tree.
>
> As above, I think we can use the event pointer for this, without having
> to maintain a new counter.

Same as above. I am trying to have a unique 64 bit key. I'd argue that
the new counter only adds a couple of lines of code and 32 bits of storage
but makes the rb-tree key comparison simpler and faster.

>
>> When events are scheduled in or rotated, all events in the context must be
>> iterated or rotated together, irrespective of the CPU/cgroup. To do that,
>> we add ctx->inactive_groups, a list that "threads" the rb-tree in total
>> ctx->nr_events_added order. Note that this order is the same as timestamp
>> order and ctx->inactive_groups is used for both scheduling and iteration.
>> The rb-tree can be seen as an index over ctx->inactive_groups.
>
> As above, I think we must use runtime as part of the key, and with that
> done, we only need the rb tree, and not the inactive_groups list.

The inactive_groups list makes the rb-tree "threaded" and follows
timestamp (or stamp) order. It makes iterating over events much
faster.

>
> With runtime as part of the key, rotation is not necessary, since
> insertion will sort the list, and then we can pick all the nodes with
> the shortest runtime. Hence, no special rotation code is necessary -- we
> simply make all events inactive, then reschedule.

Right, that's done later in this series.

>
>> Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
>> ---
>>  include/linux/perf_event.h |  5 +++
>>  kernel/events/core.c       | 82 ++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 87 insertions(+)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 3fa18f05c9b0..fd32ecc37d33 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -564,6 +564,9 @@ struct perf_event {
>>       struct list_head                group_entry;
>>       struct list_head                sibling_list;
>>
>> +     u64                             rbtree_key;
>> +     struct rb_node                  rbtree_node;
>> +
>>       /*
>>        * We need storage to track the entries in perf_pmu_migrate_context; we
>>        * cannot use the event_entry because of RCU and we want to keep the
>> @@ -736,6 +739,8 @@ struct perf_event_context {
>>       struct list_head                pinned_groups;
>>       struct list_head                flexible_groups;
>>
>> +     struct rb_root                  rbtree_root;
>> +     u32                             nr_inactive_added;
>>       struct list_head                active_pinned_groups;
>>       struct list_head                active_flexible_groups;
>>       struct list_head                inactive_groups;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b744b5a8dbd0..623d81c0ca93 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1462,19 +1462,98 @@ ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
>>               return &ctx->flexible_groups;
>>  }
>>
>> +/*
>> + * The bits perf_event::kbtree_key represent:
>> + *   - 63:33 an unique identifier for CPU (if a task context) or a cgroup
>> + *     (if a CPU context).
>> + *   - 32    a boolean to indicate if eventt is flexible (vs  pinnned).
>> + *   - 31:0  a unique "stamp" that follows the last time the event was
>> + *   scheduled.
>> + * The 64 bits value groups event of the same type (CPU/cgroup + flexible)
>> + * together in the rb-tree.
>> + */
>> +#define RBTREE_KEY_STAMP_WIDTH               32
>> +#define RBTREE_KEY_STAMP_MASK                GENMASK_ULL(RBTREE_KEY_STAMP_WIDTH - 1, 0)
>> +#define RBTREE_KEY_FLEXIBLE_MASK     BIT_ULL(RBTREE_KEY_STAMP_WIDTH)
>> +
>> +static u64 taskctx_rbtree_key(int cpu, bool flexible)
>> +{
>> +     /*
>> +      * Use CPU only. PMU is never used in schedule in/out and, since  some
>> +      * contexts share PMU, iterate over them would make things complicated.
>> +      * I could not find a case where an ordered iteration over all PMU
>> +      * events in one context is useful.
>> +      */
>> +     return ((u64)cpu << (RBTREE_KEY_STAMP_WIDTH + 1)) |
>> +             (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
>> +}
>> +
>> +static u64 cpuctx_rbtree_key(struct perf_cgroup *cgrp, bool flexible)
>> +{
>> +     u64 k;
>> +
>> +     if (cgrp)
>> +             /* A cheap way to obtain an identifier for a cgroup. Suggestions appreciated. */
>> +             k = (u64)cgrp->css.id << (RBTREE_KEY_STAMP_WIDTH + 1);
>> +     else
>> +             k = GENMASK_ULL(63, RBTREE_KEY_STAMP_WIDTH + 1);
>> +     return k | (flexible ? RBTREE_KEY_FLEXIBLE_MASK : 0);
>> +}
>
> Can we turn these info {task,cpu}ctx_rbtree_cmp() functions which
> compare two events dynamically?
>
> Given that event->task can distinguish the two cases, we could just have
> a event_rbtree_cmp() (with an assert that event_a->task ==
> event_b->task).

I tried to avoid recreating the key every time there is a comparison.
Currently the key is stored into the u64 event->rbtree_key and only recreated
when the stamp changes.

The cmp wrapper makes sense if we go away from the u64 keys and add
more fields to the comparison.

In summary:

  - The stamp (i.e. ctx->nr_inactive_added) induces the same order than the
  timestamp at the time the event is added to the rb-tree. But it guarantees
  uniqueness and is denser (fits in 32bits).

  - The inactive_groups list threads the rb-tree and is sorted in stamp order
 (same as timestamp order). This is the list used to iterate during
ctx_sched_{in,out}.
 The rb-tree only creates a index (or skiplist) on these list so that we skip
  most of the events that do not mach the current CPU/cgroup (or pmu,
optionally).

  - I see now why PMU is useful for big.LITTLE. If there is a way for
generic code
  to retrieve the PMU of a CPU, I can add it to the key for those PMUs with
  PERF_PMU_CAP_HETEROGENEOUS_CPUS capability.

>
> Thanks,
> Mark.

  reply	other threads:[~2017-01-10 20:20 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-10 10:24 [RFC 0/6] optimize ctx switch with rb-tree David Carrillo-Cisneros
2017-01-10 10:24 ` [RFC 1/6] perf/core: create active and inactive event groups David Carrillo-Cisneros
2017-01-10 13:49   ` Mark Rutland
2017-01-10 20:45     ` David Carrillo-Cisneros
2017-01-12 11:05       ` Mark Rutland
     [not found]         ` <CALcN6mhPmpSqKhE3Ua+j-xROLzeAyrgdCk4AGGtfF9kExXRTJg@mail.gmail.com>
2017-01-13 11:01           ` Mark Rutland
2017-01-10 10:24 ` [RFC 2/6] perf/core: add a rb-tree index to inactive_groups David Carrillo-Cisneros
2017-01-10 14:14   ` Mark Rutland
2017-01-10 20:20     ` David Carrillo-Cisneros [this message]
2017-01-12 11:47       ` Mark Rutland
2017-01-13  7:34         ` David Carrillo-Cisneros
2017-01-16  2:03   ` [lkp-developer] [perf/core] 33da94bd89: BUG:unable_to_handle_kernel kernel test robot
2017-01-10 10:24 ` [RFC 3/6] perf/core: use rb-tree to sched in event groups David Carrillo-Cisneros
2017-01-10 16:38   ` Mark Rutland
2017-01-10 20:51     ` David Carrillo-Cisneros
2017-01-12 12:14       ` Mark Rutland
2017-01-13  8:01         ` David Carrillo-Cisneros
2017-01-13 10:24           ` Mark Rutland
2017-01-11 20:31     ` Liang, Kan
2017-01-12 10:11       ` Mark Rutland
2017-01-12 13:28         ` Liang, Kan
2017-01-13  8:05           ` David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 4/6] perf/core: avoid rb-tree traversal when no inactive events David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 5/6] perf/core: rotation no longer necessary. Behavior has changed. Beware David Carrillo-Cisneros
2017-01-10 10:25 ` [RFC 6/6] perf/core: use rb-tree index to optimize filtered perf_iterate_ctx David Carrillo-Cisneros
2017-01-16  2:05   ` [lkp-developer] [perf/core] 49c04ee1a7: WARNING:at_kernel/events/core.c:#perf_iterate_ctx_matching kernel test robot
2017-04-25 17:27 ` [RFC 0/6] optimize ctx switch with rb-tree Liang, Kan
2017-04-25 17:49   ` David Carrillo-Cisneros
2017-04-25 18:11     ` Budankov, Alexey
2017-04-25 18:54       ` David Carrillo-Cisneros
2017-04-26 10:34         ` Budankov, Alexey
2017-04-26 19:40           ` David Carrillo-Cisneros
2017-04-26 10:52         ` Mark Rutland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALcN6mhMMLz1j1e5p=xukYp8LbRXM2zsUZMr0zpjcU2G-WMB6g@mail.gmail.com' \
    --to=davidcc@google.com \
    --cc=acme@kernel.org \
    --cc=ak@linux.intel.com \
    --cc=bp@suse.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=eranian@google.com \
    --cc=kan.liang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=srinivas.pandruvada@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=vikas.shivappa@linux.intel.com \
    --cc=vince@deater.net \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).