All of lore.kernel.org
 help / color / mirror / Atom feed
* System/uncore PMUs and unit aggregation
@ 2016-11-17 18:17 Will Deacon
  2016-11-18  3:16 ` Leeder, Neil
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Will Deacon @ 2016-11-17 18:17 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

We currently have support for three arm64 system PMUs in flight:

 [Cavium ThunderX] http://lkml.kernel.org/r/cover.1477741719.git.jglauber at cavium.com
 [Hisilicon Hip0x] http://lkml.kernel.org/r/1478151727-20250-1-git-send-email-anurup.m at huawei.com
 [Qualcomm L2] http://lkml.kernel.org/r/1477687813-11412-1-git-send-email-nleeder at codeaurora.org

Each of which have to deal with multiple underlying hardware units in one
way or another. Mark and I recently expressed a desire to expose these
units to userspace as individual PMU instances, since this can allow:

  * Fine-grained control of events from userspace, when you want to see
    individual numbers as opposed to a summed total

  * Potentially ease migration to new SoC revisions, where the units
    are laid out slightly differently

  * Easier handling of cases where the units aren't quite identical

however, this received pushback from all of the patch authors, so there's
clearly a problem with this approach. I'm hoping we can try to resolve
this here.

Speaking to Mark earlier today, we came up with the following rough rules
for drivers that present multiple hardware units as a single PMU:

  1. If the units share some part of the programming interface (e.g. control
     registers or interrupts), then they must be handled by the same PMU.
     Otherwise, they should be treated independently as separate PMU
     instances.

  2. If the units are handled by the same PMU, then care must be taken to
     handle event groups correctly. That is, if the units cannot be started
     and stopped atomically, cross-unit groups must be rejected by the
     driver. Furthermore, any cross-unit scheduling constraints must be
     honoured so that all the units targetted by a group can schedule the
     group concurrently.

  3. Summing the counters across units is only permitted if the units
     can all be started and stopped atomically. Otherwise, the counters
     should be exposed individually. It's up to the driver author to
     decide what makes sense to sum.

  4. Unit topology can optionally be described in sysfs (we should pick
     some standard directory naming here), and then events targetting
     specific units can have the unit identifier extracted from the topology
     encoded in some configN fields.

The million dollar question is: how does that fit in with the drivers I
mentioned at the top? Is this overly restrictive, or have we missed stuff?

We certainly want to allow flexibility in the way in which the drivers
talk to the hardware, but given that these decisions directly affect the
user ABI, some consistent ground rules are required.

For Cavium ThunderX, it's not clear whether or not the individual units
could be expressed as separate PMUs, or whether they're caught by one of
the rules above. The Qualcomm L2 looks like it's doing the right thing
and we can't quite work out what the Hisilicon Hip0x topology looks like,
since the interaction with djtag is confusing.

If the driver authors (on To:) could shed some light on this, then that
would be much appreciated!

Thanks,

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-17 18:17 System/uncore PMUs and unit aggregation Will Deacon
@ 2016-11-18  3:16 ` Leeder, Neil
  2017-01-10 18:54   ` Will Deacon
  2016-11-18  8:15 ` Anurup M
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Leeder, Neil @ 2016-11-18  3:16 UTC (permalink / raw)
  To: linux-arm-kernel

Thanks for opening up the discussion on this Will.

For the Qualcomm L2 driver, one objection I had to exposing each unit is 
that there are so many of them - the minimum starting point is a dozen, 
so trying to start 9 counters on each means a perf command line 
specifying 100+ events. Future chips are only likely to increase that.

There is a single CPU node so from an end-user perspective it would 
seems to make sense to also have a single L2 node. perf already has the 
ability to create events on multiple units using cpumask, aggregate the 
results, and split them out per unit with perf stat -a -A, so the user 
can get that granularity. Exposing separate units would make userspace 
duplicate a lot of that functionality. This may rely on each uncore unit 
being associated with a CPU, which is the case with L2.

I agree with your comments regarding groups and I can see that a 
standard way of representing topology could be useful - in this case, 
which CPUs are within the same L2 cluster. Perhaps a list of cpumasks, 
one per L2 unit.

On 11/17/2016 1:17 PM, Will Deacon wrote:
[...]
>   3. Summing the counters across units is only permitted if the units
>      can all be started and stopped atomically. Otherwise, the counters
>      should be exposed individually. It's up to the driver author to
>      decide what makes sense to sum.

If I understand your your point 3 correctly, I'm not sure about the need 
to start and stop them all atomically. That seems to be a tighter 
requirement than we require for CPU PMUs. For them, perf stat -a creates 
events/groups on each CPU, then starts and stops them sequentially and 
sums the results. If that model is acceptable for the CPU to collect and 
aggregate counts, that should be the same bar that uncore PMUs need to 
reach. In the L2 case, the driver isn't summing the results, it's still 
perf doing it, so I may be misinterpreting your comment about where the 
summation is permitted.

Thanks,

Neil
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-17 18:17 System/uncore PMUs and unit aggregation Will Deacon
  2016-11-18  3:16 ` Leeder, Neil
@ 2016-11-18  8:15 ` Anurup M
  2017-01-10 18:56   ` Will Deacon
  2016-11-18  9:26 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Anurup M @ 2016-11-18  8:15 UTC (permalink / raw)
  To: linux-arm-kernel

Thanks you Mark and Will to initiate this discussion.

On Thursday 17 November 2016 11:47 PM, Will Deacon wrote:
> Hi all,
>
> We currently have support for three arm64 system PMUs in flight:
>
>   [Cavium ThunderX] http://lkml.kernel.org/r/cover.1477741719.git.jglauber at cavium.com
>   [Hisilicon Hip0x] http://lkml.kernel.org/r/1478151727-20250-1-git-send-email-anurup.m at huawei.com
>   [Qualcomm L2] http://lkml.kernel.org/r/1477687813-11412-1-git-send-email-nleeder at codeaurora.org
>
> Each of which have to deal with multiple underlying hardware units in one
> way or another. Mark and I recently expressed a desire to expose these
> units to userspace as individual PMU instances, since this can allow:
>
>    * Fine-grained control of events from userspace, when you want to see
>      individual numbers as opposed to a summed total
>
>    * Potentially ease migration to new SoC revisions, where the units
>      are laid out slightly differently
>
>    * Easier handling of cases where the units aren't quite identical
>
> however, this received pushback from all of the patch authors, so there's
> clearly a problem with this approach. I'm hoping we can try to resolve
> this here.
>
> Speaking to Mark earlier today, we came up with the following rough rules
> for drivers that present multiple hardware units as a single PMU:
>
>    1. If the units share some part of the programming interface (e.g. control
>       registers or interrupts), then they must be handled by the same PMU.
>       Otherwise, they should be treated independently as separate PMU
>       instances.
The Hisilicon Hip0x chip has units like L3 cache, Miscellaneous nodes, 
DDR controller etc.
There are such units in multiple CPU die's in the chip.

The L3 cache is further divided as banks which have separate set of 
interface (control registers, interrupts etc..).
As per the suggestion, each L3 cache banks will be exposed as a 
individual PMU instance.
So for e.g. in a board using Hip0x chip with 2 sockets and each socket 
consists of 2 CPU die,
There will be a total of 16 L3 cache PMU's which will be exposed.

My doubt here is
Each L3 cache PMU has total 22 statistics events. So if registered as a 
separate PMU, will it not
create multiple entries (with same event names) in event listing for 
multiple L3 cache PMU's.
Is there a way to avoid this? or this is acceptable?

Just a thought, If we can group them as single PMU and add a config 
parameter in the event listing to
identify the L3 cache bank(sub unit). e.g:  event name will appear as 
"hisi_l3c2/read_allocate,bank=?/".
And user can choose count from bank 0x01 as -e 
"hisi_l3c2/read_allocate,bank=0x01/".
And for aggregate count, bank=0xff.
Does it over complicate? Please share your comments.

>    2. If the units are handled by the same PMU, then care must be taken to
>       handle event groups correctly. That is, if the units cannot be started
>       and stopped atomically, cross-unit groups must be rejected by the
>       driver. Furthermore, any cross-unit scheduling constraints must be
>       honoured so that all the units targetted by a group can schedule the
>       group concurrently.
>
>    3. Summing the counters across units is only permitted if the units
>       can all be started and stopped atomically. Otherwise, the counters
>       should be exposed individually. It's up to the driver author to
>       decide what makes sense to sum.
>
>    4. Unit topology can optionally be described in sysfs (we should pick
>       some standard directory naming here), and then events targetting
>       specific units can have the unit identifier extracted from the topology
>       encoded in some configN fields.
Does this unit topology and configN method can solve the duplicate event 
listing issue? Please clarify.
> The million dollar question is: how does that fit in with the drivers I
> mentioned at the top? Is this overly restrictive, or have we missed stuff?
>
> We certainly want to allow flexibility in the way in which the drivers
> talk to the hardware, but given that these decisions directly affect the
> user ABI, some consistent ground rules are required.
>
> For Cavium ThunderX, it's not clear whether or not the individual units
> could be expressed as separate PMUs, or whether they're caught by one of
> the rules above. The Qualcomm L2 looks like it's doing the right thing
> and we can't quite work out what the Hisilicon Hip0x topology looks like,
> since the interaction with djtag is confusing.
The djtag is a component which connects with some other components in 
the SoC by Debug Bus.
The registers in components like L3 cache, MN etc are accessed only via 
djtag.
Please share comments about the confusion. We can discuss to clear them.

Thanks,
Anurup
> If the driver authors (on To:) could shed some light on this, then that
> would be much appreciated!
>
> Thanks,
>
> Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-17 18:17 System/uncore PMUs and unit aggregation Will Deacon
  2016-11-18  3:16 ` Leeder, Neil
  2016-11-18  8:15 ` Anurup M
@ 2016-11-18  9:26 ` Peter Zijlstra
  2016-11-18 16:25   ` Liang, Kan
  2016-11-18 11:10 ` Jan Glauber
  2017-03-16 11:08 ` Ganapatrao Kulkarni
  4 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2016-11-18  9:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 17, 2016 at 06:17:08PM +0000, Will Deacon wrote:
> Hi all,
> 
> We currently have support for three arm64 system PMUs in flight:
> 
>  [Cavium ThunderX] http://lkml.kernel.org/r/cover.1477741719.git.jglauber at cavium.com
>  [Hisilicon Hip0x] http://lkml.kernel.org/r/1478151727-20250-1-git-send-email-anurup.m at huawei.com
>  [Qualcomm L2] http://lkml.kernel.org/r/1477687813-11412-1-git-send-email-nleeder at codeaurora.org
> 
> Each of which have to deal with multiple underlying hardware units in one
> way or another. Mark and I recently expressed a desire to expose these
> units to userspace as individual PMU instances, since this can allow:
> 
>   * Fine-grained control of events from userspace, when you want to see
>     individual numbers as opposed to a summed total
> 
>   * Potentially ease migration to new SoC revisions, where the units
>     are laid out slightly differently
> 
>   * Easier handling of cases where the units aren't quite identical

This is I think similar to the Intel Uncore situation. We expose every
single individual PMU independently. The Intel uncore is wide and varied
between parts.

Leaving the rest for Kan, who's doing the Intel uncore bits.

 ~ Peter

> however, this received pushback from all of the patch authors, so there's
> clearly a problem with this approach. I'm hoping we can try to resolve
> this here.
> 
> Speaking to Mark earlier today, we came up with the following rough rules
> for drivers that present multiple hardware units as a single PMU:
> 
>   1. If the units share some part of the programming interface (e.g. control
>      registers or interrupts), then they must be handled by the same PMU.
>      Otherwise, they should be treated independently as separate PMU
>      instances.
> 
>   2. If the units are handled by the same PMU, then care must be taken to
>      handle event groups correctly. That is, if the units cannot be started
>      and stopped atomically, cross-unit groups must be rejected by the
>      driver. Furthermore, any cross-unit scheduling constraints must be
>      honoured so that all the units targetted by a group can schedule the
>      group concurrently.
> 
>   3. Summing the counters across units is only permitted if the units
>      can all be started and stopped atomically. Otherwise, the counters
>      should be exposed individually. It's up to the driver author to
>      decide what makes sense to sum.
> 
>   4. Unit topology can optionally be described in sysfs (we should pick
>      some standard directory naming here), and then events targetting
>      specific units can have the unit identifier extracted from the topology
>      encoded in some configN fields.
> 
> The million dollar question is: how does that fit in with the drivers I
> mentioned at the top? Is this overly restrictive, or have we missed stuff?
> 
> We certainly want to allow flexibility in the way in which the drivers
> talk to the hardware, but given that these decisions directly affect the
> user ABI, some consistent ground rules are required.
> 
> For Cavium ThunderX, it's not clear whether or not the individual units
> could be expressed as separate PMUs, or whether they're caught by one of
> the rules above. The Qualcomm L2 looks like it's doing the right thing
> and we can't quite work out what the Hisilicon Hip0x topology looks like,
> since the interaction with djtag is confusing.
> 
> If the driver authors (on To:) could shed some light on this, then that
> would be much appreciated!
> 
> Thanks,
> 
> Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-17 18:17 System/uncore PMUs and unit aggregation Will Deacon
                   ` (2 preceding siblings ...)
  2016-11-18  9:26 ` Peter Zijlstra
@ 2016-11-18 11:10 ` Jan Glauber
  2016-11-23 17:18   ` Mark Rutland
  2017-03-16 11:08 ` Ganapatrao Kulkarni
  4 siblings, 1 reply; 12+ messages in thread
From: Jan Glauber @ 2016-11-18 11:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 17, 2016 at 06:17:08PM +0000, Will Deacon wrote:
> Hi all,
> 
> We currently have support for three arm64 system PMUs in flight:
> 
>  [Cavium ThunderX] http://lkml.kernel.org/r/cover.1477741719.git.jglauber at cavium.com
>  [Hisilicon Hip0x] http://lkml.kernel.org/r/1478151727-20250-1-git-send-email-anurup.m at huawei.com
>  [Qualcomm L2] http://lkml.kernel.org/r/1477687813-11412-1-git-send-email-nleeder at codeaurora.org
> 
> Each of which have to deal with multiple underlying hardware units in one
> way or another. Mark and I recently expressed a desire to expose these
> units to userspace as individual PMU instances, since this can allow:
> 
>   * Fine-grained control of events from userspace, when you want to see
>     individual numbers as opposed to a summed total
> 
>   * Potentially ease migration to new SoC revisions, where the units
>     are laid out slightly differently
> 
>   * Easier handling of cases where the units aren't quite identical
> 
> however, this received pushback from all of the patch authors, so there's
> clearly a problem with this approach. I'm hoping we can try to resolve
> this here.

Good to know. Thanks for adressing this on a higher level.

> Speaking to Mark earlier today, we came up with the following rough rules
> for drivers that present multiple hardware units as a single PMU:
> 
>   1. If the units share some part of the programming interface (e.g. control
>      registers or interrupts), then they must be handled by the same PMU.
>      Otherwise, they should be treated independently as separate PMU
>      instances.

Can you elaborate why they should be treated independent in the later
case? What is the problem with going through a list and writing the
control register per unit?

>   2. If the units are handled by the same PMU, then care must be taken to
>      handle event groups correctly. That is, if the units cannot be started
>      and stopped atomically, cross-unit groups must be rejected by the
>      driver. Furthermore, any cross-unit scheduling constraints must be
>      honoured so that all the units targetted by a group can schedule the
>      group concurrently.
> 
>   3. Summing the counters across units is only permitted if the units
>      can all be started and stopped atomically. Otherwise, the counters
>      should be exposed individually. It's up to the driver author to
>      decide what makes sense to sum.

Do you mean started/stopped atomically across units?

>   4. Unit topology can optionally be described in sysfs (we should pick
>      some standard directory naming here), and then events targetting
>      specific units can have the unit identifier extracted from the topology
>      encoded in some configN fields.
> 
> The million dollar question is: how does that fit in with the drivers I
> mentioned at the top? Is this overly restrictive, or have we missed stuff?
> 
> We certainly want to allow flexibility in the way in which the drivers
> talk to the hardware, but given that these decisions directly affect the
> user ABI, some consistent ground rules are required.
> 
> For Cavium ThunderX, it's not clear whether or not the individual units
> could be expressed as separate PMUs, or whether they're caught by one of
> the rules above. The Qualcomm L2 looks like it's doing the right thing
> and we can't quite work out what the Hisilicon Hip0x topology looks like,
> since the interaction with djtag is confusing.

On Cavium ThunderX the current patches add 4 PMU types, which unfortunately
are all handled different. The L2C-TAD and OCX-TLK have control
registers per unit. The LMC and L2C-CBC don't have control registers,
(free-running counters). So rule 1 might be too restrictive.

I've not looked into groups, would these allow to merge counters from
different PMUs in the kernel?

--Jan

> If the driver authors (on To:) could shed some light on this, then that
> would be much appreciated!
> 
> Thanks,
> 
> Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-18  9:26 ` Peter Zijlstra
@ 2016-11-18 16:25   ` Liang, Kan
  0 siblings, 0 replies; 12+ messages in thread
From: Liang, Kan @ 2016-11-18 16:25 UTC (permalink / raw)
  To: linux-arm-kernel



> On Thu, Nov 17, 2016 at 06:17:08PM +0000, Will Deacon wrote:
> > Hi all,
> >
> > We currently have support for three arm64 system PMUs in flight:
> >
> >  [Cavium ThunderX]
> > http://lkml.kernel.org/r/cover.1477741719.git.jglauber at cavium.com
> >  [Hisilicon Hip0x]
> > http://lkml.kernel.org/r/1478151727-20250-1-git-send-email-
> anurup.m at hu
> > awei.com  [Qualcomm L2]
> > http://lkml.kernel.org/r/1477687813-11412-1-git-send-email-nleeder at cod
> > eaurora.org
> >
> > Each of which have to deal with multiple underlying hardware units in
> > one way or another. Mark and I recently expressed a desire to expose
> > these units to userspace as individual PMU instances, since this can allow:
> >
> >   * Fine-grained control of events from userspace, when you want to see
> >     individual numbers as opposed to a summed total
> >
> >   * Potentially ease migration to new SoC revisions, where the units
> >     are laid out slightly differently
> >
> >   * Easier handling of cases where the units aren't quite identical
> 
> This is I think similar to the Intel Uncore situation. We expose every single
> individual PMU independently. The Intel uncore is wide and varied between
> parts.
> 
> Leaving the rest for Kan, who's doing the Intel uncore bits.
> 
>  ~ Peter
> 
> > however, this received pushback from all of the patch authors, so
> > there's clearly a problem with this approach. I'm hoping we can try to
> > resolve this here.
> >
> > Speaking to Mark earlier today, we came up with the following rough
> > rules for drivers that present multiple hardware units as a single PMU:
> >
> >   1. If the units share some part of the programming interface (e.g. control
> >      registers or interrupts), then they must be handled by the same PMU.
> >      Otherwise, they should be treated independently as separate PMU
> >      instances.
> >
> >   2. If the units are handled by the same PMU, then care must be taken to
> >      handle event groups correctly. That is, if the units cannot be started
> >      and stopped atomically, cross-unit groups must be rejected by the
> >      driver. Furthermore, any cross-unit scheduling constraints must be
> >      honoured so that all the units targetted by a group can schedule the
> >      group concurrently.
> >
> >   3. Summing the counters across units is only permitted if the units
> >      can all be started and stopped atomically. Otherwise, the counters
> >      should be exposed individually. It's up to the driver author to
> >      decide what makes sense to sum.
> >
> >   4. Unit topology can optionally be described in sysfs (we should pick
> >      some standard directory naming here), and then events targetting
> >      specific units can have the unit identifier extracted from the topology
> >      encoded in some configN fields.
> >
> > The million dollar question is: how does that fit in with the drivers
> > I mentioned at the top? Is this overly restrictive, or have we missed stuff?
> >
> > We certainly want to allow flexibility in the way in which the drivers
> > talk to the hardware, but given that these decisions directly affect
> > the user ABI, some consistent ground rules are required.
> >
> > For Cavium ThunderX, it's not clear whether or not the individual
> > units could be expressed as separate PMUs, or whether they're caught
> > by one of the rules above.

The individual unit should be a separate PMU. They are standalone system,
although they share the same PCI ID.
I think you can use bus:dev:fun to distinguish among them. That is what
we did for Skylake and KNL.

I noticed that you pick up random CPU for your uncore PMU.
In our implementation, we usually pick up the first available CPU.  

>> The Qualcomm L2 looks like it's doing the
> > right thing and

If there is only one device, it should be OK.
But the variable "num_pmus" make me very confuse.

> > we can't quite work out what the Hisilicon Hip0x
> > topology looks like, since the interaction with djtag is confusing.

Each djtag device has its own PMU. They name them by scl_id. hisi_l3c*.

> >
> > If the driver authors (on To:) could shed some light on this, then
> > that would be much appreciated!
> >
> > Thanks,
> >
> > Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-18 11:10 ` Jan Glauber
@ 2016-11-23 17:18   ` Mark Rutland
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Rutland @ 2016-11-23 17:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 18, 2016 at 12:10:17PM +0100, Jan Glauber wrote:
> On Thu, Nov 17, 2016 at 06:17:08PM +0000, Will Deacon wrote:

> > Speaking to Mark earlier today, we came up with the following rough rules
> > for drivers that present multiple hardware units as a single PMU:
> > 
> >   1. If the units share some part of the programming interface (e.g. control
> >      registers or interrupts), then they must be handled by the same PMU.
> >      Otherwise, they should be treated independently as separate PMU
> >      instances.
> 
> Can you elaborate why they should be treated independent in the later
> case? What is the problem with going through a list and writing the
> control register per unit?

For one thing, event groups spanning those units cannot be scheduled
atomically (some events would be counting while others were not),
violating group semantics.

> >   3. Summing the counters across units is only permitted if the units
> >      can all be started and stopped atomically. Otherwise, the counters
> >      should be exposed individually. It's up to the driver author to
> >      decide what makes sense to sum.
> 
> Do you mean started/stopped atomically across units?

Yes. If some units are counting while others are not, values can be
skewed, and therefore potentially misleading.

> > For Cavium ThunderX, it's not clear whether or not the individual units
> > could be expressed as separate PMUs, or whether they're caught by one of
> > the rules above. The Qualcomm L2 looks like it's doing the right thing
> > and we can't quite work out what the Hisilicon Hip0x topology looks like,
> > since the interaction with djtag is confusing.
> 
> On Cavium ThunderX the current patches add 4 PMU types, which unfortunately
> are all handled different. The L2C-TAD and OCX-TLK have control
> registers per unit. The LMC and L2C-CBC don't have control registers,
> (free-running counters). So rule 1 might be too restrictive.
> 
> I've not looked into groups, would these allow to merge counters from
> different PMUs in the kernel?

No; event groups are strictly single PMU, with the sole exception that
software events may be placed inside a hardware event group (since
there's no start/stop logic required for SW events).

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-18  3:16 ` Leeder, Neil
@ 2017-01-10 18:54   ` Will Deacon
  2017-01-11  0:46     ` Leeder, Neil
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2017-01-10 18:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Neil, Anurup, Jan,

On Thu, Nov 17, 2016 at 10:16:46PM -0500, Leeder, Neil wrote:
> Thanks for opening up the discussion on this Will.
> 
> For the Qualcomm L2 driver, one objection I had to exposing each unit is
> that there are so many of them - the minimum starting point is a dozen, so
> trying to start 9 counters on each means a perf command line specifying 100+
> events. Future chips are only likely to increase that.
> 
> There is a single CPU node so from an end-user perspective it would seems to
> make sense to also have a single L2 node. perf already has the ability to
> create events on multiple units using cpumask, aggregate the results, and
> split them out per unit with perf stat -a -A, so the user can get that
> granularity. Exposing separate units would make userspace duplicate a lot of
> that functionality. This may rely on each uncore unit being associated with
> a CPU, which is the case with L2.
> 
> I agree with your comments regarding groups and I can see that a standard
> way of representing topology could be useful - in this case, which CPUs are
> within the same L2 cluster. Perhaps a list of cpumasks, one per L2 unit.

Mark and I had a chat about this earlier today and I think we largely agree
with you. That is, for composite PMUs with a notion of CPU affinity for
their component units, it makes sense to use the event affinity as a means
to address these units, rather than e.g. create separate PMU instances.

However, for PMUs that don't have this notion of affinity, the units should
either be exposed individually or, in the case that there is something like
shared control logic, they should be addressed through the config fields
(e.g. the hisilicon cache with the bank=NN option).

I think this fits with your driver, so please post an updated version
addressing Mark's unrelated review comments.

> On 11/17/2016 1:17 PM, Will Deacon wrote:
> [...]
> >  3. Summing the counters across units is only permitted if the units
> >     can all be started and stopped atomically. Otherwise, the counters
> >     should be exposed individually. It's up to the driver author to
> >     decide what makes sense to sum.
> 
> If I understand your your point 3 correctly, I'm not sure about the need to
> start and stop them all atomically. That seems to be a tighter requirement
> than we require for CPU PMUs. For them, perf stat -a creates events/groups
> on each CPU, then starts and stops them sequentially and sums the results.
> If that model is acceptable for the CPU to collect and aggregate counts,
> that should be the same bar that uncore PMUs need to reach. In the L2 case,
> the driver isn't summing the results, it's still perf doing it, so I may be
> misinterpreting your comment about where the summation is permitted.

My concern with summation is more that I don't want to expose high-level
"meta" events from the driver which in fact end up being a bunch of
different events in a bunch of different counters that can't be read
atomically. Userspace is free to do that, but the driver shouldn't claim
that it can support the event, if you see what I mean?

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-18  8:15 ` Anurup M
@ 2017-01-10 18:56   ` Will Deacon
  0 siblings, 0 replies; 12+ messages in thread
From: Will Deacon @ 2017-01-10 18:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 18, 2016 at 01:45:23PM +0530, Anurup M wrote:
> On Thursday 17 November 2016 11:47 PM, Will Deacon wrote:
> >We currently have support for three arm64 system PMUs in flight:
> >
> >  [Cavium ThunderX] http://lkml.kernel.org/r/cover.1477741719.git.jglauber at cavium.com
> >  [Hisilicon Hip0x] http://lkml.kernel.org/r/1478151727-20250-1-git-send-email-anurup.m at huawei.com
> >  [Qualcomm L2] http://lkml.kernel.org/r/1477687813-11412-1-git-send-email-nleeder at codeaurora.org
> >
> >Each of which have to deal with multiple underlying hardware units in one
> >way or another. Mark and I recently expressed a desire to expose these
> >units to userspace as individual PMU instances, since this can allow:
> >
> >   * Fine-grained control of events from userspace, when you want to see
> >     individual numbers as opposed to a summed total
> >
> >   * Potentially ease migration to new SoC revisions, where the units
> >     are laid out slightly differently
> >
> >   * Easier handling of cases where the units aren't quite identical
> >
> >however, this received pushback from all of the patch authors, so there's
> >clearly a problem with this approach. I'm hoping we can try to resolve
> >this here.
> >
> >Speaking to Mark earlier today, we came up with the following rough rules
> >for drivers that present multiple hardware units as a single PMU:
> >
> >   1. If the units share some part of the programming interface (e.g. control
> >      registers or interrupts), then they must be handled by the same PMU.
> >      Otherwise, they should be treated independently as separate PMU
> >      instances.
> The Hisilicon Hip0x chip has units like L3 cache, Miscellaneous nodes, DDR
> controller etc.
> There are such units in multiple CPU die's in the chip.
> 
> The L3 cache is further divided as banks which have separate set of
> interface (control registers, interrupts etc..).
> As per the suggestion, each L3 cache banks will be exposed as a individual
> PMU instance.
> So for e.g. in a board using Hip0x chip with 2 sockets and each socket
> consists of 2 CPU die,
> There will be a total of 16 L3 cache PMU's which will be exposed.
> 
> My doubt here is
> Each L3 cache PMU has total 22 statistics events. So if registered as a
> separate PMU, will it not
> create multiple entries (with same event names) in event listing for
> multiple L3 cache PMU's.
> Is there a way to avoid this? or this is acceptable?
> 
> Just a thought, If we can group them as single PMU and add a config
> parameter in the event listing to
> identify the L3 cache bank(sub unit). e.g:  event name will appear as
> "hisi_l3c2/read_allocate,bank=?/".
> And user can choose count from bank 0x01 as -e
> "hisi_l3c2/read_allocate,bank=0x01/".
> And for aggregate count, bank=0xff.
> Does it over complicate? Please share your comments.

Adding a bank field to the config looks fine to me. I'm assuming the banks
aren't CPU affine?

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2017-01-10 18:54   ` Will Deacon
@ 2017-01-11  0:46     ` Leeder, Neil
  0 siblings, 0 replies; 12+ messages in thread
From: Leeder, Neil @ 2017-01-11  0:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 1/10/2017 1:54 PM, Will Deacon wrote:
> Mark and I had a chat about this earlier today and I think we largely agree
> with you. That is, for composite PMUs with a notion of CPU affinity for
> their component units, it makes sense to use the event affinity as a means
> to address these units, rather than e.g. create separate PMU instances.
>
> However, for PMUs that don't have this notion of affinity, the units should
> either be exposed individually or, in the case that there is something like
> shared control logic, they should be addressed through the config fields
> (e.g. the hisilicon cache with the bank=NN option).
>
> I think this fits with your driver, so please post an updated version
> addressing Mark's unrelated review comments.
>
Thanks Will. I'll post a new patch which covers Marks's comments.

Neil
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2016-11-17 18:17 System/uncore PMUs and unit aggregation Will Deacon
                   ` (3 preceding siblings ...)
  2016-11-18 11:10 ` Jan Glauber
@ 2017-03-16 11:08 ` Ganapatrao Kulkarni
  2017-03-20 12:37   ` Will Deacon
  4 siblings, 1 reply; 12+ messages in thread
From: Ganapatrao Kulkarni @ 2017-03-16 11:08 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Thu, Nov 17, 2016 at 11:47 PM, Will Deacon <will.deacon@arm.com> wrote:
> Hi all,
>
> We currently have support for three arm64 system PMUs in flight:
>
>  [Cavium ThunderX] http://lkml.kernel.org/r/cover.1477741719.git.jglauber at cavium.com
>  [Hisilicon Hip0x] http://lkml.kernel.org/r/1478151727-20250-1-git-send-email-anurup.m at huawei.com
>  [Qualcomm L2] http://lkml.kernel.org/r/1477687813-11412-1-git-send-email-nleeder at codeaurora.org
>
> Each of which have to deal with multiple underlying hardware units in one
> way or another. Mark and I recently expressed a desire to expose these
> units to userspace as individual PMU instances, since this can allow:
>
>   * Fine-grained control of events from userspace, when you want to see
>     individual numbers as opposed to a summed total
>
>   * Potentially ease migration to new SoC revisions, where the units
>     are laid out slightly differently
>
>   * Easier handling of cases where the units aren't quite identical
>
> however, this received pushback from all of the patch authors, so there's
> clearly a problem with this approach. I'm hoping we can try to resolve
> this here.
>
> Speaking to Mark earlier today, we came up with the following rough rules
> for drivers that present multiple hardware units as a single PMU:
>
>   1. If the units share some part of the programming interface (e.g. control
>      registers or interrupts), then they must be handled by the same PMU.
>      Otherwise, they should be treated independently as separate PMU
>      instances.

How are we planning to handle multi-node scenario?
if there are X separate PMUs on single socket, are we going to list 2X PMUs on
dual socket?
>
>   2. If the units are handled by the same PMU, then care must be taken to
>      handle event groups correctly. That is, if the units cannot be started
>      and stopped atomically, cross-unit groups must be rejected by the
>      driver. Furthermore, any cross-unit scheduling constraints must be
>      honoured so that all the units targetted by a group can schedule the
>      group concurrently.
>
>   3. Summing the counters across units is only permitted if the units
>      can all be started and stopped atomically. Otherwise, the counters
>      should be exposed individually. It's up to the driver author to
>      decide what makes sense to sum.
>
>   4. Unit topology can optionally be described in sysfs (we should pick
>      some standard directory naming here), and then events targetting
>      specific units can have the unit identifier extracted from the topology
>      encoded in some configN fields.
>
> The million dollar question is: how does that fit in with the drivers I
> mentioned at the top? Is this overly restrictive, or have we missed stuff?
>
> We certainly want to allow flexibility in the way in which the drivers
> talk to the hardware, but given that these decisions directly affect the
> user ABI, some consistent ground rules are required.
>
> For Cavium ThunderX, it's not clear whether or not the individual units
> could be expressed as separate PMUs, or whether they're caught by one of
> the rules above. The Qualcomm L2 looks like it's doing the right thing
> and we can't quite work out what the Hisilicon Hip0x topology looks like,
> since the interaction with djtag is confusing.
>
> If the driver authors (on To:) could shed some light on this, then that
> would be much appreciated!
>
> Thanks,
>
> Will
>

thanks
Ganapat
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* System/uncore PMUs and unit aggregation
  2017-03-16 11:08 ` Ganapatrao Kulkarni
@ 2017-03-20 12:37   ` Will Deacon
  0 siblings, 0 replies; 12+ messages in thread
From: Will Deacon @ 2017-03-20 12:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 16, 2017 at 04:38:28PM +0530, Ganapatrao Kulkarni wrote:
> On Thu, Nov 17, 2016 at 11:47 PM, Will Deacon <will.deacon@arm.com> wrote:
> > We currently have support for three arm64 system PMUs in flight:
> >
> >  [Cavium ThunderX] http://lkml.kernel.org/r/cover.1477741719.git.jglauber at cavium.com
> >  [Hisilicon Hip0x] http://lkml.kernel.org/r/1478151727-20250-1-git-send-email-anurup.m at huawei.com
> >  [Qualcomm L2] http://lkml.kernel.org/r/1477687813-11412-1-git-send-email-nleeder at codeaurora.org
> >
> > Each of which have to deal with multiple underlying hardware units in one
> > way or another. Mark and I recently expressed a desire to expose these
> > units to userspace as individual PMU instances, since this can allow:
> >
> >   * Fine-grained control of events from userspace, when you want to see
> >     individual numbers as opposed to a summed total
> >
> >   * Potentially ease migration to new SoC revisions, where the units
> >     are laid out slightly differently
> >
> >   * Easier handling of cases where the units aren't quite identical
> >
> > however, this received pushback from all of the patch authors, so there's
> > clearly a problem with this approach. I'm hoping we can try to resolve
> > this here.
> >
> > Speaking to Mark earlier today, we came up with the following rough rules
> > for drivers that present multiple hardware units as a single PMU:
> >
> >   1. If the units share some part of the programming interface (e.g. control
> >      registers or interrupts), then they must be handled by the same PMU.
> >      Otherwise, they should be treated independently as separate PMU
> >      instances.
> 
> How are we planning to handle multi-node scenario?
> if there are X separate PMUs on single socket, are we going to list 2X PMUs on
> dual socket?

Sure, why not? Retrofitting multi-node support into a PMU driver sounds
pretty messy to me, and I don't see the downside of exposing these as
separate instances (which is what they are).

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-03-20 12:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-17 18:17 System/uncore PMUs and unit aggregation Will Deacon
2016-11-18  3:16 ` Leeder, Neil
2017-01-10 18:54   ` Will Deacon
2017-01-11  0:46     ` Leeder, Neil
2016-11-18  8:15 ` Anurup M
2017-01-10 18:56   ` Will Deacon
2016-11-18  9:26 ` Peter Zijlstra
2016-11-18 16:25   ` Liang, Kan
2016-11-18 11:10 ` Jan Glauber
2016-11-23 17:18   ` Mark Rutland
2017-03-16 11:08 ` Ganapatrao Kulkarni
2017-03-20 12:37   ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.