linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch
@ 2020-04-14 18:56 Reinette Chatre
  2020-04-15 12:59 ` James Morse
  0 siblings, 1 reply; 5+ messages in thread
From: Reinette Chatre @ 2020-04-14 18:56 UTC (permalink / raw)
  To: James Morse, x86, LKML
  Cc: Yu, Fenghua, Thomas Gleixner, mingo, bp, hpa, Moger, Babu, Luck, Tony

Hi James,

On 12/31/1969 4:00 PM, James Morse wrote:
> Hi folks,
> 
> These two patches are the tip of the MPAM iceberg.
> 
> Arm have some CPU support for dividing caches into portions, and
> applying bandwidth limits at various points in the SoC. The collective term
> for these features is MPAM: Memory Partitioning and Monitoring.
> 
> MPAM is similar enough to Intel RDT, that it should use the defacto linux
> interface: resctrl. This filesystem currently lives under arch/x86, and is
> tightly coupled to the architecture.
> Ultimately, my plan is to split the existing resctrl code up to have an
> arch<->fs abstraction, then move all the bits out to fs/resctrl. From there
> MPAM can be wired up.
> 
> These two patches are step one: Split the two structs that resctrl uses
> to have an arch<->fs split. These sit on top of the cleanup posted here:
> lore.kernel.org/r/20200214182401.39008-1-james.morse@arm.com
> 
> I'm after strong opinions like "No! struct mbm_state is obviously arch
> specific.". Making the hardware configuration belong to the arch code
> instead of resctrl is so that it can be scaled on arm64, where MPAM
> allows terrifyingly large portion bitmaps for the caches.
> 
> 
> 
> Last time these were posted, the request was for the spec, and to see
> the whole fully assembled iceberg.
> 
> The spec is here:
> https://static.docs.arm.com/ddi0598/ab/DDI0598A_b_MPAM_supp_armv8a.pdf
> 
> For a slightly dated view of the whole tree:
> 1. Don peril sensitive sunglasses
> 2. https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/feb
> 
> The tree is generally RFC-quality. It gets more ragged once you get out of
> the x86 code. I anticipate all the arm64 code being rewritten before its
> considered for merging.
> 
> (I haven't reposted the CDP origami as before, as I think that series
> will be clearer if I re-order the patches ... it may even be shorter)
> 
> 
> Does it all work? Almost. Monitor groups are proving to be a problem, I
> can't see a way of getting these working without a user-visible change of
> behaviour.
> MPAMs counters aren't 1:1 with RMIDs, so supporting MBM_* on
> any likely hardware will have to be via something other than resctrl.
> 

Before jumping to the details within the patches of this work ...

Could you please summarize the salient points from the spec, pointing
readers to the spec for more information? A start would be how resctrl
is expected to support MPAM. Could you highlight where does resctrl
interface or assumptions currently fall short? How is the planned MPAM
integration addressing these shortcomings?

Some higher level questions I have after scanning the spec and patches are:

* The spec contains many details of how MPAM supports virtualization. Is
it expected that all of this would be supported with resctrl? For
example, while some registers may be abstracted it seems some interface
may be needed to configure the virt to phy PARTID mappings. Information
about how resctrl is envisioned to support MPAM's virtualization would
help a lot.

* Looking at the commits (1385052cce87a8aed5dc0e96967cedd9e74a17e0 -
"x86/resctrl: Group staged configuration into a separate struct") I
found mention of a change in the schemata. Highlighting any planned
resctrl interface changes would be very helpful.

* Apart from actual interface changes, highlighting planned behavior
changes and motivation for them would also be helpful … for example
force enabling of CDP on all cache levels is a red flag to me.

* I am curious about how the configurability of MPAM will be handled. It
seems as though MPAM is highly configurable, how is this expected to be
handled in resctrl? For example, this message and KNOWN_ISSUES among the
patches mentions an ABI issue that RMID is independent from CLOSID in
RDT but PMG (like RMID) is dependent on PARTID (like CLOSID) in MPAM.
There is a MATCH_PARTID configuration option in MPAM that makes PMG not
depend on PARTID and thus seem to bring closer to RDT. I am surely not
indicating that MPAM should be made to behave like RDT but it does seem
that MPAM is very configurable. Is it the intention to support all ways
in which MPAM can be used and if so is the plan for resctrl so support
making these configuration changes and then support them? For example,
would you want resctrl to support all variations where MATCH_PARTID ==
[0|1] and MATCH_PMG == [0|1]? My intention here is not to delve into
these details in particular, instead I hope to use it as an example of
what I mean when curious about how (if at all) resctrl is envisioned to
support the configurability of MPAM.

It seems to me that MPAM may need more than what is currently available
from resctrl but it is hard for me to digest a 276 page spec and 150
patch series to fully understand what needs to be support and how to do
so. I look forward to learning more about the goals of what needs to be
supported and your vision for resctrl to do so.

Thank you

Reinette

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch
  2020-04-14 18:56 [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch Reinette Chatre
@ 2020-04-15 12:59 ` James Morse
  2020-04-15 19:06   ` Reinette Chatre
  2020-04-17 23:08   ` Reinette Chatre
  0 siblings, 2 replies; 5+ messages in thread
From: James Morse @ 2020-04-15 12:59 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: x86, LKML, Yu, Fenghua, Thomas Gleixner, mingo, bp, hpa, Moger,
	Babu, Luck, Tony

Hi Reinette,

Thanks for taking a look, all great questions!

On 14/04/2020 19:56, Reinette Chatre wrote:
> On 12/31/1969 4:00 PM, James Morse wrote:
>> These two patches are the tip of the MPAM iceberg.
>>
>> Arm have some CPU support for dividing caches into portions, and
>> applying bandwidth limits at various points in the SoC. The collective term
>> for these features is MPAM: Memory Partitioning and Monitoring.
>>
>> MPAM is similar enough to Intel RDT, that it should use the defacto linux
>> interface: resctrl. This filesystem currently lives under arch/x86, and is
>> tightly coupled to the architecture.
>> Ultimately, my plan is to split the existing resctrl code up to have an
>> arch<->fs abstraction, then move all the bits out to fs/resctrl. From there
>> MPAM can be wired up.
>>
>> These two patches are step one: Split the two structs that resctrl uses
>> to have an arch<->fs split. These sit on top of the cleanup posted here:
>> lore.kernel.org/r/20200214182401.39008-1-james.morse@arm.com
>>
>> I'm after strong opinions like "No! struct mbm_state is obviously arch
>> specific.". Making the hardware configuration belong to the arch code
>> instead of resctrl is so that it can be scaled on arm64, where MPAM
>> allows terrifyingly large portion bitmaps for the caches.
>>
>>
>>
>> Last time these were posted, the request was for the spec, and to see
>> the whole fully assembled iceberg.
>>
>> The spec is here:
>> https://static.docs.arm.com/ddi0598/ab/DDI0598A_b_MPAM_supp_armv8a.pdf
>>
>> For a slightly dated view of the whole tree:
>> 1. Don peril sensitive sunglasses
>> 2. https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/feb
>>
>> The tree is generally RFC-quality. It gets more ragged once you get out of
>> the x86 code. I anticipate all the arm64 code being rewritten before its
>> considered for merging.
>>
>> (I haven't reposted the CDP origami as before, as I think that series
>> will be clearer if I re-order the patches ... it may even be shorter)
>>
>>
>> Does it all work? Almost. Monitor groups are proving to be a problem, I
>> can't see a way of getting these working without a user-visible change of
>> behaviour.
>> MPAMs counters aren't 1:1 with RMIDs, so supporting MBM_* on
>> any likely hardware will have to be via something other than resctrl.


> Before jumping to the details within the patches of this work ...
> 
> Could you please summarize the salient points from the spec, pointing
> readers to the spec for more information?

Its probably the differences with RDT that are important, so I'll focus on those:

MPAM has has a CPU interface that controls how the CPU tags traffic, and controls that are
scattered through the system, and are accessible via MMIO.

Arm doesn't specify a cache topology, (hence we have funny terms like PoU), so likewise
the controls, and where they are in the address space, are totally up to the implementer.
Linux would have to discover all this from an ACPI-table/DT-node.

MPAM has ~six different types of control, all of which are optional. Cache Portion bitmaps
correspond closely to CAT's behaviour. MBM's behaviour can be approximated with either the
bandwidth bitmap or the bandwidth max/min controls.

MPAM has partid's, which correspond perfectly with closid. It has performance monitors,
but these behave much more like a PMU than RDT's monitors. This is where it gets messy:

RDT has RMIDs as an independent value used for monitoring. There is one monitor per RMID.
MPAM has 'PMG', that subdivide the partid space. There are a number of PMU-like monitors
that can be configured to count for a partid, or a partid-and-pmg. It is very likely that
the number of these counters is much smaller than the number of partid, or partid+pmg.

PMG does not correspond to RMID, even though they look compatible from a distance.
(I can't see a way of fixing this is a compatible way. Supporting it via a different
mechanism is my best bet).

MPAM has something like the CPU interface for tagging traffic in Arm's irqchip and IOMMU.
MPAM has virtualisation support in the CPU, for mapping virtual partids to physical partids.


> A start would be how resctrl is expected to support MPAM.

Exactly as it works on an Intel Xeon today!
User-space should not be able to tell the difference.

I intend to get as much of MPAM going with this constraint as possible. We can then
discuss what needs changing/extending to allow other features to be used. (and what those
features are good for).

Practically this means that systems with MPAM can only use resctrl if they look a bit like
a Xeon. (i.e. bitmap controls on L2 or L3, bandwidth controls on-or-behind L3)


> Could you highlight where does resctrl interface or assumptions currently fall short?
> How is the planned MPAM integration addressing these shortcomings?

(shortcomings -> Design decisions that make sense for resctrl to support RDT.)

The resctrl monitor code implicitly assumes 'this' CPU all over the place. For MPAM any
'this' CPU may be a set of CPUs, which make this code tricky to work with.

tangent: (what! why? ... Reported by more than one of Arm's partners, is that part of
their system is made up of slices, which act as one component when integrated together. I
assume this helps scale the design, or improve the yield. There is one set of MPAM
controls per slice, which means each of the controls needs to be configured the same to
give one behaviour for the component. The MPAM controls may only be accessible from the
local CPUs if this slice thing contains CPUs... Arm's software model does exactly this
with its L3).


resctrl doesn't have the concept of having to allocate a counter for an RMID, because for
RDT these things are 1:1. Because the bandwidth counters are left running, and accessible
via the filesystem, MPAM cannot let resctrl use the bandwidth counters unless there are as
many PMU-like monitors as there are partid+pmg, which isn't likely.

My plan here is to add a 'resctrl_pmu' to perf. (I know there is some history with perf
here). This would allow perf to read the values that are already exposed via resctrl.
For MPAM this should give us the schedule-in/out hooks we need to allocate the PMU-like
counter when its actually being used.

I'd expect this to be done in the core /fs/ code, so that it uses a common interface to
resctrl, and works in exactly the same way with RDT. (no additional arch code means it
works on the next architecture to support resctrl too)


The rdt_resources_all array, and these alloc_enabled, alloc_capable flags are a really
neat trick to implement CDP on RDT. But this looks very much like an array of every SoC
that has been built, which is tricky if the topology is discovered at boot. It also leaves
properties of the user visible schemata file in the hands of the arch code.
I want it to be, very difficult, for Arm to invent new schemata without discussing whether
the interface is abstract enough to support on other architectures.
A good chunk of the MPAM tree is moving all the user-visible CDP behaviour into the code
that moves to /fs. This lets the arch code only deal with L2 or L3, and CDP becomes a
property of the configuration.

(this is probably one of the stranger looking changes. Secondary motivations are to avoid
emulating CDP under arch/arm64, and avoiding the nightmare SoC topology where there are
more MPAM controls to be configured, because of this slicing, in L2 than there are in
L2CODE....)


> Some higher level questions I have after scanning the spec and patches are:
> 
> * The spec contains many details of how MPAM supports virtualization. Is
> it expected that all of this would be supported with resctrl?

For now, that is totally out of scope.


> For
> example, while some registers may be abstracted it seems some interface
> may be needed to configure the virt to phy PARTID mappings. Information
> about how resctrl is envisioned to support MPAM's virtualization would
> help a lot.

As you asked!
The principle here is not to change the user-visible bits of resctrl at all.

We only need to support KVM, and the definition of KVM's virtual machine (memory layout,
number of CPUs etc) comes from its user-space virtual-machine-manager, which is typically
Qemu.
KVM only needs to support the CPU-interface parts of MPAM. The controls, configured via
MMIO could be emulated by the VMM. (this lets it give an accurate model of the machine it
is emulating)
The VMM would need to allocate a number of control groups via resctrl, then pass their
names (or preferably closid) to KVM to set up the mapping. The MPAM hardware performs the
mapping when the guest uses a partid.
When the guest tries to configure the controls, this would trap back to the VMM, as it
does for any emulated device. The VMM can then make the corresponding change in resctrl.

This lets the VMM emulate MPAM controls that the host doesn't have, or control MPAM for
the guest using a different user-space interface.

This has only been discussed vaguely with the Qemu folk, its all subject to change.

I would like to be able to allocate closid in the kernel for KVM guests, as this is a step
towards supporting other in-kernel users. (in particular, the IOMMU).


> * Looking at the commits (1385052cce87a8aed5dc0e96967cedd9e74a17e0 -
> "x86/resctrl: Group staged configuration into a separate struct") I
> found mention of a change in the schemata. Highlighting any planned
> resctrl interface changes would be very helpful.

No changes to the schemata! That would cause variation between RDT and MPAM, and fragment
user-space software.

The commit message is certainly terrible. That patch is part of the chunk that folds the
L3CODE, L3DATA and L3 resources together so that the arch code is only dealing with L3,
(or L2 or MBM). This is so that the CODE/DATA schema behaviour lives in the core /fs/
code, so that its much harder for it to behave differently on systems with MPAM.


> * Apart from actual interface changes, highlighting planned behavior
> changes and motivation for them would also be helpful … for example
> force enabling of CDP on all cache levels is a red flag to me.

Interesting. This is the change that makes the CDP on/off global, instead of per cache.
Its still controlled by user-space. (so nothing is forced).
Do you have systems that support CAT at L3 and L2, but only CDP at L3, not L2?
(I was under the impression the L2 stuff was all Atom, and the L3+MBM was all Xeon).

MPAM's equivalent to CDP is just part of the CPU interface. Its always on.
To support 'CDP on L2 but not L3', (neither of which exist), we'd need to have extra code:
"was I asked to pretend CDP is enabled on this cache".

As CDP affects the way you allocate closid, (that odd/even thing), which is global, it
makes sense that this is either on or off. (doing this let me support CDP without the arch
code doing anything special!)

Existence of hardware that does this would obviously change this.


> * I am curious about how the configurability of MPAM will be handled. It
> seems as though MPAM is highly configurable, how is this expected to be
> handled in resctrl?

ACPI-tables/DT to describe the topology, and then code to 'pick' which MPAM features map
best onto resctrl.
Its likely that there will be systems that have MPAM, but can't use resctrl without
user-visible changes. Discussing user-visible changes can happen at that point.


> For example, this message and KNOWN_ISSUES among the
> patches mentions an ABI issue that RMID is independent from CLOSID in
> RDT but PMG (like RMID) is dependent on PARTID (like CLOSID) in MPAM.

Indeed. Because there is nothing quite like RMID, there is nothing I can expose as
num_rmid. Whatever value I expose there will cause user-space to detect some breakage.


> There is a MATCH_PARTID configuration option in MPAM that makes PMG not
> depend on PARTID and thus seem to bring closer to RDT.

(crumbs, you've dug into this in some detail!)

... Welcome to the wonderful world of Arm specifications:

On page 234 of [0], it describes the CSU controls:
| If MATCH_PMG == 1 and MATCH_PARTID == 0, it is CONSTRAINED-UNPREDICTABLE whether the
| monitor instance:
| * Measures the storage used with matching PMG and with any PARTID.
| * Measures no storage usage, that is, MSMON_CSU.VALUE is zero.
| * Measures the storage used with matching PMG and PARTID, that is, treats
|   MATCH_PARTID as == 1.

Whenever you see something all-caps (CONSTRAINED-UNPREDICTABLE in this case), it means
this is pretty useless to general purpose software. We would need per-platform quirks for
which of these behaviours the implementer built.
"Measures the storage used with matching PMG and with any PARTID." is the one we wanted,
but the spec people made it optional, and its not discoverable.

Curiously, the text describing the bandwidth controls doesn't describe this as
unpredictable... I'll check that isn't an oversight. Being able to do this with the
bandwidth controls doesn't help, as we can't support those unless we have enough monitors
for the free-running files in resctrl.

Even if we could rely on this, we'd still have the problem that MPAM expects PMG to extend
the partid space. There may be very few PMG because they are expected to be used
per-partid. If there are fewer PMG than partid, we are straight back to square-one: 'no
monitor support'.


> I am surely not
> indicating that MPAM should be made to behave like RDT but it does seem
> that MPAM is very configurable.

> Is it the intention to support all ways in which MPAM can be used and
Today, no. Only those that map in a usable way to what resctrl already exposes to user-space.


> if so is the plan for resctrl so support
> making these configuration changes and then support them?

If someone comes up with a use-case that benefits from one of the the extra controls, we
can discuss how that could be abstracted to work on multiple architectures.


> For example,
> would you want resctrl to support all variations where MATCH_PARTID ==
> [0|1] and MATCH_PMG == [0|1]? My intention here is not to delve into
> these details in particular, instead I hope to use it as an example of
> what I mean when curious about how (if at all) resctrl is envisioned to
> support the configurability of MPAM.

I think not-at-all is the answer here. Resctrl's ABI is a ship that has sailed. Where at
all possible I intend to map what MPAM has to what resctrl exposes.

The num_rmid issue has me painted into a corner. No value there reflects the behaviour.
The only option may be not to expose any of the counters via the resctrl filesystem,
instead making them accessible via perf.
I think that would only fly if I can make it work on x86 too.


> It seems to me that MPAM may need more than what is currently available
> from resctrl

Ultimately yes, but the aim here isn't to support all of MPAM.
Its just to support what maps nicely. We can then discuss what to do next.


> but it is hard for me to digest a 276 page spec and 150
> patch series to fully understand what needs to be support and how to do
> so.

Yeah! Me to. But this is what Fenghua asked to see:
https://lore.kernel.org/lkml/20180827142204.GA223688@romley-ivt3.sc.intel.com/

You'll be glad to know I have no intention of posting all that in one go (obvious, but
worth saying).


> I look forward to learning more about the goals of what needs to be
> supported and your vision for resctrl to do so.

Thanks for going through this with such a level of detail. I'm sure we agree on the 'no
user-visible changes' aspect, so the next pieces is where the split between core code and
arch code should be.


Thanks,

James

[0] https://static.docs.arm.com/ddi0598/ba/DDI0598B_a_MPAM_supp_armv8ba.pdf

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch
  2020-04-15 12:59 ` James Morse
@ 2020-04-15 19:06   ` Reinette Chatre
  2020-04-17 23:08   ` Reinette Chatre
  1 sibling, 0 replies; 5+ messages in thread
From: Reinette Chatre @ 2020-04-15 19:06 UTC (permalink / raw)
  To: James Morse
  Cc: x86, LKML, Yu, Fenghua, Thomas Gleixner, mingo, bp, hpa, Moger,
	Babu, Luck, Tony

Hi James,

Thank you very much for your thorough response. I do have a lot to
digest from it but would like to at least respond promptly to a question
you included ...

On 4/15/2020 5:59 AM, James Morse wrote:
> On 14/04/2020 19:56, Reinette Chatre wrote:
>> On 12/31/1969 4:00 PM, James Morse wrote:

...

>> * Apart from actual interface changes, highlighting planned behavior
>> changes and motivation for them would also be helpful … for example
>> force enabling of CDP on all cache levels is a red flag to me.
> 
> Interesting. This is the change that makes the CDP on/off global, instead of per cache.

This is the one I referred to and a significant change.

> Its still controlled by user-space. (so nothing is forced).

Right, controlled with the mount option but the behavior is being
changed to apply to both L2 and L3, even if user requests just one of
the two.

Please note that in the documentation it is currently explicitly stated
that: "L2 and L3 CDP are controlled separately"

> Do you have systems that support CAT at L3 and L2, but only CDP at L3, not L2?
> (I was under the impression the L2 stuff was all Atom, and the L3+MBM was all Xeon).

Things are not as clear cut unfortunately. There is a new Atom system
that has a server uncore, thus inheriting some RDT features that have
previously only been seen on servers. L2 CAT/CDP is also moving to
servers in future server products.

You can find more details about RDT features in upcoming systems in
Chapter 9 of
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

> 
> MPAM's equivalent to CDP is just part of the CPU interface. Its always on.
> To support 'CDP on L2 but not L3', (neither of which exist), we'd need to have extra code:
> "was I asked to pretend CDP is enabled on this cache".
> 
> As CDP affects the way you allocate closid, (that odd/even thing), which is global, it

The odd/even is just for the CDP enabled resource, not global. It is
thus possible for, for example, the L3, L2CODE, and L2DATA resources to
be enabled. The odd/even is configured by the multiplier cbm_idx_mult
set in the resource configuration and used in cbm_idx(). Perhaps you
mean the CLOSID is global? By enabling these together it would reduce
the number of CLOSIDs that could be used by L3 in this example.

> makes sense that this is either on or off. (doing this let me support CDP without the arch
> code doing anything special!)
> 
> Existence of hardware that does this would obviously change this.
> 

Yes, there are systems that support L2 CAT/CDP and L3 CAT/CDP. CDP is
controlled separately on the different cache levels.

>> It seems to me that MPAM may need more than what is currently available
>> from resctrl
> 
> Ultimately yes, but the aim here isn't to support all of MPAM.
> Its just to support what maps nicely. We can then discuss what to do next.

Thank you for stating this. This is significant and was not clear to me
initially.

Reinette

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch
  2020-04-15 12:59 ` James Morse
  2020-04-15 19:06   ` Reinette Chatre
@ 2020-04-17 23:08   ` Reinette Chatre
  1 sibling, 0 replies; 5+ messages in thread
From: Reinette Chatre @ 2020-04-17 23:08 UTC (permalink / raw)
  To: James Morse
  Cc: x86, LKML, Yu, Fenghua, Thomas Gleixner, mingo, bp, hpa, Moger,
	Babu, Luck, Tony

Hi James,

On 4/15/2020 5:59 AM, James Morse wrote:
> On 14/04/2020 19:56, Reinette Chatre wrote:
>> On 12/31/1969 4:00 PM, James Morse wrote:
>>> These two patches are the tip of the MPAM iceberg.
>>>
>>> Arm have some CPU support for dividing caches into portions, and
>>> applying bandwidth limits at various points in the SoC. The collective term
>>> for these features is MPAM: Memory Partitioning and Monitoring.
>>>
>>> MPAM is similar enough to Intel RDT, that it should use the defacto linux
>>> interface: resctrl. This filesystem currently lives under arch/x86, and is
>>> tightly coupled to the architecture.
>>> Ultimately, my plan is to split the existing resctrl code up to have an
>>> arch<->fs abstraction, then move all the bits out to fs/resctrl. From there
>>> MPAM can be wired up.
>>>
>>> These two patches are step one: Split the two structs that resctrl uses
>>> to have an arch<->fs split. These sit on top of the cleanup posted here:
>>> lore.kernel.org/r/20200214182401.39008-1-james.morse@arm.com
>>>
>>> I'm after strong opinions like "No! struct mbm_state is obviously arch
>>> specific.". Making the hardware configuration belong to the arch code
>>> instead of resctrl is so that it can be scaled on arm64, where MPAM
>>> allows terrifyingly large portion bitmaps for the caches.
>>>
>>>
>>>
>>> Last time these were posted, the request was for the spec, and to see
>>> the whole fully assembled iceberg.
>>>
>>> The spec is here:
>>> https://static.docs.arm.com/ddi0598/ab/DDI0598A_b_MPAM_supp_armv8a.pdf
>>>
>>> For a slightly dated view of the whole tree:
>>> 1. Don peril sensitive sunglasses
>>> 2. https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/feb
>>>
>>> The tree is generally RFC-quality. It gets more ragged once you get out of
>>> the x86 code. I anticipate all the arm64 code being rewritten before its
>>> considered for merging.
>>>
>>> (I haven't reposted the CDP origami as before, as I think that series
>>> will be clearer if I re-order the patches ... it may even be shorter)
>>>
>>>
>>> Does it all work? Almost. Monitor groups are proving to be a problem, I
>>> can't see a way of getting these working without a user-visible change of
>>> behaviour.
>>> MPAMs counters aren't 1:1 with RMIDs, so supporting MBM_* on
>>> any likely hardware will have to be via something other than resctrl.
> 
> 
>> Before jumping to the details within the patches of this work ...
>>
>> Could you please summarize the salient points from the spec, pointing
>> readers to the spec for more information?
> 
> Its probably the differences with RDT that are important, so I'll focus on those:
> 
> MPAM has has a CPU interface that controls how the CPU tags traffic, and controls that are
> scattered through the system, and are accessible via MMIO.

(... so many more controls than in RDT or PQoS.)

> 
> Arm doesn't specify a cache topology, (hence we have funny terms like PoU), so likewise
> the controls, and where they are in the address space, are totally up to the implementer.
> Linux would have to discover all this from an ACPI-table/DT-node.

ok, I see. I am just starting now to familiarize myself how you were
able to align all these concepts between MPAM and RDT using the
mpam_device and mpam_component.

> 
> MPAM has ~six different types of control, all of which are optional. Cache Portion bitmaps
> correspond closely to CAT's behaviour. MBM's behaviour can be approximated with either the
> bandwidth bitmap or the bandwidth max/min controls.

ok. Not all the partitioning types are quite clear to me
(proportional-stride partitioning specifically), but I can see that
there are similarities. (I am assuming you meant MBA)

> 
> MPAM has partid's, which correspond perfectly with closid.

I am not sure about "perfectly".

If I understand correctly at least there is the caveat that MPAM is like
RDT with CDP always enabled without reducing number of CLOSIDs?

But ...

The assumption of resctrl fs is that a task would belong to a single
resource group, which in turn represents a single CLOSID/PARTID(?). On
the other hand MPAM allows a task to have different PARTIDs for code and
data. It seems that perhaps MPAM will be supported in Linux by having
its PARTIDs behave as RDT with CDP enabled (the odd/even pairing) and
thus perhaps not as flexible as the architecture supports/intends?
(Although I see you later mention that the goal is _not_ for MPAM to
emulate CDP so it is not clear to me how this would be supported).

Would user looking at schemata on resctrl on Arm thus always see the
CODE/DATA prefix on the cache resources? I am trying to create a high
level idea of what a user's interaction with resctrl fs would look like
on an Arm system and how that would be translated on a high level to
what it means for the hardware (allocating new PARTIDs, setting of
PARTIDs, etc.). (small steps)

> It has performance monitors,
> but these behave much more like a PMU than RDT's monitors. This is where it gets messy:
> 
> RDT has RMIDs as an independent value used for monitoring. There is one monitor per RMID.

Currently up to three monitors per RMID if considering a monitor to be
the same as a counter?

> MPAM has 'PMG', that subdivide the partid space. There are a number of PMU-like monitors
> that can be configured to count for a partid, or a partid-and-pmg. It is very likely that
> the number of these counters is much smaller than the number of partid, or partid+pmg.

If I understand correctly the PARTID and PMG are global because that is
the MPAM information that accompanies all memory requests as it
traverses through the system.

In comparison the resource monitors seem to be entirely contained in the
MSC to which they belong ... and an MSC is allowed to have up to 2^16 of
each type.

It seems that it may at least theoretically be possible to have many
more counters than PARTID, or PARTID+PMG.

I see your point about them being more similar to PMU.

> 
> PMG does not correspond to RMID, even though they look compatible from a distance.
> (I can't see a way of fixing this is a compatible way. Supporting it via a different
> mechanism is my best bet).

I've been trying to different ideas also but cannot find a good fit in
resctrl either.

(I seem to have trouble mapping both the partioning and monitoring MPAM
features to resctrl ... apologies as it is taking me some time to catch
up with where you are at this time)

> 
> MPAM has something like the CPU interface for tagging traffic in Arm's irqchip and IOMMU.

I cannot find mention of this specifically in the new spec. Do you
perhaps have other documentation about this?

> MPAM has virtualisation support in the CPU, for mapping virtual partids to physical partids.
> 
> 
>> A start would be how resctrl is expected to support MPAM.
> 
> Exactly as it works on an Intel Xeon today!
> User-space should not be able to tell the difference.
> 
> I intend to get as much of MPAM going with this constraint as possible. We can then
> discuss what needs changing/extending to allow other features to be used. (and what those
> features are good for).
> 
> Practically this means that systems with MPAM can only use resctrl if they look a bit like
> a Xeon. (i.e. bitmap controls on L2 or L3, bandwidth controls on-or-behind L3)
> 

This sounds manageable.

> 
>> Could you highlight where does resctrl interface or assumptions currently fall short?
>> How is the planned MPAM integration addressing these shortcomings?
> 
> (shortcomings -> Design decisions that make sense for resctrl to support RDT.)
> 
> The resctrl monitor code implicitly assumes 'this' CPU all over the place. For MPAM any
> 'this' CPU may be a set of CPUs, which make this code tricky to work with.
> 
> tangent: (what! why? ... Reported by more than one of Arm's partners, is that part of
> their system is made up of slices, which act as one component when integrated together. I
> assume this helps scale the design, or improve the yield. There is one set of MPAM
> controls per slice, which means each of the controls needs to be configured the same to
> give one behaviour for the component. The MPAM controls may only be accessible from the
> local CPUs if this slice thing contains CPUs... Arm's software model does exactly this
> with its L3).

Apologies, the issue is not clear to me. I am familiar with the slicing
of L3, Intel systems do that also. Your concern that preceded this was
that the monitor code implicitly assumes "this" CPU. Could you please
elaborate more on this? The RMIDs have global scope within the package
(L3 cache domain instance) and keeps track of all CPUs associated with
it (rdt_domain->cpu_mask). The user queries each domain separately and
the counter value is read from any CPU associated with the particular
domain. How I see it the monitor code thus tracks which CPUs are
associated with a particular cache instance and would interact with an
appropriate CPU depending on which data/counter the user is requesting.

> resctrl doesn't have the concept of having to allocate a counter for an RMID, because for
> RDT these things are 1:1. Because the bandwidth counters are left running, and accessible
> via the filesystem, MPAM cannot let resctrl use the bandwidth counters unless there are as
> many PMU-like monitors as there are partid+pmg, which isn't likely.
> 
> My plan here is to add a 'resctrl_pmu' to perf. (I know there is some history with perf
> here). This would allow perf to read the values that are already exposed via resctrl.
> For MPAM this should give us the schedule-in/out hooks we need to allocate the PMU-like
> counter when its actually being used.
> 
> I'd expect this to be done in the core /fs/ code, so that it uses a common interface to
> resctrl, and works in exactly the same way with RDT. (no additional arch code means it
> works on the next architecture to support resctrl too)

Are you saying that resctrl fs would provide an interface to the
"resctrl_pmu" addition to perf?

> The rdt_resources_all array, and these alloc_enabled, alloc_capable flags are a really
> neat trick to implement CDP on RDT. But this looks very much like an array of every SoC
> that has been built, which is tricky if the topology is discovered at boot. It also leaves
> properties of the user visible schemata file in the hands of the arch code.
> I want it to be, very difficult, for Arm to invent new schemata without discussing whether
> the interface is abstract enough to support on other architectures.
> A good chunk of the MPAM tree is moving all the user-visible CDP behaviour into the code
> that moves to /fs. This lets the arch code only deal with L2 or L3, and CDP becomes a
> property of the configuration.

I did see the introduction of the new schema list that points to the
resources that appears to separate the user interface and the
architecture differences very well.

> 
> (this is probably one of the stranger looking changes. Secondary motivations are to avoid
> emulating CDP under arch/arm64, and avoiding the nightmare SoC topology where there are
> more MPAM controls to be configured, because of this slicing, in L2 than there are in
> L2CODE....)

... oh ... no to emulate CDP. (I need to understand how the two PARTIDs
of a task are configured ... I will keep digging through the spec and
the patches).

> 
> 
>> Some higher level questions I have after scanning the spec and patches are:
>>
>> * The spec contains many details of how MPAM supports virtualization. Is
>> it expected that all of this would be supported with resctrl?
> 
> For now, that is totally out of scope.
> 
> 
>> For
>> example, while some registers may be abstracted it seems some interface
>> may be needed to configure the virt to phy PARTID mappings. Information
>> about how resctrl is envisioned to support MPAM's virtualization would
>> help a lot.
> 
> As you asked!
> The principle here is not to change the user-visible bits of resctrl at all.
> 
> We only need to support KVM, and the definition of KVM's virtual machine (memory layout,
> number of CPUs etc) comes from its user-space virtual-machine-manager, which is typically
> Qemu.
> KVM only needs to support the CPU-interface parts of MPAM. The controls, configured via
> MMIO could be emulated by the VMM. (this lets it give an accurate model of the machine it
> is emulating)
> The VMM would need to allocate a number of control groups via resctrl, then pass their
> names (or preferably closid) to KVM to set up the mapping. The MPAM hardware performs the
> mapping when the guest uses a partid.
> When the guest tries to configure the controls, this would trap back to the VMM, as it
> does for any emulated device. The VMM can then make the corresponding change in resctrl.
> 
> This lets the VMM emulate MPAM controls that the host doesn't have, or control MPAM for
> the guest using a different user-space interface.
> 
> This has only been discussed vaguely with the Qemu folk, its all subject to change.
> 
> I would like to be able to allocate closid in the kernel for KVM guests, as this is a step
> towards supporting other in-kernel users. (in particular, the IOMMU).


Thank you very much for this insight into what is being considered.

>> I look forward to learning more about the goals of what needs to be
>> supported and your vision for resctrl to do so.
> 
> Thanks for going through this with such a level of detail. I'm sure we agree on the 'no
> user-visible changes' aspect, so the next pieces is where the split between core code and
> arch code should be.

Agreeing on "no user-visible changes" really helps to guide this initial
work. The split between core and arch may be more flexible and may be
done on an as-needed basis?

Reinette

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch
@ 2020-02-14 18:29 James Morse
  0 siblings, 0 replies; 5+ messages in thread
From: James Morse @ 2020-02-14 18:29 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Fenghua Yu, Reinette Chatre, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Babu Moger, James Morse

Hi folks,

These two patches are the tip of the MPAM iceberg.

Arm have some CPU support for dividing caches into portions, and
applying bandwidth limits at various points in the SoC. The collective term
for these features is MPAM: Memory Partitioning and Monitoring.

MPAM is similar enough to Intel RDT, that it should use the defacto linux
interface: resctrl. This filesystem currently lives under arch/x86, and is
tightly coupled to the architecture.
Ultimately, my plan is to split the existing resctrl code up to have an
arch<->fs abstraction, then move all the bits out to fs/resctrl. From there
MPAM can be wired up.


These two patches are step one: Split the two structs that resctrl uses
to have an arch<->fs split. These sit on top of the cleanup posted here:
lore.kernel.org/r/20200214182401.39008-1-james.morse@arm.com

I'm after strong opinions like "No! struct mbm_state is obviously arch
specific.". Making the hardware configuration belong to the arch code
instead of resctrl is so that it can be scaled on arm64, where MPAM
allows terrifyingly large portion bitmaps for the caches.



Last time these were posted, the request was for the spec, and to see
the whole fully assembled iceberg.

The spec is here:
https://static.docs.arm.com/ddi0598/ab/DDI0598A_b_MPAM_supp_armv8a.pdf

For a slightly dated view of the whole tree:
1. Don peril sensitive sunglasses
2. https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/log/?h=mpam/snapshot/feb

The tree is generally RFC-quality. It gets more ragged once you get out of
the x86 code. I anticipate all the arm64 code being rewritten before its
considered for merging.

(I haven't reposted the CDP origami as before, as I think that series
will be clearer if I re-order the patches ... it may even be shorter)


Does it all work? Almost. Monitor groups are proving to be a problem, I
can't see a way of getting these working without a user-visible change of
behaviour.
MPAMs counters aren't 1:1 with RMIDs, so supporting MBM_* on
any likely hardware will have to be via something other than resctrl.


Thanks,

James Morse (2):
  x86/resctrl: Split struct rdt_resource
  x86/resctrl: Split struct rdt_domain

 arch/x86/kernel/cpu/resctrl/core.c        | 257 +++++++++++++---------
 arch/x86/kernel/cpu/resctrl/ctrlmondata.c |  16 +-
 arch/x86/kernel/cpu/resctrl/internal.h    | 157 +++----------
 arch/x86/kernel/cpu/resctrl/monitor.c     |  29 ++-
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kernel/cpu/resctrl/rdtgroup.c    |  77 ++++---
 include/linux/resctrl.h                   | 133 +++++++++++
 7 files changed, 389 insertions(+), 284 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-04-17 23:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-14 18:56 [RFC PATCH v2 0/2] x86/resctrl: Start abstraction for a second arch Reinette Chatre
2020-04-15 12:59 ` James Morse
2020-04-15 19:06   ` Reinette Chatre
2020-04-17 23:08   ` Reinette Chatre
  -- strict thread matches above, loose matches on Subject: below --
2020-02-14 18:29 James Morse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).