linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: James Morse <james.morse@arm.com>
To: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	Fenghua Yu <fenghua.yu@intel.com>,
	Reinette Chatre <reinette.chatre@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	H Peter Anvin <hpa@zytor.com>, Babu Moger <Babu.Moger@amd.com>,
	shameerali.kolothum.thodi@huawei.com,
	D Scott Phillips OS <scott@os.amperecomputing.com>,
	carl@os.amperecomputing.com, lcherian@marvell.com,
	bobo.shaobowang@huawei.com, tan.shaopeng@fujitsu.com,
	xingxin.hx@openanolis.org, baolin.wang@linux.alibaba.com,
	Jamie Iles <quic_jiles@quicinc.com>,
	Xin Hao <xhao@linux.alibaba.com>,
	peternewman@google.com, Drew Fustini <dfustini@baylibre.com>
Subject: Re: [PATCH v3 00/19] x86/resctrl: monitored closid+rmid together, separate arch/fs locking
Date: Thu, 25 May 2023 18:31:54 +0100	[thread overview]
Message-ID: <bece178a-0e4e-f73d-92e7-4f603b4211d0@arm.com> (raw)
In-Reply-To: <ZGz0as//iEbRpHHs@agluck-desk3>

Hi Tony,

(CC: +Drew)

On 23/05/2023 18:14, Tony Luck wrote:
> Looking at the changes already applied, and those planned to support
> new architectures, new features, and quirks in specific implementations,
> it is clear to me that the original resctrl file system implementation
> did not provide enough flexibility for all the additions that are
> needed.
> 
> So I've begun musing with 20-20 hindsight on how resctrl could have
> provided better abstract building blocks.

Heh, hindsight is 20:20!

My responses below are pretty much entirely about how this looks to user-space, that is
the bit we can't change.


> The concept of a "resource" structure with a list of domains for
> specific instances of that structure on a platform still seems like
> a good building block.

True, but having platform specific resource types does reduce the effectiveness of a
shared interface. User-space just has to have platform specific knowledge for these.

I think defining resources in terms of other things that are visible to user-space through
sysfs is the best approach. The CAT L3 schema does this.


In terms of the values used, I'd much prefer 'weights' or some other abstraction had been
used, to allow the kernel to pick the hardware configuration itself.
Similarly, properties like isolation between groups should be made explicit, instead of
asking "did user-space mean to set shared bits in those bitmaps?"

This stuff is the reason why resctrl can't support MPAM's 'CMAX', that gives a maximum
capacity limit for a cache, but doesn't implicitly isolate groups.


> But sharing those structures across increasingly different implementations
> of the underlying resource is resulting in extra gymnastic efforts to
> make all the new uses co-exist with the old. E.g. the domain structure
> has elements for every type of resource even though each instance is
> linked to just one resource type.

> I had begun this journey with a plan to just allow new features to
> hook into the existing resctrl filesystem with a "driver" registration
> mechanism:
> 
> https://lore.kernel.org/all/20230420220636.53527-1-tony.luck@intel.com/
> 
> But feedback from Reinette that this would be cleaner if drivers created
> new resources, rather than adding a patchwork of callback functions with
> special case "if (type == DRIVER)" sprinkled around made me look into
> a more radical redesign instead of joining in the trend of making the
> smallest set of changes to meet my goals.
> 
> 
> Goals:
> 1) User interfaces for existing resource control features should be
> unchanged.
> 
> 2) Admin interface should have the same capabilities, but interfaces
> may change. E.g. kernel command line and mount options may be replaced
> by choosing which resource modules to load.
> 
> 3) Should be easy to create new modules to handle big differences
> between implementations, or to handle model specific features that
> may not exist in the same form across multiple CPU generations.

The difficulty is knowing some behaviour is going to be platform specific, its not until
the next generation is different that you know there was something wrong with the first.

The difficulty is user-space expecting a resource that turned out to be platform-specific,
or was 'enhanced' in a subsequent version and doesn't behave in the same way.

I suspect we need two sets of resources, those that are abstracted to work in a portable
way between platforms and architectures - and the wild west.
The next trick is moving things between the two!


> Initial notes:
> 
> Core resctrl filesystem functionality will just be:
> 
> 1) Mount/unmount of filesystem. Architecture hook to allocate monitor
> and control IDs for the default group.
> 
> 2) Creation/removal/rename of control and monitor directories (with
> call to architecture specific code to allocate/free the control and monitor
> IDs to attach to the directory.
> 
> 3) Maintaining the "tasks" file with architecture code to update the
> control and monitor IDs in the task structure.
> 
> 4) Maintaining the "cpus" file - similar to "tasks"
> 
> 5) Context switch code to update h/w with control/monitor IDs.
> 
> 6) CPU hotplug interface to build and maintain domain list for each
> registered resource.
> 
> 7) Framework for "schemata" file. Calls to resource specific functions
> to maintain each line in the file.

> 8) Resource registration interface for modules to add new resources
> to the list (and remove them on module unload). Modules may add files
> to the info/ hierarchy, and also to each mon_data/ directory and/or
> to each control/control_mon directory.

I worry that this can lead to architecture specific schema, then each architecture having
a subtly different version. I think it would be good to keep all the user-ABI in one place
so it doesn't get re-invented. I agree its hard to know what the next platfrom will look like.

One difference I can't get my head round is how to handle platforms that use relative
percentages and fractions - and those that take an absolute MB/s value.


> 9) Note that the core code starts with an empty list of resources.
> System admins must load modules to add support for each resource they
> want to use.

I think this just moves the problem to modules. 'CAT' would get duplicated by all
architectures. MB is subtly different between them all, but user-space doesn't want to be
concerned with the differences.


> We'd need a bunch of modules to cover existing x86 functionality. E.g.
> an "L3" one for standard L3 cache allocation, an "L3CDP" one to be used
> instead of the plain "L3" one for code/data priority mode by creating
> a separate resource for each of code & data.

CDP may have system wide side-effects. For MPAM if you enable the emulation of that, then
resources that resctrl doesn't believe use CDP have to double-configure and double-count
everything.


> Logically separate mbm_local, mbm_total, and llc_cache_occupancy modules
> (though could combine the mbm ones because they both need a periodic
> counter read to avoid wraparound). "MB" for memory bandwidth allocation.

llc_cache_occupancy isn't a counter, but I'd prefer to bundle the others through perf.
That already has an interface for discovering and configuring events. I understand it was
tried and removed, but I think I've got a handle on making this work.


> The "mba_MBps" mount option would be replaced with a module that does
> both memory bandwidth allocation and monitoring, with a s/w feedback loop.

Keeping purely software features self contained is a great idea.


> Peter's workaround for the quirks of AMD monitoring could become a
> separate AMD specific module. But minor differences (e.g. contiguous
> cache bitmask Intel requirements) could be handled within a module
> if desired.
> 
> Pseudo-locking would be another case to load a different module to
> set up pseudo-locking and enforce the cache bitmask rules between resctrl
> groups instead of the basic cache allocation one.
> 
> Core resctrl code could handle overlaps between modules that want to
> control the same resource with a "first to load reserves that feature"
> policy.

> Are there additional ARM specific architectural requirements that this
> approach isn't addressing? Could the core functionality be extended to
> make life easier for ARM?

(We've got RISC-V to consider too - hence adding Drew Fustini [0])

My known issues list is:
* RMIDs.
   These are an independent number space for RDT. For MPAM they are an
   extension of the partid/closid space. There is no value that can be
   exposed to user-space as num_rmid as it depends on how they will be
   used.

* Monitors.
   RDT has one counter per RMID, they run continuously. MPAM advertises
   how many monitors it has, which is very likely to be fewer than we
   want. This means MPAM can't expose the free-runing MBM_* counters
   via the filesystem. These would need exposing via perf.

 * Bitmaps.
   MPAM has some bitmaps, but it has other stuff too. Forcing the bitmaps
   to be the user-space interface requires the underlying control to be
   capable of isolation. Ideally user-space would set a priority/cost for
   each rdtgroup, and indicate whether they should be isolated from other
   rdtgroup at the same level.

 * Memory bandwidth.
   For MB resources that control bandwidth, X86 provides user-space with
   the cache-id of the cache that implements that bandwidth controls. For
   MPAM there is no expectation that this is being done by a cache, it could
   equally be a memory controller.


I'd really like to have these solved as part of a cross-architecture user-space ABI. I'm
not sure platform-specific modules solve the user-space problem.


Otherwise MPAM has additional types of control, which could be applied to any kind of
resource. The oddest is 'PRI' which is just a priority. I've not yet heard of a system
using it, but this could appear at any choke point in the SoC, it may not be on a cache or
memory controller.

The 'types of control' and 'resource' distinction may help in places where Intel/AMD take
wildly different values to configure the same resource. (*cough* MB)


Thanks,

James


[0] lore.kernel.org/r/20230430-riscv-cbqri-rfc-v2-v2-0-8e3725c4a473@baylibre.com

  reply	other threads:[~2023-05-25 17:32 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-20 17:26 [PATCH v3 00/19] x86/resctrl: monitored closid+rmid together, separate arch/fs locking James Morse
2023-03-20 17:26 ` [PATCH v3 01/19] x86/resctrl: Track the closid with the rmid James Morse
2023-03-20 17:26 ` [PATCH v3 02/19] x86/resctrl: Access per-rmid structures by index James Morse
2023-03-21 10:57   ` Ilpo Järvinen
2023-03-31 23:19   ` Reinette Chatre
2023-04-24 13:06   ` Peter Newman
2023-05-25 17:32     ` James Morse
2023-03-20 17:26 ` [PATCH v3 03/19] x86/resctrl: Create helper for RMID allocation and mondata dir creation James Morse
2023-03-21 11:05   ` Ilpo Järvinen
2023-03-31 23:20   ` Reinette Chatre
2023-03-20 17:26 ` [PATCH v3 04/19] x86/resctrl: Move rmid allocation out of mkdir_rdt_prepare() James Morse
2023-03-20 17:26 ` [PATCH v3 05/19] x86/resctrl: Allow RMID allocation to be scoped by CLOSID James Morse
2023-03-21 11:29   ` Ilpo Järvinen
2023-03-20 17:26 ` [PATCH v3 06/19] x86/resctrl: Allow the allocator to check if a CLOSID can allocate clean RMID James Morse
2023-03-31 23:21   ` Reinette Chatre
2023-04-27 14:09     ` James Morse
2023-03-20 17:26 ` [PATCH v3 07/19] x86/resctrl: Move CLOSID/RMID matching and setting to use helpers James Morse
2023-03-20 17:26 ` [PATCH v3 08/19] x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow James Morse
2023-03-21 13:21   ` Ilpo Järvinen
2023-04-27 14:09     ` James Morse
2023-03-21 15:14   ` Ilpo Järvinen
2023-04-27 14:09     ` James Morse
2023-04-27 14:25       ` Ilpo Järvinen
2023-05-25 17:32         ` James Morse
2023-03-31 23:24   ` Reinette Chatre
2023-04-27 14:10     ` James Morse
2023-04-27 23:36       ` Reinette Chatre
2023-05-25 17:32         ` James Morse
2023-03-20 17:26 ` [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI James Morse
2023-03-22 14:07   ` Peter Newman
2023-03-23  9:09     ` Peter Newman
2023-04-27 14:12       ` James Morse
2023-04-27 14:11     ` James Morse
2023-03-31 23:25   ` Reinette Chatre
2023-04-27 14:12     ` James Morse
2023-03-20 17:26 ` [PATCH v3 10/19] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep James Morse
2023-03-31 23:26   ` Reinette Chatre
2023-04-27 14:12     ` James Morse
2023-03-20 17:26 ` [PATCH v3 11/19] x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read() James Morse
2023-03-31 23:27   ` Reinette Chatre
2023-04-27 14:19     ` James Morse
2023-04-27 23:40       ` Reinette Chatre
2023-05-25 17:31         ` James Morse
2023-03-20 17:26 ` [PATCH v3 12/19] x86/resctrl: Make resctrl_mounted checks explicit James Morse
2023-03-31 23:28   ` Reinette Chatre
2023-04-27 14:19     ` James Morse
2023-04-27 23:37       ` Reinette Chatre
2023-05-25 17:31         ` James Morse
2023-03-20 17:26 ` [PATCH v3 13/19] x86/resctrl: Move alloc/mon static keys into helpers James Morse
2023-03-20 17:26 ` [PATCH v3 14/19] x86/resctrl: Make rdt_enable_key the arch's decision to switch James Morse
2023-03-20 17:26 ` [PATCH v3 15/19] x86/resctrl: Add helpers for system wide mon/alloc capable James Morse
2023-03-31 23:29   ` Reinette Chatre
2023-04-27 14:19     ` James Morse
2023-03-20 17:26 ` [PATCH v3 16/19] x86/resctrl: Add cpu online callback for resctrl work James Morse
2023-03-31 23:29   ` Reinette Chatre
2023-03-20 17:26 ` [PATCH v3 17/19] x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but cpu James Morse
2023-03-21 15:12   ` Ilpo Järvinen
2023-03-21 15:25     ` Ilpo Järvinen
2023-04-27 14:20       ` James Morse
2023-03-20 17:26 ` [PATCH v3 18/19] x86/resctrl: Add cpu offline callback for resctrl work James Morse
2023-03-21 15:32   ` Ilpo Järvinen
2023-04-27 14:20     ` James Morse
2023-04-27 14:51       ` Ilpo Järvinen
2023-04-05 23:48   ` Reinette Chatre
2023-04-27 14:20     ` James Morse
2023-03-20 17:26 ` [PATCH v3 19/19] x86/resctrl: Separate arch and fs resctrl locks James Morse
2023-05-23 17:14 ` [PATCH v3 00/19] x86/resctrl: monitored closid+rmid together, separate arch/fs locking Tony Luck
2023-05-25 17:31   ` James Morse [this message]
2023-05-25 21:00     ` Tony Luck
2023-05-28 20:52       ` Drew Fustini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bece178a-0e4e-f73d-92e7-4f603b4211d0@arm.com \
    --to=james.morse@arm.com \
    --cc=Babu.Moger@amd.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bobo.shaobowang@huawei.com \
    --cc=bp@alien8.de \
    --cc=carl@os.amperecomputing.com \
    --cc=dfustini@baylibre.com \
    --cc=fenghua.yu@intel.com \
    --cc=hpa@zytor.com \
    --cc=lcherian@marvell.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peternewman@google.com \
    --cc=quic_jiles@quicinc.com \
    --cc=reinette.chatre@intel.com \
    --cc=scott@os.amperecomputing.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=tan.shaopeng@fujitsu.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=xhao@linux.alibaba.com \
    --cc=xingxin.hx@openanolis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).