All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] Memory b/w allocation software controller
@ 2018-03-29 22:26 Vikas Shivappa
  2018-03-29 22:26 ` [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA " Vikas Shivappa
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Vikas Shivappa @ 2018-03-29 22:26 UTC (permalink / raw)
  To: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, tglx, hpa
  Cc: linux-kernel, ak, vikas.shivappa

Intel RDT memory bandwidth allocation (MBA) currently uses the resctrl
interface and uses the schemata file in each rdtgroup to specify the max
b/w percentage that is allowed to be used by the "threads" and "cpus" in
the rdtgroup. These values are specified "per package" in each rdtgroup
in the schemata file as below:

$ cat /sys/fs/resctrl/p1/schemata 
    L3:0=7ff;1=7ff
    MB:0=100;1=50

In the above example the MB is the memory bandwidth percentage and "0"
and "1" specify the package/socket ids. The threads in rdtgroup "p1"
would get 100% memory b/w on socket0 and 50% b/w on socket1.

However, Memory bandwidth allocation (MBA) is a core specific mechanism
which means that when the Memory b/w percentage is specified in the
schemata per package it actually is applied on a per core basis via
IA32_MBA_THRTL_MSR interface. This may lead to confusion in scenarios
below:

1. User may not see increase in actual b/w when percentage values are
   increased:

This can occur when aggregate L2 external b/w is more than L3 external
b/w. Consider an SKL SKU with 24 cores on a package and where L2
external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
the percentage value specified is only 50% << 100%. Hence increasing
the b/w percentage will not yeild any more b/w. This is because
although the L2 external b/w still has capacity, the L3 external b/w
is fully used. Also note that this would be dependent on number of
cores the benchmark is run on.

2. Same b/w percentage may mean different actual b/w depending on # of
   threads:

For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
with 10% b/w' can consume upto 10GBps and 40GBps although they have same
percentage b/w of 10%. This is simply because as threads start using
more cores in an rdtgroup, the actual b/w may increase or vary although
user specified b/w percentage is same.

In order to mitigate this and make the interface more user friendly, we
can let the user specify the max bandwidth per rdtgroup in bytes(or mega
bytes). The kernel underneath would use a software feedback mechanism or
a "Software Controller" which reads the actual b/w using MBM counters
and adjust the memowy bandwidth percentages to ensure the "actual b/w
< user b/w".

The legacy behaviour is default and user can switch to the "MBA software
controller" mode using a mount option 'mba_MB'.

To use the feature mount the file system using mba_MB option:
 
$ mount -t resctrl resctrl [-o cdp[,cdpl2][mba_MB]] /sys/fs/resctrl

We could also use a config option as suggested by Fenghua. This may be
useful in situations where other resources need such options and we dont
have to keep growing the if else in the mount. However it needs enough
isolation when implemented with respect to resetting the values.

If the MBA is specified in MB(megabytes) then user can enter the max b/w
in MB rather than the percentage values. The default when mounted is
max_u32.

$ echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
$ echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata

In the above example the tasks in "p1" and "p0" rdtgroup
would use a max b/w of 1024MBps on socket0 and 500MBps on socket1.

Vikas Shivappa (6):
  x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  x86/intel_rdt/mba_sc: Add support to enable/disable via mount option
  x86/intel_rdt/mba_sc: Add initialization support
  x86/intel_rdt/mba_sc: Add schemata support
  x86/intel_rdt/mba_sc: Add counting for MBA software controller
  x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w

 Documentation/x86/intel_rdt_ui.txt          |  63 +++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c             |  50 +++++++++----
 arch/x86/kernel/cpu/intel_rdt.h             |  34 ++++++++-
 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c |  10 ++-
 arch/x86/kernel/cpu/intel_rdt_monitor.c     | 105 +++++++++++++++++++++++++---
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c    |  34 ++++++++-
 6 files changed, 268 insertions(+), 28 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  2018-03-29 22:26 [PATCH RFC 0/6] Memory b/w allocation software controller Vikas Shivappa
@ 2018-03-29 22:26 ` Vikas Shivappa
  2018-04-03  9:46   ` Thomas Gleixner
  2018-03-29 22:26 ` [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option Vikas Shivappa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Vikas Shivappa @ 2018-03-29 22:26 UTC (permalink / raw)
  To: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, tglx, hpa
  Cc: linux-kernel, ak, vikas.shivappa

Add documentation about usage which includes the "schemata" format and
use case for MBA software controller.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 Documentation/x86/intel_rdt_ui.txt | 63 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index 71c3098..3b9634e 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -315,6 +315,60 @@ Memory b/w domain is L3 cache.
 
 	MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
 
+Memory bandwidth(b/w) in MegaBytes
+----------------------------------
+
+Memory bandwidth is a core specific mechanism which means that when the
+Memory b/w percentage is specified in the schemata per package it
+actually is applied on a per core basis via IA32_MBA_THRTL_MSR
+interface. This may lead to confusion in scenarios below:
+
+1. User may not see increase in actual b/w when percentage values are
+   increased:
+
+This can occur when aggregate L2 external b/w is more than L3 external
+b/w. Consider an SKL SKU with 24 cores on a package and where L2
+external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
+L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
+b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
+the percentage value specified is only 50% << 100%. Hence increasing
+the b/w percentage will not yeild any more b/w. This is because
+although the L2 external b/w still has capacity, the L3 external b/w
+is fully used. Also note that this would be dependent on number of
+cores the benchmark is run on.
+
+2. Same b/w percentage may mean different actual b/w depending on # of
+   threads:
+
+For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
+with 10% b/w' can consume upto 10GBps and 40GBps although they have same
+percentage b/w of 10%. This is simply because as threads start using
+more cores in an rdtgroup, the actual b/w may increase or vary although
+user specified b/w percentage is same.
+
+In order to mitigate this and make the interface more user friendly, we
+can let the user specify the max bandwidth per rdtgroup in bytes(or mega
+bytes). The kernel underneath would use a software feedback mechanism or
+a "Software Controller" which reads the actual b/w using MBM counters
+and adjust the memowy bandwidth percentages to ensure the "actual b/w
+< user b/w".
+
+The legacy behaviour is default and user can switch to the "MBA software
+controller" mode using a mount option 'mba_MB'.
+
+To use the feature mount the file system using mba_MB option:
+ 
+# mount -t resctrl resctrl [-o cdp[,cdpl2][mba_MB]] /sys/fs/resctrl
+
+The schemata format is below:
+
+Memory b/w Allocation in Megabytes
+----------------------------------
+
+Memory b/w domain is L3 cache.
+
+	MB:<cache_id0>=bw_MB0;<cache_id1>=bw_MB1;...
+
 Reading/writing the schemata file
 ---------------------------------
 Reading the schemata file will show the state of all resources
@@ -358,6 +412,15 @@ allocations can overlap or not. The allocations specifies the maximum
 b/w that the group may be able to use and the system admin can configure
 the b/w accordingly.
 
+If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB
+rather than the percentage values.
+
+# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
+# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
+
+In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
+of 1024MB where as on socket 1 they would use 500MB.
+
 Example 2
 ---------
 Again two sockets, but this time with a more realistic 20-bit mask.
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option
  2018-03-29 22:26 [PATCH RFC 0/6] Memory b/w allocation software controller Vikas Shivappa
  2018-03-29 22:26 ` [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA " Vikas Shivappa
@ 2018-03-29 22:26 ` Vikas Shivappa
  2018-03-30  9:32   ` Thomas Gleixner
  2018-03-29 22:26 ` [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support Vikas Shivappa
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Vikas Shivappa @ 2018-03-29 22:26 UTC (permalink / raw)
  To: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, tglx, hpa
  Cc: linux-kernel, ak, vikas.shivappa

Specify a new mount option "mba_MB" to enable the user to specify MBA
bandwidth in Megabytes(Software Controller/SC) instead of b/w
percentage:

$mount -t resctrl resctrl [-o cdp[,cdpl2][mba_MB]] /sys/fs/resctrl

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h          |  5 +++++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 31 +++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 3fd7a70..3e9bc3f 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -259,6 +259,7 @@ struct rdt_cache {
  * @min_bw:		Minimum memory bandwidth percentage user can request
  * @bw_gran:		Granularity at which the memory bandwidth is allocated
  * @delay_linear:	True if memory B/W delay is in linear scale
+ * @bw_byte:		True if memory B/W is specified in bytes
  * @mb_map:		Mapping of memory B/W percentage to memory B/W delay
  */
 struct rdt_membw {
@@ -266,6 +267,7 @@ struct rdt_membw {
 	u32		min_bw;
 	u32		bw_gran;
 	u32		delay_linear;
+	bool		bw_byte;
 	u32		*mb_map;
 };
 
@@ -295,6 +297,9 @@ static inline bool is_mbm_event(int e)
 		e <= QOS_L3_MBM_LOCAL_EVENT_ID);
 }
 
+#define is_mba_linear() rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear
+#define is_mba_MBctrl() rdt_resources_all[RDT_RESOURCE_MBA].membw.bw_byte
+
 /**
  * struct rdt_resource - attributes of an RDT resource
  * @rid:		The index of the resource
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index fca759d..0707191 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1041,6 +1041,24 @@ static int set_cache_qos_cfg(int level, bool enable)
 	return 0;
 }
 
+static void __set_mba_byte_ctrl(bool byte_ctrl)
+{
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
+
+	r->membw.bw_byte = byte_ctrl;
+}
+
+/*
+ * MBA allocation in bytes is only supported if
+ * MBM is supported and MBA is in linear scale.
+*/
+static void set_mba_byte_ctrl(bool byte_ctrl)
+{
+	if ((is_mbm_enabled() && is_mba_linear()) &&
+	    byte_ctrl != is_mba_MBctrl())
+		__set_mba_byte_ctrl(byte_ctrl);
+}
+
 static int cdp_enable(int level, int data_type, int code_type)
 {
 	struct rdt_resource *r_ldata = &rdt_resources_all[data_type];
@@ -1104,7 +1122,7 @@ static void cdp_disable_all(void)
 		cdpl2_disable();
 }
 
-static int parse_rdtgroupfs_options(char *data)
+static int parse_rdtgroupfs_options(char *data, bool *mba_MBctrl)
 {
 	char *token, *o = data;
 	int ret = 0;
@@ -1123,6 +1141,8 @@ static int parse_rdtgroupfs_options(char *data)
 			ret = cdpl2_enable();
 			if (ret)
 				goto out;
+		} else if (!strcmp(token, "mba_MB")) {
+			*mba_MBctrl = true;
 		} else {
 			ret = -EINVAL;
 			goto out;
@@ -1209,6 +1229,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 				int flags, const char *unused_dev_name,
 				void *data)
 {
+	bool mba_MBctrl = false;
 	struct rdt_domain *dom;
 	struct rdt_resource *r;
 	struct dentry *dentry;
@@ -1224,7 +1245,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 		goto out;
 	}
 
-	ret = parse_rdtgroupfs_options(data);
+	ret = parse_rdtgroupfs_options(data, &mba_MBctrl);
 	if (ret) {
 		dentry = ERR_PTR(ret);
 		goto out_cdp;
@@ -1277,6 +1298,9 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 			mbm_setup_overflow_handler(dom, MBM_OVERFLOW_INTERVAL);
 	}
 
+	if (mba_MBctrl)
+		set_mba_byte_ctrl(true);
+
 	goto out;
 
 out_mondata:
@@ -1445,6 +1469,9 @@ static void rdt_kill_sb(struct super_block *sb)
 	cpus_read_lock();
 	mutex_lock(&rdtgroup_mutex);
 
+	/*Set the control values before the rest of reset*/
+	set_mba_byte_ctrl(false);
+
 	/*Put everything back to default values. */
 	for_each_alloc_enabled_rdt_resource(r)
 		reset_all_ctrls(r);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support
  2018-03-29 22:26 [PATCH RFC 0/6] Memory b/w allocation software controller Vikas Shivappa
  2018-03-29 22:26 ` [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA " Vikas Shivappa
  2018-03-29 22:26 ` [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option Vikas Shivappa
@ 2018-03-29 22:26 ` Vikas Shivappa
  2018-04-03  9:52   ` Thomas Gleixner
  2018-03-29 22:26 ` [PATCH 4/6] x86/intel_rdt/mba_sc: Add schemata support Vikas Shivappa
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Vikas Shivappa @ 2018-03-29 22:26 UTC (permalink / raw)
  To: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, tglx, hpa
  Cc: linux-kernel, ak, vikas.shivappa

When MBA software controller is enabled, we need a per domain storage
for user specified bandwidth in MB and the raw b/w percentage values
which are programmed into the MSR. Add support for these data structures
and initialization.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c          | 37 +++++++++++++++++++++++---------
 arch/x86/kernel/cpu/intel_rdt.h          |  4 ++++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |  3 +++
 3 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 2b65601..8a32561 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -35,6 +35,7 @@
 
 #define MAX_MBA_BW	100u
 #define MBA_IS_LINEAR	0x4
+#define MBA_BW_MAX_MB	U32_MAX
 
 /* Mutex to protect rdtgroup access. */
 DEFINE_MUTEX(rdtgroup_mutex);
@@ -431,25 +432,40 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
 	return NULL;
 }
 
+void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
+{
+	int i;
+
+	/*
+	 * Initialize the Control MSRs to having no control.
+	 * For Cache Allocation: Set all bits in cbm
+	 * For Memory Allocation: Set b/w requested to 100%
+	 * and the b/w in MB to U32_MAX
+	 */
+	for (i = 0; i < r->num_closid; i++, dc++, dm++) {
+		*dc = r->membw.bw_byte ? MBA_BW_MAX_MB : r->default_ctrl;
+		*dm = r->default_ctrl;
+	}
+}
+
 static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
 {
 	struct msr_param m;
-	u32 *dc;
-	int i;
+	u32 *dc, *dm;
 
 	dc = kmalloc_array(r->num_closid, sizeof(*d->ctrl_val), GFP_KERNEL);
 	if (!dc)
 		return -ENOMEM;
 
-	d->ctrl_val = dc;
+	dm = kmalloc_array(r->num_closid, sizeof(*d->msr_val), GFP_KERNEL);
+	if (!dm) {
+		kfree (dc);
+		return -ENOMEM;
+	}
 
-	/*
-	 * Initialize the Control MSRs to having no control.
-	 * For Cache Allocation: Set all bits in cbm
-	 * For Memory Allocation: Set b/w requested to 100
-	 */
-	for (i = 0; i < r->num_closid; i++, dc++)
-		*dc = r->default_ctrl;
+	d->ctrl_val = dc;
+	d->msr_val = dm;
+	setup_ctrlval(r, dc, dm);
 
 	m.low = 0;
 	m.high = r->num_closid;
@@ -588,6 +604,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
 		}
 
 		kfree(d->ctrl_val);
+		kfree(d->msr_val);
 		kfree(d->rmid_busy_llc);
 		kfree(d->mbm_total);
 		kfree(d->mbm_local);
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 3e9bc3f..68c7da0 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -202,6 +202,8 @@ struct mbm_state {
  * @cqm_work_cpu:
  *		worker cpu for CQM h/w counters
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
+ *		When MBA is expressed in MB, this holds number of MegaBytes
+ * @msr_val:	When MBA is expressed in MB, this holds the control MSR value
  * @new_ctrl:	new ctrl value to be loaded
  * @have_new_ctrl: did user provide new_ctrl for this domain
  */
@@ -217,6 +219,7 @@ struct rdt_domain {
 	int			mbm_work_cpu;
 	int			cqm_work_cpu;
 	u32			*ctrl_val;
+	u32			*msr_val;
 	u32			new_ctrl;
 	bool			have_new_ctrl;
 };
@@ -450,6 +453,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
 void mbm_setup_overflow_handler(struct rdt_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
+void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm);
 void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
 bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 0707191..d4e8412 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1044,8 +1044,11 @@ static int set_cache_qos_cfg(int level, bool enable)
 static void __set_mba_byte_ctrl(bool byte_ctrl)
 {
 	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
+	struct rdt_domain *d;
 
 	r->membw.bw_byte = byte_ctrl;
+	list_for_each_entry(d, &r->domains, list)
+		setup_ctrlval(r, d->ctrl_val, d->msr_val);
 }
 
 /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] x86/intel_rdt/mba_sc: Add schemata support
  2018-03-29 22:26 [PATCH RFC 0/6] Memory b/w allocation software controller Vikas Shivappa
                   ` (2 preceding siblings ...)
  2018-03-29 22:26 ` [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support Vikas Shivappa
@ 2018-03-29 22:26 ` Vikas Shivappa
  2018-03-29 22:26 ` [PATCH 5/6] x86/intel_rdt/mba_sc: Add counting for MBA software controller Vikas Shivappa
  2018-03-29 22:26 ` [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w Vikas Shivappa
  5 siblings, 0 replies; 21+ messages in thread
From: Vikas Shivappa @ 2018-03-29 22:26 UTC (permalink / raw)
  To: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, tglx, hpa
  Cc: linux-kernel, ak, vikas.shivappa

Currently user specifies the maximum memory bandwidth percentage per
domain in the "schemata" file. When user updates the schemata, kernel
writes the corresponding b/w percentage values to the
IA32_MBA_THRTL_MSR.

When MBA is expressed in MegaBytes(MB), the schemata format is changed
to have the per package memory b/w in MB instead of being specified in
b/w percentage. The b/w percentage values are only <= 100 and the b/w in
MB could be upto U32_MAX. We do not write the MSRs when the schemata is
updated as that is handled separately after the m/w in MB is converted
to b/w in percentage values.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c             | 10 +++++++++-
 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 10 ++++++++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 8a32561..8a12d26 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -179,7 +179,7 @@ struct rdt_resource rdt_resources_all[] = {
 		.msr_update		= mba_wrmsr,
 		.cache_level		= 3,
 		.parse_ctrlval		= parse_bw,
-		.format_str		= "%d=%*d",
+		.format_str		= "%d=%*u",
 		.fflags			= RFTYPE_RES_MB,
 	},
 };
@@ -356,6 +356,14 @@ static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 {
 	unsigned int i;
 
+	/*
+	 * The ctrl_val should not be written when
+	 * MBA is expressed in Megabytes because the Megabyte value
+	 * need to be first converted to delay values that can be
+	 * programmed to the MSR.
+	 */
+	WARN_ON(is_mba_MBctrl());
+
 	/*  Write the delay values for mba. */
 	for (i = m->low; i < m->high; i++)
 		wrmsrl(r->msr_base + i, delay_bw_map(d->ctrl_val[i], r));
diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
index 23e1d5c..6372e4f 100644
--- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -53,7 +53,8 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 		return false;
 	}
 
-	if (bw < r->membw.min_bw || bw > r->default_ctrl) {
+	if ((bw < r->membw.min_bw || bw > r->default_ctrl) &&
+	    !is_mba_MBctrl()) {
 		rdt_last_cmd_printf("MB value %ld out of range [%d,%d]\n", bw,
 				    r->membw.min_bw, r->default_ctrl);
 		return false;
@@ -194,7 +195,12 @@ static int update_domains(struct rdt_resource *r, int closid)
 			d->ctrl_val[closid] = d->new_ctrl;
 		}
 	}
-	if (cpumask_empty(cpu_mask))
+
+	/*
+	 * Avoid writing the control msr with control values when
+	 * MBA is expressed in Megabytes.
+	*/
+	if (cpumask_empty(cpu_mask) || is_mba_MBctrl())
 		goto done;
 	cpu = get_cpu();
 	/* Update CBM on this cpu if it's in cpu_mask. */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] x86/intel_rdt/mba_sc: Add counting for MBA software controller
  2018-03-29 22:26 [PATCH RFC 0/6] Memory b/w allocation software controller Vikas Shivappa
                   ` (3 preceding siblings ...)
  2018-03-29 22:26 ` [PATCH 4/6] x86/intel_rdt/mba_sc: Add schemata support Vikas Shivappa
@ 2018-03-29 22:26 ` Vikas Shivappa
  2018-03-29 22:26 ` [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w Vikas Shivappa
  5 siblings, 0 replies; 21+ messages in thread
From: Vikas Shivappa @ 2018-03-29 22:26 UTC (permalink / raw)
  To: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, tglx, hpa
  Cc: linux-kernel, ak, vikas.shivappa

Currently we store the per package "total bytes" for each rdtgroup for
Memory bandwidth management which exposed via
"/sys/fs/resctrl/<rdtgrpx>/mon_data/mon_L3_00/mbm_local_bytes".

The above user interface remains while we also add support to measure
the per package b/w in Megabytes and the delta b/w when the b/w MSR
values change. We do this by taking the time stamp every time a the
counter is read and then keeping a history of b/w. This will be used to
support internal queries for the bandwidth in Megabytes.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c         |  1 -
 arch/x86/kernel/cpu/intel_rdt.h         | 24 ++++++++++++++++++++--
 arch/x86/kernel/cpu/intel_rdt_monitor.c | 36 +++++++++++++++++++++++++++------
 3 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 8a12d26..78beb64 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -33,7 +33,6 @@
 #include <asm/intel_rdt_sched.h>
 #include "intel_rdt.h"
 
-#define MAX_MBA_BW	100u
 #define MBA_IS_LINEAR	0x4
 #define MBA_BW_MAX_MB	U32_MAX
 
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 68c7da0..b74619d 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -28,6 +28,18 @@
 
 #define MBM_CNTR_WIDTH			24
 #define MBM_OVERFLOW_INTERVAL		1000
+#define MAX_MBA_BW			100u
+
+/*
+ * This measures a tolerable delta value in MegaBytes between
+ * the expected bandwidth and the actual bandwidth.
+ * This is done so that we dont keep flipping the control
+ * bandwidth to more than and less than the expected bandwidth.
+ *
+ * However note that this is only initial threshold value and
+ * it is adjusted dynamically package wise for each rdtgrp
+ */
+#define MBA_BW_MB_THRSHL		1024
 
 #define RMID_VAL_ERROR			BIT_ULL(63)
 #define RMID_VAL_UNAVAIL		BIT_ULL(62)
@@ -180,10 +192,18 @@ struct rftype {
  * struct mbm_state - status for each MBM counter in each domain
  * @chunks:	Total data moved (multiply by rdt_group.mon_scale to get bytes)
  * @prev_msr	Value of IA32_QM_CTR for this RMID last time we read it
+ * @prev_read_time:The last time counter was read
+ * @prev_bw:	The most recent bandwidth in Megabytes
+ * @delta_bw:	Difference between the current b/w and previous b/w
+ * @threshl_calib:	Indicates when to calculate the delta_bw
  */
 struct mbm_state {
-	u64	chunks;
-	u64	prev_msr;
+	u64			chunks;
+	u64			prev_msr;
+	unsigned long		prev_read_time;
+	u64			prev_bw;
+	u64			delta_bw;
+	bool			thrshl_calib;
 };
 
 /**
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index 681450e..509f338 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -225,9 +225,11 @@ void free_rmid(u32 rmid)
 		list_add_tail(&entry->list, &rmid_free_lru);
 }
 
-static int __mon_event_count(u32 rmid, struct rmid_read *rr)
+static int __mon_event_count(u32 rmid, struct rmid_read *rr, struct mbm_state **md)
 {
-	u64 chunks, shift, tval;
+	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
+	u64 chunks, shift, tval, cur_bw = 0;
+	unsigned long delta_time, now;
 	struct mbm_state *m;
 
 	tval = __rmid_read(rmid, rr->evtid);
@@ -256,6 +258,9 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
 	if (rr->first) {
 		m->prev_msr = tval;
 		m->chunks = 0;
+		m->prev_read_time = jiffies;
+		m->prev_bw = 0;
+		m->delta_bw = MBA_BW_MB_THRSHL;
 		return 0;
 	}
 
@@ -266,6 +271,24 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
 	m->prev_msr = tval;
 
 	rr->val += m->chunks;
+
+	if(!md)
+		goto out;
+
+	now = jiffies;
+	delta_time = jiffies_to_usecs(now - m->prev_read_time);
+	if (delta_time)
+		cur_bw = (chunks * r->mon_scale) / delta_time;
+
+	if (m->thrshl_calib)
+		m->delta_bw = abs(cur_bw - m->prev_bw);
+	m->thrshl_calib = false;
+	m->prev_bw = cur_bw;
+	m->prev_read_time = now;
+
+	*md = m;
+out:
+
 	return 0;
 }
 
@@ -281,7 +304,7 @@ void mon_event_count(void *info)
 
 	rdtgrp = rr->rgrp;
 
-	if (__mon_event_count(rdtgrp->mon.rmid, rr))
+	if (__mon_event_count(rdtgrp->mon.rmid, rr, NULL))
 		return;
 
 	/*
@@ -291,7 +314,7 @@ void mon_event_count(void *info)
 
 	if (rdtgrp->type == RDTCTRL_GROUP) {
 		list_for_each_entry(entry, head, mon.crdtgrp_list) {
-			if (__mon_event_count(entry->mon.rmid, rr))
+			if (__mon_event_count(entry->mon.rmid, rr, NULL))
 				return;
 		}
 	}
@@ -299,6 +322,7 @@ void mon_event_count(void *info)
 
 static void mbm_update(struct rdt_domain *d, int rmid)
 {
+
 	struct rmid_read rr;
 
 	rr.first = false;
@@ -310,11 +334,11 @@ static void mbm_update(struct rdt_domain *d, int rmid)
 	 */
 	if (is_mbm_total_enabled()) {
 		rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
-		__mon_event_count(rmid, &rr);
+		__mon_event_count(rmid, &rr, NULL);
 	}
 	if (is_mbm_local_enabled()) {
 		rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
-		__mon_event_count(rmid, &rr);
+		__mon_event_count(rmid, &rr, NULL);
 	}
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w
  2018-03-29 22:26 [PATCH RFC 0/6] Memory b/w allocation software controller Vikas Shivappa
                   ` (4 preceding siblings ...)
  2018-03-29 22:26 ` [PATCH 5/6] x86/intel_rdt/mba_sc: Add counting for MBA software controller Vikas Shivappa
@ 2018-03-29 22:26 ` Vikas Shivappa
  2018-03-30 21:21   ` kbuild test robot
  2018-03-31  1:37   ` kbuild test robot
  5 siblings, 2 replies; 21+ messages in thread
From: Vikas Shivappa @ 2018-03-29 22:26 UTC (permalink / raw)
  To: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, tglx, hpa
  Cc: linux-kernel, ak, vikas.shivappa

The software controller uses the existing MBM overflow timer calls (once
per second currently) to read the bandwidth to ensure that always

	"current b/w < user specified max b/w"

Similarly when we see that the current b/w is too low, we also try to
increase the b/w. We use a threshold b/w or a delta b/w which is
calculated dynamically to determine what is too low. OS then uses the
"IA32_MBA_THRTL_MSR" to change the b/w. The change itself is currently
linear and is in the increment of decrement of the "b/w granularity"
specified by the SKU. The values written to the MSR are also cached so
that we donot do a rdmsr for every 1s.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c         |  2 +-
 arch/x86/kernel/cpu/intel_rdt.h         |  1 +
 arch/x86/kernel/cpu/intel_rdt_monitor.c | 71 +++++++++++++++++++++++++++++++--
 3 files changed, 69 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 78beb64..700e957 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -341,7 +341,7 @@ static int get_cache_id(int cpu, int level)
  * that can be written to QOS_MSRs.
  * There are currently no SKUs which support non linear delay values.
  */
-static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
+u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
 {
 	if (r->membw.delay_linear)
 		return MAX_MBA_BW - bw;
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index b74619d..aafbc4b 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -474,6 +474,7 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom,
 				unsigned long delay_ms);
 void mbm_handle_overflow(struct work_struct *work);
 void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm);
+u32 delay_bw_map(unsigned long bw, struct rdt_resource *r);
 void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
 bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c
index 509f338..13b6eff 100644
--- a/arch/x86/kernel/cpu/intel_rdt_monitor.c
+++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c
@@ -272,6 +272,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr, struct mbm_state **
 
 	rr->val += m->chunks;
 
+	/*
+	 * We only do the bw calculations for the mbm overflow
+	 * periodic timer calls and for local events only.
+	 */
 	if(!md)
 		goto out;
 
@@ -320,10 +324,61 @@ void mon_event_count(void *info)
 	}
 }
 
-static void mbm_update(struct rdt_domain *d, int rmid)
+/*
+ * Check the current b/w using the MBM counters to always ensure that
+ * current b/w < user specified b/w. If the current b/w is way less than
+ * the user defined b/w (determined by the delta b/w)
+ * then try to increase the b/w
+ */
+static void update_mba_bw(struct rdtgroup *rgrp, struct mbm_state *m)
 {
+	u32 closid, rmid, cur_msr, cur_msr_val, new_msr_val;
+	struct rdt_resource *r_mba;
+	u64 cur_bw, user_bw, thrshl_bw;
+	struct rdt_domain *dom_mba;
+
+	r_mba = &rdt_resources_all[RDT_RESOURCE_MBA];
+	closid = rgrp->closid;
+	rmid = rgrp->mon.rmid;
+
+	dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
+	if (!dom_mba) {
+		pr_warn_once("Failure to get domain for MBA update\n");
+		return;
+	}
+
+	cur_bw = m->prev_bw;
+	user_bw = dom_mba->ctrl_val[closid];
+	thrshl_bw = m->delta_bw;
+	cur_msr_val = dom_mba->msr_val[closid];
+	/*
+	 * Scale up/down the b/w linearly.
+	 */
+	if (user_bw < cur_bw && cur_msr_val > r_mba->membw.min_bw) {
+		new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
+	} else if ((cur_bw && user_bw > (cur_bw + thrshl_bw)) &&
+		   cur_msr_val < MAX_MBA_BW) {
+		new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
+	} else {
+		return;
+	}
 
+	cur_msr = r_mba->msr_base + closid;
+	wrmsrl(cur_msr, delay_bw_map(new_msr_val, r_mba));
+	dom_mba->msr_val[closid] = new_msr_val;
+
+	/*
+	 * When the counter is read next time, recaliberate the
+	 * threshold since we changed the MSR value.
+	 */
+	m->thrshl_calib = true;
+}
+
+static void mbm_update(struct rdt_domain *d, struct rdtgroup *rgrp, bool prgrp)
+{
+	int rmid = rgrp->mon.rmid;
 	struct rmid_read rr;
+	struct mbm_state *m;
 
 	rr.first = false;
 	rr.d = d;
@@ -338,7 +393,15 @@ static void mbm_update(struct rdt_domain *d, int rmid)
 	}
 	if (is_mbm_local_enabled()) {
 		rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
-		__mon_event_count(rmid, &rr, NULL);
+		__mon_event_count(rmid, &rr, &m);
+
+		/*
+		 * Call the MBA software controller core function
+		 * only for the control groups and when user has enabled
+		 * the software controller explicitly.
+		 */
+		if (prgrp && is_mba_MBctrl())
+			update_mba_bw(rgrp, m);
 	}
 }
 
@@ -404,11 +467,11 @@ void mbm_handle_overflow(struct work_struct *work)
 		goto out_unlock;
 
 	list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
-		mbm_update(d, prgrp->mon.rmid);
+		mbm_update(d, prgrp, true);
 
 		head = &prgrp->mon.crdtgrp_list;
 		list_for_each_entry(crgrp, head, mon.crdtgrp_list)
-			mbm_update(d, crgrp->mon.rmid);
+			mbm_update(d, crgrp, false);
 	}
 
 	schedule_delayed_work_on(cpu, &d->mbm_over, delay);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option
  2018-03-29 22:26 ` [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option Vikas Shivappa
@ 2018-03-30  9:32   ` Thomas Gleixner
  2018-03-30 17:19     ` Shivappa Vikas
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2018-03-30  9:32 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, hpa, linux-kernel, ak

On Thu, 29 Mar 2018, Vikas Shivappa wrote:

Subject: x86/intel_rdt/mba_sc: Add support to enable/disable via mount option

Huch? From Documentation:

      The ``summary phrase`` in the email's Subject should concisely
      describe the patch which that email contains.

You're introducing somthing new: mba_sc

It's completely unclear what that is and what it means.

     x86/intel_rdt: Add mount option for bandwidth allocation in MB/s

or something like that. 

> Specify a new mount option "mba_MB" to enable the user to specify MBA
> bandwidth in Megabytes(Software Controller/SC) instead of b/w

You cannot specify bandwidth in Megabytes. Bandwidth is a bit-rate and the
units are multiples of bits per second and not Megabytes.

> --- a/arch/x86/kernel/cpu/intel_rdt.h
> +++ b/arch/x86/kernel/cpu/intel_rdt.h
> @@ -259,6 +259,7 @@ struct rdt_cache {
>   * @min_bw:		Minimum memory bandwidth percentage user can request
>   * @bw_gran:		Granularity at which the memory bandwidth is allocated
>   * @delay_linear:	True if memory B/W delay is in linear scale
> + * @bw_byte:		True if memory B/W is specified in bytes

So the mount parameter says Megabytes, but here you say bytes? What?

And bw_byte is a misnomer. bw_bytes if you really mean bytes. bw_mb if it's megabytes.

> +#define is_mba_linear() rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear
> +#define is_mba_MBctrl() rdt_resources_all[RDT_RESOURCE_MBA].membw.bw_byte

Please use inlines and no camel case. That's horrible.

> +
>  /**
>   * struct rdt_resource - attributes of an RDT resource
>   * @rid:		The index of the resource
> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> index fca759d..0707191 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> @@ -1041,6 +1041,24 @@ static int set_cache_qos_cfg(int level, bool enable)
>  	return 0;
>  }
>  
> +static void __set_mba_byte_ctrl(bool byte_ctrl)
> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
> +
> +	r->membw.bw_byte = byte_ctrl;

I don't see the point of this extra function. It has exactly one user.

> +}
> +
> +/*
> + * MBA allocation in bytes is only supported if
> + * MBM is supported and MBA is in linear scale.
> +*/

Hint: checkpatch.pl is not optional

> +static void set_mba_byte_ctrl(bool byte_ctrl)
> +{
> +	if ((is_mbm_enabled() && is_mba_linear()) &&
> +	    byte_ctrl != is_mba_MBctrl())
> +		__set_mba_byte_ctrl(byte_ctrl);

And that user is a small enough function. To avoid indentation you can
simply return when the condition is false.

Also if the user wants to mount with the MB option and it's not supported,
why are you not returning an error code and refuse the mount? That's just
wrong.

> +
>  static int cdp_enable(int level, int data_type, int code_type)
>  {
>  	struct rdt_resource *r_ldata = &rdt_resources_all[data_type];
> @@ -1104,7 +1122,7 @@ static void cdp_disable_all(void)
>  		cdpl2_disable();
>  }
>  
> -static int parse_rdtgroupfs_options(char *data)
> +static int parse_rdtgroupfs_options(char *data, bool *mba_MBctrl)

What?

>  {
>  	char *token, *o = data;
>  	int ret = 0;
> @@ -1123,6 +1141,8 @@ static int parse_rdtgroupfs_options(char *data)
>  			ret = cdpl2_enable();
>  			if (ret)
>  				goto out;
> +		} else if (!strcmp(token, "mba_MB")) {
> +			*mba_MBctrl = true;

That's mindless hackery. Really. What's wrong with setting the flag in the
resource and then add the actual register fiddling right in the

	if (is_mbm_enabled()) {

section in rdt_mount()? That would be too obvious and fit into the existing
code, right?

> +	/*Set the control values before the rest of reset*/

Space after '/*' and before '*/

Aside of that the comment is pretty useless. 'the control values' ??? Which
control values? 

> +	set_mba_byte_ctrl(false);

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option
  2018-03-30  9:32   ` Thomas Gleixner
@ 2018-03-30 17:19     ` Shivappa Vikas
  0 siblings, 0 replies; 21+ messages in thread
From: Shivappa Vikas @ 2018-03-30 17:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, tony.luck, ravi.v.shankar,
	fenghua.yu, sai.praneeth.prakhya, x86, hpa, linux-kernel, ak


Hello Thomas,

On Fri, 30 Mar 2018, Thomas Gleixner wrote:

> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>
> Subject: x86/intel_rdt/mba_sc: Add support to enable/disable via mount option
>
> Huch? From Documentation:
>
>      The ``summary phrase`` in the email's Subject should concisely
>      describe the patch which that email contains.
>
> You're introducing somthing new: mba_sc
>
> It's completely unclear what that is and what it means.
>
>     x86/intel_rdt: Add mount option for bandwidth allocation in MB/s
>
> or something like that.

would 'Mount option to enable MBA softwarecontroller' be better? Given that I 
have a documentation patch which says what is mba software controller.

>
>> Specify a new mount option "mba_MB" to enable the user to specify MBA
>> bandwidth in Megabytes(Software Controller/SC) instead of b/w
>
> You cannot specify bandwidth in Megabytes. Bandwidth is a bit-rate and the
> units are multiples of bits per second and not Megabytes.
>
>> --- a/arch/x86/kernel/cpu/intel_rdt.h
>> +++ b/arch/x86/kernel/cpu/intel_rdt.h
>> @@ -259,6 +259,7 @@ struct rdt_cache {
>>   * @min_bw:		Minimum memory bandwidth percentage user can request
>>   * @bw_gran:		Granularity at which the memory bandwidth is allocated
>>   * @delay_linear:	True if memory B/W delay is in linear scale
>> + * @bw_byte:		True if memory B/W is specified in bytes
>
> So the mount parameter says Megabytes, but here you say bytes? What?
>
> And bw_byte is a misnomer. bw_bytes if you really mean bytes. bw_mb if it's megabytes.

Will fix the namings. Thanks for pointing it should be MBps everywhere.

>
>> +#define is_mba_linear() rdt_resources_all[RDT_RESOURCE_MBA].membw.delay_linear
>> +#define is_mba_MBctrl() rdt_resources_all[RDT_RESOURCE_MBA].membw.bw_byte
>
> Please use inlines and no camel case. That's horrible.

Will fix..

>
>> +
>>  /**
>>   * struct rdt_resource - attributes of an RDT resource
>>   * @rid:		The index of the resource
>> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
>> index fca759d..0707191 100644
>> --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
>> @@ -1041,6 +1041,24 @@ static int set_cache_qos_cfg(int level, bool enable)
>>  	return 0;
>>  }
>>
>> +static void __set_mba_byte_ctrl(bool byte_ctrl)
>> +{
>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA];
>> +
>> +	r->membw.bw_byte = byte_ctrl;
>
> I don't see the point of this extra function. It has exactly one user.
>
>> +}
>> +
>> +/*
>> + * MBA allocation in bytes is only supported if
>> + * MBM is supported and MBA is in linear scale.
>> +*/
>
> Hint: checkpatch.pl is not optional
>
>> +static void set_mba_byte_ctrl(bool byte_ctrl)
>> +{
>> +	if ((is_mbm_enabled() && is_mba_linear()) &&
>> +	    byte_ctrl != is_mba_MBctrl())
>> +		__set_mba_byte_ctrl(byte_ctrl);
>
> And that user is a small enough function. To avoid indentation you can
> simply return when the condition is false.
>
> Also if the user wants to mount with the MB option and it's not supported,
> why are you not returning an error code and refuse the mount? That's just
> wrong.

Will fix. can merge into one function and return error when not available.

>
>> +
>>  static int cdp_enable(int level, int data_type, int code_type)
>>  {
>>  	struct rdt_resource *r_ldata = &rdt_resources_all[data_type];
>> @@ -1104,7 +1122,7 @@ static void cdp_disable_all(void)
>>  		cdpl2_disable();
>>  }
>>
>> -static int parse_rdtgroupfs_options(char *data)
>> +static int parse_rdtgroupfs_options(char *data, bool *mba_MBctrl)
>
> What?
>
>>  {
>>  	char *token, *o = data;
>>  	int ret = 0;
>> @@ -1123,6 +1141,8 @@ static int parse_rdtgroupfs_options(char *data)
>>  			ret = cdpl2_enable();
>>  			if (ret)
>>  				goto out;
>> +		} else if (!strcmp(token, "mba_MB")) {
>> +			*mba_MBctrl = true;
>
> That's mindless hackery. Really. What's wrong with setting the flag in the
> resource and then add the actual register fiddling right in the
>
> 	if (is_mbm_enabled()) {
>
> section in rdt_mount()? That would be too obvious and fit into the existing
> code, right?

Will fix.

>
>> +	/*Set the control values before the rest of reset*/
>
> Space after '/*' and before '*/
>
> Aside of that the comment is pretty useless. 'the control values' ??? Which
> control values?
>

Will fix the comment or remove. Wanted to point here that we reset the control 
values (the delay values that go into the IA32_MBA_THRTL_MSRs) but thats done 
any ways in the reset_all_ctrls call after this, so comment can be removed.

Will fix the checkpatch issues as pointed.

In general wanted to know if this is 
a sane idea to have a software feedback and let the user specify b/w in MBps 
rather than the confusing percentage values. The typical confusing scenarios are 
documented in documentation patch with examples. The use 
can can occur in any rdtgroups 
which are trying to group jobs where different number of threads are active. Say 
if you want to create an rdtgroup with low priority jobs and give them 10% of 
b/w the actual raw b/w in MBps used can vary and increase if more threads are 
spawned (because the new threads spawned belong to the same rdtgroup and each 
thread can use up 10% of the 'per core' memory b/w).

Thanks,
Vikas

>> +	set_mba_byte_ctrl(false);
>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w
  2018-03-29 22:26 ` [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w Vikas Shivappa
@ 2018-03-30 21:21   ` kbuild test robot
  2018-03-31  1:37   ` kbuild test robot
  1 sibling, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2018-03-30 21:21 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: kbuild-all, vikas.shivappa, tony.luck, ravi.v.shankar,
	fenghua.yu, sai.praneeth.prakhya, x86, tglx, hpa, linux-kernel,
	ak, vikas.shivappa

[-- Attachment #1: Type: text/plain, Size: 5733 bytes --]

Hi Vikas,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.16-rc7]
[also build test ERROR on next-20180329]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Vikas-Shivappa/Memory-b-w-allocation-software-controller/20180331-040536
config: i386-randconfig-a0-201812 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/kernel/cpu/intel_rdt_monitor.o: In function `__mon_event_count':
>> arch/x86/kernel/cpu/intel_rdt_monitor.c:285: undefined reference to `__udivdi3'

vim +285 arch/x86/kernel/cpu/intel_rdt_monitor.c

edf6fa1c Vikas Shivappa 2017-07-25  227  
2bbfc129 Vikas Shivappa 2018-03-29  228  static int __mon_event_count(u32 rmid, struct rmid_read *rr, struct mbm_state **md)
d89b7379 Vikas Shivappa 2017-07-25  229  {
2bbfc129 Vikas Shivappa 2018-03-29  230  	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
2bbfc129 Vikas Shivappa 2018-03-29  231  	u64 chunks, shift, tval, cur_bw = 0;
2bbfc129 Vikas Shivappa 2018-03-29  232  	unsigned long delta_time, now;
9f52425b Tony Luck      2017-07-25  233  	struct mbm_state *m;
d89b7379 Vikas Shivappa 2017-07-25  234  
d89b7379 Vikas Shivappa 2017-07-25  235  	tval = __rmid_read(rmid, rr->evtid);
d89b7379 Vikas Shivappa 2017-07-25  236  	if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
d89b7379 Vikas Shivappa 2017-07-25  237  		rr->val = tval;
d89b7379 Vikas Shivappa 2017-07-25  238  		return -EINVAL;
d89b7379 Vikas Shivappa 2017-07-25  239  	}
d89b7379 Vikas Shivappa 2017-07-25  240  	switch (rr->evtid) {
d89b7379 Vikas Shivappa 2017-07-25  241  	case QOS_L3_OCCUP_EVENT_ID:
d89b7379 Vikas Shivappa 2017-07-25  242  		rr->val += tval;
d89b7379 Vikas Shivappa 2017-07-25  243  		return 0;
9f52425b Tony Luck      2017-07-25  244  	case QOS_L3_MBM_TOTAL_EVENT_ID:
9f52425b Tony Luck      2017-07-25  245  		m = &rr->d->mbm_total[rmid];
9f52425b Tony Luck      2017-07-25  246  		break;
9f52425b Tony Luck      2017-07-25  247  	case QOS_L3_MBM_LOCAL_EVENT_ID:
9f52425b Tony Luck      2017-07-25  248  		m = &rr->d->mbm_local[rmid];
9f52425b Tony Luck      2017-07-25  249  		break;
d89b7379 Vikas Shivappa 2017-07-25  250  	default:
d89b7379 Vikas Shivappa 2017-07-25  251  		/*
d89b7379 Vikas Shivappa 2017-07-25  252  		 * Code would never reach here because
d89b7379 Vikas Shivappa 2017-07-25  253  		 * an invalid event id would fail the __rmid_read.
d89b7379 Vikas Shivappa 2017-07-25  254  		 */
d89b7379 Vikas Shivappa 2017-07-25  255  		return -EINVAL;
d89b7379 Vikas Shivappa 2017-07-25  256  	}
a4de1dfd Vikas Shivappa 2017-07-25  257  
a4de1dfd Vikas Shivappa 2017-07-25  258  	if (rr->first) {
a4de1dfd Vikas Shivappa 2017-07-25  259  		m->prev_msr = tval;
a4de1dfd Vikas Shivappa 2017-07-25  260  		m->chunks = 0;
2bbfc129 Vikas Shivappa 2018-03-29  261  		m->prev_read_time = jiffies;
2bbfc129 Vikas Shivappa 2018-03-29  262  		m->prev_bw = 0;
2bbfc129 Vikas Shivappa 2018-03-29  263  		m->delta_bw = MBA_BW_MB_THRSHL;
a4de1dfd Vikas Shivappa 2017-07-25  264  		return 0;
a4de1dfd Vikas Shivappa 2017-07-25  265  	}
a4de1dfd Vikas Shivappa 2017-07-25  266  
9f52425b Tony Luck      2017-07-25  267  	shift = 64 - MBM_CNTR_WIDTH;
9f52425b Tony Luck      2017-07-25  268  	chunks = (tval << shift) - (m->prev_msr << shift);
9f52425b Tony Luck      2017-07-25  269  	chunks >>= shift;
9f52425b Tony Luck      2017-07-25  270  	m->chunks += chunks;
9f52425b Tony Luck      2017-07-25  271  	m->prev_msr = tval;
9f52425b Tony Luck      2017-07-25  272  
9f52425b Tony Luck      2017-07-25  273  	rr->val += m->chunks;
2bbfc129 Vikas Shivappa 2018-03-29  274  
6138a999 Vikas Shivappa 2018-03-29  275  	/*
6138a999 Vikas Shivappa 2018-03-29  276  	 * We only do the bw calculations for the mbm overflow
6138a999 Vikas Shivappa 2018-03-29  277  	 * periodic timer calls and for local events only.
6138a999 Vikas Shivappa 2018-03-29  278  	 */
2bbfc129 Vikas Shivappa 2018-03-29  279  	if(!md)
2bbfc129 Vikas Shivappa 2018-03-29  280  		goto out;
2bbfc129 Vikas Shivappa 2018-03-29  281  
2bbfc129 Vikas Shivappa 2018-03-29  282  	now = jiffies;
2bbfc129 Vikas Shivappa 2018-03-29  283  	delta_time = jiffies_to_usecs(now - m->prev_read_time);
2bbfc129 Vikas Shivappa 2018-03-29  284  	if (delta_time)
2bbfc129 Vikas Shivappa 2018-03-29 @285  		cur_bw = (chunks * r->mon_scale) / delta_time;
2bbfc129 Vikas Shivappa 2018-03-29  286  
2bbfc129 Vikas Shivappa 2018-03-29  287  	if (m->thrshl_calib)
2bbfc129 Vikas Shivappa 2018-03-29  288  		m->delta_bw = abs(cur_bw - m->prev_bw);
2bbfc129 Vikas Shivappa 2018-03-29  289  	m->thrshl_calib = false;
2bbfc129 Vikas Shivappa 2018-03-29  290  	m->prev_bw = cur_bw;
2bbfc129 Vikas Shivappa 2018-03-29  291  	m->prev_read_time = now;
2bbfc129 Vikas Shivappa 2018-03-29  292  
2bbfc129 Vikas Shivappa 2018-03-29  293  	*md = m;
2bbfc129 Vikas Shivappa 2018-03-29  294  out:
2bbfc129 Vikas Shivappa 2018-03-29  295  
9f52425b Tony Luck      2017-07-25  296  	return 0;
d89b7379 Vikas Shivappa 2017-07-25  297  }
d89b7379 Vikas Shivappa 2017-07-25  298  

:::::: The code at line 285 was first introduced by commit
:::::: 2bbfc12978bb70164a0fa01307798973a4e2c80d x86/intel_rdt/mba_sc: Add counting for MBA software controller

:::::: TO: Vikas Shivappa <vikas.shivappa@linux.intel.com>
:::::: CC: 0day robot <fengguang.wu@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30662 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w
  2018-03-29 22:26 ` [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w Vikas Shivappa
  2018-03-30 21:21   ` kbuild test robot
@ 2018-03-31  1:37   ` kbuild test robot
  1 sibling, 0 replies; 21+ messages in thread
From: kbuild test robot @ 2018-03-31  1:37 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: kbuild-all, vikas.shivappa, tony.luck, ravi.v.shankar,
	fenghua.yu, sai.praneeth.prakhya, x86, tglx, hpa, linux-kernel,
	ak, vikas.shivappa

[-- Attachment #1: Type: text/plain, Size: 923 bytes --]

Hi Vikas,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.16-rc7]
[also build test ERROR on next-20180329]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Vikas-Shivappa/Memory-b-w-allocation-software-controller/20180331-040536
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/kernel/cpu/intel_rdt_monitor.o: In function `__mon_event_count':
>> intel_rdt_monitor.c:(.text+0x1b8): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 63067 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  2018-03-29 22:26 ` [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA " Vikas Shivappa
@ 2018-04-03  9:46   ` Thomas Gleixner
  2018-04-03 14:29     ` Thomas Gleixner
  2018-04-03 18:45     ` Shivappa Vikas
  0 siblings, 2 replies; 21+ messages in thread
From: Thomas Gleixner @ 2018-04-03  9:46 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, hpa, linux-kernel, ak

On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> +Memory bandwidth(b/w) in MegaBytes
> +----------------------------------
> +
> +Memory bandwidth is a core specific mechanism which means that when the
> +Memory b/w percentage is specified in the schemata per package it
> +actually is applied on a per core basis via IA32_MBA_THRTL_MSR
> +interface. This may lead to confusion in scenarios below:
> +
> +1. User may not see increase in actual b/w when percentage values are
> +   increased:
> +
> +This can occur when aggregate L2 external b/w is more than L3 external
> +b/w. Consider an SKL SKU with 24 cores on a package and where L2
> +external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
> +L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
> +b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
> +the percentage value specified is only 50% << 100%. Hence increasing
> +the b/w percentage will not yeild any more b/w. This is because
> +although the L2 external b/w still has capacity, the L3 external b/w
> +is fully used. Also note that this would be dependent on number of
> +cores the benchmark is run on.
> +
> +2. Same b/w percentage may mean different actual b/w depending on # of
> +   threads:
> +
> +For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
> +with 10% b/w' can consume upto 10GBps and 40GBps although they have same
> +percentage b/w of 10%. This is simply because as threads start using
> +more cores in an rdtgroup, the actual b/w may increase or vary although
> +user specified b/w percentage is same.
> +
> +In order to mitigate this and make the interface more user friendly, we
> +can let the user specify the max bandwidth per rdtgroup in bytes(or mega
> +bytes). The kernel underneath would use a software feedback mechanism or
> +a "Software Controller" which reads the actual b/w using MBM counters
> +and adjust the memowy bandwidth percentages to ensure the "actual b/w
> +< user b/w".
> +
> +The legacy behaviour is default and user can switch to the "MBA software
> +controller" mode using a mount option 'mba_MB'.

You said above:

> This may lead to confusion in scenarios below:

Reading the blurb after that creates even more confusion than being
helpful.

First of all this information should not be under the section 'Memory
bandwidth in MB/s'.

Also please write bandwidth. The weird acronym b/w (band per width???) is
really not increasing legibility. 

What you really want is a general section about memory bandwidth allocation
where you explain the technical background in purely technical terms w/o
fairy tale mode. Technical descriptions have to be factual and not
'could/may/would'. 

If I decode the above correctly then the current percentage based
implementation was buggy from the very beginning in several ways.

Now the obvious question which is in no way answered by the cover letter is
why the current percentage based implementation cannot be fixed and we need
some feedback driven magic to achieve that. I assume you spent some brain
cycles on that question, so it would be really helpful if you shared that.

If I understand it correctly then the problem is that the throttling
mechanism is per core and affects the L2 external bandwidth.

  Is this really per core? What about hyper threads. Both threads have that
  MSR. How is that working?

The L2 external bandwidth is higher than the L3 external bandwidth.

  Is there any information available from CPUID or whatever source which
  allows us to retrieve the bandwidth ratio or the absolute maximum
  bandwidth per level?

What's also missing from your explanation is how that feedback loop behaves
under different workloads.

  Is this assuming that the involved threads/cpus actually try to utilize
  the bandwidth completely?

  What happens if the threads/cpus are only using a small set because they
  are idle or their computations are mostly cache local and do not need
  external bandwidth? Looking at the implementation I don't see how that is
  taken into account.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support
  2018-03-29 22:26 ` [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support Vikas Shivappa
@ 2018-04-03  9:52   ` Thomas Gleixner
  2018-04-03 18:51     ` Shivappa Vikas
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2018-04-03  9:52 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, hpa, linux-kernel, ak

On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> +void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
> +{
> +	int i;
> +
> +	/*
> +	 * Initialize the Control MSRs to having no control.
> +	 * For Cache Allocation: Set all bits in cbm
> +	 * For Memory Allocation: Set b/w requested to 100%
> +	 * and the b/w in MB to U32_MAX
> +	 */
> +	for (i = 0; i < r->num_closid; i++, dc++, dm++) {
> +		*dc = r->membw.bw_byte ? MBA_BW_MAX_MB : r->default_ctrl;
> +		*dm = r->default_ctrl;

No! Please stop duct taping your stuff into the existing code. So far the
ctrl value was the same as the value which was actually written into the
MSR. With your new mode you have to split that up into the user supplied
value and the value which gets written into the MSR.

So the right thing to do is to separate the user value and the MSR value
first and independent of the mode. Then the new mode falls into place
naturally because r->default_ctrl and r->default_msrval are set up at mount
time with the values which correspond to the mount mode. 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  2018-04-03  9:46   ` Thomas Gleixner
@ 2018-04-03 14:29     ` Thomas Gleixner
  2018-04-03 18:49       ` Shivappa Vikas
  2018-04-03 18:45     ` Shivappa Vikas
  1 sibling, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2018-04-03 14:29 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, hpa, linux-kernel, ak

On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> You said above:
> 
> > This may lead to confusion in scenarios below:
> 
> Reading the blurb after that creates even more confusion than being
> helpful.
> 
> First of all this information should not be under the section 'Memory
> bandwidth in MB/s'.
> 
> Also please write bandwidth. The weird acronym b/w (band per width???) is
> really not increasing legibility. 
> 
> What you really want is a general section about memory bandwidth allocation
> where you explain the technical background in purely technical terms w/o
> fairy tale mode. Technical descriptions have to be factual and not
> 'could/may/would'. 
> 
> If I decode the above correctly then the current percentage based
> implementation was buggy from the very beginning in several ways.
> 
> Now the obvious question which is in no way answered by the cover letter is
> why the current percentage based implementation cannot be fixed and we need
> some feedback driven magic to achieve that. I assume you spent some brain
> cycles on that question, so it would be really helpful if you shared that.
> 
> If I understand it correctly then the problem is that the throttling
> mechanism is per core and affects the L2 external bandwidth.
> 
>   Is this really per core? What about hyper threads. Both threads have that
>   MSR. How is that working?
> 
> The L2 external bandwidth is higher than the L3 external bandwidth.
> 
>   Is there any information available from CPUID or whatever source which
>   allows us to retrieve the bandwidth ratio or the absolute maximum
>   bandwidth per level?
> 
> What's also missing from your explanation is how that feedback loop behaves
> under different workloads.
> 
>   Is this assuming that the involved threads/cpus actually try to utilize
>   the bandwidth completely?
> 
>   What happens if the threads/cpus are only using a small set because they
>   are idle or their computations are mostly cache local and do not need
>   external bandwidth? Looking at the implementation I don't see how that is
>   taken into account.

Forgot to mention the following:

  The proposed new interface has no upper limit. The existing percentage
  based implementation has at least some notion of limit and scale; not
  really helpful either because of the hardware implementation. but I

  How is the poor admin supposed to configure that new thing without
  knowing what the actual hardware limits are in the first place?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  2018-04-03  9:46   ` Thomas Gleixner
  2018-04-03 14:29     ` Thomas Gleixner
@ 2018-04-03 18:45     ` Shivappa Vikas
  2018-04-04  9:11       ` Thomas Gleixner
  1 sibling, 1 reply; 21+ messages in thread
From: Shivappa Vikas @ 2018-04-03 18:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, tony.luck, ravi.v.shankar,
	fenghua.yu, sai.praneeth.prakhya, x86, hpa, linux-kernel, ak



On Tue, 3 Apr 2018, Thomas Gleixner wrote:

> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>> +Memory bandwidth(b/w) in MegaBytes
>> +----------------------------------
>> +
>> +Memory bandwidth is a core specific mechanism which means that when the
>> +Memory b/w percentage is specified in the schemata per package it
>> +actually is applied on a per core basis via IA32_MBA_THRTL_MSR
>> +interface. This may lead to confusion in scenarios below:
>> +
>> +1. User may not see increase in actual b/w when percentage values are
>> +   increased:
>> +
>> +This can occur when aggregate L2 external b/w is more than L3 external
>> +b/w. Consider an SKL SKU with 24 cores on a package and where L2
>> +external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
>> +L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
>> +b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
>> +the percentage value specified is only 50% << 100%. Hence increasing
>> +the b/w percentage will not yeild any more b/w. This is because
>> +although the L2 external b/w still has capacity, the L3 external b/w
>> +is fully used. Also note that this would be dependent on number of
>> +cores the benchmark is run on.
>> +
>> +2. Same b/w percentage may mean different actual b/w depending on # of
>> +   threads:
>> +
>> +For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
>> +with 10% b/w' can consume upto 10GBps and 40GBps although they have same
>> +percentage b/w of 10%. This is simply because as threads start using
>> +more cores in an rdtgroup, the actual b/w may increase or vary although
>> +user specified b/w percentage is same.
>> +
>> +In order to mitigate this and make the interface more user friendly, we
>> +can let the user specify the max bandwidth per rdtgroup in bytes(or mega
>> +bytes). The kernel underneath would use a software feedback mechanism or
>> +a "Software Controller" which reads the actual b/w using MBM counters
>> +and adjust the memowy bandwidth percentages to ensure the "actual b/w
>> +< user b/w".
>> +
>> +The legacy behaviour is default and user can switch to the "MBA software
>> +controller" mode using a mount option 'mba_MB'.
>
> You said above:
>
>> This may lead to confusion in scenarios below:
>
> Reading the blurb after that creates even more confusion than being
> helpful.
>
> First of all this information should not be under the section 'Memory
> bandwidth in MB/s'.
>
> Also please write bandwidth. The weird acronym b/w (band per width???) is
> really not increasing legibility.

Ok will fix and add a seperate section.

>
> What you really want is a general section about memory bandwidth allocation
> where you explain the technical background in purely technical terms w/o
> fairy tale mode. Technical descriptions have to be factual and not
> 'could/may/would'.
>
> If I decode the above correctly then the current percentage based
> implementation was buggy from the very beginning in several ways.
>
> Now the obvious question which is in no way answered by the cover letter is
> why the current percentage based implementation cannot be fixed and we need
> some feedback driven magic to achieve that. I assume you spent some brain
> cycles on that question, so it would be really helpful if you shared that.
>
> If I understand it correctly then the problem is that the throttling
> mechanism is per core and affects the L2 external bandwidth.
>
>  Is this really per core? What about hyper threads. Both threads have that
>  MSR. How is that working?

It is per core mechanism. On hyperthreads, it just takes the lowest bandwidth 
among the thread siblings. We have the below to explain the same - i can add 
more description if needed

"The bandwidth throttling is a core specific mechanism on some of Intel
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth."

>
> The L2 external bandwidth is higher than the L3 external bandwidth.
>
>  Is there any information available from CPUID or whatever source which
>  allows us to retrieve the bandwidth ratio or the absolute maximum
>  bandwidth per level?

There is no information in cpuid on the bandwidth available. Also we have seen 
from our experiments that the increase is not perfectly linear (delta bandwidth 
increase from 30% to 40% may not be same as 70% to 80%). So we currently 
dynamically caliberate this delta for the software controller.

>
> What's also missing from your explanation is how that feedback loop behaves
> under different workloads.
>
>  Is this assuming that the involved threads/cpus actually try to utilize
>  the bandwidth completely?

No, the feedback loop only guarentees that the usage will not exceed what the 
user specifies as max bandwidth. If it is using below the max value it does not 
matter how much less it is using.

>
>  What happens if the threads/cpus are only using a small set because they
>  are idle or their computations are mostly cache local and do not need
>  external bandwidth? Looking at the implementation I don't see how that is
>  taken into account.

The feedback only kicks into action if a rdtgroup uses more bandwidth than the 
max specified by the user. I specified that it is always "ensure the "actual b/w
354 < user b/w" " and can add more explanation on these scenarios.

Also note that we are using the MBM counters for this feedback loop. Now that 
the interface is much more useful because we have the same rdtgroup that is 
being monitored and controlled. (vs. if we had the perf mbm the group of threads 
in resctrl mba and in mbm could be different and would be hard to measure what 
the threads/cpus in the resctrl are using). So the resctrl being used for both 
of these is a requirement for this and we always check that.

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  2018-04-03 14:29     ` Thomas Gleixner
@ 2018-04-03 18:49       ` Shivappa Vikas
  2018-04-04  9:30         ` Thomas Gleixner
  0 siblings, 1 reply; 21+ messages in thread
From: Shivappa Vikas @ 2018-04-03 18:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, tony.luck, ravi.v.shankar,
	fenghua.yu, sai.praneeth.prakhya, x86, hpa, linux-kernel, ak



On Tue, 3 Apr 2018, Thomas Gleixner wrote:

> On Tue, 3 Apr 2018, Thomas Gleixner wrote:
>> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>> You said above:
>>
>>> This may lead to confusion in scenarios below:
>>
>> Reading the blurb after that creates even more confusion than being
>> helpful.
>>
>> First of all this information should not be under the section 'Memory
>> bandwidth in MB/s'.
>>
>> Also please write bandwidth. The weird acronym b/w (band per width???) is
>> really not increasing legibility.
>>
>> What you really want is a general section about memory bandwidth allocation
>> where you explain the technical background in purely technical terms w/o
>> fairy tale mode. Technical descriptions have to be factual and not
>> 'could/may/would'.
>>
>> If I decode the above correctly then the current percentage based
>> implementation was buggy from the very beginning in several ways.
>>
>> Now the obvious question which is in no way answered by the cover letter is
>> why the current percentage based implementation cannot be fixed and we need
>> some feedback driven magic to achieve that. I assume you spent some brain
>> cycles on that question, so it would be really helpful if you shared that.
>>
>> If I understand it correctly then the problem is that the throttling
>> mechanism is per core and affects the L2 external bandwidth.
>>
>>   Is this really per core? What about hyper threads. Both threads have that
>>   MSR. How is that working?
>>
>> The L2 external bandwidth is higher than the L3 external bandwidth.
>>
>>   Is there any information available from CPUID or whatever source which
>>   allows us to retrieve the bandwidth ratio or the absolute maximum
>>   bandwidth per level?
>>
>> What's also missing from your explanation is how that feedback loop behaves
>> under different workloads.
>>
>>   Is this assuming that the involved threads/cpus actually try to utilize
>>   the bandwidth completely?
>>
>>   What happens if the threads/cpus are only using a small set because they
>>   are idle or their computations are mostly cache local and do not need
>>   external bandwidth? Looking at the implementation I don't see how that is
>>   taken into account.
>
> Forgot to mention the following:
>
>  The proposed new interface has no upper limit. The existing percentage
>  based implementation has at least some notion of limit and scale; not
>  really helpful either because of the hardware implementation. but I
>
>  How is the poor admin supposed to configure that new thing without
>  knowing what the actual hardware limits are in the first place?

That is true. The default values only put it to a very high bandwidth which 
means user gets to use everything. There seems no other way other than 
caliberating to know the actual max bandwidth in bytes. That could be a better 
value to have as default so admin knows the limit. I will explore if there is 
a way to calculate the same without caliberating.

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support
  2018-04-03  9:52   ` Thomas Gleixner
@ 2018-04-03 18:51     ` Shivappa Vikas
  0 siblings, 0 replies; 21+ messages in thread
From: Shivappa Vikas @ 2018-04-03 18:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vikas Shivappa, vikas.shivappa, tony.luck, ravi.v.shankar,
	fenghua.yu, sai.praneeth.prakhya, x86, hpa, linux-kernel, ak



On Tue, 3 Apr 2018, Thomas Gleixner wrote:

> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>> +void setup_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
>> +{
>> +	int i;
>> +
>> +	/*
>> +	 * Initialize the Control MSRs to having no control.
>> +	 * For Cache Allocation: Set all bits in cbm
>> +	 * For Memory Allocation: Set b/w requested to 100%
>> +	 * and the b/w in MB to U32_MAX
>> +	 */
>> +	for (i = 0; i < r->num_closid; i++, dc++, dm++) {
>> +		*dc = r->membw.bw_byte ? MBA_BW_MAX_MB : r->default_ctrl;
>> +		*dm = r->default_ctrl;
>
> No! Please stop duct taping your stuff into the existing code. So far the
> ctrl value was the same as the value which was actually written into the
> MSR. With your new mode you have to split that up into the user supplied
> value and the value which gets written into the MSR.
>
> So the right thing to do is to separate the user value and the MSR value
> first and independent of the mode. Then the new mode falls into place
> naturally because r->default_ctrl and r->default_msrval are set up at mount
> time with the values which correspond to the mount mode.

will fix. I tried both and this implementation assumes what user modifies is 
the control values (because then schemata read and write is easy as user does it 
directly) but agree we can change that.

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  2018-04-03 18:45     ` Shivappa Vikas
@ 2018-04-04  9:11       ` Thomas Gleixner
  2018-04-04 18:56         ` Shivappa Vikas
  0 siblings, 1 reply; 21+ messages in thread
From: Thomas Gleixner @ 2018-04-04  9:11 UTC (permalink / raw)
  To: Shivappa Vikas
  Cc: Vikas Shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, hpa, linux-kernel, ak

On Tue, 3 Apr 2018, Shivappa Vikas wrote:
> On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> > On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> > The L2 external bandwidth is higher than the L3 external bandwidth.
> > 
> >  Is there any information available from CPUID or whatever source which
> >  allows us to retrieve the bandwidth ratio or the absolute maximum
> >  bandwidth per level?
> 
> There is no information in cpuid on the bandwidth available. Also we have seen
> from our experiments that the increase is not perfectly linear (delta
> bandwidth increase from 30% to 40% may not be same as 70% to 80%). So we
> currently dynamically caliberate this delta for the software controller.

I assume you mean: calibrate

Though I don't see anything which looks remotely like calibration.
Calibration means that you determine the exact parameters by observation and
then can use the calibrated values afterwards. But that's not what you are
doing. So please don't claim its calibration.

You observe behaviour which depends on the workload and other
factors. That's not calibration. If you change the MSR by a granularity
value then you calculate the bandwidth delta vs. the previous MSR
value. That only makes sense and works when the application is having the
same memory access patterns accross both observation periods.

And of course, this won't be necessarily linear because if you throttle the
application then it gets less work done per CPU time slice and the
resulting stalls will also have side effects on the requested amount of
memory and therefore distort the measurement. Ditto the other way
around.

There are too many factors influencing this, so claiming that it's
calibration is window dressing at best. Even worse it suggests that it's
something accurate, which subverts your goal of reducing confusion.

Adaptive control might be an acceptable description, though given the
amount of factors which play into that it's still an euphemism for
'heuristic'.

> > What's also missing from your explanation is how that feedback loop behaves
> > under different workloads.
> > 
> >  Is this assuming that the involved threads/cpus actually try to utilize
> >  the bandwidth completely?
> 
> No, the feedback loop only guarentees that the usage will not exceed what the
> user specifies as max bandwidth. If it is using below the max value it does
> not matter how much less it is using.
> > 
> >  What happens if the threads/cpus are only using a small set because they
> >  are idle or their computations are mostly cache local and do not need
> >  external bandwidth? Looking at the implementation I don't see how that is
> >  taken into account.
> 
> The feedback only kicks into action if a rdtgroup uses more bandwidth than the
> max specified by the user. I specified that it is always "ensure the "actual
> b/w
> 354 < user b/w" " and can add more explanation on these scenarios.

Please finally stop to use this horrible 'b/w' thingy. It makes my eyes bleed
everytime.

> Also note that we are using the MBM counters for this feedback loop. Now that
> the interface is much more useful because we have the same rdtgroup that is
> being monitored and controlled. (vs. if we had the perf mbm the group of
> threads in resctrl mba and in mbm could be different and would be hard to
> measure what the threads/cpus in the resctrl are using).

Why does that make me smile?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  2018-04-03 18:49       ` Shivappa Vikas
@ 2018-04-04  9:30         ` Thomas Gleixner
  0 siblings, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2018-04-04  9:30 UTC (permalink / raw)
  To: Shivappa Vikas
  Cc: Vikas Shivappa, tony.luck, ravi.v.shankar, fenghua.yu,
	sai.praneeth.prakhya, x86, hpa, linux-kernel, ak

On Tue, 3 Apr 2018, Shivappa Vikas wrote:
> On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> >  The proposed new interface has no upper limit. The existing percentage
> >  based implementation has at least some notion of limit and scale; not
> >  really helpful either because of the hardware implementation. but I
> > 
> >  How is the poor admin supposed to configure that new thing without
> >  knowing what the actual hardware limits are in the first place?
> 
> That is true. The default values only put it to a very high bandwidth which
> means user gets to use everything. There seems no other way other than
> caliberating to know the actual max bandwidth in bytes. That could be a better
> value to have as default so admin knows the limit. I will explore if there is
> a way to calculate the same without caliberating.

Right, ideally we get that information from the hardware.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller
  2018-04-04  9:11       ` Thomas Gleixner
@ 2018-04-04 18:56         ` Shivappa Vikas
  0 siblings, 0 replies; 21+ messages in thread
From: Shivappa Vikas @ 2018-04-04 18:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Shivappa Vikas, Vikas Shivappa, tony.luck, ravi.v.shankar,
	fenghua.yu, sai.praneeth.prakhya, x86, hpa, linux-kernel, ak



On Wed, 4 Apr 2018, Thomas Gleixner wrote:

> On Tue, 3 Apr 2018, Shivappa Vikas wrote:
> > On Tue, 3 Apr 2018, Thomas Gleixner wrote:
> > > On Thu, 29 Mar 2018, Vikas Shivappa wrote:
> > > The L2 external bandwidth is higher than the L3 external bandwidth.
> > > 
> > >  Is there any information available from CPUID or whatever source which
> > >  allows us to retrieve the bandwidth ratio or the absolute maximum
> > >  bandwidth per level?
> > 
> > There is no information in cpuid on the bandwidth available. Also we have seen
> > from our experiments that the increase is not perfectly linear (delta
> > bandwidth increase from 30% to 40% may not be same as 70% to 80%). So we
> > currently dynamically caliberate this delta for the software controller.
> 
> I assume you mean: calibrate
> 
> Though I don't see anything which looks remotely like calibration.
> Calibration means that you determine the exact parameters by observation and
> then can use the calibrated values afterwards. But that's not what you are
> doing. So please don't claim its calibration.
> 
> You observe behaviour which depends on the workload and other
> factors. That's not calibration. If you change the MSR by a granularity
> value then you calculate the bandwidth delta vs. the previous MSR
> value. That only makes sense and works when the application is having the
> same memory access patterns accross both observation periods.
> 
> And of course, this won't be necessarily linear because if you throttle the
> application then it gets less work done per CPU time slice and the
> resulting stalls will also have side effects on the requested amount of
> memory and therefore distort the measurement. Ditto the other way
> around.
> 
> There are too many factors influencing this, so claiming that it's
> calibration is window dressing at best. Even worse it suggests that it's
> something accurate, which subverts your goal of reducing confusion.
> 
> Adaptive control might be an acceptable description, though given the
> amount of factors which play into that it's still an euphemism for
> 'heuristic'.

Agree we donot really caliberate and the only thing we guarentee is that 
the actual bandwidth in bytes < user specified bandwidth bytes. This is 
what the hardware guarenteed when we specified the values in percentage 
as well but just that it was confusing.

> 
> > > What's also missing from your explanation is how that feedback loop behaves
> > > under different workloads.
> > > 
> > >  Is this assuming that the involved threads/cpus actually try to utilize
> > >  the bandwidth completely?
> > 
> > No, the feedback loop only guarentees that the usage will not exceed what the
> > user specifies as max bandwidth. If it is using below the max value it does
> > not matter how much less it is using.
> > > 
> > >  What happens if the threads/cpus are only using a small set because they
> > >  are idle or their computations are mostly cache local and do not need
> > >  external bandwidth? Looking at the implementation I don't see how that is
> > >  taken into account.
> > 
> > The feedback only kicks into action if a rdtgroup uses more bandwidth than the
> > max specified by the user. I specified that it is always "ensure the "actual
> > b/w
> > 354 < user b/w" " and can add more explanation on these scenarios.
> 
> Please finally stop to use this horrible 'b/w' thingy. It makes my eyes bleed
> everytime.

Will fix - this was a text from already existing documentation.

> 
> > Also note that we are using the MBM counters for this feedback loop. Now that
> > the interface is much more useful because we have the same rdtgroup that is
> > being monitored and controlled. (vs. if we had the perf mbm the group of
> > threads in resctrl mba and in mbm could be different and would be hard to
> > measure what the threads/cpus in the resctrl are using).
> 
> Why does that make me smile?

I know why :) Full credits to you as you had suggested to rewrite the 
cqm/mbm in resctrl which is definitely very good in long term !

Thanks,
Vikas

> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/6] x86/intel_rdt/mba_sc: Add schemata support
  2018-04-20 22:36 [PATCH V2 0/6] Memory bandwidth allocation software controller(mba_sc) Vikas Shivappa
@ 2018-04-20 22:36 ` Vikas Shivappa
  0 siblings, 0 replies; 21+ messages in thread
From: Vikas Shivappa @ 2018-04-20 22:36 UTC (permalink / raw)
  To: vikas.shivappa, tony.luck, ravi.v.shankar, fenghua.yu, x86, tglx, hpa
  Cc: linux-kernel, ak, vikas.shivappa

Currently when user updates the "schemata" with new MBA percentage
values, kernel writes the corresponding bandwidth percentage values to
the IA32_MBA_THRTL_MSR.

When MBA is expressed in MBps, the schemata format is changed to have
the per package memory bandwidth in MBps instead of being specified in
percentage. We do not write the IA32_MBA_THRTL_MSRs when the schemata
is updated as that is handled separately.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c             |  2 +-
 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 24 +++++++++++++++++++-----
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 07f6b3b..85805d7 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -179,7 +179,7 @@ struct rdt_resource rdt_resources_all[] = {
 		.msr_update		= mba_wrmsr,
 		.cache_level		= 3,
 		.parse_ctrlval		= parse_bw,
-		.format_str		= "%d=%*d",
+		.format_str		= "%d=%*u",
 		.fflags			= RFTYPE_RES_MB,
 	},
 };
diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
index 23e1d5c..116d57b 100644
--- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -53,7 +53,8 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 		return false;
 	}
 
-	if (bw < r->membw.min_bw || bw > r->default_ctrl) {
+	if ((bw < r->membw.min_bw || bw > r->default_ctrl) &&
+	    !is_mba_sc(r)) {
 		rdt_last_cmd_printf("MB value %ld out of range [%d,%d]\n", bw,
 				    r->membw.min_bw, r->default_ctrl);
 		return false;
@@ -179,6 +180,8 @@ static int update_domains(struct rdt_resource *r, int closid)
 	struct msr_param msr_param;
 	cpumask_var_t cpu_mask;
 	struct rdt_domain *d;
+	bool mba_sc;
+	u32 *dc;
 	int cpu;
 
 	if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -188,13 +191,20 @@ static int update_domains(struct rdt_resource *r, int closid)
 	msr_param.high = msr_param.low + 1;
 	msr_param.res = r;
 
+	mba_sc = is_mba_sc(r);
 	list_for_each_entry(d, &r->domains, list) {
-		if (d->have_new_ctrl && d->new_ctrl != d->ctrl_val[closid]) {
+		dc = !mba_sc ? d->ctrl_val : d->mbps_val;
+		if (d->have_new_ctrl && d->new_ctrl != dc[closid]) {
 			cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
-			d->ctrl_val[closid] = d->new_ctrl;
+			dc[closid] = d->new_ctrl;
 		}
 	}
-	if (cpumask_empty(cpu_mask))
+
+	/*
+	 * Avoid writing the control msr with control values when
+	 * MBA software controller is enabled
+	 */
+	if (cpumask_empty(cpu_mask) || mba_sc)
 		goto done;
 	cpu = get_cpu();
 	/* Update CBM on this cpu if it's in cpu_mask. */
@@ -282,13 +292,17 @@ static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid)
 {
 	struct rdt_domain *dom;
 	bool sep = false;
+	u32 ctrl_val;
 
 	seq_printf(s, "%*s:", max_name_width, r->name);
 	list_for_each_entry(dom, &r->domains, list) {
 		if (sep)
 			seq_puts(s, ";");
+
+		ctrl_val = (!is_mba_sc(r) ? dom->ctrl_val[closid] :
+			    dom->mbps_val[closid]);
 		seq_printf(s, r->format_str, dom->id, max_data_width,
-			   dom->ctrl_val[closid]);
+			   ctrl_val);
 		sep = true;
 	}
 	seq_puts(s, "\n");
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-04-20 22:39 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-29 22:26 [PATCH RFC 0/6] Memory b/w allocation software controller Vikas Shivappa
2018-03-29 22:26 ` [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA " Vikas Shivappa
2018-04-03  9:46   ` Thomas Gleixner
2018-04-03 14:29     ` Thomas Gleixner
2018-04-03 18:49       ` Shivappa Vikas
2018-04-04  9:30         ` Thomas Gleixner
2018-04-03 18:45     ` Shivappa Vikas
2018-04-04  9:11       ` Thomas Gleixner
2018-04-04 18:56         ` Shivappa Vikas
2018-03-29 22:26 ` [PATCH 2/6] x86/intel_rdt/mba_sc: Add support to enable/disable via mount option Vikas Shivappa
2018-03-30  9:32   ` Thomas Gleixner
2018-03-30 17:19     ` Shivappa Vikas
2018-03-29 22:26 ` [PATCH 3/6] x86/intel_rdt/mba_sc: Add initialization support Vikas Shivappa
2018-04-03  9:52   ` Thomas Gleixner
2018-04-03 18:51     ` Shivappa Vikas
2018-03-29 22:26 ` [PATCH 4/6] x86/intel_rdt/mba_sc: Add schemata support Vikas Shivappa
2018-03-29 22:26 ` [PATCH 5/6] x86/intel_rdt/mba_sc: Add counting for MBA software controller Vikas Shivappa
2018-03-29 22:26 ` [PATCH 6/6] x86/intel_rdt/mba_sc: Add support to dynamically update the memory b/w Vikas Shivappa
2018-03-30 21:21   ` kbuild test robot
2018-03-31  1:37   ` kbuild test robot
2018-04-20 22:36 [PATCH V2 0/6] Memory bandwidth allocation software controller(mba_sc) Vikas Shivappa
2018-04-20 22:36 ` [PATCH 4/6] x86/intel_rdt/mba_sc: Add schemata support Vikas Shivappa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.