linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH V2 00/22]  Intel(R) Resource Director Technology Cache Pseudo-Locking enabling
@ 2018-02-13 15:46 Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking Reinette Chatre
                   ` (22 more replies)
  0 siblings, 23 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre, linux-mm, Andrew Morton,
	Mike Kravetz, Michal Hocko, Vlastimil Babka

Adding MM maintainers to v2 to share the new MM change (patch 21/22) that
enables large contiguous regions that was created to support large Cache
Pseudo-Locked regions (patch 22/22). This week MM team received another
proposal to support large contiguous allocations ("[RFC PATCH 0/3]
Interface for higher order contiguous allocations" at
http://lkml.kernel.org/r/20180212222056.9735-1-mike.kravetz@oracle.com).
I have not yet tested with this new proposal but it does seem appropriate
and I should be able to rework patch 22 from this series on top of that if
it is accepted instead of what I have in patch 21 of this series.

Changes since v1:
- Enable allocation of contiguous regions larger than what SLAB allocators
  can support. This removes the 4MB Cache Pseudo-Locking limitation
  documented in v1 submission.
  This depends on "mm: drop hotplug lock from lru_add_drain_all",
  now in v4.16-rc1 as 9852a7212324fd25f896932f4f4607ce47b0a22f.
- Convert to debugfs_file_get() and -put() from the now obsolete
  debugfs_use_file_start() and debugfs_use_file_finish() calls.
- Rebase on top of, and take into account, recent L2 CDP enabling.
- Simplify tracing output to print cache hits and miss counts on same line.

This version is based on x86/cache of tip.git when the HEAD was
(based on v4.15-rc8):

commit 31516de306c0c9235156cdc7acb976ea21f1f646
Author: Fenghua Yu <fenghua.yu@intel.com>
Date:   Wed Dec 20 14:57:24 2017 -0800

    x86/intel_rdt: Add command line parameter to control L2_CDP

Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>

No changes below. It is verbatim from first submission (except for
diffstat at the end that reflects v2).

Dear Maintainers,

Cache Allocation Technology (CAT), part of Intel(R) Resource Director
Technology (Intel(R) RDT), enables a user to specify the amount of cache
space into which an application can fill. Cache pseudo-locking builds on
the fact that a CPU can still read and write data pre-allocated outside
its current allocated area on cache hit. With cache pseudo-locking data
can be preloaded into a reserved portion of cache that no application can
fill, and from that point on will only serve cache hits. The cache
pseudo-locked memory is made accessible to user space where an application
can map it into its virtual address space and thus have a region of
memory with reduced average read latency.

The cache pseudo-locking approach relies on generation-specific behavior
of processors. It may provide benefits on certain processor generations,
but is not guaranteed to be supported in the future. It is not a guarantee
that data will remain in the cache. It is not a guarantee that data will
remain in certain levels or certain regions of the cache. Rather, cache
pseudo-locking increases the probability that data will remain in a certain
level of the cache via carefully configuring the CAT feature and carefully
controlling application behavior.

Known limitations:
Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict pseudo-locked
memory from the cache. Power management C-states may still shrink or power
off cache causing eviction of cache pseudo-locked memory. We utilize
PM QoS to prevent entering deeper C-states on cores associated with cache
pseudo-locked regions at the time they (the pseudo-locked regions) are
created.

Known software limitation:
Cache pseudo-locked regions are currently limited to 4MB, even on
platforms that support larger cache sizes. Work is in progress to
support larger regions.

Graphs visualizing the benefits of cache pseudo-locking on an Intel(R)
NUC NUC6CAYS (it has an Intel(R) Celeron(R) Processor J3455) with the
default 2GB DDR3L-1600 memory are available. In these tests the patches
from this series were applied on the x86/cache branch of tip.git at the
time the HEAD was:

commit 87943db7dfb0c5ee5aa74a9ac06346fadd9695c8 (tip/x86/cache)
Author: Reinette Chatre <reinette.chatre@intel.com>
Date:   Fri Oct 20 02:16:59 2017 -0700
    x86/intel_rdt: Fix potential deadlock during resctrl mount

DISCLAIMER: Tests document performance of components on a particular test,
in specific systems. Differences in hardware, software, or configuration
will affect actual performance. Performance varies depending on system
configuration.

- https://github.com/rchatre/data/blob/master/cache_pseudo_locking/rfc_v1/perfcount.png
Above shows the few L2 cache misses possible with cache pseudo-locking
on the Intel(R) NUC with default configuration. Each test, which is
repeated 100 times, pseudo-locks schemata shown and then measure from
the kernel via precision counters the number of cache misses when
accessing the memory afterwards. This test is run on an idle system as
well as a system with significant noise (using stress-ng) from a
neighboring core associated with the same cache. This plot shows us that:
(1) the number of cache misses remain consistent irrespective of the size
of region being pseudo-locked, and (2) the number of cache misses for a
pseudo-locked region remains low when traversing memory regions ranging
in size from 256KB (4096 cache lines) to 896KB (14336 cache lines).

- https://github.com/rchatre/data/blob/master/cache_pseudo_locking/rfc_v1/userspace_malloc_with_load.png
Above shows the read latency experienced by an application running with
default CAT CLOS after it allocated 256KB memory with malloc() (and using
mlockall()). In this example the application reads randomly (to not trigger
hardware prefetcher) from its entire allocated region at 2 second intervals
while there is a noisy neighbor present. Each individual access is 32 bytes
in size and the latency of each access is measured using the rdtsc
instruction. In this visualization we can observe two groupings of data,
the group with lower latency indicating cache hits, and the group with
higher latency indicating cache misses. We can see a significant portion
of memory reads experience larger latencies.

- https://github.com/rchatre/data/blob/master/cache_pseudo_locking/rfc_v1/userspace_psl_with_load.png
Above plots a similar test as the previous, but instead of the application
reading from a 256KB malloc() region it reads from a 256KB pseudo-locked
region that was mmap()'ed into its address space. When comparing these
latencies to that of regular malloc() latencies we do see a significant
improvement in latencies experienced.

https://github.com/rchatre/data/blob/master/cache_pseudo_locking/rfc_v1/userspace_malloc_and_cat_with_load_clos0_fixed.png
Applications that are sensitive to latencies may use existing CAT
technology to isolate the sensitive application. In this plot we show an
application running with a dedicated CAT CLOS double the size (512KB) of
the memory being tested (256KB). A dedicated CLOS with CBM 0x0f is created and
the default CLOS changed to CBM 0xf0. We see in this plot that even though
the application runs within a dedicated portion of cache it still
experiences significant latency accessing its memory (when compared to
pseudo-locking).

Your feedback about this proposal for enabling of Cache Pseudo-Locking
will be greatly appreciated.

Regards,

Reinette

Reinette Chatre (22):
  x86/intel_rdt: Documentation for Cache Pseudo-Locking
  x86/intel_rdt: Make useful functions available internally
  x86/intel_rdt: Introduce hooks to create pseudo-locking files
  x86/intel_rdt: Introduce test to determine if closid is in use
  x86/intel_rdt: Print more accurate pseudo-locking availability
  x86/intel_rdt: Create pseudo-locked regions
  x86/intel_rdt: Connect pseudo-locking directory to operations
  x86/intel_rdt: Introduce pseudo-locking resctrl files
  x86/intel_rdt: Discover supported platforms via prefetch disable bits
  x86/intel_rdt: Disable pseudo-locking if CDP enabled
  x86/intel_rdt: Associate pseudo-locked regions with its domain
  x86/intel_rdt: Support CBM checking from value and character buffer
  x86/intel_rdt: Support schemata write - pseudo-locking core
  x86/intel_rdt: Enable testing for pseudo-locked region
  x86/intel_rdt: Prevent new allocations from pseudo-locked regions
  x86/intel_rdt: Create debugfs files for pseudo-locking testing
  x86/intel_rdt: Create character device exposing pseudo-locked region
  x86/intel_rdt: More precise L2 hit/miss measurements
  x86/intel_rdt: Support L3 cache performance event of Broadwell
  x86/intel_rdt: Limit C-states dynamically when pseudo-locking active
  mm/hugetlb: Enable large allocations through gigantic page API
  x86/intel_rdt: Support contiguous memory of all sizes

 Documentation/x86/intel_rdt_ui.txt                |  229 ++-
 arch/x86/Kconfig                                  |   11 +
 arch/x86/kernel/cpu/Makefile                      |    4 +-
 arch/x86/kernel/cpu/intel_rdt.h                   |   24 +
 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c       |   44 +-
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c       | 1894 +++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h |   52 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c          |   46 +-
 include/linux/hugetlb.h                           |    2 +
 mm/hugetlb.c                                      |   10 +-
 10 files changed, 2290 insertions(+), 26 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h

-- 
2.13.6

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-19 20:35   ` Thomas Gleixner
  2018-02-19 21:27   ` Randy Dunlap
  2018-02-13 15:46 ` [RFC PATCH V2 02/22] x86/intel_rdt: Make useful functions available internally Reinette Chatre
                   ` (21 subsequent siblings)
  22 siblings, 2 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Add description of Cache Pseudo-Locking feature, its interface,
as well as an example of its usage.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 Documentation/x86/intel_rdt_ui.txt | 229 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 228 insertions(+), 1 deletion(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index 756fd76b78a6..bb3d6fe0a3e4 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -27,7 +27,10 @@ mount options are:
 L2 and L3 CDP are controlled seperately.
 
 RDT features are orthogonal. A particular system may support only
-monitoring, only control, or both monitoring and control.
+monitoring, only control, or both monitoring and control. Cache
+pseudo-locking is a unique way of using cache control to "pin" or
+"lock" data in the cache. Details can be found in
+"Cache Pseudo-Locking".
 
 The mount succeeds if either of allocation or monitoring is present, but
 only those files and directories supported by the system will be created.
@@ -329,6 +332,149 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
 
+Cache Pseudo-Locking
+--------------------
+CAT enables a user to specify the amount of cache space into which an
+application can fill. Cache pseudo-locking builds on the fact that a
+CPU can still read and write data pre-allocated outside its current
+allocated area on a cache hit. With cache pseudo-locking, data can be
+preloaded into a reserved portion of cache that no application can
+fill, and from that point on will only serve cache hits. The cache
+pseudo-locked memory is made accessible to user space where an
+application can map it into its virtual address space and thus have
+a region of memory with reduced average read latency.
+
+Cache pseudo-locking increases the probability that data will remain
+in the cache via carefully configuring the CAT feature and controlling
+application behavior. There is no guarantee that data is placed in
+cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
+“locked” data from cache. Power management C-states may shrink or
+power off cache. It is thus recommended to limit the processor maximum
+C-state, for example, by setting the processor.max_cstate kernel parameter.
+
+It is required that an application using a pseudo-locked region runs
+with affinity to the cores (or a subset of the cores) associated
+with the cache on which the pseudo-locked region resides. This is
+enforced by the implementation.
+
+Pseudo-locking is accomplished in two stages:
+1) During the first stage the system administrator allocates a portion
+   of cache that should be dedicated to pseudo-locking. At this time an
+   equivalent portion of memory is allocated, loaded into allocated
+   cache portion, and exposed as a character device.
+2) During the second stage a user-space application maps (mmap()) the
+   pseudo-locked memory into its address space.
+
+Cache Pseudo-Locking Interface
+------------------------------
+Platforms supporting cache pseudo-locking will expose a new
+"/sys/fs/restrl/pseudo_lock" directory after successful mount of the
+resctrl filesystem. Initially this directory will contain a single file,
+"avail" that contains the schemata, one line per resource, of cache region
+available for pseudo-locking.
+
+A pseudo-locked region is created by creating a new directory within
+/sys/fs/resctrl/pseudo_lock. On success two new files will appear in
+the directory:
+
+"schemata":
+	Shows the schemata representing the pseudo-locked cache region.
+	User writes schemata of requested locked area to file.
+	Only one id of single resource accepted - can only lock from
+	single cache instance. Writing of schemata to this file will
+	return success on successful pseudo-locked region setup.
+"size":
+	After successful pseudo-locked region setup this read-only file
+	will contain the size in bytes of pseudo-locked region.
+
+Cache Pseudo-Locking Debugging Interface
+---------------------------------------
+The pseudo-locking debugging interface is enabled with
+CONFIG_INTEL_RDT_DEBUGFS and can be found in
+/sys/kernel/debug/resctrl/pseudo_lock.
+
+There is no explicit way for the kernel to test if a provided memory
+location is present in the cache. The pseudo-locking debugging interface uses
+the tracing infrastructure to provide two ways to measure cache residency of
+the pseudo-locked region:
+1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
+   from these measurements are best visualized using a hist trigger (see
+   example below). In this test the pseudo-locked region is traversed at
+   a stride of 32 bytes while hardware prefetchers, preemption, and interrupts
+   are disabled. This also provides a substitute visualization of cache
+   hits and misses.
+2) Cache hit and miss measurements using model specific precision counters if
+   available. Depending on the levels of cache on the system the following
+   tracepoints are available: pseudo_lock_l2_hits, pseudo_lock_l2_miss,
+   pseudo_lock_l3_miss, and pseudo_lock_l3_hits. WARNING: triggering this
+   measurement uses from two (for just L2 measurements) to four (for L2 and L3
+   measurements) precision counters on the system, if any other
+   measurements are in progress the counters and their corresponding event
+   registers will be clobbered.
+
+When a pseudo-locked region is created a new debugfs directory is created for
+it in debugfs as /sys/kernel/debug/resctrl/pseudo_lock/<newdir>. A single
+write-only file, measure_trigger, is present in this directory. The
+measurement on the pseudo-locked region depends on the number, 1 or 2,
+written to this debugfs file. Since the measurements are recorded with the
+tracing infrastructure the relevant tracepoints need to be enabled before the
+measurement is triggered.
+
+Example of latency debugging interface:
+In this example a pseudo-locked region named "newlock" was created. Here is
+how we can measure the latency in cycles of reading from this region:
+# :> /sys/kernel/debug/tracing/trace
+# echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_mem_latency/trigger
+# echo 1 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_mem_latency/enable
+# echo 1 > /sys/kernel/debug/resctrl/pseudo_lock/newlock/measure_trigger
+# echo 0 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_mem_latency/enable
+# cat /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_mem_latency/hist
+
+# event histogram
+#
+# trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
+#
+
+{ latency:        456 } hitcount:          1
+{ latency:         50 } hitcount:         83
+{ latency:         36 } hitcount:         96
+{ latency:         44 } hitcount:        174
+{ latency:         48 } hitcount:        195
+{ latency:         46 } hitcount:        262
+{ latency:         42 } hitcount:        693
+{ latency:         40 } hitcount:       3204
+{ latency:         38 } hitcount:       3484
+
+Totals:
+    Hits: 8192
+    Entries: 9
+    Dropped: 0
+
+Example of cache hits/misses debugging:
+In this example a pseudo-locked region named "newlock" was created on the L2
+cache of a platform. Here is how we can obtain details of the cache hits
+and misses using the platform's precision counters.
+
+# :> /sys/kernel/debug/tracing/trace
+# echo 1 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_l2_hits/enable
+# echo 1 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_l2_miss/enable
+# echo 2 > /sys/kernel/debug/resctrl/pseudo_lock/newlock/measure_trigger
+# echo 0 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_l2_hits/enable
+# echo 0 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_l2_miss/enable
+# cat /sys/kernel/debug/tracing/trace
+
+# tracer: nop
+#
+#                              _-----=> irqs-off
+#                             / _----=> need-resched
+#                            | / _---=> hardirq/softirq
+#                            || / _--=> preempt-depth
+#                            ||| /     delay
+#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
+#              | |       |   ||||       |         |
+ pseudo_lock_mea-1039  [002] ....  1598.825180: pseudo_lock_l2_hits: L2 hits=4097
+ pseudo_lock_mea-1039  [002] ....  1598.825184: pseudo_lock_l2_miss: L2 miss=2
+
 Examples for RDT allocation usage:
 
 Example 1
@@ -443,6 +589,87 @@ siblings and only the real time threads are scheduled on the cores 4-7.
 
 # echo F0 > p0/cpus
 
+Example of Cache Pseudo-Locking
+-------------------------------
+Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
+region is exposed at /dev/pseudo_lock/newlock that can be provided to
+application for argument to mmap().
+
+# cd /sys/fs/resctrl/pseudo_lock
+# cat avail
+L2:0=ff;1=ff
+# mkdir newlock
+# cd newlock
+# cat schemata
+# L2:uninitialized
+# echo ‘L2:1=3’ > schemata
+# ls -l /dev/pseudo_lock/newlock
+crw------- 1 root root 244, 0 Mar 30 03:00 /dev/pseudo_lock/newlock
+
+/*
+ * Example code to access one page of pseudo-locked cache region
+ * from user space.
+ */
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/mman.h>
+
+/*
+ * It is required that the application runs with affinity to only
+ * cores associated with the pseudo-locked region. Here the cpu
+ * is hardcoded for convenience of example.
+ */
+static int cpuid = 2;
+
+int main(int argc, char *argv[])
+{
+	cpu_set_t cpuset;
+	long page_size;
+	void *mapping;
+	int dev_fd;
+	int ret;
+
+	page_size = sysconf(_SC_PAGESIZE);
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpuid, &cpuset);
+	ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+	if (ret < 0) {
+		perror("sched_setaffinity");
+		exit(EXIT_FAILURE);
+	}
+
+	dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
+	if (dev_fd < 0) {
+		perror("open");
+		exit(EXIT_FAILURE);
+	}
+
+	mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+		       dev_fd, 0);
+	if (mapping == MAP_FAILED) {
+		perror("mmap");
+		close(dev_fd);
+		exit(EXIT_FAILURE);
+	}
+
+	/* Application interacts with pseudo-locked memory @mapping */
+
+	ret = munmap(mapping, page_size);
+	if (ret < 0) {
+		perror("munmap");
+		close(dev_fd);
+		exit(EXIT_FAILURE);
+	}
+
+	close(dev_fd);
+	exit(EXIT_SUCCESS);
+}
+
 4) Locking between applications
 
 Certain operations on the resctrl filesystem, composed of read/writes
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 02/22] x86/intel_rdt: Make useful functions available internally
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 03/22] x86/intel_rdt: Introduce hooks to create pseudo-locking files Reinette Chatre
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

In preparation for support of pseudo-locking we move some static
functions to be available for sharing amongst all RDT components.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h             | 5 +++++
 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 2 +-
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c    | 8 ++++----
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 3fd7a70ee04a..df4db23ddd74 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -430,7 +430,12 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
 				char *buf, size_t nbytes, loff_t off);
 int rdtgroup_schemata_show(struct kernfs_open_file *of,
 			   struct seq_file *s, void *v);
+int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags);
+int rdtgroup_kn_set_ugid(struct kernfs_node *kn);
 struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
+int closid_alloc(void);
+void closid_free(int closid);
+int update_domains(struct rdt_resource *r, int closid);
 int alloc_rmid(void);
 void free_rmid(u32 rmid);
 int rdt_get_mon_l3_config(struct rdt_resource *r);
diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
index 23e1d5c249c6..9e1a455e4d9b 100644
--- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -174,7 +174,7 @@ static int parse_line(char *line, struct rdt_resource *r)
 	return -EINVAL;
 }
 
-static int update_domains(struct rdt_resource *r, int closid)
+int update_domains(struct rdt_resource *r, int closid)
 {
 	struct msr_param msr_param;
 	cpumask_var_t cpu_mask;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index bdab7d2f51af..2a14867a14f7 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -109,7 +109,7 @@ static void closid_init(void)
 	closid_free_map &= ~1;
 }
 
-static int closid_alloc(void)
+int closid_alloc(void)
 {
 	u32 closid = ffs(closid_free_map);
 
@@ -121,13 +121,13 @@ static int closid_alloc(void)
 	return closid;
 }
 
-static void closid_free(int closid)
+void closid_free(int closid)
 {
 	closid_free_map |= 1 << closid;
 }
 
 /* set uid and gid of rdtgroup dirs and files to that of the creator */
-static int rdtgroup_kn_set_ugid(struct kernfs_node *kn)
+int rdtgroup_kn_set_ugid(struct kernfs_node *kn)
 {
 	struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID,
 				.ia_uid = current_fsuid(),
@@ -855,7 +855,7 @@ static struct rftype res_common_files[] = {
 	},
 };
 
-static int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
+int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
 {
 	struct rftype *rfts, *rft;
 	int ret, len;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 03/22] x86/intel_rdt: Introduce hooks to create pseudo-locking files
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 02/22] x86/intel_rdt: Make useful functions available internally Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 04/22] x86/intel_rdt: Introduce test to determine if closid is in use Reinette Chatre
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

We create a new file to host pseudo-locking specific code. The first of
this code are the functions that create the initial pseudo_lock
directory with its first file, "avail", starting by reporting zero. This
will be expanded in future commits.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/Makefile                |   3 +-
 arch/x86/kernel/cpu/intel_rdt.h             |   2 +
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 105 ++++++++++++++++++++++++++++
 3 files changed, 109 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 570e8bb1f386..53022c2413e0 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -35,7 +35,8 @@ obj-$(CONFIG_CPU_SUP_CENTAUR)		+= centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32)	+= transmeta.o
 obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
-obj-$(CONFIG_INTEL_RDT)	+= intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_monitor.o intel_rdt_ctrlmondata.o
+obj-$(CONFIG_INTEL_RDT)	+= intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_monitor.o
+obj-$(CONFIG_INTEL_RDT)	+= intel_rdt_ctrlmondata.o intel_rdt_pseudo_lock.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index df4db23ddd74..146a8090bb58 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -454,5 +454,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
 bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
 void __check_limbo(struct rdt_domain *d, bool force_free);
+int rdt_pseudo_lock_fs_init(struct kernfs_node *root);
+void rdt_pseudo_lock_fs_remove(void);
 
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
new file mode 100644
index 000000000000..ad8b97747024
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -0,0 +1,105 @@
+/*
+ * Resource Director Technology(RDT)
+ *
+ * Pseudo-locking support built on top of Cache Allocation Technology (CAT)
+ *
+ * Copyright (C) 2017 Intel Corporation
+ *
+ * Author: Reinette Chatre <reinette.chatre@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/kernfs.h>
+#include <linux/seq_file.h>
+#include <linux/stat.h>
+#include "intel_rdt.h"
+
+static struct kernfs_node *pseudo_lock_kn;
+
+static int pseudo_lock_avail_show(struct seq_file *sf, void *v)
+{
+	seq_puts(sf, "0\n");
+	return 0;
+}
+
+static struct kernfs_ops pseudo_lock_avail_ops = {
+	.seq_show		= pseudo_lock_avail_show,
+};
+
+/**
+ * rdt_pseudo_lock_fs_init - Create and initialize pseudo-locking files
+ * @root: location in kernfs where directory and files should be created
+ *
+ * The pseudo_lock directory and the pseudo-locking related files and
+ * directories will live within the structure created here.
+ *
+ * LOCKING:
+ * rdtgroup_mutex is expected to be held when called
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure
+ */
+int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
+{
+	struct kernfs_node *kn;
+	int ret;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	pseudo_lock_kn = kernfs_create_dir(root, "pseudo_lock",
+					   root->mode, NULL);
+	if (IS_ERR(pseudo_lock_kn))
+		return PTR_ERR(pseudo_lock_kn);
+
+	kn = __kernfs_create_file(pseudo_lock_kn, "avail", 0444,
+				  0, &pseudo_lock_avail_ops,
+				  NULL, NULL, NULL);
+	if (IS_ERR(kn)) {
+		ret = PTR_ERR(kn);
+		goto error;
+	}
+
+	ret = rdtgroup_kn_set_ugid(pseudo_lock_kn);
+	if (ret)
+		goto error;
+
+	kernfs_activate(pseudo_lock_kn);
+
+	ret = 0;
+	goto out;
+
+error:
+	kernfs_remove(pseudo_lock_kn);
+	pseudo_lock_kn = NULL;
+out:
+	return ret;
+}
+
+/**
+ * rdt_pseudo_lock_fs_remove - Remove all pseudo-locking files
+ *
+ * All pseudo-locking related files and directories are removed.
+ *
+ * LOCKING:
+ * rdtgroup_mutex is expected to be held when called
+ *
+ * RETURNS:
+ * none
+ */
+void rdt_pseudo_lock_fs_remove(void)
+{
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	kernfs_remove(pseudo_lock_kn);
+	pseudo_lock_kn = NULL;
+}
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 04/22] x86/intel_rdt: Introduce test to determine if closid is in use
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (2 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 03/22] x86/intel_rdt: Introduce hooks to create pseudo-locking files Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 05/22] x86/intel_rdt: Print more accurate pseudo-locking availability Reinette Chatre
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

During CAT feature discovery the capacity bitmasks (CBMs) associated
with all the classes of service are initialized to all ones, even if the
class of service is not in use. Introduce a test that can be used to
determine if a class of service is in use. This test enables code
interested in parsing the CBMs to know if its values are meaningful or
can be ignored.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h          | 1 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 146a8090bb58..8f5ded384e19 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -435,6 +435,7 @@ int rdtgroup_kn_set_ugid(struct kernfs_node *kn);
 struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closid_alloc(void);
 void closid_free(int closid);
+bool closid_allocated(unsigned int closid);
 int update_domains(struct rdt_resource *r, int closid);
 int alloc_rmid(void);
 void free_rmid(u32 rmid);
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 2a14867a14f7..5698d66b6892 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -126,6 +126,12 @@ void closid_free(int closid)
 	closid_free_map |= 1 << closid;
 }
 
+/* closid_allocated - test if provided closid is in use */
+bool closid_allocated(unsigned int closid)
+{
+	return (closid_free_map & (1 << closid)) == 0;
+}
+
 /* set uid and gid of rdtgroup dirs and files to that of the creator */
 int rdtgroup_kn_set_ugid(struct kernfs_node *kn)
 {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 05/22] x86/intel_rdt: Print more accurate pseudo-locking availability
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (3 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 04/22] x86/intel_rdt: Introduce test to determine if closid is in use Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions Reinette Chatre
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

A region of cache is considered available for pseudo-locking when:
 * Cache area is in use by default COS.
 * Cache area is NOT in use by any other (other than default) COS.
 * Cache area is not shared with any other entity. Specifically, the
   cache area does not appear in "Bitmask of Shareable Resource with Other
   executing entities" found in EBX during CAT enumeration.
 * Cache area is not currently pseudo-locked.

At this time the first three tests are possible and we update the "avail"
file associated with pseudo-locking to print a more accurate reflection
of pseudo-locking availability.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 62 ++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index ad8b97747024..a787a103c432 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -26,9 +26,69 @@
 
 static struct kernfs_node *pseudo_lock_kn;
 
+/**
+ * pseudo_lock_avail_get - return bitmask of cache available for locking
+ * @r: resource to which this cache instance belongs
+ * @d: domain representing the cache instance
+ *
+ * Availability for pseudo-locking is determined as follows:
+ * * Cache area is in use by default COS.
+ * * Cache area is NOT in use by any other (other than default) COS.
+ * * Cache area is not shared with any other entity. Specifically, the
+ *   cache area does not appear in "Bitmask of Shareable Resource with Other
+ *   executing entities" found in EBX during CAT enumeration.
+ *
+ * Below is also required to determine availability and will be
+ * added in later:
+ * * Cache area is not currently pseudo-locked.
+ *
+ * LOCKING:
+ * rdtgroup_mutex is expected to be held when called
+ *
+ * RETURNS:
+ * Bitmask representing region of cache that can be locked, zero if nothing
+ * available.
+ */
+static u32 pseudo_lock_avail_get(struct rdt_resource *r, struct rdt_domain *d)
+{
+	u32 avail;
+	int i;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+
+	avail = d->ctrl_val[0];
+	for (i = 1; i < r->num_closid; i++) {
+		if (closid_allocated(i))
+			avail &= ~d->ctrl_val[i];
+	}
+	avail &= ~r->cache.shareable_bits;
+
+	return avail;
+}
+
 static int pseudo_lock_avail_show(struct seq_file *sf, void *v)
 {
-	seq_puts(sf, "0\n");
+	struct rdt_resource *r;
+	struct rdt_domain *d;
+	bool sep;
+
+	mutex_lock(&rdtgroup_mutex);
+
+	for_each_alloc_enabled_rdt_resource(r) {
+		sep = false;
+		seq_printf(sf, "%s:", r->name);
+		list_for_each_entry(d, &r->domains, list) {
+			if (sep)
+				seq_puts(sf, ";");
+			seq_printf(sf, "%d=%x", d->id,
+				   pseudo_lock_avail_get(r, d));
+			sep = true;
+		}
+		seq_puts(sf, "\n");
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+
 	return 0;
 }
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (4 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 05/22] x86/intel_rdt: Print more accurate pseudo-locking availability Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-19 20:57   ` Thomas Gleixner
  2018-02-13 15:46 ` [RFC PATCH V2 07/22] x86/intel_rdt: Connect pseudo-locking directory to operations Reinette Chatre
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

System administrator creates/removes pseudo-locked regions by
creating/removing directories in the pseudo-lock subdirectory of the
resctrl filesystem. Here we add directory creation and removal support.

A "pseudo-lock region" is introduced, which represents an
instance of a pseudo-locked cache region. During mkdir a new region is
created but since we do not know which cache it belongs to at that time
we maintain a global pointer to it from where it will be moved to the cache
(rdt_domain) it belongs to after initialization. This implies that
we only support one uninitialized pseudo-locked region at a time.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h             |   3 +
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 220 +++++++++++++++++++++++++++-
 2 files changed, 222 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 8f5ded384e19..55f085985072 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -352,6 +352,7 @@ extern struct mutex rdtgroup_mutex;
 extern struct rdt_resource rdt_resources_all[];
 extern struct rdtgroup rdtgroup_default;
 DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
+extern struct kernfs_node *pseudo_lock_kn;
 
 int __init rdtgroup_init(void);
 
@@ -457,5 +458,7 @@ bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
 void __check_limbo(struct rdt_domain *d, bool force_free);
 int rdt_pseudo_lock_fs_init(struct kernfs_node *root);
 void rdt_pseudo_lock_fs_remove(void);
+int rdt_pseudo_lock_mkdir(const char *name, umode_t mode);
+int rdt_pseudo_lock_rmdir(struct kernfs_node *kn);
 
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index a787a103c432..7a22e367b82f 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -20,11 +20,142 @@
 #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
 
 #include <linux/kernfs.h>
+#include <linux/kref.h>
 #include <linux/seq_file.h>
 #include <linux/stat.h>
+#include <linux/slab.h>
 #include "intel_rdt.h"
 
-static struct kernfs_node *pseudo_lock_kn;
+struct kernfs_node *pseudo_lock_kn;
+
+/*
+ * Protect the pseudo_lock_region access. Since we will link to
+ * pseudo_lock_region from rdt domains rdtgroup_mutex should be obtained
+ * first if needed.
+ */
+static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
+
+/**
+ * struct pseudo_lock_region - pseudo-lock region information
+ * @kn:			kernfs node representing this region in the resctrl
+ *			filesystem
+ * @cbm:		bitmask of the pseudo-locked region
+ * @cpu:		core associated with the cache on which the setup code
+ *			will be run
+ * @minor:		minor number of character device associated with this
+ *			region
+ * @locked:		state indicating if this region has been locked or not
+ * @refcount:		how many are waiting to access this pseudo-lock
+ *			region via kernfs
+ * @deleted:		user requested removal of region via rmdir on kernfs
+ */
+struct pseudo_lock_region {
+	struct kernfs_node	*kn;
+	u32			cbm;
+	int			cpu;
+	unsigned int		minor;
+	bool			locked;
+	struct kref		refcount;
+	bool			deleted;
+};
+
+/*
+ * Only one uninitialized pseudo-locked region can exist at a time. An
+ * uninitialized pseudo-locked region is created when the user creates a
+ * new directory within the pseudo_lock subdirectory of the resctrl
+ * filsystem. The user will initialize the pseudo-locked region by writing
+ * to its schemata file at which point this structure will be moved to the
+ * cache domain it belongs to.
+ */
+static struct pseudo_lock_region *new_plr;
+
+static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
+{
+	bool is_new_plr = (plr == new_plr);
+
+	WARN_ON(!plr->deleted);
+	if (!plr->deleted)
+		return;
+
+	kfree(plr);
+	if (is_new_plr)
+		new_plr = NULL;
+}
+
+static void pseudo_lock_region_release(struct kref *ref)
+{
+	struct pseudo_lock_region *plr = container_of(ref,
+						      struct pseudo_lock_region,
+						      refcount);
+
+	mutex_lock(&rdt_pseudo_lock_mutex);
+	__pseudo_lock_region_release(plr);
+	mutex_unlock(&rdt_pseudo_lock_mutex);
+}
+
+/**
+ * pseudo_lock_region_kn_lock - Obtain lock to pseudo-lock region kernfs node
+ *
+ * This is called from the kernfs related functions which are called with
+ * an active reference to the kernfs_node that contains a valid pointer to
+ * the pseudo-lock region it represents. We can thus safely take an active
+ * reference to the pseudo-lock region before dropping the reference to the
+ * kernfs_node.
+ *
+ * We need to handle the scenarios where the kernfs directory representing
+ * this pseudo-lock region can be removed while an application still has an
+ * open handle to one of the directory's files and operations on this
+ * handle are attempted.
+ * To support this we allow a file operation to drop its reference to the
+ * kernfs_node so that the removal can proceed, while using the mutex to
+ * ensure these operations on the pseudo-lock region are serialized. At the
+ * time an operation does obtain access to the region it may thus have been
+ * deleted.
+ */
+static struct pseudo_lock_region *pseudo_lock_region_kn_lock(
+						struct kernfs_node *kn)
+{
+	struct pseudo_lock_region *plr = (kernfs_type(kn) == KERNFS_DIR) ?
+						kn->priv : kn->parent->priv;
+
+	WARN_ON(!plr);
+	if (!plr)
+		return NULL;
+
+	kref_get(&plr->refcount);
+	kernfs_break_active_protection(kn);
+
+	mutex_lock(&rdtgroup_mutex);
+	mutex_lock(&rdt_pseudo_lock_mutex);
+
+	if (plr->deleted)
+		return NULL;
+
+	return plr;
+}
+
+/**
+ * pseudo_lock_region_kn_unlock - Release lock to pseudo-lock region kernfs node
+ *
+ * The pseudo-lock region's kernfs_node did not have protection against
+ * removal while the lock was held. Here we do actual cleanup if the region
+ * was removed while the lock was held.
+ */
+static void pseudo_lock_region_kn_unlock(struct kernfs_node *kn)
+{
+	struct pseudo_lock_region *plr = (kernfs_type(kn) == KERNFS_DIR) ?
+						kn->priv : kn->parent->priv;
+
+	WARN_ON(!plr);
+	if (!plr)
+		return;
+
+	mutex_unlock(&rdt_pseudo_lock_mutex);
+	mutex_unlock(&rdtgroup_mutex);
+
+	kernfs_unbreak_active_protection(kn);
+	kref_put(&plr->refcount, pseudo_lock_region_release);
+}
 
 /**
  * pseudo_lock_avail_get - return bitmask of cache available for locking
@@ -96,6 +227,87 @@ static struct kernfs_ops pseudo_lock_avail_ops = {
 	.seq_show		= pseudo_lock_avail_show,
 };
 
+int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
+{
+	struct pseudo_lock_region *plr;
+	struct kernfs_node *kn;
+	int ret = 0;
+
+	mutex_lock(&rdtgroup_mutex);
+	mutex_lock(&rdt_pseudo_lock_mutex);
+
+	if (new_plr) {
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	plr = kzalloc(sizeof(*plr), GFP_KERNEL);
+	if (!plr) {
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	kn = kernfs_create_dir(pseudo_lock_kn, name, mode, plr);
+	if (IS_ERR(kn)) {
+		ret = PTR_ERR(kn);
+		goto out_free;
+	}
+
+	plr->kn = kn;
+	ret = rdtgroup_kn_set_ugid(kn);
+	if (ret)
+		goto out_remove;
+
+	kref_init(&plr->refcount);
+	kernfs_activate(kn);
+	new_plr = plr;
+	ret = 0;
+	goto out;
+
+out_remove:
+	kernfs_remove(kn);
+out_free:
+	kfree(plr);
+out:
+	mutex_unlock(&rdt_pseudo_lock_mutex);
+	mutex_unlock(&rdtgroup_mutex);
+	return ret;
+}
+
+/*
+ * rdt_pseudo_lock_rmdir - Remove pseudo-lock region
+ *
+ * LOCKING:
+ * Since the pseudo-locked region can be associated with a RDT domain at
+ * removal we take both rdtgroup_mutex and rdt_pseudo_lock_mutex to protect
+ * the rdt_domain access as well as the pseudo_lock_region access.
+ */
+int rdt_pseudo_lock_rmdir(struct kernfs_node *kn)
+{
+	struct kernfs_node *parent_kn = kn->parent;
+	struct pseudo_lock_region *plr;
+	int ret = 0;
+
+	plr = pseudo_lock_region_kn_lock(kn);
+	if (!plr) {
+		ret = -EPERM;
+		goto out;
+	}
+
+	if (parent_kn != pseudo_lock_kn) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	kernfs_remove(kn);
+	plr->deleted = true;
+	kref_put(&plr->refcount, pseudo_lock_region_release);
+
+out:
+	pseudo_lock_region_kn_unlock(kn);
+	return ret;
+}
+
 /**
  * rdt_pseudo_lock_fs_init - Create and initialize pseudo-locking files
  * @root: location in kernfs where directory and files should be created
@@ -159,7 +371,13 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
 void rdt_pseudo_lock_fs_remove(void)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
+	mutex_lock(&rdt_pseudo_lock_mutex);
 
+	if (new_plr) {
+		new_plr->deleted = true;
+		__pseudo_lock_region_release(new_plr);
+	}
 	kernfs_remove(pseudo_lock_kn);
 	pseudo_lock_kn = NULL;
+	mutex_unlock(&rdt_pseudo_lock_mutex);
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 07/22] x86/intel_rdt: Connect pseudo-locking directory to operations
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (5 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 08/22] x86/intel_rdt: Introduce pseudo-locking resctrl files Reinette Chatre
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

As a dependent of RDT/CAT we hook up the pseudo-locking files
initialization to that of RDT/CAT. The initial operations of mkdir/rmdir
used to create pseudo-locked regions are now hooked up also.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 5698d66b6892..24d2def37797 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -1244,13 +1244,19 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 		goto out_cdp;
 	}
 
+	ret = rdt_pseudo_lock_fs_init(rdtgroup_default.kn);
+	if (ret) {
+		dentry = ERR_PTR(ret);
+		goto out_info;
+	}
+
 	if (rdt_mon_capable) {
 		ret = mongroup_create_dir(rdtgroup_default.kn,
 					  NULL, "mon_groups",
 					  &kn_mongrp);
 		if (ret) {
 			dentry = ERR_PTR(ret);
-			goto out_info;
+			goto out_psl;
 		}
 		kernfs_get(kn_mongrp);
 
@@ -1291,6 +1297,8 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
 out_mongrp:
 	if (rdt_mon_capable)
 		kernfs_remove(kn_mongrp);
+out_psl:
+	rdt_pseudo_lock_fs_remove();
 out_info:
 	kernfs_remove(kn_info);
 out_cdp:
@@ -1439,6 +1447,7 @@ static void rmdir_all_sub(void)
 	/* Notify online CPUs to update per cpu storage and PQR_ASSOC MSR */
 	update_closid_rmid(cpu_online_mask, &rdtgroup_default);
 
+	rdt_pseudo_lock_fs_remove();
 	kernfs_remove(kn_info);
 	kernfs_remove(kn_mongrp);
 	kernfs_remove(kn_mondata);
@@ -1861,6 +1870,9 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
 	if (strchr(name, '\n'))
 		return -EINVAL;
 
+	if (parent_kn == pseudo_lock_kn)
+		return rdt_pseudo_lock_mkdir(name, mode);
+
 	/*
 	 * If the parent directory is the root directory and RDT
 	 * allocation is supported, add a control and monitoring
@@ -1970,6 +1982,9 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
 	cpumask_var_t tmpmask;
 	int ret = 0;
 
+	if (parent_kn == pseudo_lock_kn)
+		return rdt_pseudo_lock_rmdir(kn);
+
 	if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
 		return -ENOMEM;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 08/22] x86/intel_rdt: Introduce pseudo-locking resctrl files
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (6 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 07/22] x86/intel_rdt: Connect pseudo-locking directory to operations Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-19 21:01   ` Thomas Gleixner
  2018-02-13 15:46 ` [RFC PATCH V2 09/22] x86/intel_rdt: Discover supported platforms via prefetch disable bits Reinette Chatre
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Each sub-directory within the pseudo-lock directory represents a
pseudo-locked region. Each of these sub-directories now receive the
files that will be used by the user to specify requirements for the
particular region and for the kernel to communicate some details about
the region.

Only support reading operations on these files in this commit. Since
writing to these files will trigger the locking of a region we also just
support reading of unlocked region data.

Two files are created:
schemata:
	Print the details of the portion of cache locked. If this has
	not yet been locked all resources will be listed as uninitialized.
size:
	Print the size in bytes of the memory region pseudo-locked to
	the cache. Value is not yet initialized.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h             |  5 +++
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 49 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c    | 14 +++++++++
 3 files changed, 68 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 55f085985072..060a0976ac00 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -134,6 +134,7 @@ struct rdtgroup {
 #define RFTYPE_CTRL			BIT(RF_CTRLSHIFT)
 #define RFTYPE_MON			BIT(RF_MONSHIFT)
 #define RFTYPE_TOP			BIT(RF_TOPSHIFT)
+#define RF_PSEUDO_LOCK			BIT(7)
 #define RFTYPE_RES_CACHE		BIT(8)
 #define RFTYPE_RES_MB			BIT(9)
 #define RF_CTRL_INFO			(RFTYPE_INFO | RFTYPE_CTRL)
@@ -460,5 +461,9 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root);
 void rdt_pseudo_lock_fs_remove(void);
 int rdt_pseudo_lock_mkdir(const char *name, umode_t mode);
 int rdt_pseudo_lock_rmdir(struct kernfs_node *kn);
+int pseudo_lock_schemata_show(struct kernfs_open_file *of,
+			      struct seq_file *seq, void *v);
+int pseudo_lock_size_show(struct kernfs_open_file *of,
+			  struct seq_file *seq, void *v);
 
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 7a22e367b82f..94bd1b4fbfee 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -40,6 +40,7 @@ static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
  * @kn:			kernfs node representing this region in the resctrl
  *			filesystem
  * @cbm:		bitmask of the pseudo-locked region
+ * @size:		size of pseudo-locked region in bytes
  * @cpu:		core associated with the cache on which the setup code
  *			will be run
  * @minor:		minor number of character device associated with this
@@ -52,6 +53,7 @@ static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
 struct pseudo_lock_region {
 	struct kernfs_node	*kn;
 	u32			cbm;
+	unsigned int		size;
 	int			cpu;
 	unsigned int		minor;
 	bool			locked;
@@ -227,6 +229,49 @@ static struct kernfs_ops pseudo_lock_avail_ops = {
 	.seq_show		= pseudo_lock_avail_show,
 };
 
+int pseudo_lock_schemata_show(struct kernfs_open_file *of,
+			      struct seq_file *seq, void *v)
+{
+	struct pseudo_lock_region *plr;
+	struct rdt_resource *r;
+	int ret = 0;
+
+	plr = pseudo_lock_region_kn_lock(of->kn);
+	if (!plr) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	if (!plr->locked) {
+		for_each_alloc_enabled_rdt_resource(r) {
+			seq_printf(seq, "%s:uninitialized\n", r->name);
+		}
+	}
+
+out:
+	pseudo_lock_region_kn_unlock(of->kn);
+	return ret;
+}
+
+int pseudo_lock_size_show(struct kernfs_open_file *of,
+			  struct seq_file *seq, void *v)
+{
+	struct pseudo_lock_region *plr;
+	int ret = 0;
+
+	plr = pseudo_lock_region_kn_lock(of->kn);
+	if (!plr) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	seq_printf(seq, "%u\n", plr->size);
+
+out:
+	pseudo_lock_region_kn_unlock(of->kn);
+	return ret;
+}
+
 int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
 {
 	struct pseudo_lock_region *plr;
@@ -258,6 +303,10 @@ int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
 	if (ret)
 		goto out_remove;
 
+	ret = rdtgroup_add_files(kn, RF_PSEUDO_LOCK);
+	if (ret)
+		goto out_remove;
+
 	kref_init(&plr->refcount);
 	kernfs_activate(kn);
 	new_plr = plr;
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index 24d2def37797..a7cbaf85ed54 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -859,6 +859,20 @@ static struct rftype res_common_files[] = {
 		.seq_show	= rdtgroup_schemata_show,
 		.fflags		= RF_CTRL_BASE,
 	},
+	{
+		.name		= "schemata",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= pseudo_lock_schemata_show,
+		.fflags		= RF_PSEUDO_LOCK,
+	},
+	{
+		.name		= "size",
+		.mode		= 0444,
+		.kf_ops		= &rdtgroup_kf_single_ops,
+		.seq_show	= pseudo_lock_size_show,
+		.fflags		= RF_PSEUDO_LOCK,
+	},
 };
 
 int rdtgroup_add_files(struct kernfs_node *kn, unsigned long fflags)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 09/22] x86/intel_rdt: Discover supported platforms via prefetch disable bits
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (7 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 08/22] x86/intel_rdt: Introduce pseudo-locking resctrl files Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 10/22] x86/intel_rdt: Disable pseudo-locking if CDP enabled Reinette Chatre
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Knowing the model specific prefetch disable bits is required to support
cache pseudo-locking because the hardware prefetchers need to be disabled
when the kernel memory is pseudo-locked to cache. We add these bits only
for platforms known to support cache pseudo-locking.

If we have not validated pseudo-locking on a platform that does support
RDT/CAT this should not be seen as a failure of CAT, the pseudo-locking
interface will just not be set up.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 80 +++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 94bd1b4fbfee..a0c144b5b09b 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -24,8 +24,22 @@
 #include <linux/seq_file.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
+#include <asm/intel-family.h>
 #include "intel_rdt.h"
 
+/*
+ * MSR_MISC_FEATURE_CONTROL register enables the modification of hardware
+ * prefetcher state. Details about this register can be found in the MSR
+ * tables for specific platforms found in Intel's SDM.
+ */
+#define MSR_MISC_FEATURE_CONTROL	0x000001a4
+
+/*
+ * The bits needed to disable hardware prefetching varies based on the
+ * platform. During initialization we will discover which bits to use.
+ */
+static u64 prefetch_disable_bits;
+
 struct kernfs_node *pseudo_lock_kn;
 
 /*
@@ -358,6 +372,57 @@ int rdt_pseudo_lock_rmdir(struct kernfs_node *kn)
 }
 
 /**
+ * get_prefetch_disable_bits - prefetch disable bits of supported platforms
+ *
+ * Here we capture the list of platforms that have been validated to support
+ * pseudo-locking. This includes testing to ensure pseudo-locked regions
+ * with low cache miss rates can be created under variety of load conditions
+ * as well as that these pseudo-locked regions can maintain their low cache
+ * miss rates under variety of load conditions for significant lengths of time.
+ *
+ * After a platform has been validated to support pseudo-locking its
+ * hardware prefetch disable bits are included here as they are documented
+ * in the SDM.
+ *
+ * RETURNS
+ * If platform is supported, the bits to disable hardware prefetchers, 0
+ * if platform is not supported.
+ */
+static u64 get_prefetch_disable_bits(void)
+{
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+	    boot_cpu_data.x86 != 6)
+		return 0;
+
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_BROADWELL_X:
+		/*
+		 * SDM defines bits of MSR_MISC_FEATURE_CONTROL register
+		 * as:
+		 * 0    L2 Hardware Prefetcher Disable (R/W)
+		 * 1    L2 Adjacent Cache Line Prefetcher Disable (R/W)
+		 * 2    DCU Hardware Prefetcher Disable (R/W)
+		 * 3    DCU IP Prefetcher Disable (R/W)
+		 * 63:4 Reserved
+		 */
+		return 0xF;
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GEMINI_LAKE:
+		/*
+		 * SDM defines bits of MSR_MISC_FEATURE_CONTROL register
+		 * as:
+		 * 0     L2 Hardware Prefetcher Disable (R/W)
+		 * 1     Reserved
+		 * 2     DCU Hardware Prefetcher Disable (R/W)
+		 * 63:3  Reserved
+		 */
+		return 0x5;
+	}
+
+	return 0;
+}
+
+/**
  * rdt_pseudo_lock_fs_init - Create and initialize pseudo-locking files
  * @root: location in kernfs where directory and files should be created
  *
@@ -377,6 +442,17 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
+	/*
+	 * Not knowing the bits to disable prefetching is not a failure
+	 * that should be propagated since we only return prefetching bits
+	 * for those platforms pseudo-locking has been tested on. If
+	 * pseudo-locking has not been tested to work on this platform the
+	 * other RDT features should continue to be available.
+	 */
+	prefetch_disable_bits = get_prefetch_disable_bits();
+	if (prefetch_disable_bits == 0)
+		return 0;
+
 	pseudo_lock_kn = kernfs_create_dir(root, "pseudo_lock",
 					   root->mode, NULL);
 	if (IS_ERR(pseudo_lock_kn))
@@ -420,6 +496,10 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
 void rdt_pseudo_lock_fs_remove(void)
 {
 	lockdep_assert_held(&rdtgroup_mutex);
+
+	if (!pseudo_lock_kn)
+		return;
+
 	mutex_lock(&rdt_pseudo_lock_mutex);
 
 	if (new_plr) {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 10/22] x86/intel_rdt: Disable pseudo-locking if CDP enabled
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (8 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 09/22] x86/intel_rdt: Discover supported platforms via prefetch disable bits Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain Reinette Chatre
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Pseudo-locking can work when Code and Data Prioritization (CDP) is enabled,
but there are a few additional checks and actions involved. At this time
it is not clear if users would want to use pseudo-locking and CDP at the
same time so the support of this is delayed until we understand the
usage better.

Disable pseudo-locking if CDP is enabled. Add the details of things to
keep in mind for anybody considering enabling this support.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 30 +++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index a0c144b5b09b..f6932a7de6e7 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -443,6 +443,36 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	/*
+	 * Pseudo-locking not supported when CDP is enabled.
+	 *
+	 * Some things to consider if you would like to enable this support
+	 * (using L3 CDP as example):
+	 * - When CDP is enabled two separate resources are exposed, L3DATA
+	 *   and L3CODE, but they are actually on the same cache. The
+	 *   implication for pseudo-locking is that if a pseudo-locked
+	 *   region is created on a domain of one resource (eg. L3CODE),
+	 *   then a pseudo-locked region cannot be created on that same
+	 *   domain of the other resource (eg. L3DATA). This is because
+	 *   the creation of a pseudo-locked region involves a call to
+	 *   wbinvd that will affect all cache allocations on particular
+	 *   domain.
+	 * - Considering the previous, it may be possible to only expose
+	 *   one of the CDP resources to pseudo-locking and hide the other.
+	 *   For example, we could consider to only expose L3DATA and since
+	 *   the L3 cache is unified it is still possible to place
+	 *   instructions there are execute it.
+	 * - If only one region is exposed to pseudo-locking we should still
+	 *   keep in mind that availability of a portion of cache for
+	 *   pseudo-locking should take into account both resources. Similarly,
+	 *   if a pseudo-locked region is created in one resource, the portion
+	 *   of cache used by it should be made unavailable to all future
+	 *   allocations from both resources.
+	 */
+	if (rdt_resources_all[RDT_RESOURCE_L3DATA].alloc_enabled ||
+	    rdt_resources_all[RDT_RESOURCE_L2DATA].alloc_enabled)
+		return 0;
+
+	/*
 	 * Not knowing the bits to disable prefetching is not a failure
 	 * that should be propagated since we only return prefetching bits
 	 * for those platforms pseudo-locking has been tested on. If
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (9 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 10/22] x86/intel_rdt: Disable pseudo-locking if CDP enabled Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-19 21:19   ` Thomas Gleixner
  2018-02-13 15:46 ` [RFC PATCH V2 12/22] x86/intel_rdt: Support CBM checking from value and character buffer Reinette Chatre
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

After a pseudo-locked region is locked it needs to be associated with
the RDT domain representing the pseudo-locked cache so that its life
cycle can be managed correctly.

Only a single pseudo-locked region can exist on any cache instance so we
maintain a single pointer to a pseudo-locked region from each RDT
domain.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 060a0976ac00..f0e020686e99 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -187,6 +187,8 @@ struct mbm_state {
 	u64	prev_msr;
 };
 
+struct pseudo_lock_region;
+
 /**
  * struct rdt_domain - group of cpus sharing an RDT resource
  * @list:	all instances of this resource
@@ -205,6 +207,7 @@ struct mbm_state {
  * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
  * @new_ctrl:	new ctrl value to be loaded
  * @have_new_ctrl: did user provide new_ctrl for this domain
+ * @plr:	pseudo-locked region associated with this domain
  */
 struct rdt_domain {
 	struct list_head	list;
@@ -220,6 +223,7 @@ struct rdt_domain {
 	u32			*ctrl_val;
 	u32			new_ctrl;
 	bool			have_new_ctrl;
+	struct pseudo_lock_region	*plr;
 };
 
 /**
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 12/22] x86/intel_rdt: Support CBM checking from value and character buffer
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (10 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core Reinette Chatre
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Validity check of capacity bitmask (CBM) is currently only done on
character buffer when user writes new schemata to resctrl file.

In preparation for support of CBM checking within other areas of the RDT
code the CBM validity check is split up to support checking with CBM
provided as character buffer as well as the value itself.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h             |  1 +
 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 34 ++++++++++++++++++++---------
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index f0e020686e99..2c4e13252057 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -442,6 +442,7 @@ struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
 int closid_alloc(void);
 void closid_free(int closid);
 bool closid_allocated(unsigned int closid);
+bool cbm_validate_val(unsigned long val, struct rdt_resource *r);
 int update_domains(struct rdt_resource *r, int closid);
 int alloc_rmid(void);
 void free_rmid(u32 rmid);
diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
index 9e1a455e4d9b..4a11aea3ad2c 100644
--- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -86,17 +86,10 @@ int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d)
  *	are allowed (e.g. FFFFH, 0FF0H, 003CH, etc.).
  * Additionally Haswell requires at least two bits set.
  */
-static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
+bool cbm_validate_val(unsigned long val, struct rdt_resource *r)
 {
-	unsigned long first_bit, zero_bit, val;
+	unsigned long first_bit, zero_bit;
 	unsigned int cbm_len = r->cache.cbm_len;
-	int ret;
-
-	ret = kstrtoul(buf, 16, &val);
-	if (ret) {
-		rdt_last_cmd_printf("non-hex character in mask %s\n", buf);
-		return false;
-	}
 
 	if (val == 0 || val > r->default_ctrl) {
 		rdt_last_cmd_puts("mask out of range\n");
@@ -117,11 +110,32 @@ static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 		return false;
 	}
 
-	*data = val;
 	return true;
 }
 
 /*
+ * Validate CBM as provided in character buffer. If CBM is valid
+ * true will be returned as well as number representation pointed to by
+ * @data. If CBM is invalid, return false.
+ */
+static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
+{
+	unsigned long val;
+	bool ret;
+
+	if (kstrtoul(buf, 16, &val)) {
+		rdt_last_cmd_printf("non-hex character in mask %s\n", buf);
+		return false;
+	}
+
+	ret = cbm_validate_val(val, r);
+	if (ret)
+		*data = val;
+
+	return ret;
+}
+
+/*
  * Read one cache bit mask (hex). Check that it is valid for the current
  * resource type.
  */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (11 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 12/22] x86/intel_rdt: Support CBM checking from value and character buffer Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-20 17:15   ` Thomas Gleixner
  2018-02-13 15:46 ` [RFC PATCH V2 14/22] x86/intel_rdt: Enable testing for pseudo-locked region Reinette Chatre
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

When a user writes the requested pseudo-locking schemata it will trigger
the pseudo-locking of equivalent sized memory. A successful return from
this schemata write means that the pseudo-locking succeeded.

To support the pseudo-locking we first initialize as much as we can
about the region that will be pseudo-locked. This includes, how much
memory does the requested bitmask represent, which CPU the requested
region is associated with, and what is the cache line size of that cache
(so that we know what stride to use for locking). At this point a
contiguous block of memory matching the requested bitmask is allocated.

After initialization the pseudo-locking is performed. A temporary CAT
allocation is made to reflect the requested bitmask and with this new
class of service active and interference minimized, the allocated memory
is loaded into the cache. This completes the pseudo-locking of kernel
memory.

As part of the pseudo-locking the pseudo-locked region is moved to
the RDT domain to which it belongs. We thus also need to ensure that
cleanups happen in this area when there is a directory removal or
unmount request.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h             |   2 +
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 573 +++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c    |   3 +-
 3 files changed, 571 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 2c4e13252057..85f9ad6de113 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -468,6 +468,8 @@ int rdt_pseudo_lock_mkdir(const char *name, umode_t mode);
 int rdt_pseudo_lock_rmdir(struct kernfs_node *kn);
 int pseudo_lock_schemata_show(struct kernfs_open_file *of,
 			      struct seq_file *seq, void *v);
+ssize_t pseudo_lock_schemata_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off);
 int pseudo_lock_size_show(struct kernfs_open_file *of,
 			  struct seq_file *seq, void *v);
 
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index f6932a7de6e7..1f351b7170ef 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -19,12 +19,18 @@
 
 #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
 
+#include <linux/cacheinfo.h>
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <linux/kernfs.h>
 #include <linux/kref.h>
+#include <linux/kthread.h>
 #include <linux/seq_file.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
+#include <asm/cacheflush.h>
 #include <asm/intel-family.h>
+#include <asm/intel_rdt_sched.h>
 #include "intel_rdt.h"
 
 /*
@@ -43,6 +49,20 @@ static u64 prefetch_disable_bits;
 struct kernfs_node *pseudo_lock_kn;
 
 /*
+ * Only one pseudo-locked region can be set up at a time and that is
+ * enforced by taking the rdt_pseudo_lock_mutex when the user writes the
+ * requested schemata to the resctrl file and releasing the mutex on
+ * completion. The thread locking the kernel memory into the cache starts
+ * and completes during this time so we can be sure that only one thread
+ * can run at any time.
+ * The functions starting the pseudo-locking thread needs to wait for its
+ * completion and since there can only be one we have a global workqueue
+ * and variable to support this.
+ */
+static DECLARE_WAIT_QUEUE_HEAD(wq);
+static int thread_done;
+
+/*
  * Protect the pseudo_lock_region access. Since we will link to
  * pseudo_lock_region from rdt domains rdtgroup_mutex should be obtained
  * first if needed.
@@ -53,26 +73,39 @@ static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
  * struct pseudo_lock_region - pseudo-lock region information
  * @kn:			kernfs node representing this region in the resctrl
  *			filesystem
+ * @r:			point back to the rdt_resource to which this
+ *			pseudo-locked region belongs
+ * @d:			point back to the rdt_domain to which this
+ *			pseudo-locked region belongs
  * @cbm:		bitmask of the pseudo-locked region
  * @size:		size of pseudo-locked region in bytes
+ * @line_size:		size of the cache lines
  * @cpu:		core associated with the cache on which the setup code
  *			will be run
+ * @closid:		CAT class of service that will be used temporarily
+ *			to initialize this pseudo-locked region
  * @minor:		minor number of character device associated with this
  *			region
  * @locked:		state indicating if this region has been locked or not
  * @refcount:		how many are waiting to access this pseudo-lock
  *			region via kernfs
  * @deleted:		user requested removal of region via rmdir on kernfs
+ * @kmem:		the kernel memory associated with pseudo-locked region
  */
 struct pseudo_lock_region {
 	struct kernfs_node	*kn;
+	struct rdt_resource	*r;
+	struct rdt_domain	*d;
 	u32			cbm;
 	unsigned int		size;
+	unsigned int		line_size;
 	int			cpu;
+	int			closid;
 	unsigned int		minor;
 	bool			locked;
 	struct kref		refcount;
 	bool			deleted;
+	void			*kmem;
 };
 
 /*
@@ -85,6 +118,55 @@ struct pseudo_lock_region {
  */
 static struct pseudo_lock_region *new_plr;
 
+/*
+ * Helper to write 64bit value to MSR without tracing. Used when
+ * use of the cache should be restricted and use of registers used
+ * for local variables should be avoided.
+ */
+static inline void pseudo_wrmsrl_notrace(unsigned int msr, u64 val)
+{
+	__wrmsr(msr, (u32)(val & 0xffffffffULL), (u32)(val >> 32));
+}
+
+/**
+ * pseudo_lock_clos_set - Program requested class of service
+ * @plr:    pseudo-locked region identifying cache that will have its
+ *          class of service modified
+ * @closid: class of service that should be modified
+ * @bm:     new bitmask for @closid
+ */
+static int pseudo_lock_clos_set(struct pseudo_lock_region *plr,
+				int closid, u32 bm)
+{
+	struct rdt_resource *r;
+	struct rdt_domain *d;
+	int ret;
+
+	for_each_alloc_enabled_rdt_resource(r) {
+		list_for_each_entry(d, &r->domains, list)
+			d->have_new_ctrl = false;
+	}
+
+	r = plr->r;
+	d = plr->d;
+	d->new_ctrl = bm;
+	d->have_new_ctrl = true;
+
+	ret = update_domains(r, closid);
+
+	return ret;
+}
+
+static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
+{
+	plr->size = 0;
+	plr->line_size = 0;
+	kfree(plr->kmem);
+	plr->kmem = NULL;
+	plr->r = NULL;
+	plr->d = NULL;
+}
+
 static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
 {
 	bool is_new_plr = (plr == new_plr);
@@ -93,6 +175,23 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
 	if (!plr->deleted)
 		return;
 
+	if (plr->locked) {
+		plr->d->plr = NULL;
+		/*
+		 * Resource groups come and go. Simply returning this
+		 * pseudo-locked region's bits to the default CLOS may
+		 * result in default CLOS to become fragmented, causing
+		 * the setting of its bitmask to fail. Ensure it is valid
+		 * first. If this check does fail we cannot return the bits
+		 * to the default CLOS and userspace intervention would be
+		 * required to ensure portions of the cache do not go
+		 * unused.
+		 */
+		if (cbm_validate_val(plr->d->ctrl_val[0] | plr->cbm, plr->r))
+			pseudo_lock_clos_set(plr, 0,
+					     plr->d->ctrl_val[0] | plr->cbm);
+		pseudo_lock_region_clear(plr);
+	}
 	kfree(plr);
 	if (is_new_plr)
 		new_plr = NULL;
@@ -178,17 +277,17 @@ static void pseudo_lock_region_kn_unlock(struct kernfs_node *kn)
  * @r: resource to which this cache instance belongs
  * @d: domain representing the cache instance
  *
- * Availability for pseudo-locking is determined as follows:
+ * Pseudo-locked regions are set up with wbinvd, limiting us to one region
+ * per cache instance.
+ *
+ * If no other pseudo-locked region present on this cache instance
+ * availability for pseudo-locking is determined as follows:
  * * Cache area is in use by default COS.
  * * Cache area is NOT in use by any other (other than default) COS.
  * * Cache area is not shared with any other entity. Specifically, the
  *   cache area does not appear in "Bitmask of Shareable Resource with Other
  *   executing entities" found in EBX during CAT enumeration.
  *
- * Below is also required to determine availability and will be
- * added in later:
- * * Cache area is not currently pseudo-locked.
- *
  * LOCKING:
  * rdtgroup_mutex is expected to be held when called
  *
@@ -203,6 +302,13 @@ static u32 pseudo_lock_avail_get(struct rdt_resource *r, struct rdt_domain *d)
 
 	lockdep_assert_held(&rdtgroup_mutex);
 
+	/*
+	 * Nothing available if a pseudo-locked region already associated
+	 * with this cache instance.
+	 */
+	if (d->plr)
+		return 0;
+
 	avail = d->ctrl_val[0];
 	for (i = 1; i < r->num_closid; i++) {
 		if (closid_allocated(i))
@@ -213,6 +319,34 @@ static u32 pseudo_lock_avail_get(struct rdt_resource *r, struct rdt_domain *d)
 	return avail;
 }
 
+/**
+ * pseudo_lock_space_avail - returns if any space available for pseudo-locking
+ *
+ * Checks all cache instances on system for any regions available for
+ * pseudo-locking.
+ *
+ * LOCKING:
+ * rdtgroup_mutex is expected to be held when called
+ *
+ * RETURNS:
+ * true if any cache instance has space available for pseudo-locking, false
+ * otherwise
+ */
+static bool pseudo_lock_space_avail(void)
+{
+	struct rdt_resource *r;
+	struct rdt_domain *d;
+
+	lockdep_assert_held(&rdtgroup_mutex);
+	for_each_alloc_enabled_rdt_resource(r) {
+		list_for_each_entry(d, &r->domains, list) {
+			if (pseudo_lock_avail_get(r, d) > 0)
+				return true;
+		}
+	}
+	return false;
+}
+
 static int pseudo_lock_avail_show(struct seq_file *sf, void *v)
 {
 	struct rdt_resource *r;
@@ -260,6 +394,9 @@ int pseudo_lock_schemata_show(struct kernfs_open_file *of,
 		for_each_alloc_enabled_rdt_resource(r) {
 			seq_printf(seq, "%s:uninitialized\n", r->name);
 		}
+	} else {
+		seq_printf(seq, "%s:%d=%x\n", plr->r->name,
+			   plr->d->id, plr->cbm);
 	}
 
 out:
@@ -267,6 +404,418 @@ int pseudo_lock_schemata_show(struct kernfs_open_file *of,
 	return ret;
 }
 
+/**
+ * init_from_cache_details - Initialize pseudo-lock region info from cache data
+ *
+ * When a user requests a cache region to be locked the request is provided
+ * as a bitmask. We need to allocate memory of matching size so here we
+ * translate the requested bitmask into how many bytes it represents. This
+ * is done by dividing the total cache size by the CBM len to first
+ * determine how many bytes each bit in bitmask represents, then
+ * multiply that with how many bits were set in requested bitmask.
+ *
+ * Also set the cache line size to know the stride with which data needs to
+ * be accessed to be pseudo-locked.
+ */
+static int init_from_cache_details(struct pseudo_lock_region *plr,
+				   struct rdt_resource *r)
+{
+	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(plr->cpu);
+	unsigned int cbm_len = r->cache.cbm_len;
+	int num_b;
+	int i;
+
+	num_b = bitmap_weight((unsigned long *)&plr->cbm, cbm_len);
+
+	for (i = 0; i < ci->num_leaves; i++) {
+		if (ci->info_list[i].level == r->cache_level) {
+			plr->size = ci->info_list[i].size / cbm_len * num_b;
+			plr->line_size = ci->info_list[i].coherency_line_size;
+			return 0;
+		}
+	}
+
+	return -1;
+}
+
+static int pseudo_lock_region_init(struct pseudo_lock_region *plr,
+				   struct rdt_resource *r,
+				   struct rdt_domain *d)
+{
+	unsigned long b_req = plr->cbm;
+	unsigned long b_avail;
+	int ret;
+
+	b_avail = pseudo_lock_avail_get(r, d);
+
+	if (!bitmap_subset(&b_req, &b_avail, r->cache.cbm_len)) {
+		rdt_last_cmd_puts("requested bitmask not available\n");
+		return -ENOSPC;
+	}
+
+	/*
+	 * Use the first cpu we find that is associated with the
+	 * cache selected.
+	 */
+	plr->cpu = cpumask_first(&d->cpu_mask);
+
+	if (!cpu_online(plr->cpu)) {
+		rdt_last_cmd_printf("cpu %u associated with cache not online\n",
+				    plr->cpu);
+		return -ENODEV;
+	}
+
+	ret = init_from_cache_details(plr, r);
+	if (ret < 0) {
+		rdt_last_cmd_puts("unable to lookup cache details\n");
+		return -ENOSPC;
+	}
+
+	/*
+	 * We do not yet support contiguous regions larger than
+	 * KMALLOC_MAX_SIZE
+	 */
+	if (plr->size > KMALLOC_MAX_SIZE) {
+		rdt_last_cmd_puts("requested region exceeds maximum size\n");
+		return -E2BIG;
+	}
+
+	plr->kmem = kzalloc(plr->size, GFP_KERNEL);
+	if (!plr->kmem) {
+		rdt_last_cmd_puts("unable to allocate memory\n");
+		return -ENOMEM;
+	}
+
+	plr->r = r;
+	plr->d = d;
+
+	return 0;
+}
+
+/**
+ * pseudo_lock_fn - Load kernel memory into cache
+ *
+ * This is the core pseudo-locking function.
+ *
+ * First we ensure that the kernel memory cannot be found in the cache.
+ * Then, while taking care that there will be as little interference as
+ * possible, each cache line of the memory to be loaded is touched while
+ * core is running with class of service set to the bitmask of the
+ * pseudo-locked region. After this is complete no future CAT allocations
+ * will be allowed to overlap with this bitmask.
+ *
+ * Local register variables are utilized to ensure that the memory region
+ * to be locked is the only memory access made during the critical locking
+ * loop.
+ */
+static int pseudo_lock_fn(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	u32 rmid_p, closid_p;
+	unsigned long flags;
+	u64 i;
+#ifdef CONFIG_KASAN
+	/*
+	 * The registers used for local register variables are also used
+	 * when KASAN is active. When KASAN is active we use a regular
+	 * variable to ensure we always use a valid pointer, but the cost
+	 * is that this variable will enter the cache through evicting the
+	 * memory we are trying to lock into the cache. Thus expect lower
+	 * pseudo-locking success rate when KASAN is active.
+	 */
+	unsigned int line_size;
+	unsigned int size;
+	void *mem_r;
+#else
+	register unsigned int line_size asm("esi");
+	register unsigned int size asm("edi");
+#ifdef CONFIG_X86_64
+	register void *mem_r asm("rbx");
+#else
+	register void *mem_r asm("ebx");
+#endif /* CONFIG_X86_64 */
+#endif /* CONFIG_KASAN */
+
+	/*
+	 * Make sure none of the allocated memory is cached. If it is we
+	 * will get a cache hit in below loop from outside of pseudo-locked
+	 * region.
+	 * wbinvd (as opposed to clflush/clflushopt) is required to
+	 * increase likelihood that allocated cache portion will be filled
+	 * with associated memory
+	 */
+	wbinvd();
+
+	preempt_disable();
+	local_irq_save(flags);
+	/*
+	 * Call wrmsr and rdmsr as directly as possible to avoid tracing
+	 * clobbering local register variables or affecting cache accesses.
+	 */
+	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+	closid_p = this_cpu_read(pqr_state.cur_closid);
+	rmid_p = this_cpu_read(pqr_state.cur_rmid);
+	mem_r = plr->kmem;
+	size = plr->size;
+	line_size = plr->line_size;
+	__wrmsr(IA32_PQR_ASSOC, rmid_p, plr->closid);
+	/*
+	 * Cache was flushed earlier. Now access kernel memory to read it
+	 * into cache region associated with just activated plr->closid.
+	 * Loop over data twice:
+	 * - In first loop the cache region is shared with the page walker
+	 *   as it populates the paging structure caches (including TLB).
+	 * - In the second loop the paging structure caches are used and
+	 *   cache region is populated with the memory being referenced.
+	 */
+	for (i = 0; i < size; i += PAGE_SIZE) {
+		asm volatile("mov (%0,%1,1), %%eax\n\t"
+			:
+			: "r" (mem_r), "r" (i)
+			: "%eax", "memory");
+	}
+	for (i = 0; i < size; i += line_size) {
+		asm volatile("mov (%0,%1,1), %%eax\n\t"
+			:
+			: "r" (mem_r), "r" (i)
+			: "%eax", "memory");
+	}
+	__wrmsr(IA32_PQR_ASSOC, rmid_p, closid_p);
+	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
+	local_irq_restore(flags);
+	preempt_enable();
+
+	thread_done = 1;
+	wake_up_interruptible(&wq);
+	return 0;
+}
+
+static int pseudo_lock_doit(struct pseudo_lock_region *plr,
+			    struct rdt_resource *r,
+			    struct rdt_domain *d)
+{
+	struct task_struct *thread;
+	int closid;
+	int ret, i;
+
+	/*
+	 * With the usage of wbinvd we can only support one pseudo-locked
+	 * region per domain at this time.
+	 */
+	if (d->plr) {
+		rdt_last_cmd_puts("pseudo-locked region exists on cache\n");
+		return -ENOSPC;
+	}
+
+	ret = pseudo_lock_region_init(plr, r, d);
+	if (ret < 0)
+		return ret;
+
+	closid = closid_alloc();
+	if (closid < 0) {
+		ret = closid;
+		rdt_last_cmd_puts("unable to obtain free closid\n");
+		goto out_region;
+	}
+
+	/*
+	 * Ensure we end with a valid default CLOS. If a pseudo-locked
+	 * region in middle of possible bitmasks is selected it will split
+	 * up default CLOS which would be a fault and for which handling
+	 * is unclear so we fail back to userspace. Validation will also
+	 * ensure that default CLOS is not zero, keeping some cache
+	 * available to rest of system.
+	 */
+	if (!cbm_validate_val(d->ctrl_val[0] & ~plr->cbm, r)) {
+		ret = -EINVAL;
+		rdt_last_cmd_printf("bm 0x%x causes invalid clos 0 bm 0x%x\n",
+				    plr->cbm, d->ctrl_val[0] & ~plr->cbm);
+		goto out_closid;
+	}
+
+	ret = pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] & ~plr->cbm);
+	if (ret < 0) {
+		rdt_last_cmd_printf("unable to set clos 0 bitmask to 0x%x\n",
+				    d->ctrl_val[0] & ~plr->cbm);
+		goto out_closid;
+	}
+
+	ret = pseudo_lock_clos_set(plr, closid, plr->cbm);
+	if (ret < 0) {
+		rdt_last_cmd_printf("unable to set closid %d bitmask to 0x%x\n",
+				    closid, plr->cbm);
+		goto out_clos_def;
+	}
+
+	plr->closid = closid;
+
+	thread_done = 0;
+
+	thread = kthread_create_on_node(pseudo_lock_fn, plr,
+					cpu_to_node(plr->cpu),
+					"pseudo_lock/%u", plr->cpu);
+	if (IS_ERR(thread)) {
+		ret = PTR_ERR(thread);
+		rdt_last_cmd_printf("locking thread returned error %d\n", ret);
+		/*
+		 * We do not return CBM to newly allocated CLOS here on
+		 * error path since that will result in a CBM of all
+		 * zeroes which is an illegal MSR write.
+		 */
+		goto out_clos_def;
+	}
+
+	kthread_bind(thread, plr->cpu);
+	wake_up_process(thread);
+
+	ret = wait_event_interruptible(wq, thread_done == 1);
+	if (ret < 0) {
+		rdt_last_cmd_puts("locking thread interrupted\n");
+		goto out_clos_def;
+	}
+
+	/*
+	 * closid will be released soon but its CBM as well as CBM of not
+	 * yet allocated CLOS as stored in the array will remain. Ensure
+	 * that CBM will be what is currently the default CLOS, which
+	 * excludes pseudo-locked region.
+	 */
+	for (i = 1; i < r->num_closid; i++) {
+		if (i == closid || !closid_allocated(i))
+			pseudo_lock_clos_set(plr, i, d->ctrl_val[0]);
+	}
+
+	plr->locked = true;
+	d->plr = plr;
+	new_plr = NULL;
+
+	/*
+	 * We do not return CBM to CLOS here since that will result in a
+	 * CBM of all zeroes which is an illegal MSR write.
+	 */
+	closid_free(closid);
+	ret = 0;
+	goto out;
+
+out_clos_def:
+	pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] | plr->cbm);
+out_closid:
+	closid_free(closid);
+out_region:
+	pseudo_lock_region_clear(plr);
+out:
+	return ret;
+}
+
+/**
+ * pseudo_lock_schemata_write - process user's pseudo-locking request
+ *
+ * User provides a schemata in format of RESOURCE:ID=BITMASK with the
+ * following meaning:
+ * RESOURCE - Name of the RDT resource (rdt_resource->name) that will be
+ *            pseudo-locked.
+ * ID       - id of the particular instace of RESOURCE that will be
+ *            pseudo-locked. This maps to rdt_domain->id.
+ * BITMASK  - The bitmask specifying the region of cache that should be
+ *            pseudo-locked.
+ *
+ * RETURNS:
+ * On success the user's requested region has been pseudo-locked
+ */
+ssize_t pseudo_lock_schemata_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	struct pseudo_lock_region *plr;
+	struct rdt_resource *r;
+	struct rdt_domain *d;
+	char *resname, *dom;
+	bool found = false;
+	int ret = -EINVAL;
+	int dom_id;
+	u32 b_req;
+
+	if (nbytes == 0 || buf[nbytes - 1] != '\n')
+		return -EINVAL;
+
+	cpus_read_lock();
+
+	plr = pseudo_lock_region_kn_lock(of->kn);
+	if (!plr) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	rdt_last_cmd_clear();
+
+	/* Do not lock a region twice. */
+	if (plr->locked) {
+		ret = -EEXIST;
+		rdt_last_cmd_puts("region is already locked\n");
+		goto out;
+	}
+
+	if (plr != new_plr) {
+		rdt_last_cmd_puts("region has already been initialized\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	buf[nbytes - 1] = '\0';
+
+	resname = strsep(&buf, ":");
+	if (!buf) {
+		rdt_last_cmd_puts("schemata missing ':'\n");
+		goto out;
+	}
+
+	dom = strsep(&buf, "=");
+	if (!buf) {
+		rdt_last_cmd_puts("schemata missing '='\n");
+		goto out;
+	}
+
+	ret = kstrtoint(dom, 10, &dom_id);
+	if (ret < 0 || dom_id < 0) {
+		rdt_last_cmd_puts("unable to parse cache id\n");
+		goto out;
+	}
+
+	for_each_alloc_enabled_rdt_resource(r) {
+		if (!strcmp(resname, r->name)) {
+			found = true;
+			ret = kstrtou32(buf, 16, &b_req);
+			if (ret) {
+				rdt_last_cmd_puts("unable to parse bitmask\n");
+				goto out;
+			}
+			if (!cbm_validate_val(b_req, r)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			plr->cbm = b_req;
+			list_for_each_entry(d, &r->domains, list) {
+				if (d->id == dom_id) {
+					ret = pseudo_lock_doit(plr, r, d);
+					goto out;
+				}
+			}
+			rdt_last_cmd_puts("no matching cache instance\n");
+			ret = -EINVAL;
+			break;
+		}
+	}
+
+	if (!found) {
+		rdt_last_cmd_puts("invalid resource name\n");
+		ret = -EINVAL;
+	}
+
+out:
+	pseudo_lock_region_kn_unlock(of->kn);
+	cpus_read_unlock();
+	return ret ?: nbytes;
+}
+
 int pseudo_lock_size_show(struct kernfs_open_file *of,
 			  struct seq_file *seq, void *v)
 {
@@ -295,7 +844,7 @@ int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
 	mutex_lock(&rdtgroup_mutex);
 	mutex_lock(&rdt_pseudo_lock_mutex);
 
-	if (new_plr) {
+	if (new_plr || !pseudo_lock_space_avail()) {
 		ret = -ENOSPC;
 		goto out;
 	}
@@ -525,6 +1074,9 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
  */
 void rdt_pseudo_lock_fs_remove(void)
 {
+	struct rdt_resource *r;
+	struct rdt_domain *d;
+
 	lockdep_assert_held(&rdtgroup_mutex);
 
 	if (!pseudo_lock_kn)
@@ -536,6 +1088,15 @@ void rdt_pseudo_lock_fs_remove(void)
 		new_plr->deleted = true;
 		__pseudo_lock_region_release(new_plr);
 	}
+
+	for_each_alloc_enabled_rdt_resource(r) {
+		list_for_each_entry(d, &r->domains, list) {
+			if (d->plr) {
+				d->plr->deleted = true;
+				__pseudo_lock_region_release(d->plr);
+			}
+		}
+	}
 	kernfs_remove(pseudo_lock_kn);
 	pseudo_lock_kn = NULL;
 	mutex_unlock(&rdt_pseudo_lock_mutex);
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index a7cbaf85ed54..5e55cd10ce31 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -861,9 +861,10 @@ static struct rftype res_common_files[] = {
 	},
 	{
 		.name		= "schemata",
-		.mode		= 0444,
+		.mode		= 0644,
 		.kf_ops		= &rdtgroup_kf_single_ops,
 		.seq_show	= pseudo_lock_schemata_show,
+		.write		= pseudo_lock_schemata_write,
 		.fflags		= RF_PSEUDO_LOCK,
 	},
 	{
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 14/22] x86/intel_rdt: Enable testing for pseudo-locked region
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (12 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:46 ` [RFC PATCH V2 15/22] x86/intel_rdt: Prevent new allocations from pseudo-locked regions Reinette Chatre
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Introduce a new test that can be used to determine if a provided CBM
intersects with an existing pseudo-locked region of cache domain.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.h             |  1 +
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 85f9ad6de113..17b7d14e2e02 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -462,6 +462,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
 bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
 void __check_limbo(struct rdt_domain *d, bool force_free);
+bool cbm_pseudo_locked(unsigned long cbm, struct rdt_domain *d);
 int rdt_pseudo_lock_fs_init(struct kernfs_node *root);
 void rdt_pseudo_lock_fs_remove(void);
 int rdt_pseudo_lock_mkdir(const char *name, umode_t mode);
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 1f351b7170ef..e9ab724432f8 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -273,6 +273,25 @@ static void pseudo_lock_region_kn_unlock(struct kernfs_node *kn)
 }
 
 /**
+ * cbm_pseudo_locked - Test if all or portion of CBM is pseudo-locked
+ * @cbm:	bitmask to be tested
+ * @d:		rdt_domain for which @cbm was provided
+ *
+ * RETURNS:
+ * True if bits from @cbm intersects with what has been pseudo-locked in
+ * rdt_domain @d, false otherwise.
+ */
+bool cbm_pseudo_locked(unsigned long cbm, struct rdt_domain *d)
+{
+	if (d->plr &&
+	    bitmap_intersects(&cbm, (unsigned long *)&d->plr->cbm,
+			      d->plr->r->cache.cbm_len))
+		return true;
+
+	return false;
+}
+
+/**
  * pseudo_lock_avail_get - return bitmask of cache available for locking
  * @r: resource to which this cache instance belongs
  * @d: domain representing the cache instance
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 15/22] x86/intel_rdt: Prevent new allocations from pseudo-locked regions
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (13 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 14/22] x86/intel_rdt: Enable testing for pseudo-locked region Reinette Chatre
@ 2018-02-13 15:46 ` Reinette Chatre
  2018-02-13 15:47 ` [RFC PATCH V2 16/22] x86/intel_rdt: Create debugfs files for pseudo-locking testing Reinette Chatre
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:46 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

When a user requests a new cache allocation we need to enforce that it
does not intersect with an existing pseudo-locked region. An allocation
with a bitmask intersection with a pseudo-locked region will enable
cache allocations to that region and thus evict pseudo-locked data.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
index 4a11aea3ad2c..4f823d631345 100644
--- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
+++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
@@ -136,8 +136,10 @@ static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r)
 }
 
 /*
- * Read one cache bit mask (hex). Check that it is valid for the current
- * resource type.
+ * Read one cache bit mask (hex). Check that it is valid and available for
+ * the current resource type. While CAT allows CBM to overlap amongst
+ * classes of service we do not allow a CBM to overlap with a region that has
+ * been pseudo-locked.
  */
 int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d)
 {
@@ -150,6 +152,8 @@ int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d)
 
 	if(!cbm_validate(buf, &data, r))
 		return -EINVAL;
+	if (cbm_pseudo_locked(data, d))
+		return -EINVAL;
 	d->new_ctrl = data;
 	d->have_new_ctrl = true;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 16/22] x86/intel_rdt: Create debugfs files for pseudo-locking testing
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (14 preceding siblings ...)
  2018-02-13 15:46 ` [RFC PATCH V2 15/22] x86/intel_rdt: Prevent new allocations from pseudo-locked regions Reinette Chatre
@ 2018-02-13 15:47 ` Reinette Chatre
  2018-02-13 15:47 ` [RFC PATCH V2 17/22] x86/intel_rdt: Create character device exposing pseudo-locked region Reinette Chatre
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:47 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

There is no simple yes/no test to determine if pseudo-locking was
successful. In order to test pseudo-locking we expose a debugfs file for
each pseudo-locked region that will record the latency of reading the
pseudo-locked memory at a stride of 32 bytes (hardcoded). These numbers
will give us an idea of locking was successful or not since they will
reflect cache hits and cache misses (hardware prefetching is disabled
during the test).

The new debugfs file "measure_trigger" will, when the
pseudo_lock_mem_latency tracepoint is enabled, record the latency of
accessing each cache line twice.

Kernel tracepoints offer us histograms that is a simple way to visualize
the memory access latency and immediately see any cache misses. For
example, the hist trigger below before trigger of the measurement
will display the memory access latency and instances at each
latency:
echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/pseudo_lock/\
                           pseudo_lock_mem_latency/trigger

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/Kconfig                                  |  11 ++
 arch/x86/kernel/cpu/Makefile                      |   1 +
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c       | 203 ++++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h |  22 +++
 4 files changed, 237 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 20da391b5f32..640d212cecfd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -455,6 +455,17 @@ config INTEL_RDT
 
 	  Say N if unsure.
 
+config INTEL_RDT_DEBUGFS
+	bool "Intel RDT debugfs interface"
+	depends on INTEL_RDT
+	select HIST_TRIGGERS
+	select DEBUG_FS
+	---help---
+	  Enable the creation of Intel RDT debugfs files. In support of
+	  debugging and validation of Intel RDT sub-features that use it.
+
+	  Say N if unsure.
+
 if X86_32
 config X86_EXTENDED_PLATFORM
 	bool "Support for extended (non-PC) x86 platforms"
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 53022c2413e0..9ca7b1625a4a 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -37,6 +37,7 @@ obj-$(CONFIG_CPU_SUP_UMC_32)		+= umc.o
 
 obj-$(CONFIG_INTEL_RDT)	+= intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_monitor.o
 obj-$(CONFIG_INTEL_RDT)	+= intel_rdt_ctrlmondata.o intel_rdt_pseudo_lock.o
+CFLAGS_intel_rdt_pseudo_lock.o = -I$(src)
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index e9ab724432f8..c03413021f45 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -22,6 +22,7 @@
 #include <linux/cacheinfo.h>
 #include <linux/cpu.h>
 #include <linux/cpumask.h>
+#include <linux/debugfs.h>
 #include <linux/kernfs.h>
 #include <linux/kref.h>
 #include <linux/kthread.h>
@@ -33,6 +34,11 @@
 #include <asm/intel_rdt_sched.h>
 #include "intel_rdt.h"
 
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+#define CREATE_TRACE_POINTS
+#include "intel_rdt_pseudo_lock_event.h"
+#endif
+
 /*
  * MSR_MISC_FEATURE_CONTROL register enables the modification of hardware
  * prefetcher state. Details about this register can be found in the MSR
@@ -69,6 +75,17 @@ static int thread_done;
  */
 static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
 
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+/*
+ * Pointers to debugfs directories. @debugfs_resctrl points to the top-level
+ * directory named resctrl. This can be moved to a central area when other
+ * RDT components start using it.
+ * @debugfs_pseudo points to the pseudo_lock directory under resctrl.
+ */
+static struct dentry *debugfs_resctrl;
+static struct dentry *debugfs_pseudo;
+#endif
+
 /**
  * struct pseudo_lock_region - pseudo-lock region information
  * @kn:			kernfs node representing this region in the resctrl
@@ -91,6 +108,8 @@ static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
  *			region via kernfs
  * @deleted:		user requested removal of region via rmdir on kernfs
  * @kmem:		the kernel memory associated with pseudo-locked region
+ * @debugfs_dir:	pointer to this region's directory in the debugfs
+ *			filesystem
  */
 struct pseudo_lock_region {
 	struct kernfs_node	*kn;
@@ -106,6 +125,9 @@ struct pseudo_lock_region {
 	struct kref		refcount;
 	bool			deleted;
 	void			*kmem;
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+	struct dentry		*debugfs_dir;
+#endif
 };
 
 /*
@@ -192,6 +214,9 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
 					     plr->d->ctrl_val[0] | plr->cbm);
 		pseudo_lock_region_clear(plr);
 	}
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+	debugfs_remove_recursive(plr->debugfs_dir);
+#endif
 	kfree(plr);
 	if (is_new_plr)
 		new_plr = NULL;
@@ -291,6 +316,135 @@ bool cbm_pseudo_locked(unsigned long cbm, struct rdt_domain *d)
 	return false;
 }
 
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+static int measure_cycles_fn(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	unsigned long flags;
+	u64 start, end;
+	u64 i;
+#ifdef CONFIG_KASAN
+	/*
+	 * The registers used for local register variables are also used
+	 * when KASAN is active. When KASAN is active we use a regular
+	 * variable to ensure we always use a valid pointer to access memory.
+	 * The cost is that accessing this pointer, which could be in
+	 * cache, will be included in the measurement of memory read latency.
+	 */
+	void *mem_r;
+#else
+#ifdef CONFIG_X86_64
+	register void *mem_r asm("rbx");
+#else
+	register void *mem_r asm("ebx");
+#endif /* CONFIG_X86_64 */
+#endif /* CONFIG_KASAN */
+
+	preempt_disable();
+	local_irq_save(flags);
+	/*
+	 * The wrmsr call may be reordered with the assignment below it.
+	 * Call wrmsr as directly as possible to avoid tracing clobbering
+	 * local register variable used for memory pointer.
+	 */
+	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+	mem_r = plr->kmem;
+	for (i = 0; i < plr->size; i += 32) {
+		start = rdtsc_ordered();
+		asm volatile("mov (%0,%1,1), %%eax\n\t"
+			     :
+			     : "r" (mem_r), "r" (i)
+			     : "%eax", "memory");
+		end = rdtsc_ordered();
+		trace_pseudo_lock_mem_latency((u32)(end - start));
+	}
+	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
+	local_irq_restore(flags);
+	preempt_enable();
+	thread_done = 1;
+	wake_up_interruptible(&wq);
+	return 0;
+}
+
+static int pseudo_measure_cycles(struct pseudo_lock_region *plr)
+{
+	struct task_struct *thread;
+	unsigned int cpu;
+	int ret;
+
+	cpus_read_lock();
+	mutex_lock(&rdt_pseudo_lock_mutex);
+
+	if (!plr->locked || plr->deleted) {
+		ret = 0;
+		goto out;
+	}
+
+	thread_done = 0;
+	cpu = cpumask_first(&plr->d->cpu_mask);
+	if (!cpu_online(cpu)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	thread = kthread_create_on_node(measure_cycles_fn, plr,
+					cpu_to_node(cpu),
+					"pseudo_lock_measure/%u", cpu);
+	if (IS_ERR(thread)) {
+		ret = PTR_ERR(thread);
+		goto out;
+	}
+	kthread_bind(thread, cpu);
+	wake_up_process(thread);
+
+	ret = wait_event_interruptible(wq, thread_done == 1);
+	if (ret < 0)
+		goto out;
+
+	ret = 0;
+
+out:
+	mutex_unlock(&rdt_pseudo_lock_mutex);
+	cpus_read_unlock();
+	return ret;
+}
+
+static ssize_t pseudo_measure_trigger(struct file *file,
+				      const char __user *user_buf,
+				      size_t count, loff_t *ppos)
+{
+	struct pseudo_lock_region *plr = file->private_data;
+	size_t buf_size;
+	char buf[32];
+	int ret;
+	bool bv;
+
+	buf_size = min(count, (sizeof(buf) - 1));
+	if (copy_from_user(buf, user_buf, buf_size))
+		return -EFAULT;
+
+	buf[buf_size] = '\0';
+	ret = strtobool(buf, &bv);
+	if (ret == 0) {
+		ret = debugfs_file_get(file->f_path.dentry);
+		if (ret == 0 && bv) {
+			ret = pseudo_measure_cycles(plr);
+			if (ret == 0)
+				ret = count;
+		}
+		debugfs_file_put(file->f_path.dentry);
+	}
+
+	return ret;
+}
+
+static const struct file_operations pseudo_measure_fops = {
+	.write = pseudo_measure_trigger,
+	.open = simple_open,
+	.llseek = default_llseek,
+};
+#endif /* CONFIG_INTEL_RDT_DEBUGFS */
+
 /**
  * pseudo_lock_avail_get - return bitmask of cache available for locking
  * @r: resource to which this cache instance belongs
@@ -858,6 +1012,9 @@ int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
 {
 	struct pseudo_lock_region *plr;
 	struct kernfs_node *kn;
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+	struct dentry *entry;
+#endif
 	int ret = 0;
 
 	mutex_lock(&rdtgroup_mutex);
@@ -889,12 +1046,32 @@ int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
 	if (ret)
 		goto out_remove;
 
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+	plr->debugfs_dir = debugfs_create_dir(plr->kn->name, debugfs_pseudo);
+	if (IS_ERR(plr->debugfs_dir)) {
+		ret = PTR_ERR(plr->debugfs_dir);
+		plr->debugfs_dir = NULL;
+		goto out_remove;
+	}
+
+	entry = debugfs_create_file("measure_trigger", 0200, plr->debugfs_dir,
+				    plr, &pseudo_measure_fops);
+	if (IS_ERR(entry)) {
+		ret = PTR_ERR(entry);
+		goto out_debugfs;
+	}
+#endif
+
 	kref_init(&plr->refcount);
 	kernfs_activate(kn);
 	new_plr = plr;
 	ret = 0;
 	goto out;
 
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+out_debugfs:
+	debugfs_remove_recursive(plr->debugfs_dir);
+#endif
 out_remove:
 	kernfs_remove(kn);
 out_free:
@@ -990,6 +1167,23 @@ static u64 get_prefetch_disable_bits(void)
 	return 0;
 }
 
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+static int pseudo_lock_debugfs_create(void)
+{
+	debugfs_resctrl = debugfs_create_dir("resctrl", NULL);
+	if (IS_ERR(debugfs_resctrl))
+		return PTR_ERR(debugfs_resctrl);
+
+	debugfs_pseudo = debugfs_create_dir("pseudo_lock", debugfs_resctrl);
+	if (IS_ERR(debugfs_pseudo)) {
+		debugfs_remove_recursive(debugfs_resctrl);
+		return PTR_ERR(debugfs_pseudo);
+	}
+
+	return 0;
+}
+#endif
+
 /**
  * rdt_pseudo_lock_fs_init - Create and initialize pseudo-locking files
  * @root: location in kernfs where directory and files should be created
@@ -1068,6 +1262,12 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
 	if (ret)
 		goto error;
 
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+	ret = pseudo_lock_debugfs_create();
+	if (ret < 0)
+		goto error;
+#endif
+
 	kernfs_activate(pseudo_lock_kn);
 
 	ret = 0;
@@ -1116,6 +1316,9 @@ void rdt_pseudo_lock_fs_remove(void)
 			}
 		}
 	}
+#ifdef CONFIG_INTEL_RDT_DEBUGFS
+	debugfs_remove_recursive(debugfs_resctrl);
+#endif
 	kernfs_remove(pseudo_lock_kn);
 	pseudo_lock_kn = NULL;
 	mutex_unlock(&rdt_pseudo_lock_mutex);
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h
new file mode 100644
index 000000000000..cd74d1a0f592
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h
@@ -0,0 +1,22 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pseudo_lock
+
+#if !defined(_TRACE_PSEUDO_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PSEUDO_LOCK_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(pseudo_lock_mem_latency,
+	    TP_PROTO(u32 latency),
+	    TP_ARGS(latency),
+	    TP_STRUCT__entry(__field(u32, latency)),
+	    TP_fast_assign(__entry->latency = latency),
+	    TP_printk("latency=%u", __entry->latency)
+	   );
+
+#endif /* _TRACE_PSEUDO_LOCK_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE intel_rdt_pseudo_lock_event
+#include <trace/define_trace.h>
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 17/22] x86/intel_rdt: Create character device exposing pseudo-locked region
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (15 preceding siblings ...)
  2018-02-13 15:47 ` [RFC PATCH V2 16/22] x86/intel_rdt: Create debugfs files for pseudo-locking testing Reinette Chatre
@ 2018-02-13 15:47 ` Reinette Chatre
  2018-02-13 15:47 ` [RFC PATCH V2 18/22] x86/intel_rdt: More precise L2 hit/miss measurements Reinette Chatre
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:47 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Once a pseudo-locked region has been created it needs to be made
available to user space to provide benefit there.

A character device supporting mmap() is created for each pseudo-locked
region. A user space application can now use mmap() system call to map
pseudo-locked region into its virtual address space.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 267 +++++++++++++++++++++++++++-
 1 file changed, 265 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index c03413021f45..b4923aa4314c 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -26,6 +26,7 @@
 #include <linux/kernfs.h>
 #include <linux/kref.h>
 #include <linux/kthread.h>
+#include <linux/mman.h>
 #include <linux/seq_file.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
@@ -52,6 +53,14 @@
  */
 static u64 prefetch_disable_bits;
 
+/*
+ * Major number assigned to and shared by all devices exposing
+ * pseudo-locked regions.
+ */
+static unsigned int pseudo_lock_major;
+static unsigned long pseudo_lock_minor_avail = GENMASK(MINORBITS, 0);
+static struct class *pseudo_lock_class;
+
 struct kernfs_node *pseudo_lock_kn;
 
 /*
@@ -189,6 +198,15 @@ static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
 	plr->d = NULL;
 }
 
+/**
+ * pseudo_lock_minor_release - Return minor number to available
+ * @minor: The minor number being released
+ */
+static void pseudo_lock_minor_release(unsigned int minor)
+{
+	__set_bit(minor, &pseudo_lock_minor_avail);
+}
+
 static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
 {
 	bool is_new_plr = (plr == new_plr);
@@ -199,6 +217,9 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
 
 	if (plr->locked) {
 		plr->d->plr = NULL;
+		device_destroy(pseudo_lock_class,
+			       MKDEV(pseudo_lock_major, plr->minor));
+		pseudo_lock_minor_release(plr->minor);
 		/*
 		 * Resource groups come and go. Simply returning this
 		 * pseudo-locked region's bits to the default CLOS may
@@ -763,11 +784,74 @@ static int pseudo_lock_fn(void *_plr)
 	return 0;
 }
 
+/**
+ * pseudo_lock_minor_get - Obtain available minor number
+ * @minor: Pointer to where new minor number will be stored
+ *
+ * A bitmask is used to track available minor numbers. Here the next free
+ * minor number is allocated and returned.
+ *
+ * RETURNS:
+ * Zero on success, error on failure.
+ */
+static int pseudo_lock_minor_get(unsigned int *minor)
+{
+	unsigned long first_bit;
+
+	first_bit = find_first_bit(&pseudo_lock_minor_avail, MINORBITS);
+
+	if (first_bit == MINORBITS)
+		return -ENOSPC;
+
+	__clear_bit(first_bit, &pseudo_lock_minor_avail);
+	*minor = first_bit;
+
+	return 0;
+}
+
+/**
+ * region_find_by_minor - Locate a pseudo-lock region by inode minor number
+ * @minor: The minor number of the device representing pseudo-locked region
+ *
+ * When the character device is accessed we need to determine which
+ * pseudo-locked region it belongs to. This is done by matching the minor
+ * number of the device to the pseudo-locked region it belongs.
+ *
+ * Minor numbers are assigned at the time a pseudo-locked region is associated
+ * with a cache instance.
+ *
+ * LOCKING:
+ * rdt_pseudo_lock_mutex must be held
+ *
+ * RETURNS:
+ * On success returns pointer to pseudo-locked region, NULL on failure.
+ */
+static struct pseudo_lock_region *region_find_by_minor(unsigned int minor)
+{
+	struct pseudo_lock_region *plr_match = NULL;
+	struct rdt_resource *r;
+	struct rdt_domain *d;
+
+	lockdep_assert_held(&rdt_pseudo_lock_mutex);
+
+	for_each_alloc_enabled_rdt_resource(r) {
+		list_for_each_entry(d, &r->domains, list) {
+			if (d->plr && d->plr->minor == minor) {
+				plr_match = d->plr;
+				break;
+			}
+		}
+	}
+	return plr_match;
+}
+
 static int pseudo_lock_doit(struct pseudo_lock_region *plr,
 			    struct rdt_resource *r,
 			    struct rdt_domain *d)
 {
 	struct task_struct *thread;
+	unsigned int new_minor;
+	struct device *dev;
 	int closid;
 	int ret, i;
 
@@ -858,11 +942,45 @@ static int pseudo_lock_doit(struct pseudo_lock_region *plr,
 			pseudo_lock_clos_set(plr, i, d->ctrl_val[0]);
 	}
 
+	ret = pseudo_lock_minor_get(&new_minor);
+	if (ret < 0) {
+		rdt_last_cmd_puts("unable to obtain a new minor number\n");
+		goto out_clos_def;
+	}
+
 	plr->locked = true;
 	d->plr = plr;
 	new_plr = NULL;
 
 	/*
+	 * Unlock access but do not release the reference. The
+	 * pseudo-locked region will still be here when we return.
+	 * If anything else attempts to access the region while we do not
+	 * have the mutex the region would be considered locked.
+	 *
+	 * We need to release the mutex temporarily to avoid a potential
+	 * deadlock with the mm->mmap_sem semaphore which is obtained in
+	 * the device_create() callpath below as well as before our mmap()
+	 * callback is called.
+	 */
+	mutex_unlock(&rdt_pseudo_lock_mutex);
+
+	dev = device_create(pseudo_lock_class, NULL,
+			    MKDEV(pseudo_lock_major, new_minor),
+			    plr, "%s", plr->kn->name);
+
+	mutex_lock(&rdt_pseudo_lock_mutex);
+
+	if (IS_ERR(dev)) {
+		ret = PTR_ERR(dev);
+		rdt_last_cmd_printf("failed to created character device: %d\n",
+				    ret);
+		goto out_minor;
+	}
+
+	plr->minor = new_minor;
+
+	/*
 	 * We do not return CBM to CLOS here since that will result in a
 	 * CBM of all zeroes which is an illegal MSR write.
 	 */
@@ -870,6 +988,8 @@ static int pseudo_lock_doit(struct pseudo_lock_region *plr,
 	ret = 0;
 	goto out;
 
+out_minor:
+	pseudo_lock_minor_release(new_minor);
 out_clos_def:
 	pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] | plr->cbm);
 out_closid:
@@ -1184,6 +1304,127 @@ static int pseudo_lock_debugfs_create(void)
 }
 #endif
 
+static int pseudo_lock_dev_open(struct inode *inode, struct file *filp)
+{
+	struct pseudo_lock_region *plr;
+
+	mutex_lock(&rdt_pseudo_lock_mutex);
+
+	plr = region_find_by_minor(iminor(inode));
+	if (!plr) {
+		mutex_unlock(&rdt_pseudo_lock_mutex);
+		return -ENODEV;
+	}
+
+	filp->private_data = plr;
+	/* Perform a non-seekable open - llseek is not supported */
+	filp->f_mode &= ~(FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+	mutex_unlock(&rdt_pseudo_lock_mutex);
+
+	return 0;
+}
+
+static int pseudo_lock_dev_release(struct inode *inode, struct file *filp)
+{
+	mutex_lock(&rdt_pseudo_lock_mutex);
+	filp->private_data = NULL;
+	mutex_unlock(&rdt_pseudo_lock_mutex);
+	return 0;
+}
+
+static int pseudo_lock_dev_mremap(struct vm_area_struct *area)
+{
+	/* Not supported */
+	return -EINVAL;
+}
+
+static const struct vm_operations_struct pseudo_mmap_ops = {
+	.mremap = pseudo_lock_dev_mremap,
+};
+
+static int pseudo_lock_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long vsize = vma->vm_end - vma->vm_start;
+	unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
+	struct pseudo_lock_region *plr;
+	unsigned long physical;
+	unsigned long psize;
+
+	mutex_lock(&rdt_pseudo_lock_mutex);
+
+	plr = file->private_data;
+	WARN_ON(!plr);
+	if (!plr) {
+		mutex_unlock(&rdt_pseudo_lock_mutex);
+		return -ENODEV;
+	}
+
+	/*
+	 * Task is required to run with affinity to the cpus associated
+	 * with the pseudo-locked region. If this is not the case the task
+	 * may be scheduled elsewhere and invalidate entries in the
+	 * pseudo-locked region.
+	 */
+	if (!cpumask_subset(&current->cpus_allowed, &plr->d->cpu_mask)) {
+		mutex_unlock(&rdt_pseudo_lock_mutex);
+		return -EINVAL;
+	}
+
+	physical = __pa(plr->kmem) >> PAGE_SHIFT;
+	psize = plr->size - off;
+
+	if (off > plr->size) {
+		mutex_unlock(&rdt_pseudo_lock_mutex);
+		return -ENOSPC;
+	}
+
+	/*
+	 * Ensure changes are carried directly to the memory being mapped,
+	 * do not allow copy-on-write mapping.
+	 */
+	if (!(vma->vm_flags & VM_SHARED)) {
+		mutex_unlock(&rdt_pseudo_lock_mutex);
+		return -EINVAL;
+	}
+
+	if (vsize > psize) {
+		mutex_unlock(&rdt_pseudo_lock_mutex);
+		return -ENOSPC;
+	}
+
+	memset(plr->kmem + off, 0, vsize);
+
+	if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff,
+			    vsize, vma->vm_page_prot)) {
+		mutex_unlock(&rdt_pseudo_lock_mutex);
+		return -EAGAIN;
+	}
+	vma->vm_ops = &pseudo_mmap_ops;
+	mutex_unlock(&rdt_pseudo_lock_mutex);
+	return 0;
+}
+
+static const struct file_operations pseudo_lock_dev_fops = {
+	.owner =	THIS_MODULE,
+	.llseek =	no_llseek,
+	.read =		NULL,
+	.write =	NULL,
+	.open =		pseudo_lock_dev_open,
+	.release =	pseudo_lock_dev_release,
+	.mmap =		pseudo_lock_dev_mmap,
+};
+
+static char *pseudo_lock_devnode(struct device *dev, umode_t *mode)
+{
+	struct pseudo_lock_region *plr;
+
+	plr = dev_get_drvdata(dev);
+	if (mode)
+		*mode = 0600;
+	return kasprintf(GFP_KERNEL, "pseudo_lock/%s", plr->kn->name);
+}
+
 /**
  * rdt_pseudo_lock_fs_init - Create and initialize pseudo-locking files
  * @root: location in kernfs where directory and files should be created
@@ -1245,10 +1486,26 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
 	if (prefetch_disable_bits == 0)
 		return 0;
 
+	ret = register_chrdev(0, "pseudo_lock", &pseudo_lock_dev_fops);
+	if (ret < 0)
+		return ret;
+
+	pseudo_lock_major = ret;
+
+	pseudo_lock_class = class_create(THIS_MODULE, "pseudo_lock");
+	if (IS_ERR(pseudo_lock_class)) {
+		ret = PTR_ERR(pseudo_lock_class);
+		goto out_char;
+	}
+
+	pseudo_lock_class->devnode = pseudo_lock_devnode;
+
 	pseudo_lock_kn = kernfs_create_dir(root, "pseudo_lock",
 					   root->mode, NULL);
-	if (IS_ERR(pseudo_lock_kn))
-		return PTR_ERR(pseudo_lock_kn);
+	if (IS_ERR(pseudo_lock_kn)) {
+		ret = PTR_ERR(pseudo_lock_kn);
+		goto out_class;
+	}
 
 	kn = __kernfs_create_file(pseudo_lock_kn, "avail", 0444,
 				  0, &pseudo_lock_avail_ops,
@@ -1276,6 +1533,10 @@ int rdt_pseudo_lock_fs_init(struct kernfs_node *root)
 error:
 	kernfs_remove(pseudo_lock_kn);
 	pseudo_lock_kn = NULL;
+out_class:
+	class_destroy(pseudo_lock_class);
+out_char:
+	unregister_chrdev(pseudo_lock_major, "pseudo_lock");
 out:
 	return ret;
 }
@@ -1321,5 +1582,7 @@ void rdt_pseudo_lock_fs_remove(void)
 #endif
 	kernfs_remove(pseudo_lock_kn);
 	pseudo_lock_kn = NULL;
+	class_destroy(pseudo_lock_class);
+	unregister_chrdev(pseudo_lock_major, "pseudo_lock");
 	mutex_unlock(&rdt_pseudo_lock_mutex);
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 18/22] x86/intel_rdt: More precise L2 hit/miss measurements
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (16 preceding siblings ...)
  2018-02-13 15:47 ` [RFC PATCH V2 17/22] x86/intel_rdt: Create character device exposing pseudo-locked region Reinette Chatre
@ 2018-02-13 15:47 ` Reinette Chatre
  2018-02-13 15:47 ` [RFC PATCH V2 19/22] x86/intel_rdt: Support L3 cache performance event of Broadwell Reinette Chatre
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:47 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Intel Goldmont processors supports non-architectural precise events that
can be used to give us more insight into the success of L2 cache
pseudo-locking on these platforms.

Introduce a new measurement trigger that will enable two precise events,
MEM_LOAD_UOPS_RETIRED.L2_HIT and MEM_LOAD_UOPS_RETIRED.L2_MISS, while
accessing pseudo-locked data. A new tracepoint, pseudo_lock_l2, is
created to make these results visible to the user.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c       | 146 ++++++++++++++++++++--
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h |  15 +++
 2 files changed, 148 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index b4923aa4314c..34b2de387c3a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -36,6 +36,7 @@
 #include "intel_rdt.h"
 
 #ifdef CONFIG_INTEL_RDT_DEBUGFS
+#include <asm/perf_event.h>
 #define CREATE_TRACE_POINTS
 #include "intel_rdt_pseudo_lock_event.h"
 #endif
@@ -338,7 +339,7 @@ bool cbm_pseudo_locked(unsigned long cbm, struct rdt_domain *d)
 }
 
 #ifdef CONFIG_INTEL_RDT_DEBUGFS
-static int measure_cycles_fn(void *_plr)
+static int measure_cycles_hist_fn(void *_plr)
 {
 	struct pseudo_lock_region *plr = _plr;
 	unsigned long flags;
@@ -387,11 +388,115 @@ static int measure_cycles_fn(void *_plr)
 	return 0;
 }
 
-static int pseudo_measure_cycles(struct pseudo_lock_region *plr)
+static int measure_cycles_perf_fn(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	unsigned long long l2_hits, l2_miss;
+	u64 l2_hit_bits, l2_miss_bits;
+	unsigned long flags;
+	u64 i;
+#ifdef CONFIG_KASAN
+	/*
+	 * The registers used for local register variables are also used
+	 * when KASAN is active. When KASAN is active we use regular variables
+	 * at the cost of including cache access latency to these variables
+	 * in the measurements.
+	 */
+	unsigned int line_size;
+	unsigned int size;
+	void *mem_r;
+#else
+	register unsigned int line_size asm("esi");
+	register unsigned int size asm("edi");
+#ifdef CONFIG_X86_64
+	register void *mem_r asm("rbx");
+#else
+	register void *mem_r asm("ebx");
+#endif /* CONFIG_X86_64 */
+#endif /* CONFIG_KASAN */
+
+	/*
+	 * Non-architectural event for the Goldmont Microarchitecture
+	 * from Intel x86 Architecture Software Developer Manual (SDM):
+	 * MEM_LOAD_UOPS_RETIRED D1H (event number)
+	 * Umask values:
+	 *     L1_HIT   01H
+	 *     L2_HIT   02H
+	 *     L1_MISS  08H
+	 *     L2_MISS  10H
+	 */
+
+	/*
+	 * Start by setting flags for IA32_PERFEVTSELx:
+	 *     OS  (Operating system mode)  0x2
+	 *     INT (APIC interrupt enable)  0x10
+	 *     EN  (Enable counter)         0x40
+	 *
+	 * Then add the Umask value and event number to select performance
+	 * event.
+	 */
+
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GEMINI_LAKE:
+		l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
+		l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
+		break;
+	default:
+		goto out;
+	}
+
+	preempt_disable();
+	local_irq_save(flags);
+	/*
+	 * Call wrmsr direcly to avoid the local register variables from
+	 * being overwritten due to reordering of their assignment with
+	 * the wrmsr calls.
+	 */
+	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+	/* Disable events and reset counters */
+	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
+	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
+	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
+	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
+	/* Set and enable the L2 counters */
+	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
+	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
+	mem_r = plr->kmem;
+	size = plr->size;
+	line_size = plr->line_size;
+	for (i = 0; i < size; i += line_size) {
+		asm volatile("mov (%0,%1,1), %%eax\n\t"
+			     :
+			     : "r" (mem_r), "r" (i)
+			     : "%eax", "memory");
+	}
+	/*
+	 * Call wrmsr directly (no tracing) to not influence
+	 * the cache access counters as they are disabled.
+	 */
+	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0,
+			      l2_hit_bits & ~(0x40ULL << 16));
+	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
+			      l2_miss_bits & ~(0x40ULL << 16));
+	l2_hits = native_read_pmc(0);
+	l2_miss = native_read_pmc(1);
+	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
+	local_irq_restore(flags);
+	preempt_enable();
+	trace_pseudo_lock_l2(l2_hits, l2_miss);
+
+out:
+	thread_done = 1;
+	wake_up_interruptible(&wq);
+	return 0;
+}
+
+static int pseudo_measure_cycles(struct pseudo_lock_region *plr, int sel)
 {
 	struct task_struct *thread;
 	unsigned int cpu;
-	int ret;
+	int ret = -1;
 
 	cpus_read_lock();
 	mutex_lock(&rdt_pseudo_lock_mutex);
@@ -408,9 +513,19 @@ static int pseudo_measure_cycles(struct pseudo_lock_region *plr)
 		goto out;
 	}
 
-	thread = kthread_create_on_node(measure_cycles_fn, plr,
-					cpu_to_node(cpu),
-					"pseudo_lock_measure/%u", cpu);
+	if (sel == 1)
+		thread = kthread_create_on_node(measure_cycles_hist_fn, plr,
+						cpu_to_node(cpu),
+						"pseudo_lock_measure/%u",
+						cpu);
+	else if (sel == 2)
+		thread = kthread_create_on_node(measure_cycles_perf_fn, plr,
+						cpu_to_node(cpu),
+						"pseudo_lock_measure/%u",
+						cpu);
+	else
+		goto out;
+
 	if (IS_ERR(thread)) {
 		ret = PTR_ERR(thread);
 		goto out;
@@ -438,21 +553,23 @@ static ssize_t pseudo_measure_trigger(struct file *file,
 	size_t buf_size;
 	char buf[32];
 	int ret;
-	bool bv;
+	int sel;
 
 	buf_size = min(count, (sizeof(buf) - 1));
 	if (copy_from_user(buf, user_buf, buf_size))
 		return -EFAULT;
 
 	buf[buf_size] = '\0';
-	ret = strtobool(buf, &bv);
+	ret = kstrtoint(buf, 10, &sel);
 	if (ret == 0) {
+		if (sel != 1 && sel != 2)
+			return -EINVAL;
 		ret = debugfs_file_get(file->f_path.dentry);
-		if (ret == 0 && bv) {
-			ret = pseudo_measure_cycles(plr);
-			if (ret == 0)
-				ret = count;
-		}
+		if (unlikely(ret))
+			return ret;
+		ret = pseudo_measure_cycles(plr, sel);
+		if (ret == 0)
+			ret = count;
 		debugfs_file_put(file->f_path.dentry);
 	}
 
@@ -1249,6 +1366,9 @@ int rdt_pseudo_lock_rmdir(struct kernfs_node *kn)
  * hardware prefetch disable bits are included here as they are documented
  * in the SDM.
  *
+ * When adding a platform here also add support for its cache events to
+ * measure_cycles_perf_fn()
+ *
  * RETURNS
  * If platform is supported, the bits to disable hardware prefetchers, 0
  * if platform is not supported.
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h
index cd74d1a0f592..45f6d1e35378 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h
@@ -14,6 +14,21 @@ TRACE_EVENT(pseudo_lock_mem_latency,
 	    TP_printk("latency=%u", __entry->latency)
 	   );
 
+TRACE_EVENT(pseudo_lock_l2,
+	    TP_PROTO(u64 l2_hits, u64 l2_miss),
+	    TP_ARGS(l2_hits, l2_miss),
+	    TP_STRUCT__entry(
+			     __field(u64, l2_hits)
+			     __field(u64, l2_miss)
+	    ),
+	    TP_fast_assign(
+			   __entry->l2_hits = l2_hits;
+			   __entry->l2_miss = l2_miss;
+	    ),
+	    TP_printk("hits=%llu miss=%llu",
+		      __entry->l2_hits, __entry->l2_miss)
+	   );
+
 #endif /* _TRACE_PSEUDO_LOCK_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 19/22] x86/intel_rdt: Support L3 cache performance event of Broadwell
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (17 preceding siblings ...)
  2018-02-13 15:47 ` [RFC PATCH V2 18/22] x86/intel_rdt: More precise L2 hit/miss measurements Reinette Chatre
@ 2018-02-13 15:47 ` Reinette Chatre
  2018-02-13 15:47 ` [RFC PATCH V2 20/22] x86/intel_rdt: Limit C-states dynamically when pseudo-locking active Reinette Chatre
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:47 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Broadwell microarchitecture supports pseudo-locking. Add support for
the L3 cache related performance events of these systems so that we can
measure the success of pseudo-locking.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c       | 56 +++++++++++++++++++++++
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h | 15 ++++++
 2 files changed, 71 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 34b2de387c3a..7511c2089d07 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -390,6 +390,8 @@ static int measure_cycles_hist_fn(void *_plr)
 
 static int measure_cycles_perf_fn(void *_plr)
 {
+	unsigned long long l3_hits = 0, l3_miss = 0;
+	u64 l3_hit_bits = 0, l3_miss_bits = 0;
 	struct pseudo_lock_region *plr = _plr;
 	unsigned long long l2_hits, l2_miss;
 	u64 l2_hit_bits, l2_miss_bits;
@@ -424,6 +426,16 @@ static int measure_cycles_perf_fn(void *_plr)
 	 *     L2_HIT   02H
 	 *     L1_MISS  08H
 	 *     L2_MISS  10H
+	 *
+	 * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
+	 * has two "no fix" errata associated with it: BDM35 and BDM100. On
+	 * this platform we use the following events instead:
+	 *  L2_RQSTS 24H (Documented in https://download.01.org/perfmon/BDW/)
+	 *       REFERENCES FFH
+	 *       MISS       3FH
+	 *  LONGEST_LAT_CACHE 2EH (Documented in SDM)
+	 *       REFERENCE 4FH
+	 *       MISS      41H
 	 */
 
 	/*
@@ -442,6 +454,14 @@ static int measure_cycles_perf_fn(void *_plr)
 		l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
 		l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
 		break;
+	case INTEL_FAM6_BROADWELL_X:
+		/* On BDW the l2_hit_bits count references, not hits */
+		l2_hit_bits = (0x52ULL << 16) | (0xff << 8) | 0x24;
+		l2_miss_bits = (0x52ULL << 16) | (0x3f << 8) | 0x24;
+		/* On BDW the l3_hit_bits count references, not hits */
+		l3_hit_bits = (0x52ULL << 16) | (0x4f << 8) | 0x2e;
+		l3_miss_bits = (0x52ULL << 16) | (0x41 << 8) | 0x2e;
+		break;
 	default:
 		goto out;
 	}
@@ -459,9 +479,21 @@ static int measure_cycles_perf_fn(void *_plr)
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
+	if (l3_hit_bits > 0) {
+		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x0);
+		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3, 0x0);
+		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 2, 0x0);
+		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
+	}
 	/* Set and enable the L2 counters */
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
+	if (l3_hit_bits > 0) {
+		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
+				      l3_hit_bits);
+		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
+				      l3_miss_bits);
+	}
 	mem_r = plr->kmem;
 	size = plr->size;
 	line_size = plr->line_size;
@@ -479,12 +511,36 @@ static int measure_cycles_perf_fn(void *_plr)
 			      l2_hit_bits & ~(0x40ULL << 16));
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
 			      l2_miss_bits & ~(0x40ULL << 16));
+	if (l3_hit_bits > 0) {
+		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
+				      l3_hit_bits & ~(0x40ULL << 16));
+		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
+				      l3_miss_bits & ~(0x40ULL << 16));
+	}
 	l2_hits = native_read_pmc(0);
 	l2_miss = native_read_pmc(1);
+	if (l3_hit_bits > 0) {
+		l3_hits = native_read_pmc(2);
+		l3_miss = native_read_pmc(3);
+	}
 	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
 	local_irq_restore(flags);
 	preempt_enable();
+	/*
+	 * On BDW we count references and misses, need to adjust. Sometimes
+	 * the "hits" counter is a bit more than the references, for
+	 * example, x references but x + 1 hits. To not report invalid
+	 * hit values in this case we treat that as misses eaqual to
+	 * references.
+	 */
+	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
+		l2_hits -= (l2_miss > l2_hits ? l2_hits : l2_miss);
 	trace_pseudo_lock_l2(l2_hits, l2_miss);
+	if (l3_hit_bits > 0) {
+		if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
+			l3_hits -= (l3_miss > l3_hits ? l3_hits : l3_miss);
+		trace_pseudo_lock_l3(l3_hits, l3_miss);
+	}
 
 out:
 	thread_done = 1;
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h
index 45f6d1e35378..710535ae8235 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock_event.h
@@ -29,6 +29,21 @@ TRACE_EVENT(pseudo_lock_l2,
 		      __entry->l2_hits, __entry->l2_miss)
 	   );
 
+TRACE_EVENT(pseudo_lock_l3,
+	    TP_PROTO(u64 l3_hits, u64 l3_miss),
+	    TP_ARGS(l3_hits, l3_miss),
+	    TP_STRUCT__entry(
+			     __field(u64, l3_hits)
+			     __field(u64, l3_miss)
+	    ),
+	    TP_fast_assign(
+			   __entry->l3_hits = l3_hits;
+			   __entry->l3_miss = l3_miss;
+	    ),
+	    TP_printk("hits=%llu miss=%llu",
+		      __entry->l3_hits, __entry->l3_miss)
+	   );
+
 #endif /* _TRACE_PSEUDO_LOCK_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 20/22] x86/intel_rdt: Limit C-states dynamically when pseudo-locking active
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (18 preceding siblings ...)
  2018-02-13 15:47 ` [RFC PATCH V2 19/22] x86/intel_rdt: Support L3 cache performance event of Broadwell Reinette Chatre
@ 2018-02-13 15:47 ` Reinette Chatre
  2018-02-13 15:47 ` [RFC PATCH V2 21/22] mm/hugetlb: Enable large allocations through gigantic page API Reinette Chatre
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:47 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Deeper C-states impact cache content through shrinking of the cache or
flushing entire cache to memory before reducing power to the cache.
Deeper C-states will thus negatively impact the pseudo-locked regions.

To avoid impacting pseudo-locked regions we limit C-states on
pseudo-locked region creation so that cores associated with the
pseudo-locked region are prevented from entering deeper C-states.
This is accomplished by requesting a CPU latency target which will
prevent the core from entering C6 across all supported platforms.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 Documentation/x86/intel_rdt_ui.txt          |  4 +-
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 89 ++++++++++++++++++++++++++++-
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index bb3d6fe0a3e4..755d16ae7db6 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -349,8 +349,8 @@ in the cache via carefully configuring the CAT feature and controlling
 application behavior. There is no guarantee that data is placed in
 cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
 “locked” data from cache. Power management C-states may shrink or
-power off cache. It is thus recommended to limit the processor maximum
-C-state, for example, by setting the processor.max_cstate kernel parameter.
+power off cache. Deeper C-states will automatically be restricted on
+pseudo-locked region creation.
 
 It is required that an application using a pseudo-locked region runs
 with affinity to the cores (or a subset of the cores) associated
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 7511c2089d07..90f040166fcd 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -27,6 +27,7 @@
 #include <linux/kref.h>
 #include <linux/kthread.h>
 #include <linux/mman.h>
+#include <linux/pm_qos.h>
 #include <linux/seq_file.h>
 #include <linux/stat.h>
 #include <linux/slab.h>
@@ -120,6 +121,7 @@ static struct dentry *debugfs_pseudo;
  * @kmem:		the kernel memory associated with pseudo-locked region
  * @debugfs_dir:	pointer to this region's directory in the debugfs
  *			filesystem
+ * @pm_reqs:		Power management QoS requests related to this region
  */
 struct pseudo_lock_region {
 	struct kernfs_node	*kn;
@@ -138,6 +140,17 @@ struct pseudo_lock_region {
 #ifdef CONFIG_INTEL_RDT_DEBUGFS
 	struct dentry		*debugfs_dir;
 #endif
+	struct list_head	pm_reqs;
+};
+
+/**
+ * pseudo_lock_pm_req - A power management QoS request list entry
+ * @list:	Entry within the @pm_reqs list for a pseudo-locked region
+ * @req:	PM QoS request
+ */
+struct pseudo_lock_pm_req {
+	struct list_head list;
+	struct dev_pm_qos_request req;
 };
 
 /*
@@ -208,6 +221,66 @@ static void pseudo_lock_minor_release(unsigned int minor)
 	__set_bit(minor, &pseudo_lock_minor_avail);
 }
 
+static void pseudo_lock_cstates_relax(struct pseudo_lock_region *plr)
+{
+	struct pseudo_lock_pm_req *pm_req, *next;
+
+	list_for_each_entry_safe(pm_req, next, &plr->pm_reqs, list) {
+		dev_pm_qos_remove_request(&pm_req->req);
+		list_del(&pm_req->list);
+		kfree(pm_req);
+	}
+}
+
+/**
+ * pseudo_lock_cstates_constrain - Restrict cores from entering C6
+ *
+ * To prevent the cache from being affected by power management we have to
+ * avoid entering C6. We accomplish this by requesting a latency
+ * requirement lower than lowest C6 exit latency of all supported
+ * platforms as found in the cpuidle state tables in the intel_idle driver.
+ * At this time it is possible to do so with a single latency requirement
+ * for all supported platforms.
+ *
+ * Since we do support Goldmont, which is affected by X86_BUG_MONITOR, we
+ * need to consider the ACPI latencies while keeping in mind that C2 may be
+ * set to map to deeper sleep states. In this case the latency requirement
+ * needs to prevent entering C2 also.
+ */
+static int pseudo_lock_cstates_constrain(struct pseudo_lock_region *plr)
+{
+	struct pseudo_lock_pm_req *pm_req;
+	int cpu;
+	int ret;
+
+	for_each_cpu(cpu, &plr->d->cpu_mask) {
+		pm_req = kzalloc(sizeof(*pm_req), GFP_KERNEL);
+		if (!pm_req) {
+			rdt_last_cmd_puts("fail allocating mem for PM QoS\n");
+			ret = -ENOMEM;
+			goto out_err;
+		}
+		ret = dev_pm_qos_add_request(get_cpu_device(cpu),
+					     &pm_req->req,
+					     DEV_PM_QOS_RESUME_LATENCY,
+					     30);
+		if (ret < 0) {
+			rdt_last_cmd_printf("fail to add latency req cpu%d\n",
+					    cpu);
+			kfree(pm_req);
+			ret = -1;
+			goto out_err;
+		}
+		list_add(&pm_req->list, &plr->pm_reqs);
+	}
+
+	return 0;
+
+out_err:
+	pseudo_lock_cstates_relax(plr);
+	return ret;
+}
+
 static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
 {
 	bool is_new_plr = (plr == new_plr);
@@ -218,6 +291,7 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
 
 	if (plr->locked) {
 		plr->d->plr = NULL;
+		pseudo_lock_cstates_relax(plr);
 		device_destroy(pseudo_lock_class,
 			       MKDEV(pseudo_lock_major, plr->minor));
 		pseudo_lock_minor_release(plr->minor);
@@ -1077,6 +1151,12 @@ static int pseudo_lock_doit(struct pseudo_lock_region *plr,
 		goto out_clos_def;
 	}
 
+	ret = pseudo_lock_cstates_constrain(plr);
+	if (ret < 0) {
+		ret = -EINVAL;
+		goto out_clos_def;
+	}
+
 	plr->closid = closid;
 
 	thread_done = 0;
@@ -1092,7 +1172,7 @@ static int pseudo_lock_doit(struct pseudo_lock_region *plr,
 		 * error path since that will result in a CBM of all
 		 * zeroes which is an illegal MSR write.
 		 */
-		goto out_clos_def;
+		goto out_cstates;
 	}
 
 	kthread_bind(thread, plr->cpu);
@@ -1101,7 +1181,7 @@ static int pseudo_lock_doit(struct pseudo_lock_region *plr,
 	ret = wait_event_interruptible(wq, thread_done == 1);
 	if (ret < 0) {
 		rdt_last_cmd_puts("locking thread interrupted\n");
-		goto out_clos_def;
+		goto out_cstates;
 	}
 
 	/*
@@ -1118,7 +1198,7 @@ static int pseudo_lock_doit(struct pseudo_lock_region *plr,
 	ret = pseudo_lock_minor_get(&new_minor);
 	if (ret < 0) {
 		rdt_last_cmd_puts("unable to obtain a new minor number\n");
-		goto out_clos_def;
+		goto out_cstates;
 	}
 
 	plr->locked = true;
@@ -1163,6 +1243,8 @@ static int pseudo_lock_doit(struct pseudo_lock_region *plr,
 
 out_minor:
 	pseudo_lock_minor_release(new_minor);
+out_cstates:
+	pseudo_lock_cstates_relax(plr);
 out_clos_def:
 	pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] | plr->cbm);
 out_closid:
@@ -1355,6 +1437,7 @@ int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
 	}
 #endif
 
+	INIT_LIST_HEAD(&plr->pm_reqs);
 	kref_init(&plr->refcount);
 	kernfs_activate(kn);
 	new_plr = plr;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 21/22] mm/hugetlb: Enable large allocations through gigantic page API
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (19 preceding siblings ...)
  2018-02-13 15:47 ` [RFC PATCH V2 20/22] x86/intel_rdt: Limit C-states dynamically when pseudo-locking active Reinette Chatre
@ 2018-02-13 15:47 ` Reinette Chatre
  2018-02-13 15:47 ` [RFC PATCH V2 22/22] x86/intel_rdt: Support contiguous memory of all sizes Reinette Chatre
  2018-02-14 18:12 ` [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Mike Kravetz
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:47 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre, linux-mm, Andrew Morton,
	Mike Kravetz, Michal Hocko, Vlastimil Babka

Memory allocation within the kernel as supported by the SLAB allocators
is limited by the maximum allocatable page order. With the default
maximum page order of 11 it is not possible for the SLAB allocators to
allocate more than 4MB.

Large contiguous allocations are currently possible within the kernel
through the gigantic page support. The creation of which is currently
directed from userspace.

Expose the gigantic page support within the kernel to enable memory
allocations that cannot be fulfilled by the SLAB allocators.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>

---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c            | 10 ++++------
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 82a25880714a..8f2125dc8a86 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -349,6 +349,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
+struct page *alloc_gigantic_page(int nid, unsigned int order, gfp_t gfp_mask);
+void free_gigantic_page(struct page *page, unsigned int order);
 
 /* arch callback */
 int __init __alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a334f5fb730..f3f5e4ef3144 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1060,7 +1060,7 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
-static void free_gigantic_page(struct page *page, unsigned int order)
+void free_gigantic_page(struct page *page, unsigned int order)
 {
 	free_contig_range(page_to_pfn(page), 1 << order);
 }
@@ -1108,17 +1108,15 @@ static bool zone_spans_last_pfn(const struct zone *zone,
 	return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(int nid, struct hstate *h)
+struct page *alloc_gigantic_page(int nid, unsigned int order, gfp_t gfp_mask)
 {
-	unsigned int order = huge_page_order(h);
 	unsigned long nr_pages = 1 << order;
 	unsigned long ret, pfn, flags;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
-	gfp_t gfp_mask;
 
-	gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	gfp_mask = gfp_mask | __GFP_THISNODE;
 	zonelist = node_zonelist(nid, gfp_mask);
 	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), NULL) {
 		spin_lock_irqsave(&zone->lock, flags);
@@ -1155,7 +1153,7 @@ static struct page *alloc_fresh_gigantic_page_node(struct hstate *h, int nid)
 {
 	struct page *page;
 
-	page = alloc_gigantic_page(nid, h);
+	page = alloc_gigantic_page(nid, huge_page_order(h), htlb_alloc_mask(h));
 	if (page) {
 		prep_compound_gigantic_page(page, huge_page_order(h));
 		prep_new_huge_page(h, page, nid);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH V2 22/22] x86/intel_rdt: Support contiguous memory of all sizes
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (20 preceding siblings ...)
  2018-02-13 15:47 ` [RFC PATCH V2 21/22] mm/hugetlb: Enable large allocations through gigantic page API Reinette Chatre
@ 2018-02-13 15:47 ` Reinette Chatre
  2018-02-14 18:12 ` [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Mike Kravetz
  22 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-13 15:47 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, Reinette Chatre

Through "mm/hugetlb: Enable large allocations through gigantic page
API" we are able to allocate contiguous memory regions larger than what
the SLAB allocators can support.

Use the alloc_gigantic_page/free_gigantic_page API to support allocation
of large contiguous memory regions in order to support pseudo-locked
regions larger than 4MB.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 89 ++++++++++++++++++++++-------
 1 file changed, 68 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 90f040166fcd..99918943a98a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -23,6 +23,7 @@
 #include <linux/cpu.h>
 #include <linux/cpumask.h>
 #include <linux/debugfs.h>
+#include <linux/hugetlb.h>
 #include <linux/kernfs.h>
 #include <linux/kref.h>
 #include <linux/kthread.h>
@@ -136,7 +137,7 @@ struct pseudo_lock_region {
 	bool			locked;
 	struct kref		refcount;
 	bool			deleted;
-	void			*kmem;
+	struct page		*kmem;
 #ifdef CONFIG_INTEL_RDT_DEBUGFS
 	struct dentry		*debugfs_dir;
 #endif
@@ -202,12 +203,69 @@ static int pseudo_lock_clos_set(struct pseudo_lock_region *plr,
 	return ret;
 }
 
+/**
+ * contig_mem_alloc - Allocate contiguous memory for pseudo-locked region
+ * @plr: pseudo-locked region for which memory is requested
+ *
+ * In an effort to ensure best coverage of cache with allocated memory
+ * (fewest conflicting physical addresses) allocate contiguous memory
+ * that will be pseudo-locked. The SLAB allocators are restricted wrt
+ * the maximum memory it can allocate. If more memory is required than
+ * what can be requested from the SLAB allocators a gigantic page is
+ * requested instead.
+ */
+static int contig_mem_alloc(struct pseudo_lock_region *plr)
+{
+	void *kmem;
+
+	/*
+	 * We should not be allocating from the slab cache - we need whole
+	 * pages.
+	 */
+	if (plr->size < KMALLOC_MAX_CACHE_SIZE) {
+		rdt_last_cmd_puts("requested region smaller than page size\n");
+		return -EINVAL;
+	}
+
+	if (plr->size > KMALLOC_MAX_SIZE) {
+		plr->kmem = alloc_gigantic_page(cpu_to_node(plr->cpu),
+						get_order(plr->size),
+						GFP_KERNEL | __GFP_ZERO);
+		if (!plr->kmem) {
+			rdt_last_cmd_puts("unable to allocate gigantic page\n");
+			return -ENOMEM;
+		}
+	} else {
+		kmem = kzalloc(plr->size, GFP_KERNEL);
+		if (!kmem) {
+			rdt_last_cmd_puts("unable to allocate memory\n");
+			return -ENOMEM;
+		}
+
+		if (!PAGE_ALIGNED(kmem)) {
+			rdt_last_cmd_puts("received unaligned memory\n");
+			kfree(kmem);
+			return -ENOMEM;
+		}
+		plr->kmem = virt_to_page(kmem);
+	}
+	return 0;
+}
+
+static void contig_mem_free(struct pseudo_lock_region *plr)
+{
+	if (plr->size > KMALLOC_MAX_SIZE)
+		free_gigantic_page(plr->kmem, get_order(plr->size));
+	else
+		kfree(page_to_virt(plr->kmem));
+}
+
 static void pseudo_lock_region_clear(struct pseudo_lock_region *plr)
 {
-	plr->size = 0;
 	plr->line_size = 0;
-	kfree(plr->kmem);
+	contig_mem_free(plr);
 	plr->kmem = NULL;
+	plr->size = 0;
 	plr->r = NULL;
 	plr->d = NULL;
 }
@@ -444,7 +502,7 @@ static int measure_cycles_hist_fn(void *_plr)
 	 * local register variable used for memory pointer.
 	 */
 	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
-	mem_r = plr->kmem;
+	mem_r = page_to_virt(plr->kmem);
 	for (i = 0; i < plr->size; i += 32) {
 		start = rdtsc_ordered();
 		asm volatile("mov (%0,%1,1), %%eax\n\t"
@@ -568,7 +626,7 @@ static int measure_cycles_perf_fn(void *_plr)
 		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
 				      l3_miss_bits);
 	}
-	mem_r = plr->kmem;
+	mem_r = page_to_virt(plr->kmem);
 	size = plr->size;
 	line_size = plr->line_size;
 	for (i = 0; i < size; i += line_size) {
@@ -912,20 +970,9 @@ static int pseudo_lock_region_init(struct pseudo_lock_region *plr,
 		return -ENOSPC;
 	}
 
-	/*
-	 * We do not yet support contiguous regions larger than
-	 * KMALLOC_MAX_SIZE
-	 */
-	if (plr->size > KMALLOC_MAX_SIZE) {
-		rdt_last_cmd_puts("requested region exceeds maximum size\n");
-		return -E2BIG;
-	}
-
-	plr->kmem = kzalloc(plr->size, GFP_KERNEL);
-	if (!plr->kmem) {
-		rdt_last_cmd_puts("unable to allocate memory\n");
+	ret = contig_mem_alloc(plr);
+	if (ret < 0)
 		return -ENOMEM;
-	}
 
 	plr->r = r;
 	plr->d = d;
@@ -996,7 +1043,7 @@ static int pseudo_lock_fn(void *_plr)
 	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
 	closid_p = this_cpu_read(pqr_state.cur_closid);
 	rmid_p = this_cpu_read(pqr_state.cur_rmid);
-	mem_r = plr->kmem;
+	mem_r = page_to_virt(plr->kmem);
 	size = plr->size;
 	line_size = plr->line_size;
 	__wrmsr(IA32_PQR_ASSOC, rmid_p, plr->closid);
@@ -1630,7 +1677,7 @@ static int pseudo_lock_dev_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 	}
 
-	physical = __pa(plr->kmem) >> PAGE_SHIFT;
+	physical = page_to_phys(plr->kmem) >> PAGE_SHIFT;
 	psize = plr->size - off;
 
 	if (off > plr->size) {
@@ -1652,7 +1699,7 @@ static int pseudo_lock_dev_mmap(struct file *file, struct vm_area_struct *vma)
 		return -ENOSPC;
 	}
 
-	memset(plr->kmem + off, 0, vsize);
+	memset(page_to_virt(plr->kmem) + off, 0, vsize);
 
 	if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff,
 			    vsize, vma->vm_page_prot)) {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling
  2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
                   ` (21 preceding siblings ...)
  2018-02-13 15:47 ` [RFC PATCH V2 22/22] x86/intel_rdt: Support contiguous memory of all sizes Reinette Chatre
@ 2018-02-14 18:12 ` Mike Kravetz
  2018-02-14 18:31   ` Reinette Chatre
  22 siblings, 1 reply; 65+ messages in thread
From: Mike Kravetz @ 2018-02-14 18:12 UTC (permalink / raw)
  To: Reinette Chatre, tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, linux-mm, Andrew Morton, Michal Hocko,
	Vlastimil Babka

On 02/13/2018 07:46 AM, Reinette Chatre wrote:
> Adding MM maintainers to v2 to share the new MM change (patch 21/22) that
> enables large contiguous regions that was created to support large Cache
> Pseudo-Locked regions (patch 22/22). This week MM team received another
> proposal to support large contiguous allocations ("[RFC PATCH 0/3]
> Interface for higher order contiguous allocations" at
> http://lkml.kernel.org/r/20180212222056.9735-1-mike.kravetz@oracle.com).
> I have not yet tested with this new proposal but it does seem appropriate
> and I should be able to rework patch 22 from this series on top of that if
> it is accepted instead of what I have in patch 21 of this series.
> 

Well, I certainly would prefer the adoption and use of a more general
purpose interface rather than exposing alloc_gigantic_page().

Both the interface I suggested and alloc_gigantic_page end up calling
alloc_contig_range().  I have not looked at your entire patch series, but
do be aware that in its present form alloc_contig_range will run into
issues if called by two threads simultaneously for the same page range.
Calling alloc_gigantic_page without some form of synchronization will
expose this issue.  Currently this is handled by hugetlb_lock for all
users of alloc_gigantic_page.  If you simply expose alloc_gigantic_page
without any type of synchronization, you may run into issues.  The first
patch in my RFC "mm: make start_isolate_page_range() fail if already
isolated" should handle this situation IF we decide to expose
alloc_gigantic_page (which I do not suggest).

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling
  2018-02-14 18:12 ` [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Mike Kravetz
@ 2018-02-14 18:31   ` Reinette Chatre
  2018-02-15 20:39     ` Reinette Chatre
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-14 18:31 UTC (permalink / raw)
  To: Mike Kravetz, tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, linux-mm, Andrew Morton, Michal Hocko,
	Vlastimil Babka

Hi Mike,

On 2/14/2018 10:12 AM, Mike Kravetz wrote:
> On 02/13/2018 07:46 AM, Reinette Chatre wrote:
>> Adding MM maintainers to v2 to share the new MM change (patch 21/22) that
>> enables large contiguous regions that was created to support large Cache
>> Pseudo-Locked regions (patch 22/22). This week MM team received another
>> proposal to support large contiguous allocations ("[RFC PATCH 0/3]
>> Interface for higher order contiguous allocations" at
>> http://lkml.kernel.org/r/20180212222056.9735-1-mike.kravetz@oracle.com).
>> I have not yet tested with this new proposal but it does seem appropriate
>> and I should be able to rework patch 22 from this series on top of that if
>> it is accepted instead of what I have in patch 21 of this series.
>>
> 
> Well, I certainly would prefer the adoption and use of a more general
> purpose interface rather than exposing alloc_gigantic_page().
> 
> Both the interface I suggested and alloc_gigantic_page end up calling
> alloc_contig_range().  I have not looked at your entire patch series, but
> do be aware that in its present form alloc_contig_range will run into
> issues if called by two threads simultaneously for the same page range.
> Calling alloc_gigantic_page without some form of synchronization will
> expose this issue.  Currently this is handled by hugetlb_lock for all
> users of alloc_gigantic_page.  If you simply expose alloc_gigantic_page
> without any type of synchronization, you may run into issues.  The first
> patch in my RFC "mm: make start_isolate_page_range() fail if already
> isolated" should handle this situation IF we decide to expose
> alloc_gigantic_page (which I do not suggest).

My work depends on the ability to create large contiguous regions,
creating these large regions is not the goal in itself. Certainly I
would want to use the most appropriate mechanism and I would gladly
modify my work to do so.

I do not insist on using alloc_gigantic_page(). Now that I am aware of
your RFC I started the process to convert to the new
find_alloc_contig_pages(). I did not do so earlier because it was not
available when I prepared this work for submission. I plan to respond to
your RFC when my testing is complete but please give me a few days to do
so. Could you please also cc me if you do send out any new versions?

Thank you very much!

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling
  2018-02-14 18:31   ` Reinette Chatre
@ 2018-02-15 20:39     ` Reinette Chatre
  2018-02-15 21:10       ` Mike Kravetz
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-15 20:39 UTC (permalink / raw)
  To: Mike Kravetz, tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, linux-mm, Andrew Morton, Michal Hocko,
	Vlastimil Babka

On 2/14/2018 10:31 AM, Reinette Chatre wrote:
> On 2/14/2018 10:12 AM, Mike Kravetz wrote:
>> On 02/13/2018 07:46 AM, Reinette Chatre wrote:
>>> Adding MM maintainers to v2 to share the new MM change (patch 21/22) that
>>> enables large contiguous regions that was created to support large Cache
>>> Pseudo-Locked regions (patch 22/22). This week MM team received another
>>> proposal to support large contiguous allocations ("[RFC PATCH 0/3]
>>> Interface for higher order contiguous allocations" at
>>> http://lkml.kernel.org/r/20180212222056.9735-1-mike.kravetz@oracle.com).
>>> I have not yet tested with this new proposal but it does seem appropriate
>>> and I should be able to rework patch 22 from this series on top of that if
>>> it is accepted instead of what I have in patch 21 of this series.
>>>
>>
>> Well, I certainly would prefer the adoption and use of a more general
>> purpose interface rather than exposing alloc_gigantic_page().
>>
>> Both the interface I suggested and alloc_gigantic_page end up calling
>> alloc_contig_range().  I have not looked at your entire patch series, but
>> do be aware that in its present form alloc_contig_range will run into
>> issues if called by two threads simultaneously for the same page range.
>> Calling alloc_gigantic_page without some form of synchronization will
>> expose this issue.  Currently this is handled by hugetlb_lock for all
>> users of alloc_gigantic_page.  If you simply expose alloc_gigantic_page
>> without any type of synchronization, you may run into issues.  The first
>> patch in my RFC "mm: make start_isolate_page_range() fail if already
>> isolated" should handle this situation IF we decide to expose
>> alloc_gigantic_page (which I do not suggest).
> 
> My work depends on the ability to create large contiguous regions,
> creating these large regions is not the goal in itself. Certainly I
> would want to use the most appropriate mechanism and I would gladly
> modify my work to do so.
> 
> I do not insist on using alloc_gigantic_page(). Now that I am aware of
> your RFC I started the process to convert to the new
> find_alloc_contig_pages(). I did not do so earlier because it was not
> available when I prepared this work for submission. I plan to respond to
> your RFC when my testing is complete but please give me a few days to do
> so. Could you please also cc me if you do send out any new versions?

Testing with the new find_alloc_contig_pages() introduced in
"[RFC PATCH 0/3] Interface for higher order contiguous allocations" at
http://lkml.kernel.org/r/20180212222056.9735-1-mike.kravetz@oracle.com
was successful. If this new interface is merged then Cache
Pseudo-Locking can easily be ported to use that instead of what I have
in patch 21/22 (exposing alloc_gigantic_page()) with the following
change to patch 22/22:


diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 99918943a98a..b5e4ae379352 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -228,9 +228,10 @@ static int contig_mem_alloc(struct
pseudo_lock_region *plr)
        }

        if (plr->size > KMALLOC_MAX_SIZE) {
-               plr->kmem = alloc_gigantic_page(cpu_to_node(plr->cpu),
-                                               get_order(plr->size),
-                                               GFP_KERNEL | __GFP_ZERO);
+               plr->kmem = find_alloc_contig_pages(get_order(plr->size),
+                                                   GFP_KERNEL | __GFP_ZERO,
+                                                   cpu_to_node(plr->cpu),
+                                                   NULL);
                if (!plr->kmem) {
                        rdt_last_cmd_puts("unable to allocate gigantic
page\n");
                        return -ENOMEM;
@@ -255,7 +256,7 @@ static int contig_mem_alloc(struct
pseudo_lock_region *plr)
 static void contig_mem_free(struct pseudo_lock_region *plr)
 {
        if (plr->size > KMALLOC_MAX_SIZE)
-               free_gigantic_page(plr->kmem, get_order(plr->size));
+               free_contig_pages(plr->kmem, 1 << get_order(plr->size));
        else
                kfree(page_to_virt(plr->kmem));
 }


It does seem as though there will be a new API for large contiguous
allocations, eliminating the need for patch 21 of this series. How large
contiguous regions are allocated are independent of Cache Pseudo-Locking
though and the patch series as submitted still stands. I can include the
above snippet in a new version of the series but I am not sure if it is
preferred at this time. Please do let me know, I'd be happy to.

Reinette

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling
  2018-02-15 20:39     ` Reinette Chatre
@ 2018-02-15 21:10       ` Mike Kravetz
  0 siblings, 0 replies; 65+ messages in thread
From: Mike Kravetz @ 2018-02-15 21:10 UTC (permalink / raw)
  To: Reinette Chatre, tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel, linux-mm, Andrew Morton, Michal Hocko,
	Vlastimil Babka

On 02/15/2018 12:39 PM, Reinette Chatre wrote:
> On 2/14/2018 10:31 AM, Reinette Chatre wrote:
>> On 2/14/2018 10:12 AM, Mike Kravetz wrote:
>>> On 02/13/2018 07:46 AM, Reinette Chatre wrote:
>>>> Adding MM maintainers to v2 to share the new MM change (patch 21/22) that
>>>> enables large contiguous regions that was created to support large Cache
>>>> Pseudo-Locked regions (patch 22/22). This week MM team received another
>>>> proposal to support large contiguous allocations ("[RFC PATCH 0/3]
>>>> Interface for higher order contiguous allocations" at
>>>> http://lkml.kernel.org/r/20180212222056.9735-1-mike.kravetz@oracle.com).
>>>> I have not yet tested with this new proposal but it does seem appropriate
>>>> and I should be able to rework patch 22 from this series on top of that if
>>>> it is accepted instead of what I have in patch 21 of this series.
>>>>
>>>
>>> Well, I certainly would prefer the adoption and use of a more general
>>> purpose interface rather than exposing alloc_gigantic_page().
>>>
>>> Both the interface I suggested and alloc_gigantic_page end up calling
>>> alloc_contig_range().  I have not looked at your entire patch series, but
>>> do be aware that in its present form alloc_contig_range will run into
>>> issues if called by two threads simultaneously for the same page range.
>>> Calling alloc_gigantic_page without some form of synchronization will
>>> expose this issue.  Currently this is handled by hugetlb_lock for all
>>> users of alloc_gigantic_page.  If you simply expose alloc_gigantic_page
>>> without any type of synchronization, you may run into issues.  The first
>>> patch in my RFC "mm: make start_isolate_page_range() fail if already
>>> isolated" should handle this situation IF we decide to expose
>>> alloc_gigantic_page (which I do not suggest).
>>
>> My work depends on the ability to create large contiguous regions,
>> creating these large regions is not the goal in itself. Certainly I
>> would want to use the most appropriate mechanism and I would gladly
>> modify my work to do so.
>>
>> I do not insist on using alloc_gigantic_page(). Now that I am aware of
>> your RFC I started the process to convert to the new
>> find_alloc_contig_pages(). I did not do so earlier because it was not
>> available when I prepared this work for submission. I plan to respond to
>> your RFC when my testing is complete but please give me a few days to do
>> so. Could you please also cc me if you do send out any new versions?
> 
> Testing with the new find_alloc_contig_pages() introduced in
> "[RFC PATCH 0/3] Interface for higher order contiguous allocations" at
> http://lkml.kernel.org/r/20180212222056.9735-1-mike.kravetz@oracle.com
> was successful. If this new interface is merged then Cache
> Pseudo-Locking can easily be ported to use that instead of what I have
> in patch 21/22 (exposing alloc_gigantic_page()) with the following
> change to patch 22/22:
> 

Nice.  Thank you for converting and testing with this interface.
-- 
Mike Kravetz

> 
> diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> index 99918943a98a..b5e4ae379352 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> @@ -228,9 +228,10 @@ static int contig_mem_alloc(struct
> pseudo_lock_region *plr)
>         }
> 
>         if (plr->size > KMALLOC_MAX_SIZE) {
> -               plr->kmem = alloc_gigantic_page(cpu_to_node(plr->cpu),
> -                                               get_order(plr->size),
> -                                               GFP_KERNEL | __GFP_ZERO);
> +               plr->kmem = find_alloc_contig_pages(get_order(plr->size),
> +                                                   GFP_KERNEL | __GFP_ZERO,
> +                                                   cpu_to_node(plr->cpu),
> +                                                   NULL);
>                 if (!plr->kmem) {
>                         rdt_last_cmd_puts("unable to allocate gigantic
> page\n");
>                         return -ENOMEM;
> @@ -255,7 +256,7 @@ static int contig_mem_alloc(struct
> pseudo_lock_region *plr)
>  static void contig_mem_free(struct pseudo_lock_region *plr)
>  {
>         if (plr->size > KMALLOC_MAX_SIZE)
> -               free_gigantic_page(plr->kmem, get_order(plr->size));
> +               free_contig_pages(plr->kmem, 1 << get_order(plr->size));
>         else
>                 kfree(page_to_virt(plr->kmem));
>  }
> 
> 
> It does seem as though there will be a new API for large contiguous
> allocations, eliminating the need for patch 21 of this series. How large
> contiguous regions are allocated are independent of Cache Pseudo-Locking
> though and the patch series as submitted still stands. I can include the
> above snippet in a new version of the series but I am not sure if it is
> preferred at this time. Please do let me know, I'd be happy to.
> 
> Reinette
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking
  2018-02-13 15:46 ` [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking Reinette Chatre
@ 2018-02-19 20:35   ` Thomas Gleixner
  2018-02-19 22:15     ` Reinette Chatre
  2018-02-19 21:27   ` Randy Dunlap
  1 sibling, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-19 20:35 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1699 bytes --]

On Tue, 13 Feb 2018, Reinette Chatre wrote:
> +Cache Pseudo-Locking
> +--------------------
> +CAT enables a user to specify the amount of cache space into which an
> +application can fill. Cache pseudo-locking builds on the fact that a
> +CPU can still read and write data pre-allocated outside its current
> +allocated area on a cache hit. With cache pseudo-locking, data can be
> +preloaded into a reserved portion of cache that no application can
> +fill, and from that point on will only serve cache hits.

This lacks explanation how that preloading works.

> The cache
> +pseudo-locked memory is made accessible to user space where an
> +application can map it into its virtual address space and thus have
> +a region of memory with reduced average read latency.
> +
> +Cache pseudo-locking increases the probability that data will remain
> +in the cache via carefully configuring the CAT feature and controlling
> +application behavior. There is no guarantee that data is placed in
> +cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
> +“locked” data from cache. Power management C-states may shrink or
> +power off cache. It is thus recommended to limit the processor maximum
> +C-state, for example, by setting the processor.max_cstate kernel parameter.
> +
> +It is required that an application using a pseudo-locked region runs
> +with affinity to the cores (or a subset of the cores) associated
> +with the cache on which the pseudo-locked region resides. This is
> +enforced by the implementation.

Well, you only enforce in pseudo_lock_dev_mmap() that the caller is affine
to the right CPUs. But that's not a guarantee that the task stays there.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions
  2018-02-13 15:46 ` [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions Reinette Chatre
@ 2018-02-19 20:57   ` Thomas Gleixner
  2018-02-19 23:02     ` Reinette Chatre
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-19 20:57 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Tue, 13 Feb 2018, Reinette Chatre wrote:

> System administrator creates/removes pseudo-locked regions by
> creating/removing directories in the pseudo-lock subdirectory of the
> resctrl filesystem. Here we add directory creation and removal support.
> 
> A "pseudo-lock region" is introduced, which represents an
> instance of a pseudo-locked cache region. During mkdir a new region is
> created but since we do not know which cache it belongs to at that time
> we maintain a global pointer to it from where it will be moved to the cache
> (rdt_domain) it belongs to after initialization. This implies that
> we only support one uninitialized pseudo-locked region at a time.

Whats the reason for this restriction? If there are uninitialized
directories, so what?

> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
>  arch/x86/kernel/cpu/intel_rdt.h             |   3 +
>  arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 220 +++++++++++++++++++++++++++-
>  2 files changed, 222 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
> index 8f5ded384e19..55f085985072 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.h
> +++ b/arch/x86/kernel/cpu/intel_rdt.h
> @@ -352,6 +352,7 @@ extern struct mutex rdtgroup_mutex;
>  extern struct rdt_resource rdt_resources_all[];
>  extern struct rdtgroup rdtgroup_default;
>  DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> +extern struct kernfs_node *pseudo_lock_kn;
>  
>  int __init rdtgroup_init(void);
>  
> @@ -457,5 +458,7 @@ bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
>  void __check_limbo(struct rdt_domain *d, bool force_free);
>  int rdt_pseudo_lock_fs_init(struct kernfs_node *root);
>  void rdt_pseudo_lock_fs_remove(void);
> +int rdt_pseudo_lock_mkdir(const char *name, umode_t mode);
> +int rdt_pseudo_lock_rmdir(struct kernfs_node *kn);
>  
>  #endif /* _ASM_X86_INTEL_RDT_H */
> diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> index a787a103c432..7a22e367b82f 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> @@ -20,11 +20,142 @@
>  #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
>  
>  #include <linux/kernfs.h>
> +#include <linux/kref.h>
>  #include <linux/seq_file.h>
>  #include <linux/stat.h>
> +#include <linux/slab.h>
>  #include "intel_rdt.h"
>  
> -static struct kernfs_node *pseudo_lock_kn;
> +struct kernfs_node *pseudo_lock_kn;
> +
> +/*
> + * Protect the pseudo_lock_region access. Since we will link to
> + * pseudo_lock_region from rdt domains rdtgroup_mutex should be obtained
> + * first if needed.
> + */
> +static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
> +
> +/**
> + * struct pseudo_lock_region - pseudo-lock region information
> + * @kn:			kernfs node representing this region in the resctrl
> + *			filesystem
> + * @cbm:		bitmask of the pseudo-locked region
> + * @cpu:		core associated with the cache on which the setup code
> + *			will be run
> + * @minor:		minor number of character device associated with this
> + *			region
> + * @locked:		state indicating if this region has been locked or not
> + * @refcount:		how many are waiting to access this pseudo-lock
> + *			region via kernfs
> + * @deleted:		user requested removal of region via rmdir on kernfs
> + */
> +struct pseudo_lock_region {
> +	struct kernfs_node	*kn;
> +	u32			cbm;
> +	int			cpu;
> +	unsigned int		minor;
> +	bool			locked;
> +	struct kref		refcount;
> +	bool			deleted;
> +};
> +
> +/*
> + * Only one uninitialized pseudo-locked region can exist at a time. An
> + * uninitialized pseudo-locked region is created when the user creates a
> + * new directory within the pseudo_lock subdirectory of the resctrl
> + * filsystem. The user will initialize the pseudo-locked region by writing
> + * to its schemata file at which point this structure will be moved to the
> + * cache domain it belongs to.
> + */
> +static struct pseudo_lock_region *new_plr;

Why isn't the struct pointer not stored in the corresponding kernfs's node->priv?

> +static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
> +{
> +	bool is_new_plr = (plr == new_plr);
> +
> +	WARN_ON(!plr->deleted);
> +	if (!plr->deleted)
> +		return;

  	if (WARN_ON(...))
		return;

> +
> +	kfree(plr);
> +	if (is_new_plr)
> +		new_plr = NULL;
> +}
> +
> +static void pseudo_lock_region_release(struct kref *ref)
> +{
> +	struct pseudo_lock_region *plr = container_of(ref,
> +						      struct pseudo_lock_region,
> +						      refcount);

You simply can avoid those line breaks by:

	struct pseudo_lock_region *plr;

	plr = container_of(ref, struct pseudo_lock_region, refcount);

Hmm?

> +	mutex_lock(&rdt_pseudo_lock_mutex);
> +	__pseudo_lock_region_release(plr);
> +	mutex_unlock(&rdt_pseudo_lock_mutex);
> +}
> +
> +/**
> + * pseudo_lock_region_kn_lock - Obtain lock to pseudo-lock region kernfs node
> + *
> + * This is called from the kernfs related functions which are called with
> + * an active reference to the kernfs_node that contains a valid pointer to
> + * the pseudo-lock region it represents. We can thus safely take an active
> + * reference to the pseudo-lock region before dropping the reference to the
> + * kernfs_node.
> + *
> + * We need to handle the scenarios where the kernfs directory representing
> + * this pseudo-lock region can be removed while an application still has an
> + * open handle to one of the directory's files and operations on this
> + * handle are attempted.
> + * To support this we allow a file operation to drop its reference to the
> + * kernfs_node so that the removal can proceed, while using the mutex to
> + * ensure these operations on the pseudo-lock region are serialized. At the
> + * time an operation does obtain access to the region it may thus have been
> + * deleted.
> + */
> +static struct pseudo_lock_region *pseudo_lock_region_kn_lock(
> +						struct kernfs_node *kn)
> +{
> +	struct pseudo_lock_region *plr = (kernfs_type(kn) == KERNFS_DIR) ?
> +						kn->priv : kn->parent->priv;

See above.

> +int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
> +{
> +	struct pseudo_lock_region *plr;
> +	struct kernfs_node *kn;
> +	int ret = 0;
> +
> +	mutex_lock(&rdtgroup_mutex);
> +	mutex_lock(&rdt_pseudo_lock_mutex);
> +
> +	if (new_plr) {
> +		ret = -ENOSPC;
> +		goto out;
> +	}
> +
> +	plr = kzalloc(sizeof(*plr), GFP_KERNEL);
> +	if (!plr) {
> +		ret = -ENOSPC;

  ENOMEM is the proper error code here.

> +		goto out;
> +	}
> +
> +	kn = kernfs_create_dir(pseudo_lock_kn, name, mode, plr);
> +	if (IS_ERR(kn)) {
> +		ret = PTR_ERR(kn);
> +		goto out_free;
> +	}
> +
> +	plr->kn = kn;
> +	ret = rdtgroup_kn_set_ugid(kn);
> +	if (ret)
> +		goto out_remove;
> +
> +	kref_init(&plr->refcount);
> +	kernfs_activate(kn);
> +	new_plr = plr;
> +	ret = 0;
> +	goto out;
> +
> +out_remove:
> +	kernfs_remove(kn);
> +out_free:
> +	kfree(plr);
> +out:
> +	mutex_unlock(&rdt_pseudo_lock_mutex);
> +	mutex_unlock(&rdtgroup_mutex);
> +	return ret;
> +}
> +
> +/*
> + * rdt_pseudo_lock_rmdir - Remove pseudo-lock region
> + *
> + * LOCKING:
> + * Since the pseudo-locked region can be associated with a RDT domain at
> + * removal we take both rdtgroup_mutex and rdt_pseudo_lock_mutex to protect
> + * the rdt_domain access as well as the pseudo_lock_region access.

Is there a real reason / benefit for having this second mutex?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 08/22] x86/intel_rdt: Introduce pseudo-locking resctrl files
  2018-02-13 15:46 ` [RFC PATCH V2 08/22] x86/intel_rdt: Introduce pseudo-locking resctrl files Reinette Chatre
@ 2018-02-19 21:01   ` Thomas Gleixner
  0 siblings, 0 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-19 21:01 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Tue, 13 Feb 2018, Reinette Chatre wrote:
> diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> index 7a22e367b82f..94bd1b4fbfee 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
> @@ -40,6 +40,7 @@ static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
>   * @kn:			kernfs node representing this region in the resctrl
>   *			filesystem
>   * @cbm:		bitmask of the pseudo-locked region
> + * @size:		size of pseudo-locked region in bytes
>   * @cpu:		core associated with the cache on which the setup code
>   *			will be run
>   * @minor:		minor number of character device associated with this
> @@ -52,6 +53,7 @@ static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
>  struct pseudo_lock_region {
>  	struct kernfs_node	*kn;
>  	u32			cbm;
> +	unsigned int		size;
>  	int			cpu;
>  	unsigned int		minor;
>  	bool			locked;
> @@ -227,6 +229,49 @@ static struct kernfs_ops pseudo_lock_avail_ops = {
>  	.seq_show		= pseudo_lock_avail_show,
>  };
>  
> +int pseudo_lock_schemata_show(struct kernfs_open_file *of,
> +			      struct seq_file *seq, void *v)
> +{
> +	struct pseudo_lock_region *plr;
> +	struct rdt_resource *r;
> +	int ret = 0;
> +
> +	plr = pseudo_lock_region_kn_lock(of->kn);
> +	if (!plr) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	if (!plr->locked) {
> +		for_each_alloc_enabled_rdt_resource(r) {
> +			seq_printf(seq, "%s:uninitialized\n", r->name);
> +		}

Surplus braces around the for_...()

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain
  2018-02-13 15:46 ` [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain Reinette Chatre
@ 2018-02-19 21:19   ` Thomas Gleixner
  2018-02-19 23:00     ` Reinette Chatre
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-19 21:19 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Tue, 13 Feb 2018, Reinette Chatre wrote:

> After a pseudo-locked region is locked it needs to be associated with
> the RDT domain representing the pseudo-locked cache so that its life
> cycle can be managed correctly.
> 
> Only a single pseudo-locked region can exist on any cache instance so we
> maintain a single pointer to a pseudo-locked region from each RDT
> domain.

Why is only a single pseudo locked region possible? 

> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
>  arch/x86/kernel/cpu/intel_rdt.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
> index 060a0976ac00..f0e020686e99 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.h
> +++ b/arch/x86/kernel/cpu/intel_rdt.h
> @@ -187,6 +187,8 @@ struct mbm_state {
>  	u64	prev_msr;
>  };
>  
> +struct pseudo_lock_region;
> +
>  /**
>   * struct rdt_domain - group of cpus sharing an RDT resource
>   * @list:	all instances of this resource
> @@ -205,6 +207,7 @@ struct mbm_state {
>   * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
>   * @new_ctrl:	new ctrl value to be loaded
>   * @have_new_ctrl: did user provide new_ctrl for this domain
> + * @plr:	pseudo-locked region associated with this domain
>   */
>  struct rdt_domain {
>  	struct list_head	list;
> @@ -220,6 +223,7 @@ struct rdt_domain {
>  	u32			*ctrl_val;
>  	u32			new_ctrl;
>  	bool			have_new_ctrl;
> +	struct pseudo_lock_region	*plr;

Please keep the tabular fashion of the struct declaration intact.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking
  2018-02-13 15:46 ` [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking Reinette Chatre
  2018-02-19 20:35   ` Thomas Gleixner
@ 2018-02-19 21:27   ` Randy Dunlap
  2018-02-19 22:21     ` Reinette Chatre
  1 sibling, 1 reply; 65+ messages in thread
From: Randy Dunlap @ 2018-02-19 21:27 UTC (permalink / raw)
  To: Reinette Chatre, tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel

On 02/13/18 07:46, Reinette Chatre wrote:
> Add description of Cache Pseudo-Locking feature, its interface,
> as well as an example of its usage.
> 
> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
> ---
>  Documentation/x86/intel_rdt_ui.txt | 229 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 228 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
> index 756fd76b78a6..bb3d6fe0a3e4 100644
> --- a/Documentation/x86/intel_rdt_ui.txt
> +++ b/Documentation/x86/intel_rdt_ui.txt

> @@ -329,6 +332,149 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
>  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
>  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
>  
> +Cache Pseudo-Locking
> +--------------------
> +CAT enables a user to specify the amount of cache space into which an

                                                     space that an

> +application can fill. Cache pseudo-locking builds on the fact that a
> +CPU can still read and write data pre-allocated outside its current
> +allocated area on a cache hit. With cache pseudo-locking, data can be
> +preloaded into a reserved portion of cache that no application can
> +fill, and from that point on will only serve cache hits. The cache
> +pseudo-locked memory is made accessible to user space where an
> +application can map it into its virtual address space and thus have
> +a region of memory with reduced average read latency.
> +
> +Cache pseudo-locking increases the probability that data will remain
> +in the cache via carefully configuring the CAT feature and controlling
> +application behavior. There is no guarantee that data is placed in
> +cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
> +“locked” data from cache. Power management C-states may shrink or
> +power off cache. It is thus recommended to limit the processor maximum
> +C-state, for example, by setting the processor.max_cstate kernel parameter.
> +
> +It is required that an application using a pseudo-locked region runs
> +with affinity to the cores (or a subset of the cores) associated
> +with the cache on which the pseudo-locked region resides. This is
> +enforced by the implementation.
> +
> +Pseudo-locking is accomplished in two stages:
> +1) During the first stage the system administrator allocates a portion
> +   of cache that should be dedicated to pseudo-locking. At this time an
> +   equivalent portion of memory is allocated, loaded into allocated
> +   cache portion, and exposed as a character device.
> +2) During the second stage a user-space application maps (mmap()) the
> +   pseudo-locked memory into its address space.
> +
> +Cache Pseudo-Locking Interface
> +------------------------------
> +Platforms supporting cache pseudo-locking will expose a new
> +"/sys/fs/restrl/pseudo_lock" directory after successful mount of the
> +resctrl filesystem. Initially this directory will contain a single file,
> +"avail" that contains the schemata, one line per resource, of cache region
> +available for pseudo-locking.

uh, sysfs is supposed to be one value per file.

> +A pseudo-locked region is created by creating a new directory within
> +/sys/fs/resctrl/pseudo_lock. On success two new files will appear in
> +the directory:
> +
> +"schemata":
> +	Shows the schemata representing the pseudo-locked cache region.
> +	User writes schemata of requested locked area to file.

	use complete sentences, please. E.g.:

	The user writes the schemata of the requested locked area to the file.

> +	Only one id of single resource accepted - can only lock from

	            of a single resource is accepted -

> +	single cache instance. Writing of schemata to this file will
> +	return success on successful pseudo-locked region setup.
> +"size":
> +	After successful pseudo-locked region setup this read-only file
> +	will contain the size in bytes of pseudo-locked region.


-- 
~Randy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking
  2018-02-19 20:35   ` Thomas Gleixner
@ 2018-02-19 22:15     ` Reinette Chatre
  2018-02-19 22:19       ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-19 22:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/19/2018 12:35 PM, Thomas Gleixner wrote:
> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>> +Cache Pseudo-Locking
>> +--------------------
>> +CAT enables a user to specify the amount of cache space into which an
>> +application can fill. Cache pseudo-locking builds on the fact that a
>> +CPU can still read and write data pre-allocated outside its current
>> +allocated area on a cache hit. With cache pseudo-locking, data can be
>> +preloaded into a reserved portion of cache that no application can
>> +fill, and from that point on will only serve cache hits.
> 
> This lacks explanation how that preloading works.

Following this text you quote there is a brief explanation starting with
"Pseudo-locking is accomplished in two stages:" - I'll add more details
to that area.

> 
>> The cache
>> +pseudo-locked memory is made accessible to user space where an
>> +application can map it into its virtual address space and thus have
>> +a region of memory with reduced average read latency.
>> +
>> +Cache pseudo-locking increases the probability that data will remain
>> +in the cache via carefully configuring the CAT feature and controlling
>> +application behavior. There is no guarantee that data is placed in
>> +cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
>> +“locked” data from cache. Power management C-states may shrink or
>> +power off cache. It is thus recommended to limit the processor maximum
>> +C-state, for example, by setting the processor.max_cstate kernel parameter.
>> +
>> +It is required that an application using a pseudo-locked region runs
>> +with affinity to the cores (or a subset of the cores) associated
>> +with the cache on which the pseudo-locked region resides. This is
>> +enforced by the implementation.
> 
> Well, you only enforce in pseudo_lock_dev_mmap() that the caller is affine
> to the right CPUs. But that's not a guarantee that the task stays there.

It is required that the user space application self sets affinity to
cores associated with the cache. This is also highlighted in the example
application code (later in this patch) within the comments as well as
the example usage of sched_setaffinity(). The enforcement done in the
kernel code is done as a check that the user space application did so,
no the actual affinity management.

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking
  2018-02-19 22:15     ` Reinette Chatre
@ 2018-02-19 22:19       ` Thomas Gleixner
  2018-02-19 22:24         ` Reinette Chatre
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-19 22:19 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2596 bytes --]

On Mon, 19 Feb 2018, Reinette Chatre wrote:
> Hi Thomas,
> 
> On 2/19/2018 12:35 PM, Thomas Gleixner wrote:
> > On Tue, 13 Feb 2018, Reinette Chatre wrote:
> >> +Cache Pseudo-Locking
> >> +--------------------
> >> +CAT enables a user to specify the amount of cache space into which an
> >> +application can fill. Cache pseudo-locking builds on the fact that a
> >> +CPU can still read and write data pre-allocated outside its current
> >> +allocated area on a cache hit. With cache pseudo-locking, data can be
> >> +preloaded into a reserved portion of cache that no application can
> >> +fill, and from that point on will only serve cache hits.
> > 
> > This lacks explanation how that preloading works.
> 
> Following this text you quote there is a brief explanation starting with
> "Pseudo-locking is accomplished in two stages:" - I'll add more details
> to that area.
> 
> > 
> >> The cache
> >> +pseudo-locked memory is made accessible to user space where an
> >> +application can map it into its virtual address space and thus have
> >> +a region of memory with reduced average read latency.
> >> +
> >> +Cache pseudo-locking increases the probability that data will remain
> >> +in the cache via carefully configuring the CAT feature and controlling
> >> +application behavior. There is no guarantee that data is placed in
> >> +cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
> >> +“locked” data from cache. Power management C-states may shrink or
> >> +power off cache. It is thus recommended to limit the processor maximum
> >> +C-state, for example, by setting the processor.max_cstate kernel parameter.
> >> +
> >> +It is required that an application using a pseudo-locked region runs
> >> +with affinity to the cores (or a subset of the cores) associated
> >> +with the cache on which the pseudo-locked region resides. This is
> >> +enforced by the implementation.
> > 
> > Well, you only enforce in pseudo_lock_dev_mmap() that the caller is affine
> > to the right CPUs. But that's not a guarantee that the task stays there.
> 
> It is required that the user space application self sets affinity to
> cores associated with the cache. This is also highlighted in the example
> application code (later in this patch) within the comments as well as
> the example usage of sched_setaffinity(). The enforcement done in the
> kernel code is done as a check that the user space application did so,
> no the actual affinity management.

Right, but your documentation claims it's enforced. There is no enforcement
aside of the initial sanity check.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking
  2018-02-19 21:27   ` Randy Dunlap
@ 2018-02-19 22:21     ` Reinette Chatre
  0 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-19 22:21 UTC (permalink / raw)
  To: Randy Dunlap, tglx, fenghua.yu, tony.luck
  Cc: gavin.hindman, vikas.shivappa, dave.hansen, mingo, hpa, x86,
	linux-kernel

Hi Randy,

On 2/19/2018 1:27 PM, Randy Dunlap wrote:
> On 02/13/18 07:46, Reinette Chatre wrote:
>> Add description of Cache Pseudo-Locking feature, its interface,
>> as well as an example of its usage.
>>
>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>> ---
>>  Documentation/x86/intel_rdt_ui.txt | 229 ++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 228 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
>> index 756fd76b78a6..bb3d6fe0a3e4 100644
>> --- a/Documentation/x86/intel_rdt_ui.txt
>> +++ b/Documentation/x86/intel_rdt_ui.txt
> 
>> @@ -329,6 +332,149 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
>>  L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
>>  L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
>>  
>> +Cache Pseudo-Locking
>> +--------------------
>> +CAT enables a user to specify the amount of cache space into which an
> 
>                                                      space that an

Will fix.

> 
>> +application can fill. Cache pseudo-locking builds on the fact that a
>> +CPU can still read and write data pre-allocated outside its current
>> +allocated area on a cache hit. With cache pseudo-locking, data can be
>> +preloaded into a reserved portion of cache that no application can
>> +fill, and from that point on will only serve cache hits. The cache
>> +pseudo-locked memory is made accessible to user space where an
>> +application can map it into its virtual address space and thus have
>> +a region of memory with reduced average read latency.
>> +
>> +Cache pseudo-locking increases the probability that data will remain
>> +in the cache via carefully configuring the CAT feature and controlling
>> +application behavior. There is no guarantee that data is placed in
>> +cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
>> +“locked” data from cache. Power management C-states may shrink or
>> +power off cache. It is thus recommended to limit the processor maximum
>> +C-state, for example, by setting the processor.max_cstate kernel parameter.
>> +
>> +It is required that an application using a pseudo-locked region runs
>> +with affinity to the cores (or a subset of the cores) associated
>> +with the cache on which the pseudo-locked region resides. This is
>> +enforced by the implementation.
>> +
>> +Pseudo-locking is accomplished in two stages:
>> +1) During the first stage the system administrator allocates a portion
>> +   of cache that should be dedicated to pseudo-locking. At this time an
>> +   equivalent portion of memory is allocated, loaded into allocated
>> +   cache portion, and exposed as a character device.
>> +2) During the second stage a user-space application maps (mmap()) the
>> +   pseudo-locked memory into its address space.
>> +
>> +Cache Pseudo-Locking Interface
>> +------------------------------
>> +Platforms supporting cache pseudo-locking will expose a new
>> +"/sys/fs/restrl/pseudo_lock" directory after successful mount of the
>> +resctrl filesystem. Initially this directory will contain a single file,
>> +"avail" that contains the schemata, one line per resource, of cache region
>> +available for pseudo-locking.
> 
> uh, sysfs is supposed to be one value per file.

This builds on how the schemata file currently works.

>> +A pseudo-locked region is created by creating a new directory within
>> +/sys/fs/resctrl/pseudo_lock. On success two new files will appear in
>> +the directory:
>> +
>> +"schemata":
>> +	Shows the schemata representing the pseudo-locked cache region.
>> +	User writes schemata of requested locked area to file.
> 
> 	use complete sentences, please. E.g.:
> 
> 	The user writes the schemata of the requested locked area to the file.
> 
>> +	Only one id of single resource accepted - can only lock from
> 
> 	            of a single resource is accepted -
> 

Will fix both.

>> +	single cache instance. Writing of schemata to this file will
>> +	return success on successful pseudo-locked region setup.
>> +"size":
>> +	After successful pseudo-locked region setup this read-only file
>> +	will contain the size in bytes of pseudo-locked region.
> 
> 

Thank you very much for taking a look!

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking
  2018-02-19 22:19       ` Thomas Gleixner
@ 2018-02-19 22:24         ` Reinette Chatre
  0 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-19 22:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/19/2018 2:19 PM, Thomas Gleixner wrote:
>> It is required that the user space application self sets affinity to
>> cores associated with the cache. This is also highlighted in the example
>> application code (later in this patch) within the comments as well as
>> the example usage of sched_setaffinity(). The enforcement done in the
>> kernel code is done as a check that the user space application did so,
>> no the actual affinity management.
> 
> Right, but your documentation claims it's enforced. There is no enforcement
> aside of the initial sanity check.

I see the confusion. I will fix the documentation to clarify that it is
a sanity check.

Thank you

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain
  2018-02-19 21:19   ` Thomas Gleixner
@ 2018-02-19 23:00     ` Reinette Chatre
  2018-02-19 23:19       ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-19 23:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/19/2018 1:19 PM, Thomas Gleixner wrote:
> On Tue, 13 Feb 2018, Reinette Chatre wrote:
> 
>> After a pseudo-locked region is locked it needs to be associated with
>> the RDT domain representing the pseudo-locked cache so that its life
>> cycle can be managed correctly.
>>
>> Only a single pseudo-locked region can exist on any cache instance so we
>> maintain a single pointer to a pseudo-locked region from each RDT
>> domain.
> 
> Why is only a single pseudo locked region possible? 

The setup of a pseudo-locked region requires the usage of wbinvd. If a
second pseudo-locked region is thus attempted it will evict the
pseudo-locked data of the first.

> 
>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>> ---
>>  arch/x86/kernel/cpu/intel_rdt.h | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
>> index 060a0976ac00..f0e020686e99 100644
>> --- a/arch/x86/kernel/cpu/intel_rdt.h
>> +++ b/arch/x86/kernel/cpu/intel_rdt.h
>> @@ -187,6 +187,8 @@ struct mbm_state {
>>  	u64	prev_msr;
>>  };
>>  
>> +struct pseudo_lock_region;
>> +
>>  /**
>>   * struct rdt_domain - group of cpus sharing an RDT resource
>>   * @list:	all instances of this resource
>> @@ -205,6 +207,7 @@ struct mbm_state {
>>   * @ctrl_val:	array of cache or mem ctrl values (indexed by CLOSID)
>>   * @new_ctrl:	new ctrl value to be loaded
>>   * @have_new_ctrl: did user provide new_ctrl for this domain
>> + * @plr:	pseudo-locked region associated with this domain
>>   */
>>  struct rdt_domain {
>>  	struct list_head	list;
>> @@ -220,6 +223,7 @@ struct rdt_domain {
>>  	u32			*ctrl_val;
>>  	u32			new_ctrl;
>>  	bool			have_new_ctrl;
>> +	struct pseudo_lock_region	*plr;
> 
> Please keep the tabular fashion of the struct declaration intact.

Will fix.

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions
  2018-02-19 20:57   ` Thomas Gleixner
@ 2018-02-19 23:02     ` Reinette Chatre
  2018-02-19 23:16       ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-19 23:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/19/2018 12:57 PM, Thomas Gleixner wrote:
> On Tue, 13 Feb 2018, Reinette Chatre wrote:
> 
>> System administrator creates/removes pseudo-locked regions by
>> creating/removing directories in the pseudo-lock subdirectory of the
>> resctrl filesystem. Here we add directory creation and removal support.
>>
>> A "pseudo-lock region" is introduced, which represents an
>> instance of a pseudo-locked cache region. During mkdir a new region is
>> created but since we do not know which cache it belongs to at that time
>> we maintain a global pointer to it from where it will be moved to the cache
>> (rdt_domain) it belongs to after initialization. This implies that
>> we only support one uninitialized pseudo-locked region at a time.
> 
> Whats the reason for this restriction? If there are uninitialized
> directories, so what?

I was thinking about a problematic scenario where an application
attempts to create infinite directories. All of these uninitialized
directories need to be kept track of before they are initialized as
pseudo-locked regions. It seemed simpler to require that one
pseudo-locked region is set up at a time.

>> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
>> ---
>>  arch/x86/kernel/cpu/intel_rdt.h             |   3 +
>>  arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 220 +++++++++++++++++++++++++++-
>>  2 files changed, 222 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
>> index 8f5ded384e19..55f085985072 100644
>> --- a/arch/x86/kernel/cpu/intel_rdt.h
>> +++ b/arch/x86/kernel/cpu/intel_rdt.h
>> @@ -352,6 +352,7 @@ extern struct mutex rdtgroup_mutex;
>>  extern struct rdt_resource rdt_resources_all[];
>>  extern struct rdtgroup rdtgroup_default;
>>  DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>> +extern struct kernfs_node *pseudo_lock_kn;
>>  
>>  int __init rdtgroup_init(void);
>>  
>> @@ -457,5 +458,7 @@ bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
>>  void __check_limbo(struct rdt_domain *d, bool force_free);
>>  int rdt_pseudo_lock_fs_init(struct kernfs_node *root);
>>  void rdt_pseudo_lock_fs_remove(void);
>> +int rdt_pseudo_lock_mkdir(const char *name, umode_t mode);
>> +int rdt_pseudo_lock_rmdir(struct kernfs_node *kn);
>>  
>>  #endif /* _ASM_X86_INTEL_RDT_H */
>> diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
>> index a787a103c432..7a22e367b82f 100644
>> --- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
>> +++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
>> @@ -20,11 +20,142 @@
>>  #define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
>>  
>>  #include <linux/kernfs.h>
>> +#include <linux/kref.h>
>>  #include <linux/seq_file.h>
>>  #include <linux/stat.h>
>> +#include <linux/slab.h>
>>  #include "intel_rdt.h"
>>  
>> -static struct kernfs_node *pseudo_lock_kn;
>> +struct kernfs_node *pseudo_lock_kn;
>> +
>> +/*
>> + * Protect the pseudo_lock_region access. Since we will link to
>> + * pseudo_lock_region from rdt domains rdtgroup_mutex should be obtained
>> + * first if needed.
>> + */
>> +static DEFINE_MUTEX(rdt_pseudo_lock_mutex);
>> +
>> +/**
>> + * struct pseudo_lock_region - pseudo-lock region information
>> + * @kn:			kernfs node representing this region in the resctrl
>> + *			filesystem
>> + * @cbm:		bitmask of the pseudo-locked region
>> + * @cpu:		core associated with the cache on which the setup code
>> + *			will be run
>> + * @minor:		minor number of character device associated with this
>> + *			region
>> + * @locked:		state indicating if this region has been locked or not
>> + * @refcount:		how many are waiting to access this pseudo-lock
>> + *			region via kernfs
>> + * @deleted:		user requested removal of region via rmdir on kernfs
>> + */
>> +struct pseudo_lock_region {
>> +	struct kernfs_node	*kn;
>> +	u32			cbm;
>> +	int			cpu;
>> +	unsigned int		minor;
>> +	bool			locked;
>> +	struct kref		refcount;
>> +	bool			deleted;
>> +};
>> +
>> +/*
>> + * Only one uninitialized pseudo-locked region can exist at a time. An
>> + * uninitialized pseudo-locked region is created when the user creates a
>> + * new directory within the pseudo_lock subdirectory of the resctrl
>> + * filsystem. The user will initialize the pseudo-locked region by writing
>> + * to its schemata file at which point this structure will be moved to the
>> + * cache domain it belongs to.
>> + */
>> +static struct pseudo_lock_region *new_plr;
> 
> Why isn't the struct pointer not stored in the corresponding kernfs's node->priv?

It is. The life cycle of the uninitialized pseudo-locked region is
managed with this pointer though. The initialized pseudo-locked regions
are managed through their pointers within struct rdt_domain to which
they belong. The uninitialized pseudo-locked region is tracked here
until moved to the cache domain to which it belongs.

>> +static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>> +{
>> +	bool is_new_plr = (plr == new_plr);
>> +
>> +	WARN_ON(!plr->deleted);
>> +	if (!plr->deleted)
>> +		return;
> 
>   	if (WARN_ON(...))
> 		return;
> 

Will fix.

>> +
>> +	kfree(plr);
>> +	if (is_new_plr)
>> +		new_plr = NULL;
>> +}
>> +
>> +static void pseudo_lock_region_release(struct kref *ref)
>> +{
>> +	struct pseudo_lock_region *plr = container_of(ref,
>> +						      struct pseudo_lock_region,
>> +						      refcount);
> 
> You simply can avoid those line breaks by:
> 
> 	struct pseudo_lock_region *plr;
> 
> 	plr = container_of(ref, struct pseudo_lock_region, refcount);
> 
> Hmm?
> 

Absolutely. Will fix.

>> +	mutex_lock(&rdt_pseudo_lock_mutex);
>> +	__pseudo_lock_region_release(plr);
>> +	mutex_unlock(&rdt_pseudo_lock_mutex);
>> +}
>> +
>> +/**
>> + * pseudo_lock_region_kn_lock - Obtain lock to pseudo-lock region kernfs node
>> + *
>> + * This is called from the kernfs related functions which are called with
>> + * an active reference to the kernfs_node that contains a valid pointer to
>> + * the pseudo-lock region it represents. We can thus safely take an active
>> + * reference to the pseudo-lock region before dropping the reference to the
>> + * kernfs_node.
>> + *
>> + * We need to handle the scenarios where the kernfs directory representing
>> + * this pseudo-lock region can be removed while an application still has an
>> + * open handle to one of the directory's files and operations on this
>> + * handle are attempted.
>> + * To support this we allow a file operation to drop its reference to the
>> + * kernfs_node so that the removal can proceed, while using the mutex to
>> + * ensure these operations on the pseudo-lock region are serialized. At the
>> + * time an operation does obtain access to the region it may thus have been
>> + * deleted.
>> + */
>> +static struct pseudo_lock_region *pseudo_lock_region_kn_lock(
>> +						struct kernfs_node *kn)
>> +{
>> +	struct pseudo_lock_region *plr = (kernfs_type(kn) == KERNFS_DIR) ?
>> +						kn->priv : kn->parent->priv;
> 
> See above.

Will fix.

> 
>> +int rdt_pseudo_lock_mkdir(const char *name, umode_t mode)
>> +{
>> +	struct pseudo_lock_region *plr;
>> +	struct kernfs_node *kn;
>> +	int ret = 0;
>> +
>> +	mutex_lock(&rdtgroup_mutex);
>> +	mutex_lock(&rdt_pseudo_lock_mutex);
>> +
>> +	if (new_plr) {
>> +		ret = -ENOSPC;
>> +		goto out;
>> +	}
>> +
>> +	plr = kzalloc(sizeof(*plr), GFP_KERNEL);
>> +	if (!plr) {
>> +		ret = -ENOSPC;
> 
>   ENOMEM is the proper error code here.

Will fix.

> 
>> +		goto out;
>> +	}
>> +
>> +	kn = kernfs_create_dir(pseudo_lock_kn, name, mode, plr);
>> +	if (IS_ERR(kn)) {
>> +		ret = PTR_ERR(kn);
>> +		goto out_free;
>> +	}
>> +
>> +	plr->kn = kn;
>> +	ret = rdtgroup_kn_set_ugid(kn);
>> +	if (ret)
>> +		goto out_remove;
>> +
>> +	kref_init(&plr->refcount);
>> +	kernfs_activate(kn);
>> +	new_plr = plr;
>> +	ret = 0;
>> +	goto out;
>> +
>> +out_remove:
>> +	kernfs_remove(kn);
>> +out_free:
>> +	kfree(plr);
>> +out:
>> +	mutex_unlock(&rdt_pseudo_lock_mutex);
>> +	mutex_unlock(&rdtgroup_mutex);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * rdt_pseudo_lock_rmdir - Remove pseudo-lock region
>> + *
>> + * LOCKING:
>> + * Since the pseudo-locked region can be associated with a RDT domain at
>> + * removal we take both rdtgroup_mutex and rdt_pseudo_lock_mutex to protect
>> + * the rdt_domain access as well as the pseudo_lock_region access.
> 
> Is there a real reason / benefit for having this second mutex?

Some interactions with the pseudo-locked region are currently done
without the need for the rdtgroup_mutex. For example, interaction with
the character device associated with the pseudo-locked region (the
mmap() call) as well as the debugfs operations.

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions
  2018-02-19 23:02     ` Reinette Chatre
@ 2018-02-19 23:16       ` Thomas Gleixner
  2018-02-20  3:21         ` Reinette Chatre
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-19 23:16 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Mon, 19 Feb 2018, Reinette Chatre wrote:
> On 2/19/2018 12:57 PM, Thomas Gleixner wrote:
> > On Tue, 13 Feb 2018, Reinette Chatre wrote:
> > 
> >> System administrator creates/removes pseudo-locked regions by
> >> creating/removing directories in the pseudo-lock subdirectory of the
> >> resctrl filesystem. Here we add directory creation and removal support.
> >>
> >> A "pseudo-lock region" is introduced, which represents an
> >> instance of a pseudo-locked cache region. During mkdir a new region is
> >> created but since we do not know which cache it belongs to at that time
> >> we maintain a global pointer to it from where it will be moved to the cache
> >> (rdt_domain) it belongs to after initialization. This implies that
> >> we only support one uninitialized pseudo-locked region at a time.
> > 
> > Whats the reason for this restriction? If there are uninitialized
> > directories, so what?
> 
> I was thinking about a problematic scenario where an application
> attempts to create infinite directories. All of these uninitialized
> directories need to be kept track of before they are initialized as
> pseudo-locked regions. It seemed simpler to require that one
> pseudo-locked region is set up at a time.

If the application is allowed to create directories then it can also create
a dozen unused resource control groups. This is not a Joe User operation so
there is no problem.

> >> +/*
> >> + * rdt_pseudo_lock_rmdir - Remove pseudo-lock region
> >> + *
> >> + * LOCKING:
> >> + * Since the pseudo-locked region can be associated with a RDT domain at
> >> + * removal we take both rdtgroup_mutex and rdt_pseudo_lock_mutex to protect
> >> + * the rdt_domain access as well as the pseudo_lock_region access.
> > 
> > Is there a real reason / benefit for having this second mutex?
> 
> Some interactions with the pseudo-locked region are currently done
> without the need for the rdtgroup_mutex. For example, interaction with
> the character device associated with the pseudo-locked region (the
> mmap() call) as well as the debugfs operations.

Well, yes. But none of those operations are hot path so having the double
locking in lots of the other function is just extra complexity for no real
value.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain
  2018-02-19 23:00     ` Reinette Chatre
@ 2018-02-19 23:19       ` Thomas Gleixner
  2018-02-20  3:17         ` Reinette Chatre
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-19 23:19 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Mon, 19 Feb 2018, Reinette Chatre wrote:

> Hi Thomas,
> 
> On 2/19/2018 1:19 PM, Thomas Gleixner wrote:
> > On Tue, 13 Feb 2018, Reinette Chatre wrote:
> > 
> >> After a pseudo-locked region is locked it needs to be associated with
> >> the RDT domain representing the pseudo-locked cache so that its life
> >> cycle can be managed correctly.
> >>
> >> Only a single pseudo-locked region can exist on any cache instance so we
> >> maintain a single pointer to a pseudo-locked region from each RDT
> >> domain.
> > 
> > Why is only a single pseudo locked region possible? 
> 
> The setup of a pseudo-locked region requires the usage of wbinvd. If a
> second pseudo-locked region is thus attempted it will evict the
> pseudo-locked data of the first.

Why does it neeed wbinvd? wbinvd is a big hammer. What's wrong with clflush?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain
  2018-02-19 23:19       ` Thomas Gleixner
@ 2018-02-20  3:17         ` Reinette Chatre
  2018-02-20 10:00           ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-20  3:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/19/2018 3:19 PM, Thomas Gleixner wrote:
> On Mon, 19 Feb 2018, Reinette Chatre wrote:
>> On 2/19/2018 1:19 PM, Thomas Gleixner wrote:
>>> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>>>
>>>> After a pseudo-locked region is locked it needs to be associated with
>>>> the RDT domain representing the pseudo-locked cache so that its life
>>>> cycle can be managed correctly.
>>>>
>>>> Only a single pseudo-locked region can exist on any cache instance so we
>>>> maintain a single pointer to a pseudo-locked region from each RDT
>>>> domain.
>>>
>>> Why is only a single pseudo locked region possible? 
>>
>> The setup of a pseudo-locked region requires the usage of wbinvd. If a
>> second pseudo-locked region is thus attempted it will evict the
>> pseudo-locked data of the first.
> 
> Why does it neeed wbinvd? wbinvd is a big hammer. What's wrong with clflush?

wbinvd is required by this hardware supported feature but limited to the
creation of the pseudo-locked region. An administrator could dedicate a
portion of cache to pseudo-locking and applications using this region
can come and go. The pseudo-locked region lifetime need not be tied to
application lifetime. The pseudo-locked region could be set up once on
boot and remain for lifetime of system.

Even so, understanding that it is a big hammer I did explore the
alternatives. Trying clflush, clflushopt, as well as clwb. Finding them
all to perform poorly(*) I went further to explore if it is possible to
use these other instructions with some additional work in support to
make them perform as well as wbinvd. The additional work included,
looping over the data more times than done for wbinvd, reducing the size
of memory locked in relationship to cache size, unused spacing between
pseudo-locked region and other regions, unmapped memory at end of
pseudo-locked region.

In addition to the above research from my side I also followed up with
the CPU architects directly to question the usage of these instructions
instead of wbinvd.

In all the testing and questioning I did I was only able to confirm that
wbinvd is required. Its use consistently results in the fewest cache
misses to the created pseudo-locked region.

Reinette

(*) By poorly I mean that accessing the pseudo-locked region created
using these instructions resulted in significant cache misses.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions
  2018-02-19 23:16       ` Thomas Gleixner
@ 2018-02-20  3:21         ` Reinette Chatre
  0 siblings, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-20  3:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/19/2018 3:16 PM, Thomas Gleixner wrote:
> On Mon, 19 Feb 2018, Reinette Chatre wrote:
>> On 2/19/2018 12:57 PM, Thomas Gleixner wrote:
>>> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>>>
>>>> System administrator creates/removes pseudo-locked regions by
>>>> creating/removing directories in the pseudo-lock subdirectory of the
>>>> resctrl filesystem. Here we add directory creation and removal support.
>>>>
>>>> A "pseudo-lock region" is introduced, which represents an
>>>> instance of a pseudo-locked cache region. During mkdir a new region is
>>>> created but since we do not know which cache it belongs to at that time
>>>> we maintain a global pointer to it from where it will be moved to the cache
>>>> (rdt_domain) it belongs to after initialization. This implies that
>>>> we only support one uninitialized pseudo-locked region at a time.
>>>
>>> Whats the reason for this restriction? If there are uninitialized
>>> directories, so what?
>>
>> I was thinking about a problematic scenario where an application
>> attempts to create infinite directories. All of these uninitialized
>> directories need to be kept track of before they are initialized as
>> pseudo-locked regions. It seemed simpler to require that one
>> pseudo-locked region is set up at a time.
> 
> If the application is allowed to create directories then it can also create
> a dozen unused resource control groups. This is not a Joe User operation so
> there is no problem.

Thank you for the guidance. I will remove this restriction.

>>>> +/*
>>>> + * rdt_pseudo_lock_rmdir - Remove pseudo-lock region
>>>> + *
>>>> + * LOCKING:
>>>> + * Since the pseudo-locked region can be associated with a RDT domain at
>>>> + * removal we take both rdtgroup_mutex and rdt_pseudo_lock_mutex to protect
>>>> + * the rdt_domain access as well as the pseudo_lock_region access.
>>>
>>> Is there a real reason / benefit for having this second mutex?
>>
>> Some interactions with the pseudo-locked region are currently done
>> without the need for the rdtgroup_mutex. For example, interaction with
>> the character device associated with the pseudo-locked region (the
>> mmap() call) as well as the debugfs operations.
> 
> Well, yes. But none of those operations are hot path so having the double
> locking in lots of the other function is just extra complexity for no real
> value.

I will revise.

Thank you very much.

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain
  2018-02-20  3:17         ` Reinette Chatre
@ 2018-02-20 10:00           ` Thomas Gleixner
  2018-02-20 16:02             ` Reinette Chatre
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-20 10:00 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Mon, 19 Feb 2018, Reinette Chatre wrote:
> On 2/19/2018 3:19 PM, Thomas Gleixner wrote:
> > On Mon, 19 Feb 2018, Reinette Chatre wrote:
> >> On 2/19/2018 1:19 PM, Thomas Gleixner wrote:
> >>> On Tue, 13 Feb 2018, Reinette Chatre wrote:
> >>>
> >>>> After a pseudo-locked region is locked it needs to be associated with
> >>>> the RDT domain representing the pseudo-locked cache so that its life
> >>>> cycle can be managed correctly.
> >>>>
> >>>> Only a single pseudo-locked region can exist on any cache instance so we
> >>>> maintain a single pointer to a pseudo-locked region from each RDT
> >>>> domain.
> >>>
> >>> Why is only a single pseudo locked region possible? 
> >>
> >> The setup of a pseudo-locked region requires the usage of wbinvd. If a
> >> second pseudo-locked region is thus attempted it will evict the
> >> pseudo-locked data of the first.
> > 
> > Why does it neeed wbinvd? wbinvd is a big hammer. What's wrong with clflush?
> 
> wbinvd is required by this hardware supported feature but limited to the
> creation of the pseudo-locked region. An administrator could dedicate a
> portion of cache to pseudo-locking and applications using this region
> can come and go. The pseudo-locked region lifetime need not be tied to
> application lifetime. The pseudo-locked region could be set up once on
> boot and remain for lifetime of system.
> 
> Even so, understanding that it is a big hammer I did explore the
> alternatives. Trying clflush, clflushopt, as well as clwb. Finding them
> all to perform poorly(*) I went further to explore if it is possible to
> use these other instructions with some additional work in support to
> make them perform as well as wbinvd. The additional work included,
> looping over the data more times than done for wbinvd, reducing the size
> of memory locked in relationship to cache size, unused spacing between
> pseudo-locked region and other regions, unmapped memory at end of
> pseudo-locked region.
> 
> In addition to the above research from my side I also followed up with
> the CPU architects directly to question the usage of these instructions
> instead of wbinvd.

What was their answer? This really wants a proper explanation and not just
experimentation results as it makes absolutely no sense at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain
  2018-02-20 10:00           ` Thomas Gleixner
@ 2018-02-20 16:02             ` Reinette Chatre
  2018-02-20 17:18               ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-20 16:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/20/2018 2:00 AM, Thomas Gleixner wrote:
> On Mon, 19 Feb 2018, Reinette Chatre wrote:
>> On 2/19/2018 3:19 PM, Thomas Gleixner wrote:
>>> On Mon, 19 Feb 2018, Reinette Chatre wrote:
>>>> On 2/19/2018 1:19 PM, Thomas Gleixner wrote:
>>>>> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>>>>>
>>>>>> After a pseudo-locked region is locked it needs to be associated with
>>>>>> the RDT domain representing the pseudo-locked cache so that its life
>>>>>> cycle can be managed correctly.
>>>>>>
>>>>>> Only a single pseudo-locked region can exist on any cache instance so we
>>>>>> maintain a single pointer to a pseudo-locked region from each RDT
>>>>>> domain.
>>>>>
>>>>> Why is only a single pseudo locked region possible? 
>>>>
>>>> The setup of a pseudo-locked region requires the usage of wbinvd. If a
>>>> second pseudo-locked region is thus attempted it will evict the
>>>> pseudo-locked data of the first.
>>>
>>> Why does it neeed wbinvd? wbinvd is a big hammer. What's wrong with clflush?
>>
>> wbinvd is required by this hardware supported feature but limited to the
>> creation of the pseudo-locked region. An administrator could dedicate a
>> portion of cache to pseudo-locking and applications using this region
>> can come and go. The pseudo-locked region lifetime need not be tied to
>> application lifetime. The pseudo-locked region could be set up once on
>> boot and remain for lifetime of system.
>>
>> Even so, understanding that it is a big hammer I did explore the
>> alternatives. Trying clflush, clflushopt, as well as clwb. Finding them
>> all to perform poorly(*) I went further to explore if it is possible to
>> use these other instructions with some additional work in support to
>> make them perform as well as wbinvd. The additional work included,
>> looping over the data more times than done for wbinvd, reducing the size
>> of memory locked in relationship to cache size, unused spacing between
>> pseudo-locked region and other regions, unmapped memory at end of
>> pseudo-locked region.
>>
>> In addition to the above research from my side I also followed up with
>> the CPU architects directly to question the usage of these instructions
>> instead of wbinvd.
> 
> What was their answer? This really wants a proper explanation and not just
> experimentation results as it makes absolutely no sense at all.

I always prefer to provide detailed answers but here I find myself at
the threshold where I may end up sharing information not publicly known.
This cannot be the first time you find yourself in this situation. How
do you prefer to proceed?

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-13 15:46 ` [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core Reinette Chatre
@ 2018-02-20 17:15   ` Thomas Gleixner
  2018-02-20 18:47     ` Reinette Chatre
                       ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-20 17:15 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Tue, 13 Feb 2018, Reinette Chatre wrote:
>  static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>  {
>  	bool is_new_plr = (plr == new_plr);
> @@ -93,6 +175,23 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>  	if (!plr->deleted)
>  		return;
>  
> +	if (plr->locked) {
> +		plr->d->plr = NULL;
> +		/*
> +		 * Resource groups come and go. Simply returning this
> +		 * pseudo-locked region's bits to the default CLOS may
> +		 * result in default CLOS to become fragmented, causing
> +		 * the setting of its bitmask to fail. Ensure it is valid
> +		 * first. If this check does fail we cannot return the bits
> +		 * to the default CLOS and userspace intervention would be
> +		 * required to ensure portions of the cache do not go
> +		 * unused.
> +		 */
> +		if (cbm_validate_val(plr->d->ctrl_val[0] | plr->cbm, plr->r))
> +			pseudo_lock_clos_set(plr, 0,
> +					     plr->d->ctrl_val[0] | plr->cbm);
> +		pseudo_lock_region_clear(plr);
> +	}
>  	kfree(plr);
>  	if (is_new_plr)
>  		new_plr = NULL;

Are you really sure that the life time rules of plr are correct vs. an
application which still has the locked memory mapped? i.e. the following
operation:

1# create_pseudo_lock_region()

					2# start_app()
					   fd = open(/dev/.../lock);
					   ptr = mmap(fd, .....); <- takes a ref on fd
					   close(fd);
					   do_stuff(ptr);

1# rmdir .../lock

					   unmap(ptr);	<- releases fd

I can't see how that is protected. You already have a kref in the PLR, but
it's in no way connected to the file descriptor lifetime.  So the refcount
logic here must be:

create_lock_region()
	plr = alloc_plr();
	take_ref(plr);
	if (!init_plr(plr)) {
		drop_ref(plr);
		...
	}

lockdev_open(filp)
	take_ref(plr);
	filp->private = plr;

rmdir_lock_region()
	...
	drop_ref(plr);

lockdev_relese(filp)
	filp->private = NULL;
	drop_ref(plr);

>  /*
> + * Only one pseudo-locked region can be set up at a time and that is
> + * enforced by taking the rdt_pseudo_lock_mutex when the user writes the
> + * requested schemata to the resctrl file and releasing the mutex on
> + * completion. The thread locking the kernel memory into the cache starts
> + * and completes during this time so we can be sure that only one thread
> + * can run at any time.
> + * The functions starting the pseudo-locking thread needs to wait for its
> + * completion and since there can only be one we have a global workqueue
> + * and variable to support this.
> + */
> +static DECLARE_WAIT_QUEUE_HEAD(wq);
> +static int thread_done;

Eew. For one, you really couldn't come up with more generic and less
relatable variable names, right?

That aside, its just wrong to build code based on current hardware
limitations. The waitqueue and the result code belong into PLR.

> +/**
> + * pseudo_lock_fn - Load kernel memory into cache
> + *
> + * This is the core pseudo-locking function.
> + *
> + * First we ensure that the kernel memory cannot be found in the cache.
> + * Then, while taking care that there will be as little interference as
> + * possible, each cache line of the memory to be loaded is touched while
> + * core is running with class of service set to the bitmask of the
> + * pseudo-locked region. After this is complete no future CAT allocations
> + * will be allowed to overlap with this bitmask.
> + *
> + * Local register variables are utilized to ensure that the memory region
> + * to be locked is the only memory access made during the critical locking
> + * loop.
> + */
> +static int pseudo_lock_fn(void *_plr)
> +{
> +	struct pseudo_lock_region *plr = _plr;
> +	u32 rmid_p, closid_p;
> +	unsigned long flags;
> +	u64 i;
> +#ifdef CONFIG_KASAN
> +	/*
> +	 * The registers used for local register variables are also used
> +	 * when KASAN is active. When KASAN is active we use a regular
> +	 * variable to ensure we always use a valid pointer, but the cost
> +	 * is that this variable will enter the cache through evicting the
> +	 * memory we are trying to lock into the cache. Thus expect lower
> +	 * pseudo-locking success rate when KASAN is active.
> +	 */

I'm not a real fan of this mess. But well, 

> +	unsigned int line_size;
> +	unsigned int size;
> +	void *mem_r;
> +#else
> +	register unsigned int line_size asm("esi");
> +	register unsigned int size asm("edi");
> +#ifdef CONFIG_X86_64
> +	register void *mem_r asm("rbx");
> +#else
> +	register void *mem_r asm("ebx");
> +#endif /* CONFIG_X86_64 */
> +#endif /* CONFIG_KASAN */
> +
> +	/*
> +	 * Make sure none of the allocated memory is cached. If it is we
> +	 * will get a cache hit in below loop from outside of pseudo-locked
> +	 * region.
> +	 * wbinvd (as opposed to clflush/clflushopt) is required to
> +	 * increase likelihood that allocated cache portion will be filled
> +	 * with associated memory

Sigh.

> +	 */
> +	wbinvd();
> +
> +	preempt_disable();
> +	local_irq_save(flags);

preempt_disable() is pointless when you disable interrupts.  And this
really should be local_irq_disable(). This code is always called with
interrupts enabled....

> +	/*
> +	 * Call wrmsr and rdmsr as directly as possible to avoid tracing
> +	 * clobbering local register variables or affecting cache accesses.
> +	 */

You probably want to make sure that the code below is in L1 cache already
before the CLOSID is set to the allocation. To do this you want to put the
preload mechanics into a separate ASM function.

Then you run it with size = 1 on some other temporary memory buffer with
the default CLOSID, which has the CBM bits of the lock region excluded,
Then switch to the real CLOSID and run the loop with the real buffer and
the real size.

> +	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);

This wants an explanation why the prefetcher needs to be disabled.

> +static int pseudo_lock_doit(struct pseudo_lock_region *plr,
> +			    struct rdt_resource *r,
> +			    struct rdt_domain *d)
> +{
> +	struct task_struct *thread;
> +	int closid;
> +	int ret, i;
> +
> +	/*
> +	 * With the usage of wbinvd we can only support one pseudo-locked
> +	 * region per domain at this time.

This really sucks.

> +	 */
> +	if (d->plr) {
> +		rdt_last_cmd_puts("pseudo-locked region exists on cache\n");
> +		return -ENOSPC;

This check is not sufficient for a CPU which has both L2 and L3 allocation
capability. If there is already a L3 locked region and the current call
sets up a L2 locked region then this will not catch it and the following
wbinvd will wipe the L3 locked region ....

> +	}
> +
> +	ret = pseudo_lock_region_init(plr, r, d);
> +	if (ret < 0)
> +		return ret;
> +
> +	closid = closid_alloc();
> +	if (closid < 0) {
> +		ret = closid;
> +		rdt_last_cmd_puts("unable to obtain free closid\n");
> +		goto out_region;
> +	}
> +
> +	/*
> +	 * Ensure we end with a valid default CLOS. If a pseudo-locked
> +	 * region in middle of possible bitmasks is selected it will split
> +	 * up default CLOS which would be a fault and for which handling
> +	 * is unclear so we fail back to userspace. Validation will also
> +	 * ensure that default CLOS is not zero, keeping some cache
> +	 * available to rest of system.
> +	 */
> +	if (!cbm_validate_val(d->ctrl_val[0] & ~plr->cbm, r)) {
> +		ret = -EINVAL;
> +		rdt_last_cmd_printf("bm 0x%x causes invalid clos 0 bm 0x%x\n",
> +				    plr->cbm, d->ctrl_val[0] & ~plr->cbm);
> +		goto out_closid;
> +	}
> +
> +	ret = pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] & ~plr->cbm);

Fiddling with the default CBM is wrong. The lock operation should only
succeed when the bits in that domain are not used by _ANY_ control group
including the default one. This is a reasonable constraint.

> +	if (ret < 0) {
> +		rdt_last_cmd_printf("unable to set clos 0 bitmask to 0x%x\n",
> +				    d->ctrl_val[0] & ~plr->cbm);
> +		goto out_closid;
> +	}
> +
> +	ret = pseudo_lock_clos_set(plr, closid, plr->cbm);
> +	if (ret < 0) {
> +		rdt_last_cmd_printf("unable to set closid %d bitmask to 0x%x\n",
> +				    closid, plr->cbm);
> +		goto out_clos_def;
> +	}
> +
> +	plr->closid = closid;
> +
> +	thread_done = 0;
> +
> +	thread = kthread_create_on_node(pseudo_lock_fn, plr,
> +					cpu_to_node(plr->cpu),
> +					"pseudo_lock/%u", plr->cpu);
> +	if (IS_ERR(thread)) {
> +		ret = PTR_ERR(thread);
> +		rdt_last_cmd_printf("locking thread returned error %d\n", ret);
> +		/*
> +		 * We do not return CBM to newly allocated CLOS here on
> +		 * error path since that will result in a CBM of all
> +		 * zeroes which is an illegal MSR write.

I'm not sure what you are trying to explain here.

If you remove a ctrl group then the CBM bits are not added to anything
either. It's up to the operator to handle that. Why would this be any
different for the pseudo-locking stuff?

> +		 */
> +		goto out_clos_def;
> +	}
> +
> +	kthread_bind(thread, plr->cpu);
> +	wake_up_process(thread);
> +
> +	ret = wait_event_interruptible(wq, thread_done == 1);
> +	if (ret < 0) {
> +		rdt_last_cmd_puts("locking thread interrupted\n");
> +		goto out_clos_def;

This is broken. If the thread does not get on the CPU for whatever reason
and the process which sets up the region is interrupted then this will
leave the thread in runnable state and once it gets on the CPU it will
happily derefence the freed plr struct and fiddle with the freed memory.

You need to make sure that the thread holds a reference on the plr struct,
which prevents freeing. That includes the CLOSID .....

> +	}
> +
> +	/*
> +	 * closid will be released soon but its CBM as well as CBM of not
> +	 * yet allocated CLOS as stored in the array will remain. Ensure
> +	 * that CBM will be what is currently the default CLOS, which
> +	 * excludes pseudo-locked region.
> +	 */
> +	for (i = 1; i < r->num_closid; i++) {
> +		if (i == closid || !closid_allocated(i))
> +			pseudo_lock_clos_set(plr, i, d->ctrl_val[0]);
> +	}

This is all magical duct tape. The overall design of this is sideways and
not really well integrated into the existing infrastructure which creates
these kinds of magic warts and lots of duplicated code.

The deeper I read into the patch series the less I like that interface and
the implementation.

Let's look at the existing crtl/mon groups which are each represented by a
directory already.

 - Adding a 'size' file to the ctrl groups would be a natural extension
   which makes sense for regular cache allocations as well.

 - Adding a 'exclusive' flag would be an interesting feature even for the
   normal use case. Marking a group as exclusive prevents other groups to
   request CBM bits which are held by a exclusive allocation.

   I'd suggest to have a file 'mode' for controlling this. The valid values
   would be something like 'shareable' and 'exclusive'.

   When trying to set a group to exclusive mode then the schemata has to be
   checked for overlaps with the other schematas and in case of conflict
   the write fails. Once enabled subsequent writes to the schemata file
   need to be checked for conflicts as well.

   If the exclusive setting is enabled then the CBM bits of that group
   are excluded from being used in other control groups.

Aside of that a file in the info directory which shows the (un)used CBM
bits of all groups is really helpful for controlling all of that (even w/o
pseudo locking). You have this in the 'avail' file, but there is no reason
why this should only be available for pseudo locking enabled systems.

Now for the pseudo locking part.

What you need on top of the above is a new 'mode': 'locked'. That mode
utilizes the 'exclusive' mode rules vs. conflict checking and the
protection against allocating the associated CBM bits in other control
groups.

The setup would be like this:

    mkdir group
    echo '$CONFIG' >group/schemata
    echo 'locked' >group/mode

Setting mode to locked locks down the schemata file along with the
task/cpus/cpus_list files. The task/cpu files need to be empty when
entering locked mode, otherwise the operation fails. I'd even would not
bother handing back the CLOSID. For simplicity the CLOSID should just stay
associated with the control group until it is destroyed as any other
control group.

Now the remaining thing is the memory allocation and the mmap itself. I
really dislike the preallocation of memory right at setup time. Ideally
that should be an allocation of the application itself, but the horrid
wbinvd stuff kinda prevents that. With that restriction we are more or less
bound to immediate allocation and population.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain
  2018-02-20 16:02             ` Reinette Chatre
@ 2018-02-20 17:18               ` Thomas Gleixner
  0 siblings, 0 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-20 17:18 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Tue, 20 Feb 2018, Reinette Chatre wrote:
> On 2/20/2018 2:00 AM, Thomas Gleixner wrote:
> > On Mon, 19 Feb 2018, Reinette Chatre wrote:
> >> In addition to the above research from my side I also followed up with
> >> the CPU architects directly to question the usage of these instructions
> >> instead of wbinvd.
> > 
> > What was their answer? This really wants a proper explanation and not just
> > experimentation results as it makes absolutely no sense at all.
> 
> I always prefer to provide detailed answers but here I find myself at
> the threshold where I may end up sharing information not publicly known.
> This cannot be the first time you find yourself in this situation. How
> do you prefer to proceed?

Well, if it's secret sauce we'll have to accept it. Though it really does
not improve the confidence in all those mechanisms ....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-20 17:15   ` Thomas Gleixner
@ 2018-02-20 18:47     ` Reinette Chatre
  2018-02-20 23:21       ` Thomas Gleixner
  2018-02-27  0:34     ` Reinette Chatre
  2018-02-27 21:01     ` Reinette Chatre
  2 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-20 18:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>>  static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>>  {
>>  	bool is_new_plr = (plr == new_plr);
>> @@ -93,6 +175,23 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>>  	if (!plr->deleted)
>>  		return;
>>  
>> +	if (plr->locked) {
>> +		plr->d->plr = NULL;
>> +		/*
>> +		 * Resource groups come and go. Simply returning this
>> +		 * pseudo-locked region's bits to the default CLOS may
>> +		 * result in default CLOS to become fragmented, causing
>> +		 * the setting of its bitmask to fail. Ensure it is valid
>> +		 * first. If this check does fail we cannot return the bits
>> +		 * to the default CLOS and userspace intervention would be
>> +		 * required to ensure portions of the cache do not go
>> +		 * unused.
>> +		 */
>> +		if (cbm_validate_val(plr->d->ctrl_val[0] | plr->cbm, plr->r))
>> +			pseudo_lock_clos_set(plr, 0,
>> +					     plr->d->ctrl_val[0] | plr->cbm);
>> +		pseudo_lock_region_clear(plr);
>> +	}
>>  	kfree(plr);
>>  	if (is_new_plr)
>>  		new_plr = NULL;
> 
> Are you really sure that the life time rules of plr are correct vs. an
> application which still has the locked memory mapped? i.e. the following
> operation:

You are correct. I am not preventing an administrator from removing the
pseudo-locked region if it is in use. I will fix that.

> 1# create_pseudo_lock_region()
> 
> 					2# start_app()
> 					   fd = open(/dev/.../lock);
> 					   ptr = mmap(fd, .....); <- takes a ref on fd
> 					   close(fd);
> 					   do_stuff(ptr);
> 
> 1# rmdir .../lock
> 
> 					   unmap(ptr);	<- releases fd
> 
> I can't see how that is protected. You already have a kref in the PLR, but
> it's in no way connected to the file descriptor lifetime.  So the refcount
> logic here must be:
> 
> create_lock_region()
> 	plr = alloc_plr();
> 	take_ref(plr);
> 	if (!init_plr(plr)) {
> 		drop_ref(plr);
> 		...
> 	}
> 
> lockdev_open(filp)
> 	take_ref(plr);
> 	filp->private = plr;
> 
> rmdir_lock_region()
> 	...
> 	drop_ref(plr);
> 
> lockdev_relese(filp)
> 	filp->private = NULL;
> 	drop_ref(plr);
> 
>>  /*
>> + * Only one pseudo-locked region can be set up at a time and that is
>> + * enforced by taking the rdt_pseudo_lock_mutex when the user writes the
>> + * requested schemata to the resctrl file and releasing the mutex on
>> + * completion. The thread locking the kernel memory into the cache starts
>> + * and completes during this time so we can be sure that only one thread
>> + * can run at any time.
>> + * The functions starting the pseudo-locking thread needs to wait for its
>> + * completion and since there can only be one we have a global workqueue
>> + * and variable to support this.
>> + */
>> +static DECLARE_WAIT_QUEUE_HEAD(wq);
>> +static int thread_done;
> 
> Eew. For one, you really couldn't come up with more generic and less
> relatable variable names, right?
> 
> That aside, its just wrong to build code based on current hardware
> limitations. The waitqueue and the result code belong into PLR.

Will do. This also builds on your previous suggestion to not limit the
number of uninitialized pseudo-locked regions.

> 
>> +/**
>> + * pseudo_lock_fn - Load kernel memory into cache
>> + *
>> + * This is the core pseudo-locking function.
>> + *
>> + * First we ensure that the kernel memory cannot be found in the cache.
>> + * Then, while taking care that there will be as little interference as
>> + * possible, each cache line of the memory to be loaded is touched while
>> + * core is running with class of service set to the bitmask of the
>> + * pseudo-locked region. After this is complete no future CAT allocations
>> + * will be allowed to overlap with this bitmask.
>> + *
>> + * Local register variables are utilized to ensure that the memory region
>> + * to be locked is the only memory access made during the critical locking
>> + * loop.
>> + */
>> +static int pseudo_lock_fn(void *_plr)
>> +{
>> +	struct pseudo_lock_region *plr = _plr;
>> +	u32 rmid_p, closid_p;
>> +	unsigned long flags;
>> +	u64 i;
>> +#ifdef CONFIG_KASAN
>> +	/*
>> +	 * The registers used for local register variables are also used
>> +	 * when KASAN is active. When KASAN is active we use a regular
>> +	 * variable to ensure we always use a valid pointer, but the cost
>> +	 * is that this variable will enter the cache through evicting the
>> +	 * memory we are trying to lock into the cache. Thus expect lower
>> +	 * pseudo-locking success rate when KASAN is active.
>> +	 */
> 
> I'm not a real fan of this mess. But well, 
> 
>> +	unsigned int line_size;
>> +	unsigned int size;
>> +	void *mem_r;
>> +#else
>> +	register unsigned int line_size asm("esi");
>> +	register unsigned int size asm("edi");
>> +#ifdef CONFIG_X86_64
>> +	register void *mem_r asm("rbx");
>> +#else
>> +	register void *mem_r asm("ebx");
>> +#endif /* CONFIG_X86_64 */
>> +#endif /* CONFIG_KASAN */
>> +
>> +	/*
>> +	 * Make sure none of the allocated memory is cached. If it is we
>> +	 * will get a cache hit in below loop from outside of pseudo-locked
>> +	 * region.
>> +	 * wbinvd (as opposed to clflush/clflushopt) is required to
>> +	 * increase likelihood that allocated cache portion will be filled
>> +	 * with associated memory
> 
> Sigh.
> 
>> +	 */
>> +	wbinvd();
>> +
>> +	preempt_disable();
>> +	local_irq_save(flags);
> 
> preempt_disable() is pointless when you disable interrupts.  And this
> really should be local_irq_disable(). This code is always called with
> interrupts enabled....
> 
>> +	/*
>> +	 * Call wrmsr and rdmsr as directly as possible to avoid tracing
>> +	 * clobbering local register variables or affecting cache accesses.
>> +	 */
> 
> You probably want to make sure that the code below is in L1 cache already
> before the CLOSID is set to the allocation. To do this you want to put the
> preload mechanics into a separate ASM function.
> 
> Then you run it with size = 1 on some other temporary memory buffer with
> the default CLOSID, which has the CBM bits of the lock region excluded,
> Then switch to the real CLOSID and run the loop with the real buffer and
> the real size.

Thank you for the suggestion. I will experiment how this affects the
pseudo-locked region success.

>> +	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
> 
> This wants an explanation why the prefetcher needs to be disabled.
> 
>> +static int pseudo_lock_doit(struct pseudo_lock_region *plr,
>> +			    struct rdt_resource *r,
>> +			    struct rdt_domain *d)
>> +{
>> +	struct task_struct *thread;
>> +	int closid;
>> +	int ret, i;
>> +
>> +	/*
>> +	 * With the usage of wbinvd we can only support one pseudo-locked
>> +	 * region per domain at this time.
> 
> This really sucks.
> 
>> +	 */
>> +	if (d->plr) {
>> +		rdt_last_cmd_puts("pseudo-locked region exists on cache\n");
>> +		return -ENOSPC;
> 
> This check is not sufficient for a CPU which has both L2 and L3 allocation
> capability. If there is already a L3 locked region and the current call
> sets up a L2 locked region then this will not catch it and the following
> wbinvd will wipe the L3 locked region ....
> 
>> +	}
>> +
>> +	ret = pseudo_lock_region_init(plr, r, d);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	closid = closid_alloc();
>> +	if (closid < 0) {
>> +		ret = closid;
>> +		rdt_last_cmd_puts("unable to obtain free closid\n");
>> +		goto out_region;
>> +	}
>> +
>> +	/*
>> +	 * Ensure we end with a valid default CLOS. If a pseudo-locked
>> +	 * region in middle of possible bitmasks is selected it will split
>> +	 * up default CLOS which would be a fault and for which handling
>> +	 * is unclear so we fail back to userspace. Validation will also
>> +	 * ensure that default CLOS is not zero, keeping some cache
>> +	 * available to rest of system.
>> +	 */
>> +	if (!cbm_validate_val(d->ctrl_val[0] & ~plr->cbm, r)) {
>> +		ret = -EINVAL;
>> +		rdt_last_cmd_printf("bm 0x%x causes invalid clos 0 bm 0x%x\n",
>> +				    plr->cbm, d->ctrl_val[0] & ~plr->cbm);
>> +		goto out_closid;
>> +	}
>> +
>> +	ret = pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] & ~plr->cbm);
> 
> Fiddling with the default CBM is wrong. The lock operation should only
> succeed when the bits in that domain are not used by _ANY_ control group
> including the default one. This is a reasonable constraint.

This changes one of my original assumptions. I will rework all to adjust
since your later design change suggestions will impact this.

>> +	if (ret < 0) {
>> +		rdt_last_cmd_printf("unable to set clos 0 bitmask to 0x%x\n",
>> +				    d->ctrl_val[0] & ~plr->cbm);
>> +		goto out_closid;
>> +	}
>> +
>> +	ret = pseudo_lock_clos_set(plr, closid, plr->cbm);
>> +	if (ret < 0) {
>> +		rdt_last_cmd_printf("unable to set closid %d bitmask to 0x%x\n",
>> +				    closid, plr->cbm);
>> +		goto out_clos_def;
>> +	}
>> +
>> +	plr->closid = closid;
>> +
>> +	thread_done = 0;
>> +
>> +	thread = kthread_create_on_node(pseudo_lock_fn, plr,
>> +					cpu_to_node(plr->cpu),
>> +					"pseudo_lock/%u", plr->cpu);
>> +	if (IS_ERR(thread)) {
>> +		ret = PTR_ERR(thread);
>> +		rdt_last_cmd_printf("locking thread returned error %d\n", ret);
>> +		/*
>> +		 * We do not return CBM to newly allocated CLOS here on
>> +		 * error path since that will result in a CBM of all
>> +		 * zeroes which is an illegal MSR write.
> 
> I'm not sure what you are trying to explain here.
> 
> If you remove a ctrl group then the CBM bits are not added to anything
> either. It's up to the operator to handle that. Why would this be any
> different for the pseudo-locking stuff?

It is not different, no. On failure the closid is released but the CBM
associated with it remains. Here I attempted to explain why the CBM
remains. This is the same behavior as current CAT. I will remove the
comment since it is just causing confusion.

>> +		 */
>> +		goto out_clos_def;
>> +	}
>> +
>> +	kthread_bind(thread, plr->cpu);
>> +	wake_up_process(thread);
>> +
>> +	ret = wait_event_interruptible(wq, thread_done == 1);
>> +	if (ret < 0) {
>> +		rdt_last_cmd_puts("locking thread interrupted\n");
>> +		goto out_clos_def;
> 
> This is broken. If the thread does not get on the CPU for whatever reason
> and the process which sets up the region is interrupted then this will
> leave the thread in runnable state and once it gets on the CPU it will
> happily derefence the freed plr struct and fiddle with the freed memory.
> 
> You need to make sure that the thread holds a reference on the plr struct,
> which prevents freeing. That includes the CLOSID .....

Thanks for catching this.

> 
>> +	}
>> +
>> +	/*
>> +	 * closid will be released soon but its CBM as well as CBM of not
>> +	 * yet allocated CLOS as stored in the array will remain. Ensure
>> +	 * that CBM will be what is currently the default CLOS, which
>> +	 * excludes pseudo-locked region.
>> +	 */
>> +	for (i = 1; i < r->num_closid; i++) {
>> +		if (i == closid || !closid_allocated(i))
>> +			pseudo_lock_clos_set(plr, i, d->ctrl_val[0]);
>> +	}
> 
> This is all magical duct tape. The overall design of this is sideways and
> not really well integrated into the existing infrastructure which creates
> these kinds of magic warts and lots of duplicated code.
> 
> The deeper I read into the patch series the less I like that interface and
> the implementation.
> 
> Let's look at the existing crtl/mon groups which are each represented by a
> directory already.
> 
>  - Adding a 'size' file to the ctrl groups would be a natural extension
>    which makes sense for regular cache allocations as well.
> 
>  - Adding a 'exclusive' flag would be an interesting feature even for the
>    normal use case. Marking a group as exclusive prevents other groups to
>    request CBM bits which are held by a exclusive allocation.
> 
>    I'd suggest to have a file 'mode' for controlling this. The valid values
>    would be something like 'shareable' and 'exclusive'.
> 
>    When trying to set a group to exclusive mode then the schemata has to be
>    checked for overlaps with the other schematas and in case of conflict
>    the write fails. Once enabled subsequent writes to the schemata file
>    need to be checked for conflicts as well.
> 
>    If the exclusive setting is enabled then the CBM bits of that group
>    are excluded from being used in other control groups.
> 
> Aside of that a file in the info directory which shows the (un)used CBM
> bits of all groups is really helpful for controlling all of that (even w/o
> pseudo locking). You have this in the 'avail' file, but there is no reason
> why this should only be available for pseudo locking enabled systems.
> 
> Now for the pseudo locking part.
> 
> What you need on top of the above is a new 'mode': 'locked'. That mode
> utilizes the 'exclusive' mode rules vs. conflict checking and the
> protection against allocating the associated CBM bits in other control
> groups.
> 
> The setup would be like this:
> 
>     mkdir group
>     echo '$CONFIG' >group/schemata
>     echo 'locked' >group/mode
> 
> Setting mode to locked locks down the schemata file along with the
> task/cpus/cpus_list files. The task/cpu files need to be empty when
> entering locked mode, otherwise the operation fails. I'd even would not
> bother handing back the CLOSID. For simplicity the CLOSID should just stay
> associated with the control group until it is destroyed as any other
> control group.

Thank you so much for taking the time to do this thorough review and to
make these suggestions. While I am still digesting the details I do
intend to follow all (as well as the ones earlier I did not explicitly
respond to).

Keeping the CLOSID associated with the pseudo-locked region will surely
make the above simpler since CLOSID's are association with resource
groups (represented by the directories). I would like to highlight that
on some platforms there are only a few (for example, 4) CLOSIDs
available. Not releasing a CLOSID would thus reduce available CLOSIDs
that are already limited. These platforms do have smaller possible
bitmasks though (for example, 8 possible bits), which may make light of
this concern. I thus just add it as informational to the consequence of
this simplification.

> Now the remaining thing is the memory allocation and the mmap itself. I
> really dislike the preallocation of memory right at setup time. Ideally
> that should be an allocation of the application itself, but the horrid
> wbinvd stuff kinda prevents that. With that restriction we are more or less
> bound to immediate allocation and population.

Acknowledged. I am not sure if the current permissions would support
such a dynamic setup though. At this time the system administrator is
the one that sets up the pseudo-locked region and can through
permissions of the character device provide access to these regions to
user space applications.

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-20 18:47     ` Reinette Chatre
@ 2018-02-20 23:21       ` Thomas Gleixner
  2018-02-21  1:58         ` Mike Kravetz
  2018-02-21  5:58         ` Reinette Chatre
  0 siblings, 2 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-20 23:21 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Reinette,

On Tue, 20 Feb 2018, Reinette Chatre wrote:
> On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> > On Tue, 13 Feb 2018, Reinette Chatre wrote:
> > 
> > Are you really sure that the life time rules of plr are correct vs. an
> > application which still has the locked memory mapped? i.e. the following
> > operation:
> 
> You are correct. I am not preventing an administrator from removing the
> pseudo-locked region if it is in use. I will fix that.

The removal is fine and you cannot prevent it w/o introducing a mess, but
you have to make sure that the PLR and the mapped memory are not
vanishing. The refcount rules I outlined are exactly doing that.

> Thank you so much for taking the time to do this thorough review and to
> make these suggestions. While I am still digesting the details I do
> intend to follow all (as well as the ones earlier I did not explicitly
> respond to).

Make your mind up and tell me where I'm wrong before you implement the crap
I suggested blindly, as that will just cause the next reviewer (me or
someone else) to tell _you_ that it is crap :)

> Keeping the CLOSID associated with the pseudo-locked region will surely
> make the above simpler since CLOSID's are association with resource
> groups (represented by the directories). I would like to highlight that
> on some platforms there are only a few (for example, 4) CLOSIDs
> available. Not releasing a CLOSID would thus reduce available CLOSIDs
> that are already limited. These platforms do have smaller possible
> bitmasks though (for example, 8 possible bits), which may make light of
> this concern. I thus just add it as informational to the consequence of
> this simplification.

Yes. If you have 4 CLOSIDs and only 8 CBM bits it really does not matter
much.

> > Now the remaining thing is the memory allocation and the mmap itself. I
> > really dislike the preallocation of memory right at setup time. Ideally
> > that should be an allocation of the application itself, but the horrid
> > wbinvd stuff kinda prevents that. With that restriction we are more or less
> > bound to immediate allocation and population.
> 
> Acknowledged. I am not sure if the current permissions would support
> such a dynamic setup though. At this time the system administrator is
> the one that sets up the pseudo-locked region and can through
> permissions of the character device provide access to these regions to
> user space applications.

You still would need some interface, e.g. character device which allows you
to hand in the pointer to the user allocated memory and do the cache
priming. So you could use the same permission setup for that character
device.

The other problem is that we'd need to have MAP_CONTIG first so you
actually can allocate physically contigous memory from user space. Mike is
working on that, but it's not available today. The only way to do so today
(with lots of waste) would be MAP_HUGETLB, which might be an acceptable
constraint up to the point where MAP_CONTIG is available.

Though this all depends on the ability to remove the wbinvd
requirement. But even if we can remove that we'd still need to be aware
that the cache priming loop which needs to run with interrupts disabled is
expensive as well and can introduce undesired latencies. Needs all some
thought...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-20 23:21       ` Thomas Gleixner
@ 2018-02-21  1:58         ` Mike Kravetz
  2018-02-21  6:10           ` Reinette Chatre
  2018-02-21  8:34           ` Thomas Gleixner
  2018-02-21  5:58         ` Reinette Chatre
  1 sibling, 2 replies; 65+ messages in thread
From: Mike Kravetz @ 2018-02-21  1:58 UTC (permalink / raw)
  To: Thomas Gleixner, Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On 02/20/2018 03:21 PM, Thomas Gleixner wrote:
> On Tue, 20 Feb 2018, Reinette Chatre wrote:
>> On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
>>> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>>>
>>> Now the remaining thing is the memory allocation and the mmap itself. I
>>> really dislike the preallocation of memory right at setup time. Ideally
>>> that should be an allocation of the application itself, but the horrid
>>> wbinvd stuff kinda prevents that. With that restriction we are more or less
>>> bound to immediate allocation and population.
>>
>> Acknowledged. I am not sure if the current permissions would support
>> such a dynamic setup though. At this time the system administrator is
>> the one that sets up the pseudo-locked region and can through
>> permissions of the character device provide access to these regions to
>> user space applications.
> 
> You still would need some interface, e.g. character device which allows you
> to hand in the pointer to the user allocated memory and do the cache
> priming. So you could use the same permission setup for that character
> device.
> 
> The other problem is that we'd need to have MAP_CONTIG first so you
> actually can allocate physically contigous memory from user space. Mike is
> working on that, but it's not available today. The only way to do so today
> (with lots of waste) would be MAP_HUGETLB, which might be an acceptable
> constraint up to the point where MAP_CONTIG is available.

Just to clarify, there is not any activity on exposing a general purpose
MAP_CONTIG interface to user space.  When initially proposed, MAP_CONTIG
was shot down and the suggestion was to create a new in kernel interface
to make allocation of contiguous pages easier.  The initial use case was
a driver which could use the new internal interface as part of it's
mmap() routine to give contiguous regions to user space.

Reinette is using this new interface, but that must be for the ?immediate
allocation? case you are trying to move away from.  Sorry, I have not been
following development of this feature.

If you would have to create a device to accept a user buffer, could you
perhaps use the same device to create/hand out a contiguous mapping?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-20 23:21       ` Thomas Gleixner
  2018-02-21  1:58         ` Mike Kravetz
@ 2018-02-21  5:58         ` Reinette Chatre
  1 sibling, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-21  5:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/20/2018 3:21 PM, Thomas Gleixner wrote:
> On Tue, 20 Feb 2018, Reinette Chatre wrote:
>> On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
>>> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>>>
>>> Are you really sure that the life time rules of plr are correct vs. an
>>> application which still has the locked memory mapped? i.e. the following
>>> operation:
>>
>> You are correct. I am not preventing an administrator from removing the
>> pseudo-locked region if it is in use. I will fix that.
> 
> The removal is fine and you cannot prevent it w/o introducing a mess, but
> you have to make sure that the PLR and the mapped memory are not
> vanishing. The refcount rules I outlined are exactly doing that.

Thank you for catching my misunderstanding. Will do.

>> Thank you so much for taking the time to do this thorough review and to
>> make these suggestions. While I am still digesting the details I do
>> intend to follow all (as well as the ones earlier I did not explicitly
>> respond to).
> 
> Make your mind up and tell me where I'm wrong before you implement the crap
> I suggested blindly, as that will just cause the next reviewer (me or
> someone else) to tell _you_ that it is crap :)

Will do. I need more time to digest your suggestions because your
thorough review provided me plenty to consider.

>> Keeping the CLOSID associated with the pseudo-locked region will surely
>> make the above simpler since CLOSID's are association with resource
>> groups (represented by the directories). I would like to highlight that
>> on some platforms there are only a few (for example, 4) CLOSIDs
>> available. Not releasing a CLOSID would thus reduce available CLOSIDs
>> that are already limited. These platforms do have smaller possible
>> bitmasks though (for example, 8 possible bits), which may make light of
>> this concern. I thus just add it as informational to the consequence of
>> this simplification.
> 
> Yes. If you have 4 CLOSIDs and only 8 CBM bits it really does not matter
> much.
> 
>>> Now the remaining thing is the memory allocation and the mmap itself. I
>>> really dislike the preallocation of memory right at setup time. Ideally
>>> that should be an allocation of the application itself, but the horrid
>>> wbinvd stuff kinda prevents that. With that restriction we are more or less
>>> bound to immediate allocation and population.
>>
>> Acknowledged. I am not sure if the current permissions would support
>> such a dynamic setup though. At this time the system administrator is
>> the one that sets up the pseudo-locked region and can through
>> permissions of the character device provide access to these regions to
>> user space applications.
> 
> You still would need some interface, e.g. character device which allows you
> to hand in the pointer to the user allocated memory and do the cache
> priming. So you could use the same permission setup for that character
> device.
> 
> The other problem is that we'd need to have MAP_CONTIG first so you
> actually can allocate physically contigous memory from user space. Mike is
> working on that, but it's not available today. The only way to do so today
> (with lots of waste) would be MAP_HUGETLB, which might be an acceptable
> constraint up to the point where MAP_CONTIG is available.

I recorded this in a pseudo-locking task list as something to consider
if the wbinvd requirement goes away at some point.

> Though this all depends on the ability to remove the wbinvd
> requirement. But even if we can remove that we'd still need to be aware
> that the cache priming loop which needs to run with interrupts disabled is
> expensive as well and can introduce undesired latencies. Needs all some
> thought...

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-21  1:58         ` Mike Kravetz
@ 2018-02-21  6:10           ` Reinette Chatre
  2018-02-21  8:34           ` Thomas Gleixner
  1 sibling, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-21  6:10 UTC (permalink / raw)
  To: Mike Kravetz, Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Mike,

On 2/20/2018 5:58 PM, Mike Kravetz wrote:
> On 02/20/2018 03:21 PM, Thomas Gleixner wrote:
>> On Tue, 20 Feb 2018, Reinette Chatre wrote:
>>> On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
>>>> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>>>>
>>>> Now the remaining thing is the memory allocation and the mmap itself. I
>>>> really dislike the preallocation of memory right at setup time. Ideally
>>>> that should be an allocation of the application itself, but the horrid
>>>> wbinvd stuff kinda prevents that. With that restriction we are more or less
>>>> bound to immediate allocation and population.
>>>
>>> Acknowledged. I am not sure if the current permissions would support
>>> such a dynamic setup though. At this time the system administrator is
>>> the one that sets up the pseudo-locked region and can through
>>> permissions of the character device provide access to these regions to
>>> user space applications.
>>
>> You still would need some interface, e.g. character device which allows you
>> to hand in the pointer to the user allocated memory and do the cache
>> priming. So you could use the same permission setup for that character
>> device.
>>
>> The other problem is that we'd need to have MAP_CONTIG first so you
>> actually can allocate physically contigous memory from user space. Mike is
>> working on that, but it's not available today. The only way to do so today
>> (with lots of waste) would be MAP_HUGETLB, which might be an acceptable
>> constraint up to the point where MAP_CONTIG is available.
> 
> Just to clarify, there is not any activity on exposing a general purpose
> MAP_CONTIG interface to user space.  When initially proposed, MAP_CONTIG
> was shot down and the suggestion was to create a new in kernel interface
> to make allocation of contiguous pages easier.  The initial use case was
> a driver which could use the new internal interface as part of it's
> mmap() routine to give contiguous regions to user space.
> 
> Reinette is using this new interface, but that must be for the ?immediate
> allocation? case you are trying to move away from.  Sorry, I have not been
> following development of this feature.
> 
> If you would have to create a device to accept a user buffer, could you
> perhaps use the same device to create/hand out a contiguous mapping?

Thank you very much for keeping an eye on this discussion. I do still
intend to implement the immediate allocation case by using the new
find_alloc_contig_pages()/free_contig_pages().

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-21  1:58         ` Mike Kravetz
  2018-02-21  6:10           ` Reinette Chatre
@ 2018-02-21  8:34           ` Thomas Gleixner
  1 sibling, 0 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-21  8:34 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Reinette Chatre, fenghua.yu, tony.luck, gavin.hindman,
	vikas.shivappa, dave.hansen, mingo, hpa, x86, linux-kernel

On Tue, 20 Feb 2018, Mike Kravetz wrote:
> On 02/20/2018 03:21 PM, Thomas Gleixner wrote:
> > The other problem is that we'd need to have MAP_CONTIG first so you
> > actually can allocate physically contigous memory from user space. Mike is
> > working on that, but it's not available today. The only way to do so today
> > (with lots of waste) would be MAP_HUGETLB, which might be an acceptable
> > constraint up to the point where MAP_CONTIG is available.
> 
> Just to clarify, there is not any activity on exposing a general purpose
> MAP_CONTIG interface to user space.  When initially proposed, MAP_CONTIG
> was shot down and the suggestion was to create a new in kernel interface
> to make allocation of contiguous pages easier.  The initial use case was
> a driver which could use the new internal interface as part of it's
> mmap() routine to give contiguous regions to user space.

Thanks for the clarification.

> Reinette is using this new interface, but that must be for the ?immediate
> allocation? case you are trying to move away from.  Sorry, I have not been
> following development of this feature.
> 
> If you would have to create a device to accept a user buffer, could you
> perhaps use the same device to create/hand out a contiguous mapping?

Yes, that's straight forward.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-20 17:15   ` Thomas Gleixner
  2018-02-20 18:47     ` Reinette Chatre
@ 2018-02-27  0:34     ` Reinette Chatre
  2018-02-27 10:36       ` Thomas Gleixner
  2018-02-27 21:01     ` Reinette Chatre
  2 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-27  0:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> Let's look at the existing crtl/mon groups which are each represented by a
> directory already.
> 
>  - Adding a 'size' file to the ctrl groups would be a natural extension
>    which makes sense for regular cache allocations as well.
> 
>  - Adding a 'exclusive' flag would be an interesting feature even for the
>    normal use case. Marking a group as exclusive prevents other groups to
>    request CBM bits which are held by a exclusive allocation.
> 
>    I'd suggest to have a file 'mode' for controlling this. The valid values
>    would be something like 'shareable' and 'exclusive'.
> 
>    When trying to set a group to exclusive mode then the schemata has to be
>    checked for overlaps with the other schematas and in case of conflict
>    the write fails. Once enabled subsequent writes to the schemata file
>    need to be checked for conflicts as well.
> 
>    If the exclusive setting is enabled then the CBM bits of that group
>    are excluded from being used in other control groups.
> 
> Aside of that a file in the info directory which shows the (un)used CBM
> bits of all groups is really helpful for controlling all of that (even w/o
> pseudo locking). You have this in the 'avail' file, but there is no reason
> why this should only be available for pseudo locking enabled systems.
> 
> Now for the pseudo locking part.
> 
> What you need on top of the above is a new 'mode': 'locked'. That mode
> utilizes the 'exclusive' mode rules vs. conflict checking and the
> protection against allocating the associated CBM bits in other control
> groups.
> 
> The setup would be like this:
> 
>     mkdir group
>     echo '$CONFIG' >group/schemata
>     echo 'locked' >group/mode
> 
> Setting mode to locked locks down the schemata file along with the
> task/cpus/cpus_list files. The task/cpu files need to be empty when
> entering locked mode, otherwise the operation fails. I'd even would not
> bother handing back the CLOSID. For simplicity the CLOSID should just stay
> associated with the control group until it is destroyed as any other
> control group.

I started looking at how this implementation may look and would like to
confirm with you that your intentions behind the new "exclusive" and
"locked" modes can be maintained. I also have a few questions.

Focusing on CAT a resource group represents a closid across all domains
(cache instances) of all resources (cache layers) on the system. A full
schemata reflecting the active bitmask associated with this closid for
each domain of each resource is maintained. The current implementation
supports partial writes to the schemata, with the assumption that only
the changed values need to be updated, the others remain as is. For the
current implementation this works well since what is shown by schemata
reflects current hardware settings and what is written to schemata will
change current hardware settings. This is done irrespective of any
overlap between bitmasks of different closids (the "shareable" mode).

A change to start us off with could be to initialize the schemata with
all the shareable and unused bits set for all domains when a new
resource group is created.

Moving to "exclusive" mode it appears that, when enabled for a resource
group, all domains of all resources are forced to have an "exclusive"
region associated with this resource group (closid). This is because the
schemata reflects the hardware settings of all resources and their
domains and the hardware does not accept a "zero" bitmask. A user thus
cannot just specify a single region of a particular cache instance as
"exclusive". Does this match your intention wrt "exclusive"?

Moving on to the "locked" mode. We cannot support different
pseudo-locked regions across multiple resources (eg. L2 and L3). In
fact, if we would at some point in the future then a pseudo-locked
region on one resource could implicitly span a second resource.
Additionally, we would like to enable a user to enable a single
pseudo-locked region on a single cache instance.

>From the above it follows that "locked" mode cannot just simply build on
top of "exclusive" mode rules (as I expressed them above) since it
cannot enforce a locked region on each domain of each resource.

We would like to support something like (as you also have in your example):

mkdir group
echo "L2:1=0x3" > schemata
echo locked > mode

The above should only pseudo-lock the indicated region and not touch any
other domain. The problem is that the schemata always contain non-zero
bitmasks for all domains so at the time "locked" is written it is not
known which cache region needs to be locked. I am currently unable to
see a simple way to build on top of the current schemata design to
support the "locked" mode as you intended. It does seem as though the
user's intention to create a pseudo-locked region needs to be
communicated before the schemata is written, but from what I understand
this does not seem to be supported by the mode/schemata combination.
Please do correct me where I am wrong.

To continue, when we overcome the above obstacle:
A scenario could be where a single resource group will contain all the
pseudo-locked regions (to avoid wasting closids). It is not clear to me
how to easily support such a usage though since the way writes to the
schemata is done is "changes only". If for example, two pseudo-locked
regions exists:

# mkdir group
# echo "L2:1=0x3" > schemata
# echo locked > mode
# cat schemata
L2:1=0x3
# echo "L2:0=0xf" > schemata
# cat schemata
L2:0=0xf;1=0x3

How can the user remove one of the pseudo-locked regions without
affecting the other? Could we perhaps allow zero bitmask writes when a
region is locked?

Another point I would like to highlight is that when we talked about
keeping the closid associated with the pseudo-locked region I mentioned
that some resources may have few closids (for example, 4). As discussed
this seems ok when there are only 8 bits in the bitmask. What I did not
highlight at that time is that the closids are limited to the smallest
number supported by all resources. So, if this same platform has a
second resource (with more bits in a bitmask) with more closids, they
would also be limited to 4. In this case it does seem removing a closid
from service would have bigger impact.

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-27  0:34     ` Reinette Chatre
@ 2018-02-27 10:36       ` Thomas Gleixner
  2018-02-27 15:38         ` Thomas Gleixner
  2018-02-27 19:52         ` Reinette Chatre
  0 siblings, 2 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-27 10:36 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Reinette,

On Mon, 26 Feb 2018, Reinette Chatre wrote:
> I started looking at how this implementation may look and would like to
> confirm with you that your intentions behind the new "exclusive" and
> "locked" modes can be maintained. I also have a few questions.

Phew :)

> Focusing on CAT a resource group represents a closid across all domains
> (cache instances) of all resources (cache layers) on the system. A full
> schemata reflecting the active bitmask associated with this closid for
> each domain of each resource is maintained. The current implementation
> supports partial writes to the schemata, with the assumption that only
> the changed values need to be updated, the others remain as is. For the
> current implementation this works well since what is shown by schemata
> reflects current hardware settings and what is written to schemata will
> change current hardware settings. This is done irrespective of any
> overlap between bitmasks of different closids (the "shareable" mode).

Right.

> A change to start us off with could be to initialize the schemata with
> all the shareable and unused bits set for all domains when a new
> resource group is created.

The new resource group initialization is the least of my worries. The
current mode is to use the default group setting, right?

> Moving to "exclusive" mode it appears that, when enabled for a resource
> group, all domains of all resources are forced to have an "exclusive"
> region associated with this resource group (closid). This is because the
> schemata reflects the hardware settings of all resources and their
> domains and the hardware does not accept a "zero" bitmask. A user thus
> cannot just specify a single region of a particular cache instance as
> "exclusive". Does this match your intention wrt "exclusive"?

Interesting question. I really did not think about that yet. 

> Moving on to the "locked" mode. We cannot support different
> pseudo-locked regions across multiple resources (eg. L2 and L3). In
> fact, if we would at some point in the future then a pseudo-locked
> region on one resource could implicitly span a second resource.
> Additionally, we would like to enable a user to enable a single
> pseudo-locked region on a single cache instance.
> 
> From the above it follows that "locked" mode cannot just simply build on
> top of "exclusive" mode rules (as I expressed them above) since it
> cannot enforce a locked region on each domain of each resource.
> 
> We would like to support something like (as you also have in your example):
> 
> mkdir group
> echo "L2:1=0x3" > schemata
> echo locked > mode
> 
> The above should only pseudo-lock the indicated region and not touch any
> other domain. The problem is that the schemata always contain non-zero
> bitmasks for all domains so at the time "locked" is written it is not
> known which cache region needs to be locked. I am currently unable to
> see a simple way to build on top of the current schemata design to
> support the "locked" mode as you intended. It does seem as though the
> user's intention to create a pseudo-locked region needs to be
> communicated before the schemata is written, but from what I understand
> this does not seem to be supported by the mode/schemata combination.
> Please do correct me where I am wrong.

You could make it:

echo locksetup > mode
echo $CONF > schemata
echo locked > mode

Or something like that.

> To continue, when we overcome the above obstacle:
> A scenario could be where a single resource group will contain all the
> pseudo-locked regions (to avoid wasting closids). It is not clear to me
> how to easily support such a usage though since the way writes to the
> schemata is done is "changes only". If for example, two pseudo-locked
> regions exists:
> 
> # mkdir group
> # echo "L2:1=0x3" > schemata
> # echo locked > mode
> # cat schemata
> L2:1=0x3
> # echo "L2:0=0xf" > schemata
> # cat schemata
> L2:0=0xf;1=0x3
> 
> How can the user remove one of the pseudo-locked regions without
> affecting the other? Could we perhaps allow zero bitmask writes when a
> region is locked?

That might work. Though it looks hacky.

> Another point I would like to highlight is that when we talked about
> keeping the closid associated with the pseudo-locked region I mentioned
> that some resources may have few closids (for example, 4). As discussed
> this seems ok when there are only 8 bits in the bitmask. What I did not
> highlight at that time is that the closids are limited to the smallest
> number supported by all resources. So, if this same platform has a
> second resource (with more bits in a bitmask) with more closids, they
> would also be limited to 4. In this case it does seem removing a closid
> from service would have bigger impact.

Is that a real issue or just an academic exercise? Let's assume its real,
so you could do the following:

mkdir group		<- acquires closid
echo locksetup > mode	<- Creates 'lockarea' file
echo L2:0 > lockarea
echo 'L2:0=0xf' > schemata
echo locked > mode	<- locks down all files, does the lock setup
     	      		   and drops closid

That would solve quite some of the other issues as well. Hmm?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-27 10:36       ` Thomas Gleixner
@ 2018-02-27 15:38         ` Thomas Gleixner
  2018-02-27 19:52         ` Reinette Chatre
  1 sibling, 0 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-27 15:38 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Tue, 27 Feb 2018, Thomas Gleixner wrote:
> On Mon, 26 Feb 2018, Reinette Chatre wrote:
> > Moving to "exclusive" mode it appears that, when enabled for a resource
> > group, all domains of all resources are forced to have an "exclusive"
> > region associated with this resource group (closid). This is because the
> > schemata reflects the hardware settings of all resources and their
> > domains and the hardware does not accept a "zero" bitmask. A user thus
> > cannot just specify a single region of a particular cache instance as
> > "exclusive". Does this match your intention wrt "exclusive"?
> 
> Interesting question. I really did not think about that yet. 

Actually we could solve that problem similar to the locked one and share
most of the functionality:

mkdir group
echo exclusive > mode
echo L3:0 > restrict

and for locked:

mkdir group
echo locksetup > mode
echo L2:0 > restrict
echo 'L2:0=0xf' > schemata
echo locked > mode

The 'restrict' file (feel free to come up with a better name) is only
available/writeable in exclusive and locksetup mode. In case of exclusive
mode it can contain several domains/resources, but in locked mode its only
allowed to contain a single domain/resource.

A write to schemata for exclusive or locksetup mode will apply the
exclusiveness restrictions only to the resources/domains selected in the
'restrict' file. 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-27 10:36       ` Thomas Gleixner
  2018-02-27 15:38         ` Thomas Gleixner
@ 2018-02-27 19:52         ` Reinette Chatre
  2018-02-27 21:33           ` Reinette Chatre
  2018-02-28 18:39           ` Thomas Gleixner
  1 sibling, 2 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-27 19:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/27/2018 2:36 AM, Thomas Gleixner wrote:
> On Mon, 26 Feb 2018, Reinette Chatre wrote:
>> A change to start us off with could be to initialize the schemata with
>> all the shareable and unused bits set for all domains when a new
>> resource group is created.
> 
> The new resource group initialization is the least of my worries. The
> current mode is to use the default group setting, right?

No. When a new group is created a closid is assigned to it. The schemata
it is initialized with is the schemata the previous group with the same
closid had. At the beginning, yes, it is the default, but later you get
something like this:

# mkdir asd
# cat asd/schemata
L2:0=ff;1=ff
# echo 'L2:0=0xf;1=0xfc' > asd/schemata
# cat asd/schemata
L2:0=0f;1=fc
# rmdir asd
# mkdir qwe
# cat qwe/schemata
L2:0=0f;1=fc

The reason why I suggested this initialization is to have the defaults
work on resource group creation. I assume a new resource group would be
created with "shareable" mode so its schemata should not overlap with
any "exclusive" or "locked". Since the bitmasks used by the previous
group with this closid may not be shareable I considered it safer to
initialize with "shareable" mode with known shareable/unused bitmasks. A
potential issue with this idea is that the creation of a group may now
result in the programming of the hardware with settings these defaults.

>> Moving to "exclusive" mode it appears that, when enabled for a resource
>> group, all domains of all resources are forced to have an "exclusive"
>> region associated with this resource group (closid). This is because the
>> schemata reflects the hardware settings of all resources and their
>> domains and the hardware does not accept a "zero" bitmask. A user thus
>> cannot just specify a single region of a particular cache instance as
>> "exclusive". Does this match your intention wrt "exclusive"?
> 
> Interesting question. I really did not think about that yet. 

I pasted your second email responding to this at the bottom of this email.

>> Moving on to the "locked" mode. We cannot support different
>> pseudo-locked regions across multiple resources (eg. L2 and L3). In
>> fact, if we would at some point in the future then a pseudo-locked
>> region on one resource could implicitly span a second resource.
>> Additionally, we would like to enable a user to enable a single
>> pseudo-locked region on a single cache instance.
>>
>> From the above it follows that "locked" mode cannot just simply build on
>> top of "exclusive" mode rules (as I expressed them above) since it
>> cannot enforce a locked region on each domain of each resource.
>>
>> We would like to support something like (as you also have in your example):
>>
>> mkdir group
>> echo "L2:1=0x3" > schemata
>> echo locked > mode
>>
>> The above should only pseudo-lock the indicated region and not touch any
>> other domain. The problem is that the schemata always contain non-zero
>> bitmasks for all domains so at the time "locked" is written it is not
>> known which cache region needs to be locked. I am currently unable to
>> see a simple way to build on top of the current schemata design to
>> support the "locked" mode as you intended. It does seem as though the
>> user's intention to create a pseudo-locked region needs to be
>> communicated before the schemata is written, but from what I understand
>> this does not seem to be supported by the mode/schemata combination.
>> Please do correct me where I am wrong.
> 
> You could make it:
> 
> echo locksetup > mode
> echo $CONF > schemata
> echo locked > mode
> 
> Or something like that.

Indeed ... the final command may perhaps not be needed? Since the user
expressed intent to create pseudo-locked region by writing "locksetup"
the pseudo-locking can be done when the schemata is written. I think it
would be simpler to act when the schemata is written since we know
exactly at that point which regions should be pseudo-locked. After the
schemata is stored the user's choice is just merged with the larger
schemata representing all resources/domains. We could set mode to
"locked" on success, it can remain as "locksetup" on failure of creating
the pseudo-locked region. We could perhaps also consider a name change
"locksetup" -> "lockrsv" since after the first pseudo-locked region is
created on a domain then all the other domains associated with this
class of service need to have some special state since no task will ever
run on them with that class of service so we would not want their bits
(which will not be zero) to be taken into account when checking for
"shareable" or "exclusive".

This could also support multiple pseudo-locked regions.

For example:
# #Create first pseudo-locked region
# echo locksetup > mode
# echo L2:0=0xf > schemata
# echo $?
0
# cat mode
locked # will be locksetup on failure
# cat schemata
L2:0=0xf #only show pseudo-locked regions
# #Create second pseudo-locked region
# # Not necessary to write "locksetup" again
# echo L2:1=0xf > schemata #will trigger the pseudo-locking of new region
# echo $?
1 # just for example, this could succeed also
# cat mode
locked
# cat schemata
L2:0=0xf

Schemata shown to user would be only the pseudo-locked region(s), unless
there is none, then nothing will be returned.

I'll think about this more, but if we do go the route of releasing
closids as suggested below it may change a lot.

>> To continue, when we overcome the above obstacle:
>> A scenario could be where a single resource group will contain all the
>> pseudo-locked regions (to avoid wasting closids). It is not clear to me
>> how to easily support such a usage though since the way writes to the
>> schemata is done is "changes only". If for example, two pseudo-locked
>> regions exists:
>>
>> # mkdir group
>> # echo "L2:1=0x3" > schemata
>> # echo locked > mode
>> # cat schemata
>> L2:1=0x3
>> # echo "L2:0=0xf" > schemata
>> # cat schemata
>> L2:0=0xf;1=0x3
>>
>> How can the user remove one of the pseudo-locked regions without
>> affecting the other? Could we perhaps allow zero bitmask writes when a
>> region is locked?
> 
> That might work. Though it looks hacky.

Could it work to create another mode?
For example,

# echo lockremove > mode
# echo $SCHEMATATOREMOVE > schemata
# echo $?
0
# cat mode
#locked if more pseudo-locked regions remain, or locksetup/lockrsv if no
pseudo-locked regions remain

> 
>> Another point I would like to highlight is that when we talked about
>> keeping the closid associated with the pseudo-locked region I mentioned
>> that some resources may have few closids (for example, 4). As discussed
>> this seems ok when there are only 8 bits in the bitmask. What I did not
>> highlight at that time is that the closids are limited to the smallest
>> number supported by all resources. So, if this same platform has a
>> second resource (with more bits in a bitmask) with more closids, they
>> would also be limited to 4. In this case it does seem removing a closid
>> from service would have bigger impact.
> 
> Is that a real issue or just an academic exercise?

This is a real issue. The pros and cons of using a global CLOSID across
all resources are documented in the comments preceding:
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c:closid_init()

The issue I mention was foreseen, to quote from there "Our choices on
how to configure each resource become progressively more limited as the
number of resources grows".

> Let's assume its real,
> so you could do the following:
> 
> mkdir group		<- acquires closid
> echo locksetup > mode	<- Creates 'lockarea' file
> echo L2:0 > lockarea
> echo 'L2:0=0xf' > schemata
> echo locked > mode	<- locks down all files, does the lock setup
>      	      		   and drops closid
> 
> That would solve quite some of the other issues as well. Hmm?

At this time the resource group, represented by a resctrl directory, is
tightly associated with the closid. I'll take a closer look at what it
will take to separate them.

Could you please elaborate on the purpose of the "lockarea" file? It
does seem to duplicate the information in the schemata written in the
subsequent line.

If we do go this route then it seems that there would be one
pseudo-locked region per resource group, not multiple ones as I had in
my examples above.

An alternative to the hardware programming on creation of resource group
could also be to reset the bitmasks of the closid to be shareable/unused
bits at the time the closid is released.

> On Tue, 27 Feb 2018, Thomas Gleixner wrote:
>> On Mon, 26 Feb 2018, Reinette Chatre wrote:
>>> Moving to "exclusive" mode it appears that, when enabled for a resource
>>> group, all domains of all resources are forced to have an "exclusive"
>>> region associated with this resource group (closid). This is because the
>>> schemata reflects the hardware settings of all resources and their
>>> domains and the hardware does not accept a "zero" bitmask. A user thus
>>> cannot just specify a single region of a particular cache instance as
>>> "exclusive". Does this match your intention wrt "exclusive"?
>>
>> Interesting question. I really did not think about that yet. 
> 
> Actually we could solve that problem similar to the locked one and share
> most of the functionality:
> 
> mkdir group
> echo exclusive > mode
> echo L3:0 > restrict
> 
> and for locked:
> 
> mkdir group
> echo locksetup > mode
> echo L2:0 > restrict
> echo 'L2:0=0xf' > schemata
> echo locked > mode
> 
> The 'restrict' file (feel free to come up with a better name) is only
> available/writeable in exclusive and locksetup mode. In case of exclusive
> mode it can contain several domains/resources, but in locked mode its only
> allowed to contain a single domain/resource.
> 
> A write to schemata for exclusive or locksetup mode will apply the
> exclusiveness restrictions only to the resources/domains selected in the
> 'restrict' file. 

I think I understand for the exclusive case. Here the introduction of
the restrict file helps. I will run through a few examples to ensure I
understand it. For the pseudo-locking cases I do have the questions and
comments above. Here I likely may be missing something but I'll keep
dissecting how this would work to clear up my understanding.

Thank you very much for your guidance

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-20 17:15   ` Thomas Gleixner
  2018-02-20 18:47     ` Reinette Chatre
  2018-02-27  0:34     ` Reinette Chatre
@ 2018-02-27 21:01     ` Reinette Chatre
  2018-02-28 17:57       ` Thomas Gleixner
  2 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-27 21:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> Let's look at the existing crtl/mon groups which are each represented by a
> directory already.
> 
>  - Adding a 'size' file to the ctrl groups would be a natural extension
>    which makes sense for regular cache allocations as well.
> 

I would like to clarify how you envision the value of "size" computed. A
resource group may have several resources associated with it. Some of
these resources may indeed overlap, for example, if there is L2 and L3
CAT capable resources on the system. Similarly when CDP is enabled,
there would be overlap in bitmasks referring to the same cache locations
but treated as different resources. Indeed, there may in the future be
some resources that are capable of allocation but not cache specifically
that could also be handled within a single resource group.

Summarizing all of these cases with a single "size" associated with the
resource group does not seem straightforward to me.

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-27 19:52         ` Reinette Chatre
@ 2018-02-27 21:33           ` Reinette Chatre
  2018-02-28 18:39           ` Thomas Gleixner
  1 sibling, 0 replies; 65+ messages in thread
From: Reinette Chatre @ 2018-02-27 21:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/27/2018 11:52 AM, Reinette Chatre wrote:
> On 2/27/2018 2:36 AM, Thomas Gleixner wrote:
>> Let's assume its real,
>> so you could do the following:
>>
>> mkdir group		<- acquires closid
>> echo locksetup > mode	<- Creates 'lockarea' file
>> echo L2:0 > lockarea
>> echo 'L2:0=0xf' > schemata
>> echo locked > mode	<- locks down all files, does the lock setup
>>      	      		   and drops closid
>>
>> That would solve quite some of the other issues as well. Hmm?
> 
> At this time the resource group, represented by a resctrl directory, is
> tightly associated with the closid. I'll take a closer look at what it
> will take to separate them.
> 
> Could you please elaborate on the purpose of the "lockarea" file? It
> does seem to duplicate the information in the schemata written in the
> subsequent line.
> 
> If we do go this route then it seems that there would be one
> pseudo-locked region per resource group, not multiple ones as I had in
> my examples above.

Actually, this need not be true. It could be possible for administrator
to pseudo-lock two regions at once. For example,

mkdir group
echo locksetup > mode
echo 'L2:0=0xf;1=0xf' > schemata

This could have two pseudo-locked regions associated with a single
resource group. This does complicate the usage of the "size" file even
more though since the plan was to have a single "size" file associated
with a resource group it is not intuitive how it should describe
multiple pseudo-locked regions. I added the "size" file originally to
help users of the pseudo-locking interface where a single pseudo-locked
region existed in a directory. All information to compute the size
themselves are available to users, perhaps I can add pseudo-code to
compute the size from available information to the documentation?

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-27 21:01     ` Reinette Chatre
@ 2018-02-28 17:57       ` Thomas Gleixner
  2018-02-28 17:59         ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-28 17:57 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Tue, 27 Feb 2018, Reinette Chatre wrote:
> On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> > Let's look at the existing crtl/mon groups which are each represented by a
> > directory already.
> > 
> >  - Adding a 'size' file to the ctrl groups would be a natural extension
> >    which makes sense for regular cache allocations as well.
> > 
> 
> I would like to clarify how you envision the value of "size" computed. A
> resource group may have several resources associated with it. Some of
> these resources may indeed overlap, for example, if there is L2 and L3
> CAT capable resources on the system. Similarly when CDP is enabled,
> there would be overlap in bitmasks referring to the same cache locations
> but treated as different resources. Indeed, there may in the future be
> some resources that are capable of allocation but not cache specifically
> that could also be handled within a single resource group.
> 
> Summarizing all of these cases with a single "size" associated with the
> resource group does not seem straightforward to me.

We have the schemata file which covers everthing. So the size file inside a
resource group should show the sizes for each domain/resource as well.

L2:0=128K;1=256K;
L3:0=1M;1=2M;

L3DATA:0=128K
L3CODE:0=128K

or such. That would be consistent with the schemata file. If there are
resources which cannot be expressed in size, like MBA then you simply do
not print them.

At the top level you want to show the inuse areas. I'd go for straight
bitmap display there:

L2:0=00011100;1=11111111;
L3:0=11001100;1=11111111;

If L3 CDP is enabled then you can show:

L3:0=1DCCDC00;1=DDDD00CC;

where:

0 = unused
1 = overlapping C/D
C = code
D = data

Hmm?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-28 17:57       ` Thomas Gleixner
@ 2018-02-28 17:59         ` Thomas Gleixner
  2018-02-28 18:34           ` Reinette Chatre
  0 siblings, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-28 17:59 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Wed, 28 Feb 2018, Thomas Gleixner wrote:
> On Tue, 27 Feb 2018, Reinette Chatre wrote:
> > On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> > > Let's look at the existing crtl/mon groups which are each represented by a
> > > directory already.
> > > 
> > >  - Adding a 'size' file to the ctrl groups would be a natural extension
> > >    which makes sense for regular cache allocations as well.
> > > 
> > 
> > I would like to clarify how you envision the value of "size" computed. A
> > resource group may have several resources associated with it. Some of
> > these resources may indeed overlap, for example, if there is L2 and L3
> > CAT capable resources on the system. Similarly when CDP is enabled,
> > there would be overlap in bitmasks referring to the same cache locations
> > but treated as different resources. Indeed, there may in the future be
> > some resources that are capable of allocation but not cache specifically
> > that could also be handled within a single resource group.
> > 
> > Summarizing all of these cases with a single "size" associated with the
> > resource group does not seem straightforward to me.
> 
> We have the schemata file which covers everthing. So the size file inside a
> resource group should show the sizes for each domain/resource as well.
> 
> L2:0=128K;1=256K;
> L3:0=1M;1=2M;
> 
> L3DATA:0=128K
> L3CODE:0=128K
> 
> or such. That would be consistent with the schemata file. If there are
> resources which cannot be expressed in size, like MBA then you simply do
> not print them.
> 
> At the top level you want to show the inuse areas. I'd go for straight
> bitmap display there:
> 
> L2:0=00011100;1=11111111;
> L3:0=11001100;1=11111111;
> 
> If L3 CDP is enabled then you can show:
> 
> L3:0=1DCCDC00;1=DDDD00CC;
> 
> where:
> 
> 0 = unused
> 1 = overlapping C/D
> C = code
> D = data

Hit send too early....

For the locked case this would add:

 L = locked

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-28 17:59         ` Thomas Gleixner
@ 2018-02-28 18:34           ` Reinette Chatre
  2018-02-28 18:42             ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-28 18:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/28/2018 9:59 AM, Thomas Gleixner wrote:
> On Wed, 28 Feb 2018, Thomas Gleixner wrote:
>> On Tue, 27 Feb 2018, Reinette Chatre wrote:
>>> On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
>>>> Let's look at the existing crtl/mon groups which are each represented by a
>>>> directory already.
>>>>
>>>>  - Adding a 'size' file to the ctrl groups would be a natural extension
>>>>    which makes sense for regular cache allocations as well.
>>>>
>>>
>>> I would like to clarify how you envision the value of "size" computed. A
>>> resource group may have several resources associated with it. Some of
>>> these resources may indeed overlap, for example, if there is L2 and L3
>>> CAT capable resources on the system. Similarly when CDP is enabled,
>>> there would be overlap in bitmasks referring to the same cache locations
>>> but treated as different resources. Indeed, there may in the future be
>>> some resources that are capable of allocation but not cache specifically
>>> that could also be handled within a single resource group.
>>>
>>> Summarizing all of these cases with a single "size" associated with the
>>> resource group does not seem straightforward to me.
>>
>> We have the schemata file which covers everthing. So the size file inside a
>> resource group should show the sizes for each domain/resource as well.
>>
>> L2:0=128K;1=256K;
>> L3:0=1M;1=2M;
>>
>> L3DATA:0=128K
>> L3CODE:0=128K
>>
>> or such. That would be consistent with the schemata file. If there are
>> resources which cannot be expressed in size, like MBA then you simply do
>> not print them.
>>
>> At the top level you want to show the inuse areas. I'd go for straight
>> bitmap display there:
>>
>> L2:0=00011100;1=11111111;
>> L3:0=11001100;1=11111111;
>>
>> If L3 CDP is enabled then you can show:
>>
>> L3:0=1DCCDC00;1=DDDD00CC;
>>
>> where:
>>
>> 0 = unused
>> 1 = overlapping C/D
>> C = code
>> D = data
> 
> Hit send too early....
> 
> For the locked case this would add:
> 
>  L = locked

I hesitated doing something like this because during the review of this
series there was resistance to using sysfs files for multiple values. I
will proceed with your suggestion noting that it is tied with schemata file.

I do not think we need these special labels for CDP though. From what I
understand when CDP is enabled there will be two new resources in the
info directory. For example,

info/L3DATA/
info/L3CODE/

Each would have its own file(s) noting which bits are in use.

At this time I am also not enabling pseudo-locking when CDP is enabled
so the locked label is not needed.

There is already the file "shareable_bits" in the info directory
associated with each resource. At the moment it only reflects the bits
that could be used by other entities on the system. Considering its name
and us now introducing the idea of "shareable" I was thinking of adding
all "shareable" bits (hardware and software) of this resource to this
file. This still leaves the new "inuse_bits" (or perhaps "unused_bits")
info file that will communicate the "exclusive" and "locked" bits in
addition so what is in "shareable_bits". Between these two files users
should have information needed to choose regions for their tasks. These
would all use bitmap displays.

Thank you very much

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-27 19:52         ` Reinette Chatre
  2018-02-27 21:33           ` Reinette Chatre
@ 2018-02-28 18:39           ` Thomas Gleixner
  2018-02-28 19:17             ` Reinette Chatre
  1 sibling, 1 reply; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-28 18:39 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Reinette,

On Tue, 27 Feb 2018, Reinette Chatre wrote:
> On 2/27/2018 2:36 AM, Thomas Gleixner wrote:
> > On Mon, 26 Feb 2018, Reinette Chatre wrote:
> >> A change to start us off with could be to initialize the schemata with
> >> all the shareable and unused bits set for all domains when a new
> >> resource group is created.
> > 
> > The new resource group initialization is the least of my worries. The
> > current mode is to use the default group setting, right?
> 
> No. When a new group is created a closid is assigned to it. The schemata
> it is initialized with is the schemata the previous group with the same
> closid had. At the beginning, yes, it is the default, but later you get
> something like this:
> 
> # mkdir asd
> # cat asd/schemata
> L2:0=ff;1=ff
> # echo 'L2:0=0xf;1=0xfc' > asd/schemata
> # cat asd/schemata
> L2:0=0f;1=fc
> # rmdir asd
> # mkdir qwe
> # cat qwe/schemata
> L2:0=0f;1=fc

Ah, was not aware and did not bother to look into the code.

> The reason why I suggested this initialization is to have the defaults
> work on resource group creation. I assume a new resource group would be
> created with "shareable" mode so its schemata should not overlap with
> any "exclusive" or "locked". Since the bitmasks used by the previous
> group with this closid may not be shareable I considered it safer to
> initialize with "shareable" mode with known shareable/unused bitmasks. A
> potential issue with this idea is that the creation of a group may now
> result in the programming of the hardware with settings these defaults.

Yes, setting it to 'default' group bits at creation (ID allocation) time
makes sense.

> >> Moving to "exclusive" mode it appears that, when enabled for a resource
> >> group, all domains of all resources are forced to have an "exclusive"
> >> region associated with this resource group (closid). This is because the
> >> schemata reflects the hardware settings of all resources and their
> >> domains and the hardware does not accept a "zero" bitmask. A user thus
> >> cannot just specify a single region of a particular cache instance as
> >> "exclusive". Does this match your intention wrt "exclusive"?
> > 
> > Interesting question. I really did not think about that yet.

Second thoughts on that: I think for a start we can go the simple route and
just say: exclusive covers all cache levels.

> > You could make it:
> > 
> > echo locksetup > mode
> > echo $CONF > schemata
> > echo locked > mode
> > 
> > Or something like that.
> 
> Indeed ... the final command may perhaps not be needed? Since the user
> expressed intent to create pseudo-locked region by writing "locksetup"
> the pseudo-locking can be done when the schemata is written. I think it
> would be simpler to act when the schemata is written since we know
> exactly at that point which regions should be pseudo-locked. After the
> schemata is stored the user's choice is just merged with the larger
> schemata representing all resources/domains. We could set mode to
> "locked" on success, it can remain as "locksetup" on failure of creating
> the pseudo-locked region. We could perhaps also consider a name change
> "locksetup" -> "lockrsv" since after the first pseudo-locked region is
> created on a domain then all the other domains associated with this
> class of service need to have some special state since no task will ever
> run on them with that class of service so we would not want their bits
> (which will not be zero) to be taken into account when checking for
> "shareable" or "exclusive".

Works for me.

> This could also support multiple pseudo-locked regions.
> For example:
> # #Create first pseudo-locked region
> # echo locksetup > mode
> # echo L2:0=0xf > schemata
> # echo $?
> 0
> # cat mode
> locked # will be locksetup on failure
> # cat schemata
> L2:0=0xf #only show pseudo-locked regions
> # #Create second pseudo-locked region
> # # Not necessary to write "locksetup" again
> # echo L2:1=0xf > schemata #will trigger the pseudo-locking of new region
> # echo $?
> 1 # just for example, this could succeed also
> # cat mode
> locked
> # cat schemata
> L2:0=0xf
> 
> Schemata shown to user would be only the pseudo-locked region(s), unless
> there is none, then nothing will be returned.
> 
> I'll think about this more, but if we do go the route of releasing
> closids as suggested below it may change a lot.

I think dropping the closid makes sense. Once the thing is locked it's done
and nothing can be changed anymore, except removal of course. That also
gives you a 1:1 mapping between resource group and lockdevice.

> This is a real issue. The pros and cons of using a global CLOSID across
> all resources are documented in the comments preceding:
> arch/x86/kernel/cpu/intel_rdt_rdtgroup.c:closid_init()
> 
> The issue I mention was foreseen, to quote from there "Our choices on
> how to configure each resource become progressively more limited as the
> number of resources grows".
> 
> > Let's assume its real,
> > so you could do the following:
> > 
> > mkdir group		<- acquires closid
> > echo locksetup > mode	<- Creates 'lockarea' file
> > echo L2:0 > lockarea
> > echo 'L2:0=0xf' > schemata
> > echo locked > mode	<- locks down all files, does the lock setup
> >      	      		   and drops closid
> > 
> > That would solve quite some of the other issues as well. Hmm?
> 
> At this time the resource group, represented by a resctrl directory, is
> tightly associated with the closid. I'll take a closer look at what it
> will take to separate them.

Shouldn't be that hard.

> Could you please elaborate on the purpose of the "lockarea" file? It
> does seem to duplicate the information in the schemata written in the
> subsequent line.

No. The lockarea or restrict file (as I named it later, but feel free to
come up with something more intuitive) is there to tell which part of the
resource zoo should be made exclusive/locked. That makes the whole write to
schemata file and validate whether this is really exclusive way simpler.

> If we do go this route then it seems that there would be one
> pseudo-locked region per resource group, not multiple ones as I had in
> my examples above.

Correct.

> An alternative to the hardware programming on creation of resource group
> could also be to reset the bitmasks of the closid to be shareable/unused
> bits at the time the closid is released.

That does not help because the default/shareable/unused bits can change
between release of a CLOSID and reallocation.

> > Actually we could solve that problem similar to the locked one and share
> > most of the functionality:
> > 
> > mkdir group
> > echo exclusive > mode
> > echo L3:0 > restrict
> > 
> > and for locked:
> > 
> > mkdir group
> > echo locksetup > mode
> > echo L2:0 > restrict
> > echo 'L2:0=0xf' > schemata
> > echo locked > mode
> > 
> > The 'restrict' file (feel free to come up with a better name) is only
> > available/writeable in exclusive and locksetup mode. In case of exclusive
> > mode it can contain several domains/resources, but in locked mode its only
> > allowed to contain a single domain/resource.
> > 
> > A write to schemata for exclusive or locksetup mode will apply the
> > exclusiveness restrictions only to the resources/domains selected in the
> > 'restrict' file. 
> 
> I think I understand for the exclusive case. Here the introduction of
> the restrict file helps. I will run through a few examples to ensure I
> understand it. For the pseudo-locking cases I do have the questions and
> comments above. Here I likely may be missing something but I'll keep
> dissecting how this would work to clear up my understanding.

I came up with this under the assumptions:

  1) One locked region per resource group
  2) Drop closid after locking

Then the restrict file makes a lot of sense because it would give a clear
selection of the possible resource to lock.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-28 18:34           ` Reinette Chatre
@ 2018-02-28 18:42             ` Thomas Gleixner
  0 siblings, 0 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-28 18:42 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Wed, 28 Feb 2018, Reinette Chatre wrote:
> On 2/28/2018 9:59 AM, Thomas Gleixner wrote:
> I hesitated doing something like this because during the review of this
> series there was resistance to using sysfs files for multiple values. I
> will proceed with your suggestion noting that it is tied with schemata file.

It's not sysfs, it's resctrl fs. So we have already multiple lines and
values, e.g. in the schemata file.

> I do not think we need these special labels for CDP though. From what I
> understand when CDP is enabled there will be two new resources in the
> info directory. For example,
> 
> info/L3DATA/
> info/L3CODE/
> 
> Each would have its own file(s) noting which bits are in use.
> 
> At this time I am also not enabling pseudo-locking when CDP is enabled
> so the locked label is not needed.

The locked label is needed for the !CDP case so you can see where the
locked regions are in a bitmap

> There is already the file "shareable_bits" in the info directory
> associated with each resource. At the moment it only reflects the bits
> that could be used by other entities on the system. Considering its name
> and us now introducing the idea of "shareable" I was thinking of adding
> all "shareable" bits (hardware and software) of this resource to this
> file. This still leaves the new "inuse_bits" (or perhaps "unused_bits")
> info file that will communicate the "exclusive" and "locked" bits in
> addition so what is in "shareable_bits". Between these two files users
> should have information needed to choose regions for their tasks. These
> would all use bitmap displays.

Yeah, that needs some thought, but I happily let you sort out the details.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-28 18:39           ` Thomas Gleixner
@ 2018-02-28 19:17             ` Reinette Chatre
  2018-02-28 19:40               ` Thomas Gleixner
  0 siblings, 1 reply; 65+ messages in thread
From: Reinette Chatre @ 2018-02-28 19:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

Hi Thomas,

On 2/28/2018 10:39 AM, Thomas Gleixner wrote:
> On Tue, 27 Feb 2018, Reinette Chatre wrote:
>> On 2/27/2018 2:36 AM, Thomas Gleixner wrote:
>>> On Mon, 26 Feb 2018, Reinette Chatre wrote:
>>>> Moving to "exclusive" mode it appears that, when enabled for a resource
>>>> group, all domains of all resources are forced to have an "exclusive"
>>>> region associated with this resource group (closid). This is because the
>>>> schemata reflects the hardware settings of all resources and their
>>>> domains and the hardware does not accept a "zero" bitmask. A user thus
>>>> cannot just specify a single region of a particular cache instance as
>>>> "exclusive". Does this match your intention wrt "exclusive"?
>>>
>>> Interesting question. I really did not think about that yet.
> 
> Second thoughts on that: I think for a start we can go the simple route and
> just say: exclusive covers all cache levels.

Will do ... (will refer back to this later)

> 
>>> You could make it:
>>>
>>> echo locksetup > mode
>>> echo $CONF > schemata
>>> echo locked > mode
>>>
>>> Or something like that.
>>
>> Indeed ... the final command may perhaps not be needed? Since the user
>> expressed intent to create pseudo-locked region by writing "locksetup"
>> the pseudo-locking can be done when the schemata is written. I think it
>> would be simpler to act when the schemata is written since we know
>> exactly at that point which regions should be pseudo-locked. After the
>> schemata is stored the user's choice is just merged with the larger
>> schemata representing all resources/domains. We could set mode to
>> "locked" on success, it can remain as "locksetup" on failure of creating
>> the pseudo-locked region. We could perhaps also consider a name change
>> "locksetup" -> "lockrsv" since after the first pseudo-locked region is
>> created on a domain then all the other domains associated with this
>> class of service need to have some special state since no task will ever
>> run on them with that class of service so we would not want their bits
>> (which will not be zero) to be taken into account when checking for
>> "shareable" or "exclusive".
> 
> Works for me.

A big change in the above is that now that the closid will be released
there is no need for the "lockrsv" anymore since the closid used for the
pseudo-locking can surely be re-used. It would thus change to:

# echo locksetup > mode
# echo $CONF > schemata
# echo $?
0
# cat mode
locked

Similar as before the writing of the schemata would trigger the
pseudo-locking and there is no lockarea/restrict file.

> I think dropping the closid makes sense. Once the thing is locked it's done
> and nothing can be changed anymore, except removal of course. That also
> gives you a 1:1 mapping between resource group and lockdevice.

Thanks. One pseudo-locked region per resource group does make things
simpler and removes the need for some strange removal tricks,
pseudo-locked region removal can be done with a directory removal.
Naming of the pseudo-locked region's character device will be same as
the resource group.

> 
>> This is a real issue. The pros and cons of using a global CLOSID across
>> all resources are documented in the comments preceding:
>> arch/x86/kernel/cpu/intel_rdt_rdtgroup.c:closid_init()
>>
>> The issue I mention was foreseen, to quote from there "Our choices on
>> how to configure each resource become progressively more limited as the
>> number of resources grows".
>>
>>> Let's assume its real,
>>> so you could do the following:
>>>
>>> mkdir group		<- acquires closid
>>> echo locksetup > mode	<- Creates 'lockarea' file
>>> echo L2:0 > lockarea
>>> echo 'L2:0=0xf' > schemata
>>> echo locked > mode	<- locks down all files, does the lock setup
>>>      	      		   and drops closid
>>>
>>> That would solve quite some of the other issues as well. Hmm?
>>
>> At this time the resource group, represented by a resctrl directory, is
>> tightly associated with the closid. I'll take a closer look at what it
>> will take to separate them.
> 
> Shouldn't be that hard.
> 
>> Could you please elaborate on the purpose of the "lockarea" file? It
>> does seem to duplicate the information in the schemata written in the
>> subsequent line.
> 
> No. The lockarea or restrict file (as I named it later, but feel free to
> come up with something more intuitive) is there to tell which part of the
> resource zoo should be made exclusive/locked. That makes the whole write to
> schemata file and validate whether this is really exclusive way simpler.

This file (lockarea/restrict) would surely support a flexible exclusive
mode, but if we start with the simple case as you note above then it
does not seem to be needed for exclusive mode since the exclusive mode
would cover all domains of all resources.

# mkdir group
# echo exclusive > group/mode # will attempt to mark all bits in current
schemata as exclusive, will fail if any are in use by another closid
# echo $newschemata > group/schemata # will require that all bits are
not in use by any other closid

This file (lockarea/restrict) also does not seem to be needed for the
locked mode since only one pseudo-locked region is supported per
resource group and that is communicated at the time the schemata is
written. The locking flow above did not use this file.

>> If we do go this route then it seems that there would be one
>> pseudo-locked region per resource group, not multiple ones as I had in
>> my examples above.
> 
> Correct.
> 
>> An alternative to the hardware programming on creation of resource group
>> could also be to reset the bitmasks of the closid to be shareable/unused
>> bits at the time the closid is released.
> 
> That does not help because the default/shareable/unused bits can change
> between release of a CLOSID and reallocation.

Good point, thanks.

>>> Actually we could solve that problem similar to the locked one and share
>>> most of the functionality:
>>>
>>> mkdir group
>>> echo exclusive > mode
>>> echo L3:0 > restrict
>>>
>>> and for locked:
>>>
>>> mkdir group
>>> echo locksetup > mode
>>> echo L2:0 > restrict
>>> echo 'L2:0=0xf' > schemata
>>> echo locked > mode
>>>
>>> The 'restrict' file (feel free to come up with a better name) is only
>>> available/writeable in exclusive and locksetup mode. In case of exclusive
>>> mode it can contain several domains/resources, but in locked mode its only
>>> allowed to contain a single domain/resource.
>>>
>>> A write to schemata for exclusive or locksetup mode will apply the
>>> exclusiveness restrictions only to the resources/domains selected in the
>>> 'restrict' file. 
>>
>> I think I understand for the exclusive case. Here the introduction of
>> the restrict file helps. I will run through a few examples to ensure I
>> understand it. For the pseudo-locking cases I do have the questions and
>> comments above. Here I likely may be missing something but I'll keep
>> dissecting how this would work to clear up my understanding.
> 
> I came up with this under the assumptions:
> 
>   1) One locked region per resource group
>   2) Drop closid after locking

I am also now working under these assumptions ...

> Then the restrict file makes a lot of sense because it would give a clear
> selection of the possible resource to lock.

... but I am still stuck on why this restrict file is needed at this
time. Surely it would be needed if later we add the more flexible
exclusive mode, but I do not understand how it helps the locked mode.

Reinette

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
  2018-02-28 19:17             ` Reinette Chatre
@ 2018-02-28 19:40               ` Thomas Gleixner
  0 siblings, 0 replies; 65+ messages in thread
From: Thomas Gleixner @ 2018-02-28 19:40 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: fenghua.yu, tony.luck, gavin.hindman, vikas.shivappa,
	dave.hansen, mingo, hpa, x86, linux-kernel

On Wed, 28 Feb 2018, Reinette Chatre wrote:
> On 2/28/2018 10:39 AM, Thomas Gleixner wrote:
> > I came up with this under the assumptions:
> > 
> >   1) One locked region per resource group
> >   2) Drop closid after locking
> 
> I am also now working under these assumptions ...
> 
> > Then the restrict file makes a lot of sense because it would give a clear
> > selection of the possible resource to lock.
> 
> ... but I am still stuck on why this restrict file is needed at this
> time. Surely it would be needed if later we add the more flexible
> exclusive mode, but I do not understand how it helps the locked mode.

You're right. Brainfart on my side. With that scheme it's really only
useful for a flexible exclusive mode, which would be nice to have but is
not a prerequisite for now.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2018-02-28 19:41 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking Reinette Chatre
2018-02-19 20:35   ` Thomas Gleixner
2018-02-19 22:15     ` Reinette Chatre
2018-02-19 22:19       ` Thomas Gleixner
2018-02-19 22:24         ` Reinette Chatre
2018-02-19 21:27   ` Randy Dunlap
2018-02-19 22:21     ` Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 02/22] x86/intel_rdt: Make useful functions available internally Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 03/22] x86/intel_rdt: Introduce hooks to create pseudo-locking files Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 04/22] x86/intel_rdt: Introduce test to determine if closid is in use Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 05/22] x86/intel_rdt: Print more accurate pseudo-locking availability Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions Reinette Chatre
2018-02-19 20:57   ` Thomas Gleixner
2018-02-19 23:02     ` Reinette Chatre
2018-02-19 23:16       ` Thomas Gleixner
2018-02-20  3:21         ` Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 07/22] x86/intel_rdt: Connect pseudo-locking directory to operations Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 08/22] x86/intel_rdt: Introduce pseudo-locking resctrl files Reinette Chatre
2018-02-19 21:01   ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 09/22] x86/intel_rdt: Discover supported platforms via prefetch disable bits Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 10/22] x86/intel_rdt: Disable pseudo-locking if CDP enabled Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain Reinette Chatre
2018-02-19 21:19   ` Thomas Gleixner
2018-02-19 23:00     ` Reinette Chatre
2018-02-19 23:19       ` Thomas Gleixner
2018-02-20  3:17         ` Reinette Chatre
2018-02-20 10:00           ` Thomas Gleixner
2018-02-20 16:02             ` Reinette Chatre
2018-02-20 17:18               ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 12/22] x86/intel_rdt: Support CBM checking from value and character buffer Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core Reinette Chatre
2018-02-20 17:15   ` Thomas Gleixner
2018-02-20 18:47     ` Reinette Chatre
2018-02-20 23:21       ` Thomas Gleixner
2018-02-21  1:58         ` Mike Kravetz
2018-02-21  6:10           ` Reinette Chatre
2018-02-21  8:34           ` Thomas Gleixner
2018-02-21  5:58         ` Reinette Chatre
2018-02-27  0:34     ` Reinette Chatre
2018-02-27 10:36       ` Thomas Gleixner
2018-02-27 15:38         ` Thomas Gleixner
2018-02-27 19:52         ` Reinette Chatre
2018-02-27 21:33           ` Reinette Chatre
2018-02-28 18:39           ` Thomas Gleixner
2018-02-28 19:17             ` Reinette Chatre
2018-02-28 19:40               ` Thomas Gleixner
2018-02-27 21:01     ` Reinette Chatre
2018-02-28 17:57       ` Thomas Gleixner
2018-02-28 17:59         ` Thomas Gleixner
2018-02-28 18:34           ` Reinette Chatre
2018-02-28 18:42             ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 14/22] x86/intel_rdt: Enable testing for pseudo-locked region Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 15/22] x86/intel_rdt: Prevent new allocations from pseudo-locked regions Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 16/22] x86/intel_rdt: Create debugfs files for pseudo-locking testing Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 17/22] x86/intel_rdt: Create character device exposing pseudo-locked region Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 18/22] x86/intel_rdt: More precise L2 hit/miss measurements Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 19/22] x86/intel_rdt: Support L3 cache performance event of Broadwell Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 20/22] x86/intel_rdt: Limit C-states dynamically when pseudo-locking active Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 21/22] mm/hugetlb: Enable large allocations through gigantic page API Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 22/22] x86/intel_rdt: Support contiguous memory of all sizes Reinette Chatre
2018-02-14 18:12 ` [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Mike Kravetz
2018-02-14 18:31   ` Reinette Chatre
2018-02-15 20:39     ` Reinette Chatre
2018-02-15 21:10       ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).