From: Reinette Chatre <reinette.chatre@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com, tony.luck@intel.com,
gavin.hindman@intel.com, vikas.shivappa@linux.intel.com,
dave.hansen@intel.com, mingo@redhat.com, hpa@zytor.com,
x86@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
Date: Tue, 20 Feb 2018 10:47:51 -0800 [thread overview]
Message-ID: <6c960fc0-820e-757c-2770-d770647e63d6@intel.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1802200059240.1853@nanos.tec.linutronix.de>
Hi Thomas,
On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>> static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>> {
>> bool is_new_plr = (plr == new_plr);
>> @@ -93,6 +175,23 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>> if (!plr->deleted)
>> return;
>>
>> + if (plr->locked) {
>> + plr->d->plr = NULL;
>> + /*
>> + * Resource groups come and go. Simply returning this
>> + * pseudo-locked region's bits to the default CLOS may
>> + * result in default CLOS to become fragmented, causing
>> + * the setting of its bitmask to fail. Ensure it is valid
>> + * first. If this check does fail we cannot return the bits
>> + * to the default CLOS and userspace intervention would be
>> + * required to ensure portions of the cache do not go
>> + * unused.
>> + */
>> + if (cbm_validate_val(plr->d->ctrl_val[0] | plr->cbm, plr->r))
>> + pseudo_lock_clos_set(plr, 0,
>> + plr->d->ctrl_val[0] | plr->cbm);
>> + pseudo_lock_region_clear(plr);
>> + }
>> kfree(plr);
>> if (is_new_plr)
>> new_plr = NULL;
>
> Are you really sure that the life time rules of plr are correct vs. an
> application which still has the locked memory mapped? i.e. the following
> operation:
You are correct. I am not preventing an administrator from removing the
pseudo-locked region if it is in use. I will fix that.
> 1# create_pseudo_lock_region()
>
> 2# start_app()
> fd = open(/dev/.../lock);
> ptr = mmap(fd, .....); <- takes a ref on fd
> close(fd);
> do_stuff(ptr);
>
> 1# rmdir .../lock
>
> unmap(ptr); <- releases fd
>
> I can't see how that is protected. You already have a kref in the PLR, but
> it's in no way connected to the file descriptor lifetime. So the refcount
> logic here must be:
>
> create_lock_region()
> plr = alloc_plr();
> take_ref(plr);
> if (!init_plr(plr)) {
> drop_ref(plr);
> ...
> }
>
> lockdev_open(filp)
> take_ref(plr);
> filp->private = plr;
>
> rmdir_lock_region()
> ...
> drop_ref(plr);
>
> lockdev_relese(filp)
> filp->private = NULL;
> drop_ref(plr);
>
>> /*
>> + * Only one pseudo-locked region can be set up at a time and that is
>> + * enforced by taking the rdt_pseudo_lock_mutex when the user writes the
>> + * requested schemata to the resctrl file and releasing the mutex on
>> + * completion. The thread locking the kernel memory into the cache starts
>> + * and completes during this time so we can be sure that only one thread
>> + * can run at any time.
>> + * The functions starting the pseudo-locking thread needs to wait for its
>> + * completion and since there can only be one we have a global workqueue
>> + * and variable to support this.
>> + */
>> +static DECLARE_WAIT_QUEUE_HEAD(wq);
>> +static int thread_done;
>
> Eew. For one, you really couldn't come up with more generic and less
> relatable variable names, right?
>
> That aside, its just wrong to build code based on current hardware
> limitations. The waitqueue and the result code belong into PLR.
Will do. This also builds on your previous suggestion to not limit the
number of uninitialized pseudo-locked regions.
>
>> +/**
>> + * pseudo_lock_fn - Load kernel memory into cache
>> + *
>> + * This is the core pseudo-locking function.
>> + *
>> + * First we ensure that the kernel memory cannot be found in the cache.
>> + * Then, while taking care that there will be as little interference as
>> + * possible, each cache line of the memory to be loaded is touched while
>> + * core is running with class of service set to the bitmask of the
>> + * pseudo-locked region. After this is complete no future CAT allocations
>> + * will be allowed to overlap with this bitmask.
>> + *
>> + * Local register variables are utilized to ensure that the memory region
>> + * to be locked is the only memory access made during the critical locking
>> + * loop.
>> + */
>> +static int pseudo_lock_fn(void *_plr)
>> +{
>> + struct pseudo_lock_region *plr = _plr;
>> + u32 rmid_p, closid_p;
>> + unsigned long flags;
>> + u64 i;
>> +#ifdef CONFIG_KASAN
>> + /*
>> + * The registers used for local register variables are also used
>> + * when KASAN is active. When KASAN is active we use a regular
>> + * variable to ensure we always use a valid pointer, but the cost
>> + * is that this variable will enter the cache through evicting the
>> + * memory we are trying to lock into the cache. Thus expect lower
>> + * pseudo-locking success rate when KASAN is active.
>> + */
>
> I'm not a real fan of this mess. But well,
>
>> + unsigned int line_size;
>> + unsigned int size;
>> + void *mem_r;
>> +#else
>> + register unsigned int line_size asm("esi");
>> + register unsigned int size asm("edi");
>> +#ifdef CONFIG_X86_64
>> + register void *mem_r asm("rbx");
>> +#else
>> + register void *mem_r asm("ebx");
>> +#endif /* CONFIG_X86_64 */
>> +#endif /* CONFIG_KASAN */
>> +
>> + /*
>> + * Make sure none of the allocated memory is cached. If it is we
>> + * will get a cache hit in below loop from outside of pseudo-locked
>> + * region.
>> + * wbinvd (as opposed to clflush/clflushopt) is required to
>> + * increase likelihood that allocated cache portion will be filled
>> + * with associated memory
>
> Sigh.
>
>> + */
>> + wbinvd();
>> +
>> + preempt_disable();
>> + local_irq_save(flags);
>
> preempt_disable() is pointless when you disable interrupts. And this
> really should be local_irq_disable(). This code is always called with
> interrupts enabled....
>
>> + /*
>> + * Call wrmsr and rdmsr as directly as possible to avoid tracing
>> + * clobbering local register variables or affecting cache accesses.
>> + */
>
> You probably want to make sure that the code below is in L1 cache already
> before the CLOSID is set to the allocation. To do this you want to put the
> preload mechanics into a separate ASM function.
>
> Then you run it with size = 1 on some other temporary memory buffer with
> the default CLOSID, which has the CBM bits of the lock region excluded,
> Then switch to the real CLOSID and run the loop with the real buffer and
> the real size.
Thank you for the suggestion. I will experiment how this affects the
pseudo-locked region success.
>> + __wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
>
> This wants an explanation why the prefetcher needs to be disabled.
>
>> +static int pseudo_lock_doit(struct pseudo_lock_region *plr,
>> + struct rdt_resource *r,
>> + struct rdt_domain *d)
>> +{
>> + struct task_struct *thread;
>> + int closid;
>> + int ret, i;
>> +
>> + /*
>> + * With the usage of wbinvd we can only support one pseudo-locked
>> + * region per domain at this time.
>
> This really sucks.
>
>> + */
>> + if (d->plr) {
>> + rdt_last_cmd_puts("pseudo-locked region exists on cache\n");
>> + return -ENOSPC;
>
> This check is not sufficient for a CPU which has both L2 and L3 allocation
> capability. If there is already a L3 locked region and the current call
> sets up a L2 locked region then this will not catch it and the following
> wbinvd will wipe the L3 locked region ....
>
>> + }
>> +
>> + ret = pseudo_lock_region_init(plr, r, d);
>> + if (ret < 0)
>> + return ret;
>> +
>> + closid = closid_alloc();
>> + if (closid < 0) {
>> + ret = closid;
>> + rdt_last_cmd_puts("unable to obtain free closid\n");
>> + goto out_region;
>> + }
>> +
>> + /*
>> + * Ensure we end with a valid default CLOS. If a pseudo-locked
>> + * region in middle of possible bitmasks is selected it will split
>> + * up default CLOS which would be a fault and for which handling
>> + * is unclear so we fail back to userspace. Validation will also
>> + * ensure that default CLOS is not zero, keeping some cache
>> + * available to rest of system.
>> + */
>> + if (!cbm_validate_val(d->ctrl_val[0] & ~plr->cbm, r)) {
>> + ret = -EINVAL;
>> + rdt_last_cmd_printf("bm 0x%x causes invalid clos 0 bm 0x%x\n",
>> + plr->cbm, d->ctrl_val[0] & ~plr->cbm);
>> + goto out_closid;
>> + }
>> +
>> + ret = pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] & ~plr->cbm);
>
> Fiddling with the default CBM is wrong. The lock operation should only
> succeed when the bits in that domain are not used by _ANY_ control group
> including the default one. This is a reasonable constraint.
This changes one of my original assumptions. I will rework all to adjust
since your later design change suggestions will impact this.
>> + if (ret < 0) {
>> + rdt_last_cmd_printf("unable to set clos 0 bitmask to 0x%x\n",
>> + d->ctrl_val[0] & ~plr->cbm);
>> + goto out_closid;
>> + }
>> +
>> + ret = pseudo_lock_clos_set(plr, closid, plr->cbm);
>> + if (ret < 0) {
>> + rdt_last_cmd_printf("unable to set closid %d bitmask to 0x%x\n",
>> + closid, plr->cbm);
>> + goto out_clos_def;
>> + }
>> +
>> + plr->closid = closid;
>> +
>> + thread_done = 0;
>> +
>> + thread = kthread_create_on_node(pseudo_lock_fn, plr,
>> + cpu_to_node(plr->cpu),
>> + "pseudo_lock/%u", plr->cpu);
>> + if (IS_ERR(thread)) {
>> + ret = PTR_ERR(thread);
>> + rdt_last_cmd_printf("locking thread returned error %d\n", ret);
>> + /*
>> + * We do not return CBM to newly allocated CLOS here on
>> + * error path since that will result in a CBM of all
>> + * zeroes which is an illegal MSR write.
>
> I'm not sure what you are trying to explain here.
>
> If you remove a ctrl group then the CBM bits are not added to anything
> either. It's up to the operator to handle that. Why would this be any
> different for the pseudo-locking stuff?
It is not different, no. On failure the closid is released but the CBM
associated with it remains. Here I attempted to explain why the CBM
remains. This is the same behavior as current CAT. I will remove the
comment since it is just causing confusion.
>> + */
>> + goto out_clos_def;
>> + }
>> +
>> + kthread_bind(thread, plr->cpu);
>> + wake_up_process(thread);
>> +
>> + ret = wait_event_interruptible(wq, thread_done == 1);
>> + if (ret < 0) {
>> + rdt_last_cmd_puts("locking thread interrupted\n");
>> + goto out_clos_def;
>
> This is broken. If the thread does not get on the CPU for whatever reason
> and the process which sets up the region is interrupted then this will
> leave the thread in runnable state and once it gets on the CPU it will
> happily derefence the freed plr struct and fiddle with the freed memory.
>
> You need to make sure that the thread holds a reference on the plr struct,
> which prevents freeing. That includes the CLOSID .....
Thanks for catching this.
>
>> + }
>> +
>> + /*
>> + * closid will be released soon but its CBM as well as CBM of not
>> + * yet allocated CLOS as stored in the array will remain. Ensure
>> + * that CBM will be what is currently the default CLOS, which
>> + * excludes pseudo-locked region.
>> + */
>> + for (i = 1; i < r->num_closid; i++) {
>> + if (i == closid || !closid_allocated(i))
>> + pseudo_lock_clos_set(plr, i, d->ctrl_val[0]);
>> + }
>
> This is all magical duct tape. The overall design of this is sideways and
> not really well integrated into the existing infrastructure which creates
> these kinds of magic warts and lots of duplicated code.
>
> The deeper I read into the patch series the less I like that interface and
> the implementation.
>
> Let's look at the existing crtl/mon groups which are each represented by a
> directory already.
>
> - Adding a 'size' file to the ctrl groups would be a natural extension
> which makes sense for regular cache allocations as well.
>
> - Adding a 'exclusive' flag would be an interesting feature even for the
> normal use case. Marking a group as exclusive prevents other groups to
> request CBM bits which are held by a exclusive allocation.
>
> I'd suggest to have a file 'mode' for controlling this. The valid values
> would be something like 'shareable' and 'exclusive'.
>
> When trying to set a group to exclusive mode then the schemata has to be
> checked for overlaps with the other schematas and in case of conflict
> the write fails. Once enabled subsequent writes to the schemata file
> need to be checked for conflicts as well.
>
> If the exclusive setting is enabled then the CBM bits of that group
> are excluded from being used in other control groups.
>
> Aside of that a file in the info directory which shows the (un)used CBM
> bits of all groups is really helpful for controlling all of that (even w/o
> pseudo locking). You have this in the 'avail' file, but there is no reason
> why this should only be available for pseudo locking enabled systems.
>
> Now for the pseudo locking part.
>
> What you need on top of the above is a new 'mode': 'locked'. That mode
> utilizes the 'exclusive' mode rules vs. conflict checking and the
> protection against allocating the associated CBM bits in other control
> groups.
>
> The setup would be like this:
>
> mkdir group
> echo '$CONFIG' >group/schemata
> echo 'locked' >group/mode
>
> Setting mode to locked locks down the schemata file along with the
> task/cpus/cpus_list files. The task/cpu files need to be empty when
> entering locked mode, otherwise the operation fails. I'd even would not
> bother handing back the CLOSID. For simplicity the CLOSID should just stay
> associated with the control group until it is destroyed as any other
> control group.
Thank you so much for taking the time to do this thorough review and to
make these suggestions. While I am still digesting the details I do
intend to follow all (as well as the ones earlier I did not explicitly
respond to).
Keeping the CLOSID associated with the pseudo-locked region will surely
make the above simpler since CLOSID's are association with resource
groups (represented by the directories). I would like to highlight that
on some platforms there are only a few (for example, 4) CLOSIDs
available. Not releasing a CLOSID would thus reduce available CLOSIDs
that are already limited. These platforms do have smaller possible
bitmasks though (for example, 8 possible bits), which may make light of
this concern. I thus just add it as informational to the consequence of
this simplification.
> Now the remaining thing is the memory allocation and the mmap itself. I
> really dislike the preallocation of memory right at setup time. Ideally
> that should be an allocation of the application itself, but the horrid
> wbinvd stuff kinda prevents that. With that restriction we are more or less
> bound to immediate allocation and population.
Acknowledged. I am not sure if the current permissions would support
such a dynamic setup though. At this time the system administrator is
the one that sets up the pseudo-locked region and can through
permissions of the character device provide access to these regions to
user space applications.
Reinette
next prev parent reply other threads:[~2018-02-20 18:47 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking Reinette Chatre
2018-02-19 20:35 ` Thomas Gleixner
2018-02-19 22:15 ` Reinette Chatre
2018-02-19 22:19 ` Thomas Gleixner
2018-02-19 22:24 ` Reinette Chatre
2018-02-19 21:27 ` Randy Dunlap
2018-02-19 22:21 ` Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 02/22] x86/intel_rdt: Make useful functions available internally Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 03/22] x86/intel_rdt: Introduce hooks to create pseudo-locking files Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 04/22] x86/intel_rdt: Introduce test to determine if closid is in use Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 05/22] x86/intel_rdt: Print more accurate pseudo-locking availability Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions Reinette Chatre
2018-02-19 20:57 ` Thomas Gleixner
2018-02-19 23:02 ` Reinette Chatre
2018-02-19 23:16 ` Thomas Gleixner
2018-02-20 3:21 ` Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 07/22] x86/intel_rdt: Connect pseudo-locking directory to operations Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 08/22] x86/intel_rdt: Introduce pseudo-locking resctrl files Reinette Chatre
2018-02-19 21:01 ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 09/22] x86/intel_rdt: Discover supported platforms via prefetch disable bits Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 10/22] x86/intel_rdt: Disable pseudo-locking if CDP enabled Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain Reinette Chatre
2018-02-19 21:19 ` Thomas Gleixner
2018-02-19 23:00 ` Reinette Chatre
2018-02-19 23:19 ` Thomas Gleixner
2018-02-20 3:17 ` Reinette Chatre
2018-02-20 10:00 ` Thomas Gleixner
2018-02-20 16:02 ` Reinette Chatre
2018-02-20 17:18 ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 12/22] x86/intel_rdt: Support CBM checking from value and character buffer Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core Reinette Chatre
2018-02-20 17:15 ` Thomas Gleixner
2018-02-20 18:47 ` Reinette Chatre [this message]
2018-02-20 23:21 ` Thomas Gleixner
2018-02-21 1:58 ` Mike Kravetz
2018-02-21 6:10 ` Reinette Chatre
2018-02-21 8:34 ` Thomas Gleixner
2018-02-21 5:58 ` Reinette Chatre
2018-02-27 0:34 ` Reinette Chatre
2018-02-27 10:36 ` Thomas Gleixner
2018-02-27 15:38 ` Thomas Gleixner
2018-02-27 19:52 ` Reinette Chatre
2018-02-27 21:33 ` Reinette Chatre
2018-02-28 18:39 ` Thomas Gleixner
2018-02-28 19:17 ` Reinette Chatre
2018-02-28 19:40 ` Thomas Gleixner
2018-02-27 21:01 ` Reinette Chatre
2018-02-28 17:57 ` Thomas Gleixner
2018-02-28 17:59 ` Thomas Gleixner
2018-02-28 18:34 ` Reinette Chatre
2018-02-28 18:42 ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 14/22] x86/intel_rdt: Enable testing for pseudo-locked region Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 15/22] x86/intel_rdt: Prevent new allocations from pseudo-locked regions Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 16/22] x86/intel_rdt: Create debugfs files for pseudo-locking testing Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 17/22] x86/intel_rdt: Create character device exposing pseudo-locked region Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 18/22] x86/intel_rdt: More precise L2 hit/miss measurements Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 19/22] x86/intel_rdt: Support L3 cache performance event of Broadwell Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 20/22] x86/intel_rdt: Limit C-states dynamically when pseudo-locking active Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 21/22] mm/hugetlb: Enable large allocations through gigantic page API Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 22/22] x86/intel_rdt: Support contiguous memory of all sizes Reinette Chatre
2018-02-14 18:12 ` [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Mike Kravetz
2018-02-14 18:31 ` Reinette Chatre
2018-02-15 20:39 ` Reinette Chatre
2018-02-15 21:10 ` Mike Kravetz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6c960fc0-820e-757c-2770-d770647e63d6@intel.com \
--to=reinette.chatre@intel.com \
--cc=dave.hansen@intel.com \
--cc=fenghua.yu@intel.com \
--cc=gavin.hindman@intel.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=vikas.shivappa@linux.intel.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).