All of lore.kernel.org
 help / color / mirror / Atom feed
From: Reinette Chatre <reinette.chatre@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com, tony.luck@intel.com,
	gavin.hindman@intel.com, vikas.shivappa@linux.intel.com,
	dave.hansen@intel.com, mingo@redhat.com, hpa@zytor.com,
	x86@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core
Date: Tue, 20 Feb 2018 10:47:51 -0800	[thread overview]
Message-ID: <6c960fc0-820e-757c-2770-d770647e63d6@intel.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1802200059240.1853@nanos.tec.linutronix.de>

Hi Thomas,

On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>>  static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>>  {
>>  	bool is_new_plr = (plr == new_plr);
>> @@ -93,6 +175,23 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>>  	if (!plr->deleted)
>>  		return;
>>  
>> +	if (plr->locked) {
>> +		plr->d->plr = NULL;
>> +		/*
>> +		 * Resource groups come and go. Simply returning this
>> +		 * pseudo-locked region's bits to the default CLOS may
>> +		 * result in default CLOS to become fragmented, causing
>> +		 * the setting of its bitmask to fail. Ensure it is valid
>> +		 * first. If this check does fail we cannot return the bits
>> +		 * to the default CLOS and userspace intervention would be
>> +		 * required to ensure portions of the cache do not go
>> +		 * unused.
>> +		 */
>> +		if (cbm_validate_val(plr->d->ctrl_val[0] | plr->cbm, plr->r))
>> +			pseudo_lock_clos_set(plr, 0,
>> +					     plr->d->ctrl_val[0] | plr->cbm);
>> +		pseudo_lock_region_clear(plr);
>> +	}
>>  	kfree(plr);
>>  	if (is_new_plr)
>>  		new_plr = NULL;
> 
> Are you really sure that the life time rules of plr are correct vs. an
> application which still has the locked memory mapped? i.e. the following
> operation:

You are correct. I am not preventing an administrator from removing the
pseudo-locked region if it is in use. I will fix that.

> 1# create_pseudo_lock_region()
> 
> 					2# start_app()
> 					   fd = open(/dev/.../lock);
> 					   ptr = mmap(fd, .....); <- takes a ref on fd
> 					   close(fd);
> 					   do_stuff(ptr);
> 
> 1# rmdir .../lock
> 
> 					   unmap(ptr);	<- releases fd
> 
> I can't see how that is protected. You already have a kref in the PLR, but
> it's in no way connected to the file descriptor lifetime.  So the refcount
> logic here must be:
> 
> create_lock_region()
> 	plr = alloc_plr();
> 	take_ref(plr);
> 	if (!init_plr(plr)) {
> 		drop_ref(plr);
> 		...
> 	}
> 
> lockdev_open(filp)
> 	take_ref(plr);
> 	filp->private = plr;
> 
> rmdir_lock_region()
> 	...
> 	drop_ref(plr);
> 
> lockdev_relese(filp)
> 	filp->private = NULL;
> 	drop_ref(plr);
> 
>>  /*
>> + * Only one pseudo-locked region can be set up at a time and that is
>> + * enforced by taking the rdt_pseudo_lock_mutex when the user writes the
>> + * requested schemata to the resctrl file and releasing the mutex on
>> + * completion. The thread locking the kernel memory into the cache starts
>> + * and completes during this time so we can be sure that only one thread
>> + * can run at any time.
>> + * The functions starting the pseudo-locking thread needs to wait for its
>> + * completion and since there can only be one we have a global workqueue
>> + * and variable to support this.
>> + */
>> +static DECLARE_WAIT_QUEUE_HEAD(wq);
>> +static int thread_done;
> 
> Eew. For one, you really couldn't come up with more generic and less
> relatable variable names, right?
> 
> That aside, its just wrong to build code based on current hardware
> limitations. The waitqueue and the result code belong into PLR.

Will do. This also builds on your previous suggestion to not limit the
number of uninitialized pseudo-locked regions.

> 
>> +/**
>> + * pseudo_lock_fn - Load kernel memory into cache
>> + *
>> + * This is the core pseudo-locking function.
>> + *
>> + * First we ensure that the kernel memory cannot be found in the cache.
>> + * Then, while taking care that there will be as little interference as
>> + * possible, each cache line of the memory to be loaded is touched while
>> + * core is running with class of service set to the bitmask of the
>> + * pseudo-locked region. After this is complete no future CAT allocations
>> + * will be allowed to overlap with this bitmask.
>> + *
>> + * Local register variables are utilized to ensure that the memory region
>> + * to be locked is the only memory access made during the critical locking
>> + * loop.
>> + */
>> +static int pseudo_lock_fn(void *_plr)
>> +{
>> +	struct pseudo_lock_region *plr = _plr;
>> +	u32 rmid_p, closid_p;
>> +	unsigned long flags;
>> +	u64 i;
>> +#ifdef CONFIG_KASAN
>> +	/*
>> +	 * The registers used for local register variables are also used
>> +	 * when KASAN is active. When KASAN is active we use a regular
>> +	 * variable to ensure we always use a valid pointer, but the cost
>> +	 * is that this variable will enter the cache through evicting the
>> +	 * memory we are trying to lock into the cache. Thus expect lower
>> +	 * pseudo-locking success rate when KASAN is active.
>> +	 */
> 
> I'm not a real fan of this mess. But well, 
> 
>> +	unsigned int line_size;
>> +	unsigned int size;
>> +	void *mem_r;
>> +#else
>> +	register unsigned int line_size asm("esi");
>> +	register unsigned int size asm("edi");
>> +#ifdef CONFIG_X86_64
>> +	register void *mem_r asm("rbx");
>> +#else
>> +	register void *mem_r asm("ebx");
>> +#endif /* CONFIG_X86_64 */
>> +#endif /* CONFIG_KASAN */
>> +
>> +	/*
>> +	 * Make sure none of the allocated memory is cached. If it is we
>> +	 * will get a cache hit in below loop from outside of pseudo-locked
>> +	 * region.
>> +	 * wbinvd (as opposed to clflush/clflushopt) is required to
>> +	 * increase likelihood that allocated cache portion will be filled
>> +	 * with associated memory
> 
> Sigh.
> 
>> +	 */
>> +	wbinvd();
>> +
>> +	preempt_disable();
>> +	local_irq_save(flags);
> 
> preempt_disable() is pointless when you disable interrupts.  And this
> really should be local_irq_disable(). This code is always called with
> interrupts enabled....
> 
>> +	/*
>> +	 * Call wrmsr and rdmsr as directly as possible to avoid tracing
>> +	 * clobbering local register variables or affecting cache accesses.
>> +	 */
> 
> You probably want to make sure that the code below is in L1 cache already
> before the CLOSID is set to the allocation. To do this you want to put the
> preload mechanics into a separate ASM function.
> 
> Then you run it with size = 1 on some other temporary memory buffer with
> the default CLOSID, which has the CBM bits of the lock region excluded,
> Then switch to the real CLOSID and run the loop with the real buffer and
> the real size.

Thank you for the suggestion. I will experiment how this affects the
pseudo-locked region success.

>> +	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
> 
> This wants an explanation why the prefetcher needs to be disabled.
> 
>> +static int pseudo_lock_doit(struct pseudo_lock_region *plr,
>> +			    struct rdt_resource *r,
>> +			    struct rdt_domain *d)
>> +{
>> +	struct task_struct *thread;
>> +	int closid;
>> +	int ret, i;
>> +
>> +	/*
>> +	 * With the usage of wbinvd we can only support one pseudo-locked
>> +	 * region per domain at this time.
> 
> This really sucks.
> 
>> +	 */
>> +	if (d->plr) {
>> +		rdt_last_cmd_puts("pseudo-locked region exists on cache\n");
>> +		return -ENOSPC;
> 
> This check is not sufficient for a CPU which has both L2 and L3 allocation
> capability. If there is already a L3 locked region and the current call
> sets up a L2 locked region then this will not catch it and the following
> wbinvd will wipe the L3 locked region ....
> 
>> +	}
>> +
>> +	ret = pseudo_lock_region_init(plr, r, d);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	closid = closid_alloc();
>> +	if (closid < 0) {
>> +		ret = closid;
>> +		rdt_last_cmd_puts("unable to obtain free closid\n");
>> +		goto out_region;
>> +	}
>> +
>> +	/*
>> +	 * Ensure we end with a valid default CLOS. If a pseudo-locked
>> +	 * region in middle of possible bitmasks is selected it will split
>> +	 * up default CLOS which would be a fault and for which handling
>> +	 * is unclear so we fail back to userspace. Validation will also
>> +	 * ensure that default CLOS is not zero, keeping some cache
>> +	 * available to rest of system.
>> +	 */
>> +	if (!cbm_validate_val(d->ctrl_val[0] & ~plr->cbm, r)) {
>> +		ret = -EINVAL;
>> +		rdt_last_cmd_printf("bm 0x%x causes invalid clos 0 bm 0x%x\n",
>> +				    plr->cbm, d->ctrl_val[0] & ~plr->cbm);
>> +		goto out_closid;
>> +	}
>> +
>> +	ret = pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] & ~plr->cbm);
> 
> Fiddling with the default CBM is wrong. The lock operation should only
> succeed when the bits in that domain are not used by _ANY_ control group
> including the default one. This is a reasonable constraint.

This changes one of my original assumptions. I will rework all to adjust
since your later design change suggestions will impact this.

>> +	if (ret < 0) {
>> +		rdt_last_cmd_printf("unable to set clos 0 bitmask to 0x%x\n",
>> +				    d->ctrl_val[0] & ~plr->cbm);
>> +		goto out_closid;
>> +	}
>> +
>> +	ret = pseudo_lock_clos_set(plr, closid, plr->cbm);
>> +	if (ret < 0) {
>> +		rdt_last_cmd_printf("unable to set closid %d bitmask to 0x%x\n",
>> +				    closid, plr->cbm);
>> +		goto out_clos_def;
>> +	}
>> +
>> +	plr->closid = closid;
>> +
>> +	thread_done = 0;
>> +
>> +	thread = kthread_create_on_node(pseudo_lock_fn, plr,
>> +					cpu_to_node(plr->cpu),
>> +					"pseudo_lock/%u", plr->cpu);
>> +	if (IS_ERR(thread)) {
>> +		ret = PTR_ERR(thread);
>> +		rdt_last_cmd_printf("locking thread returned error %d\n", ret);
>> +		/*
>> +		 * We do not return CBM to newly allocated CLOS here on
>> +		 * error path since that will result in a CBM of all
>> +		 * zeroes which is an illegal MSR write.
> 
> I'm not sure what you are trying to explain here.
> 
> If you remove a ctrl group then the CBM bits are not added to anything
> either. It's up to the operator to handle that. Why would this be any
> different for the pseudo-locking stuff?

It is not different, no. On failure the closid is released but the CBM
associated with it remains. Here I attempted to explain why the CBM
remains. This is the same behavior as current CAT. I will remove the
comment since it is just causing confusion.

>> +		 */
>> +		goto out_clos_def;
>> +	}
>> +
>> +	kthread_bind(thread, plr->cpu);
>> +	wake_up_process(thread);
>> +
>> +	ret = wait_event_interruptible(wq, thread_done == 1);
>> +	if (ret < 0) {
>> +		rdt_last_cmd_puts("locking thread interrupted\n");
>> +		goto out_clos_def;
> 
> This is broken. If the thread does not get on the CPU for whatever reason
> and the process which sets up the region is interrupted then this will
> leave the thread in runnable state and once it gets on the CPU it will
> happily derefence the freed plr struct and fiddle with the freed memory.
> 
> You need to make sure that the thread holds a reference on the plr struct,
> which prevents freeing. That includes the CLOSID .....

Thanks for catching this.

> 
>> +	}
>> +
>> +	/*
>> +	 * closid will be released soon but its CBM as well as CBM of not
>> +	 * yet allocated CLOS as stored in the array will remain. Ensure
>> +	 * that CBM will be what is currently the default CLOS, which
>> +	 * excludes pseudo-locked region.
>> +	 */
>> +	for (i = 1; i < r->num_closid; i++) {
>> +		if (i == closid || !closid_allocated(i))
>> +			pseudo_lock_clos_set(plr, i, d->ctrl_val[0]);
>> +	}
> 
> This is all magical duct tape. The overall design of this is sideways and
> not really well integrated into the existing infrastructure which creates
> these kinds of magic warts and lots of duplicated code.
> 
> The deeper I read into the patch series the less I like that interface and
> the implementation.
> 
> Let's look at the existing crtl/mon groups which are each represented by a
> directory already.
> 
>  - Adding a 'size' file to the ctrl groups would be a natural extension
>    which makes sense for regular cache allocations as well.
> 
>  - Adding a 'exclusive' flag would be an interesting feature even for the
>    normal use case. Marking a group as exclusive prevents other groups to
>    request CBM bits which are held by a exclusive allocation.
> 
>    I'd suggest to have a file 'mode' for controlling this. The valid values
>    would be something like 'shareable' and 'exclusive'.
> 
>    When trying to set a group to exclusive mode then the schemata has to be
>    checked for overlaps with the other schematas and in case of conflict
>    the write fails. Once enabled subsequent writes to the schemata file
>    need to be checked for conflicts as well.
> 
>    If the exclusive setting is enabled then the CBM bits of that group
>    are excluded from being used in other control groups.
> 
> Aside of that a file in the info directory which shows the (un)used CBM
> bits of all groups is really helpful for controlling all of that (even w/o
> pseudo locking). You have this in the 'avail' file, but there is no reason
> why this should only be available for pseudo locking enabled systems.
> 
> Now for the pseudo locking part.
> 
> What you need on top of the above is a new 'mode': 'locked'. That mode
> utilizes the 'exclusive' mode rules vs. conflict checking and the
> protection against allocating the associated CBM bits in other control
> groups.
> 
> The setup would be like this:
> 
>     mkdir group
>     echo '$CONFIG' >group/schemata
>     echo 'locked' >group/mode
> 
> Setting mode to locked locks down the schemata file along with the
> task/cpus/cpus_list files. The task/cpu files need to be empty when
> entering locked mode, otherwise the operation fails. I'd even would not
> bother handing back the CLOSID. For simplicity the CLOSID should just stay
> associated with the control group until it is destroyed as any other
> control group.

Thank you so much for taking the time to do this thorough review and to
make these suggestions. While I am still digesting the details I do
intend to follow all (as well as the ones earlier I did not explicitly
respond to).

Keeping the CLOSID associated with the pseudo-locked region will surely
make the above simpler since CLOSID's are association with resource
groups (represented by the directories). I would like to highlight that
on some platforms there are only a few (for example, 4) CLOSIDs
available. Not releasing a CLOSID would thus reduce available CLOSIDs
that are already limited. These platforms do have smaller possible
bitmasks though (for example, 8 possible bits), which may make light of
this concern. I thus just add it as informational to the consequence of
this simplification.

> Now the remaining thing is the memory allocation and the mmap itself. I
> really dislike the preallocation of memory right at setup time. Ideally
> that should be an allocation of the application itself, but the horrid
> wbinvd stuff kinda prevents that. With that restriction we are more or less
> bound to immediate allocation and population.

Acknowledged. I am not sure if the current permissions would support
such a dynamic setup though. At this time the system administrator is
the one that sets up the pseudo-locked region and can through
permissions of the character device provide access to these regions to
user space applications.

Reinette

  reply	other threads:[~2018-02-20 18:47 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-13 15:46 [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Reinette Chatre
2018-02-13 15:46 ` Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking Reinette Chatre
2018-02-19 20:35   ` Thomas Gleixner
2018-02-19 22:15     ` Reinette Chatre
2018-02-19 22:19       ` Thomas Gleixner
2018-02-19 22:24         ` Reinette Chatre
2018-02-19 21:27   ` Randy Dunlap
2018-02-19 22:21     ` Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 02/22] x86/intel_rdt: Make useful functions available internally Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 03/22] x86/intel_rdt: Introduce hooks to create pseudo-locking files Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 04/22] x86/intel_rdt: Introduce test to determine if closid is in use Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 05/22] x86/intel_rdt: Print more accurate pseudo-locking availability Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 06/22] x86/intel_rdt: Create pseudo-locked regions Reinette Chatre
2018-02-19 20:57   ` Thomas Gleixner
2018-02-19 23:02     ` Reinette Chatre
2018-02-19 23:16       ` Thomas Gleixner
2018-02-20  3:21         ` Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 07/22] x86/intel_rdt: Connect pseudo-locking directory to operations Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 08/22] x86/intel_rdt: Introduce pseudo-locking resctrl files Reinette Chatre
2018-02-19 21:01   ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 09/22] x86/intel_rdt: Discover supported platforms via prefetch disable bits Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 10/22] x86/intel_rdt: Disable pseudo-locking if CDP enabled Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 11/22] x86/intel_rdt: Associate pseudo-locked regions with its domain Reinette Chatre
2018-02-19 21:19   ` Thomas Gleixner
2018-02-19 23:00     ` Reinette Chatre
2018-02-19 23:19       ` Thomas Gleixner
2018-02-20  3:17         ` Reinette Chatre
2018-02-20 10:00           ` Thomas Gleixner
2018-02-20 16:02             ` Reinette Chatre
2018-02-20 17:18               ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 12/22] x86/intel_rdt: Support CBM checking from value and character buffer Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core Reinette Chatre
2018-02-20 17:15   ` Thomas Gleixner
2018-02-20 18:47     ` Reinette Chatre [this message]
2018-02-20 23:21       ` Thomas Gleixner
2018-02-21  1:58         ` Mike Kravetz
2018-02-21  6:10           ` Reinette Chatre
2018-02-21  8:34           ` Thomas Gleixner
2018-02-21  5:58         ` Reinette Chatre
2018-02-27  0:34     ` Reinette Chatre
2018-02-27 10:36       ` Thomas Gleixner
2018-02-27 15:38         ` Thomas Gleixner
2018-02-27 19:52         ` Reinette Chatre
2018-02-27 21:33           ` Reinette Chatre
2018-02-28 18:39           ` Thomas Gleixner
2018-02-28 19:17             ` Reinette Chatre
2018-02-28 19:40               ` Thomas Gleixner
2018-02-27 21:01     ` Reinette Chatre
2018-02-28 17:57       ` Thomas Gleixner
2018-02-28 17:59         ` Thomas Gleixner
2018-02-28 18:34           ` Reinette Chatre
2018-02-28 18:42             ` Thomas Gleixner
2018-02-13 15:46 ` [RFC PATCH V2 14/22] x86/intel_rdt: Enable testing for pseudo-locked region Reinette Chatre
2018-02-13 15:46 ` [RFC PATCH V2 15/22] x86/intel_rdt: Prevent new allocations from pseudo-locked regions Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 16/22] x86/intel_rdt: Create debugfs files for pseudo-locking testing Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 17/22] x86/intel_rdt: Create character device exposing pseudo-locked region Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 18/22] x86/intel_rdt: More precise L2 hit/miss measurements Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 19/22] x86/intel_rdt: Support L3 cache performance event of Broadwell Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 20/22] x86/intel_rdt: Limit C-states dynamically when pseudo-locking active Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 21/22] mm/hugetlb: Enable large allocations through gigantic page API Reinette Chatre
2018-02-13 15:47   ` Reinette Chatre
2018-02-13 15:47 ` [RFC PATCH V2 22/22] x86/intel_rdt: Support contiguous memory of all sizes Reinette Chatre
2018-02-14 18:12 ` [RFC PATCH V2 00/22] Intel(R) Resource Director Technology Cache Pseudo-Locking enabling Mike Kravetz
2018-02-14 18:12   ` Mike Kravetz
2018-02-14 18:31   ` Reinette Chatre
2018-02-14 18:31     ` Reinette Chatre
2018-02-15 20:39     ` Reinette Chatre
2018-02-15 20:39       ` Reinette Chatre
2018-02-15 21:10       ` Mike Kravetz
2018-02-15 21:10         ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6c960fc0-820e-757c-2770-d770647e63d6@intel.com \
    --to=reinette.chatre@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=fenghua.yu@intel.com \
    --cc=gavin.hindman@intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=vikas.shivappa@linux.intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.