Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management

From: Shivappa Vikas <vikas.shivappa@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org, hpa@zytor.com,
	peterz@infradead.org, ravi.v.shankar@intel.com,
	vikas.shivappa@intel.com, tony.luck@intel.com,
	fenghua.yu@intel.com, andi.kleen@intel.com
Subject: Re: [PATCH 08/21] x86/intel_rdt/cqm: Add RMID(Resource monitoring ID) management
Date: Tue, 11 Jul 2017 16:54:50 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.10.1707111645080.6175@vshiva-Udesk> (raw)
In-Reply-To: <alpine.DEB.2.20.1707030954330.2188@nanos>

On Mon, 3 Jul 2017, Thomas Gleixner wrote:

> On Sun, 2 Jul 2017, Thomas Gleixner wrote:
>> Thinking a bit more about that limbo mechanics.
>>
>> In case that a RMID was never used on a particular package, the state check
>> forces an IPI on all packages unconditionally. That's suboptimal at least.
>>
>> We know on which package a given RMID was used, so we could restrict the
>> checks to exactly these packages, but I'm not sure it's worth the
>> trouble. We might at least document that and explain why this is
>> implemented in that way.
>
> Second thoughts on that. The allocation logic is:
>
>> +       if (list_empty(&rmid_free_lru)) {
>> +               ret = try_freeing_limbo_rmid();
>> +               if (list_empty(&rmid_free_lru))
>> +                       return ret ? -ENOSPC : -EBUSY;
>> +       }
>> +
>> +       entry = list_first_entry(&rmid_free_lru,
>> +                                struct rmid_entry, list);
>> +       list_del(&entry->list);
>> +
>> +       return entry->rmid;
>
> That means, the free list is used as the primary source. One of my boxes
> has 143 RMIDs. So it only takes 142 mkdir/rmdir invocations to move all
> RMIDs to the limbo list. On the next mkdir invocation the allocation goes
> into the limbo path and the SMP function call has to walk the list with 142
> entries on ALL online domains whether they used the RMID or not!

Would it be better if we do this in the MBM 1s overflow timer delayed_work? That 
is not in the interupt context. So we do a periodic flush of the limbo list and 
then mkdir fails with -EBUSY if list_empty(&free_list) && 
!list_empty(&limbo_list).
To improve that -
We may also include the optimization Tony suggested to 
skip the checks for RMIDs which are already checked to be < threshold (however 
that needs a domain mask like I mention below but may be we can just check the 
list here).

>
> That's bad enough already and the number of RMIDs will not become smaller;
> it doubled from HSW to BDW ...
>
> The HPC and RT folks will love you for that - NOT!
>
> So this needs to be solved differently.
>
> Let's have a look at the context switch path first. That's the most
> sensitive part of it.
>
> 	if (static_branch_likely(&rdt_mon_enable_key)) {
> 		if (current->rmid)
> 			newstate.rmid = current->rmid;
> 	}
>
> That's optimized for the !monitoring case. So we can really penalize the
> per task monitoring case.
>
> 	if (static_branch_likely(&rdt_mon_enable_key)) {
> 		if (unlikely(current->rmid)) {
> 			newstate.rmid = current->rmid;
> 			__set_bit(newstate.rmid, this_cpu_ptr(rmid_bitmap));
> 		}
> 	}
>
> Now in rmid_free() we can collect that information:
>
> 	cpumask_clear(&tmpmask);
> 	cpumask_clear(rmid_entry->mask);
>
> 	cpus_read_lock();
> 	for_each_online_cpu(cpu) {
> 		if (test_and_clear_bit(rmid, per_cpu_ptr(cpu, rmid_bitmap)))
> 			cpumask_set(cpu, tmpmask);
> 	}
>
> 	for_each_domain(d, resource) {
> 		cpu = cpumask_any_and(d->cpu_mask, tmpmask);
> 		if (cpu < nr_cpu_ids)
> 			cpumask_set(cpu, rmid_entry->mask);

When this cpu goes offline - the rmid_entry->mask needs an update. Otherwise, 
the work function would return true for
              if (!cpumask_test_cpu(cpu, rme->mask))

since the work may have been moved to a different cpu.

So we really need a package mask ? or really a per-domain mask and for that we 
dont know the max domain number(which is why we use a list..)

> 	}
>
> 	list_add(&rmid_entry->list, &limbo_list);
>
> 	for_each_cpu(cpu, rmid_entry->mask)
> 		schedule_delayed_work_on(cpu, rmid_work);
> 	cpus_read_unlock();
>
> The work function:
>
>    	boot resched = false;
>
>    	list_for_each_entry(rme, limbo_list,...) {
> 		if (!cpumask_test_cpu(cpu, rme->mask))
> 			continue;
>
> 		if (!rmid_is_reusable(rme)) {
> 			resched = true;
> 			continue;
> 		}
>
> 		cpumask_clear_cpu(cpu, rme->mask);
> 		if (!cpumask_empty(rme->mask))
> 			continue;
>
> 		/* Ready for reuse */
> 		list_del(rme->list);
> 		list_add(&rme->list, &free_list);
> 	}
>
> The alloc function then becomes:
>
> 	if (list_empty(&free_list))
> 		return list_empty(&limbo_list) ? -ENOSPC : -EBUSY;
>
> The switch_to() covers the task rmids. The per cpu default rmids can be
> marked at the point where they are installed on a CPU in the per cpu
> rmid_bitmap. The free path is the same for per task and per cpu.
>
> Another thing which needs some thought it the CPU hotplug code. We need to
> make sure that pending work which is scheduled on an outgoing CPU is moved
> in the offline callback to a still online CPU of the same domain and not
> moved to some random CPU by the workqueue hotplug code.
>
> There is another subtle issue. Assume a RMID is freed. The limbo stuff is
> scheduled on all domains which have online CPUs.
>
> Now the last CPU of a domain goes offline before the threshold for clearing
> the domain CPU bit in the rme->mask is reached.
>
> So we have two options here:
>
>   1) Clear the bit unconditionally when the last CPU of a domain goes
>      offline.
>
>   2) Arm a timer which clears the bit after a grace period
>
> #1 The RMID might become available for reuse right away because all other
>   domains have not used it or have cleared their bits already.
>
>   If one of the CPUs of that domain comes online again and is associated
>   to that reused RMID again, then the counter content might still contain
>   leftovers from the previous usage.
>
> #2 Prevents #1 but has it's own issues vs. serialization and coordination
>   with CPU hotplug.
>
> I'd say we go for #1 as the simplest solution, document it and if really
> the need arises revisit it later.
>
> Thanks,
>
> 	tglx
>