Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Yu, Fenghua" <fenghua.yu@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	H Peter Anvin <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
	linux-kernel <linux-kernel@vger.kernel.org>, x86 <x86@kernel.org>,
	Vikas Shivappa <vikas.shivappa@linux.intel.com>
Subject: Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
Date: Fri, 16 Oct 2015 17:24:42 -0300	[thread overview]
Message-ID: <20151016202439.GA27055@amt.cnet> (raw)
In-Reply-To: <20151016094452.GO3816@twins.programming.kicks-ass.net>

On Fri, Oct 16, 2015 at 11:44:52AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 15, 2015 at 09:17:16PM -0300, Marcelo Tosatti wrote:
> > On Thu, Oct 15, 2015 at 01:37:02PM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 13, 2015 at 07:40:58PM -0300, Marcelo Tosatti wrote:
> > > > How can you fix the issue of sockets with different reserved cache
> > > > regions with hw in the cgroup interface?
> > > 
> > > No idea what you're referring to. But IOCTLs blow.
> > 
> > Tejun brought up syscalls. Syscalls seem too generic.
> > So ioctls were chosen instead.
> > 
> > It is necessary to perform the following operations:
> > 
> > 1) create cache reservation (params = size, type).
> 
> mkdir
> 
> > 2) delete cache reservation.
> 
> rmdir
> 
> > 3) attach cache reservation (params = cache reservation id, pid).
> > 4) detach cache reservation (params = cache reservation id, pid).
> 
> echo $pid > tasks
> 
> > Can it done via cgroups? If so, works for me.
> 
> Trivially.

Fine. 

Tejun brought the problem of locking: how do you coordinate locking
between different users?  (on the mkdir / rmdir scenario above).

> 
> > A list of problems with the cgroup interface has been written,
> > in the thread... and we found another problem.
> 
> Which was endless and tiresome so I stopped reading.
> 
> > List of problems with cgroup interface:
> > 
> > 1) Global IPI on CBM <---> task change does not scale.
> > 
> >  * cbm_update_all() - Update the cache bit mask for all packages.
> >  */
> > static inline void cbm_update_all(u32 closid)
> > {
> >        on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid,
> > 1);
> > }
> 
> There is no way around that, the moment you view the CBM as a global
> resource; ie. a CBM is configured the same on all sockets; you need to
> do this for a task using that CBM might run on any CPU at any time.
> 
> This is not because of the cgroup interface at all. This is because you
> want CBMs to be the same machine wide.

You don't, for two reasons:

1) Item 6 below.
2) Item 7 below.

Please follow on with the discussion (just scroll down and read and
reply inline: item 6 and machine wide CBMs are not incompatible
because...).

> The only way to actually change that is to _be_ a cgroup and co-mount
> with cpusets and be incestuous and look at the cpusets state and
> discover disjoint groups.
> 
> > 2) Syscall interface specification is in kbytes, not
> > cache ways (which is what must be recorded by the OS
> > to allow migration of the OS between different
> > hardware systems).
> 
> Meh, that again is nothing fundamental. The cgroup interface could do
> bytes just the same.

Yes.

> > 3) Compilers are able to configure cache optimally for
> > given ranges of code inside applications, easily,
> > if desired.
> 
> Yeah, so? Every SKU has a different cache size, so once you're down to
> that level you're pretty hard set in your configuration and it really
> doesn't matter if you give bytes or ways, you _KNOW_ what your
> configuration will be.

That item has nothing to do with cache ways in bytes or ways.

> > 4) Problem-2: The decision to allocate cache is tied to application
> > initialization / destruction, and application initialization is
> > essentially random from the POV of the system (the events which trigger
> > the execution of the application are not visible from the system).
> > 
> > Think of a server running two different servers: one database
> > with requests that are received with poisson distribution, average 30
> > requests per hour, and every request takes 1 minute.
> > 
> > One httpd server with nearly constant load.
> > 
> > Without cache reservations, database requests takes 2 minutes.
> > That is not acceptable for the database clients.
> > But with cache reservation, database requests takes 1 minute.
> > 
> > You want to maximize performance of httpd and database requests
> > What you do? You allow the database server to perform cache
> > reservation once a request comes in, and to undo the reservation
> > once the request is finished.
> 
> > Its impossible to perform this with a centralized interface.
> 
> Not so; just a wee bit more fragile that desired. But, this is a
> pre-existing problem with cgroups and needs to be solved, not using
> cgroups because of this is silly.
> 
> Every cgroup that can work on tasks suffers this and arguably a few
> more.
> 
> > 5) Modify scenario 2 above as follows: each database request
> > is handled by two newly created threads, and they share a certain
> > percentage
> > of data cache, and a certain percentage of code cache.
> > 
> > So the dispatcher thread, on arrival of request, has to:
> > 
> >         - create data cache reservation = tcrid-A.
> >         - create code cache reservation = tcrid-B.
> >         - create thread-1.
> >         - assign tcird-A and B to thread-1.
> >         - create thread-2.
> >         - assign tcird-A and B to thread-2.
> > 
> > 6) Create reservations in such a way that the sum is larger than
> > total amount of cache, and CPU pinning (example from Karen Noel):
> > 
> > VM-1 on socket-1 with 80% of reservation.
> > VM-2 on socket-2 with 80% of reservation.
> > VM-1 pinned to socket-1.
> > VM-2 pinned to socket-2.
> > 
> > Cgroups interface attempts to set a cache mask globally. This is the
> > problem the "expand" proposal solves:
> > https://lkml.org/lkml/2015/7/29/682
> 
> That email is unparsable.

Look at item 6. If you create reservations in such a way that the sum
is larger than total amount of cache, "cosid0" which is the
"unconstrained set of tasks" (ie: rest of the system) have 0 bytes of
L3 cache to reclaim from.

> But the only way to sanely do so it do closely
> intertwine oneself with cpusets, doing that with anything other than
> another cgroup controller absolutely full on insane.

void __intel_rdt_sched_in(void)
{
        struct task_struct *task = current;
        unsigned int cpu = smp_processor_id();
        unsigned int this_socket = topology_physical_package_id(cpu);
        unsigned int start, end;

        /*
         * The CBM bitmask for a particular task is enforced
         * on sched-in to a given processor, and only for the
         * range (cbm_start_bit,cbm_end_bit) which the
         * tcr_list (COSid) owns.
         * This way we allow COSid0 (global task pool) to use
         * reserved L3 cache on sockets where the tasks that
         * reserve the cache have not been scheduled.
         *
         * Since reading the MSRs is slow, it is necessary to
         * cache the MSR CBM map on each socket.
         *
         */

        if (test_bit(this_socket,
                     task->tcrlist->synced_to_socket) == 0) {

Makes sense?

> 
> > 7) Consider two sockets with different region of L3 cache
> > shared with HW:
> > 
> > — CPUID.(EAX=10H, ECX=1):EBX[31:0] reports a bit mask. Each set bit
> > within the length of the CBM
> > indicates the corresponding unit of the L3 allocation may be used by
> > other entities in the platform (e.g. an
> > integrated graphics engine or hardware units outside the processor core
> > and have direct access to L3).
> > Each cleared bit within the length of the CBM indicates the
> > corresponding allocation unit can be configured
> > to implement a priority-based allocation scheme chosen by an OS/VMM
> > without interference with other
> > hardware agents in the system. Bits outside the length of the CBM are
> > reserved.
> > 
> > You want the kernel to maintain different bitmasks in the CBM:
> > 
> >         socket1 [range-A]
> >         socket2 [range-B]
> > 
> > And the kernel will automatically switch from range A to range B
> > when the thread switches sockets.
> 
> This is firmly in the insane range of things.. not going to happen full
> stop.

Are you saying that hardware will guarantee reserved region is the same
for all sockets? I asked Vikas and he said this is not the case.

> It a thread can freely schedule between two CPUs its configuration on
> those two CPUs had better bloody be the same.

Its just the (start,end) of the CBM which changes, so on
__intel_rdt_sched_in you do:

                struct per_socket_data *psd = get_socket_data(this_socket);
                struct cache_layout *layout = psd->layout;

                start = task->tcrlist->psd[layout->id].cbm_start;
                end = task->tcrlist->psd[layout->id].cbm_end;
                sync_to_msr(tcrlist, start, end);

Please clarify what you mean.