From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752363AbbJPBpV (ORCPT <rfc822;w@1wt.eu>);
	Thu, 15 Oct 2015 21:45:21 -0400
Received: from mx1.redhat.com ([209.132.183.28]:51591 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751277AbbJPBpT (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 15 Oct 2015 21:45:19 -0400
Date: Thu, 15 Oct 2015 21:17:16 -0300
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Yu, Fenghua" <fenghua.yu@intel.com>, Thomas Gleixner <tglx@linutronix.de>,
        H Peter Anvin <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
        linux-kernel <linux-kernel@vger.kernel.org>, x86 <x86@kernel.org>,
        Vikas Shivappa <vikas.shivappa@linux.intel.com>
Subject: Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support
Message-ID: <20151016001715.GB31794@amt.cnet>
References: <1443766185-61618-1-git-send-email-fenghua.yu@intel.com>
 <alpine.DEB.2.11.1510112145100.6097@nanos>
 <3E5A0FA7E9CA944F9D5414FEC6C712205DE5C9EE@ORSMSX106.amr.corp.intel.com>
 <20151013224058.GA19373@amt.cnet>
 <20151015113702.GM3816@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20151015113702.GM3816@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Oct 15, 2015 at 01:37:02PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2015 at 07:40:58PM -0300, Marcelo Tosatti wrote:
> > How can you fix the issue of sockets with different reserved cache
> > regions with hw in the cgroup interface?
> 
> No idea what you're referring to. But IOCTLs blow.

Tejun brought up syscalls. Syscalls seem too generic.
So ioctls were chosen instead.

It is necessary to perform the following operations:

1) create cache reservation (params = size, type).
2) delete cache reservation.
3) attach cache reservation (params = cache reservation id, pid).
4) detach cache reservation (params = cache reservation id, pid).

Can it done via cgroups? If so, works for me.

A list of problems with the cgroup interface has been written,
in the thread... and we found another problem.


List of problems with cgroup interface:

1) Global IPI on CBM <---> task change does not scale.

 * cbm_update_all() - Update the cache bit mask for all packages.
 */
static inline void cbm_update_all(u32 closid)
{
       on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid,
1);
}

Consider a machine with 32 sockets.

2) Syscall interface specification is in kbytes, not
cache ways (which is what must be recorded by the OS
to allow migration of the OS between different
hardware systems).

3) Compilers are able to configure cache optimally for
given ranges of code inside applications, easily,
if desired.

4) Problem-2: The decision to allocate cache is tied to application
initialization / destruction, and application initialization is
essentially random from the POV of the system (the events which trigger
the execution of the application are not visible from the system).

Think of a server running two different servers: one database
with requests that are received with poisson distribution, average 30
requests per hour, and every request takes 1 minute.

One httpd server with nearly constant load.

Without cache reservations, database requests takes 2 minutes.
That is not acceptable for the database clients.
But with cache reservation, database requests takes 1 minute.

You want to maximize performance of httpd and database requests
What you do? You allow the database server to perform cache
reservation once a request comes in, and to undo the reservation
once the request is finished.

Its impossible to perform this with a centralized interface.

5) Modify scenario 2 above as follows: each database request
is handled by two newly created threads, and they share a certain
percentage
of data cache, and a certain percentage of code cache.

So the dispatcher thread, on arrival of request, has to:

        - create data cache reservation = tcrid-A.
        - create code cache reservation = tcrid-B.
        - create thread-1.
        - assign tcird-A and B to thread-1.
        - create thread-2.
        - assign tcird-A and B to thread-2.

6) Create reservations in such a way that the sum is larger than
total amount of cache, and CPU pinning (example from Karen Noel):

VM-1 on socket-1 with 80% of reservation.
VM-2 on socket-2 with 80% of reservation.
VM-1 pinned to socket-1.
VM-2 pinned to socket-2.

Cgroups interface attempts to set a cache mask globally. This is the
problem the "expand" proposal solves:
https://lkml.org/lkml/2015/7/29/682

7) Consider two sockets with different region of L3 cache
shared with HW:

— CPUID.(EAX=10H, ECX=1):EBX[31:0] reports a bit mask. Each set bit
within the length of the CBM
indicates the corresponding unit of the L3 allocation may be used by
other entities in the platform (e.g. an
integrated graphics engine or hardware units outside the processor core
and have direct access to L3).
Each cleared bit within the length of the CBM indicates the
corresponding allocation unit can be configured
to implement a priority-based allocation scheme chosen by an OS/VMM
without interference with other
hardware agents in the system. Bits outside the length of the CBM are
reserved.

You want the kernel to maintain different bitmasks in the CBM:

        socket1 [range-A]
        socket2 [range-B]

And the kernel will automatically switch from range A to range B
when the thread switches sockets.

---------------------

Problems 6, 7 and 2 are fatal for us. If you can fix them in the cgroup
interface, we can use it (please understand these problems, you seem to 
ignore them for some reason).

Problems 1 4 and 5 seem to come from Tejun.

Problem 3 could be a possibility.