From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754639AbbHCSW1 (ORCPT <rfc822;w@1wt.eu>);
	Mon, 3 Aug 2015 14:22:27 -0400
Received: from mga09.intel.com ([134.134.136.24]:62797 "EHLO mga09.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753790AbbHCSWX (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 3 Aug 2015 14:22:23 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.15,603,1432623600"; 
   d="scan'208";a="618450314"
Date: Mon, 3 Aug 2015 11:22:22 -0700 (PDT)
From: Vikas Shivappa <vikas.shivappa@intel.com>
X-X-Sender: vikas@vshiva-Udesk
To: Marcelo Tosatti <mtosatti@redhat.com>
cc: Martin Kletzander <mkletzan@redhat.com>,
        Vikas Shivappa <vikas.shivappa@intel.com>,
        "Auld, Will" <will.auld@intel.com>,
        Vikas Shivappa <vikas.shivappa@linux.intel.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "x86@kernel.org" <x86@kernel.org>, "hpa@zytor.com" <hpa@zytor.com>,
        "tglx@linutronix.de" <tglx@linutronix.de>,
        "mingo@kernel.org" <mingo@kernel.org>, "tj@kernel.org" <tj@kernel.org>,
        "peterz@infradead.org" <peterz@infradead.org>,
        "Fleming, Matt" <matt.fleming@intel.com>,
        "Williamson, Glenn P" <glenn.p.williamson@intel.com>,
        "Juvva, Kanaka D" <kanaka.d.juvva@intel.com>
Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
 cgroup usage guide
In-Reply-To: <20150803151307.GA8228@amt.cnet>
Message-ID: <alpine.DEB.2.10.1508031120050.921@vshiva-Udesk>
References: <1435789270-27010-1-git-send-email-vikas.shivappa@linux.intel.com> <1435789270-27010-4-git-send-email-vikas.shivappa@linux.intel.com> <20150728231516.GA16204@amt.cnet> <alpine.DEB.2.10.1507281702030.921@vshiva-Udesk>
 <96EC5A4F3149B74492D2D9B9B1602C27461EB932@ORSMSX105.amr.corp.intel.com> <20150729193208.GC3201@amt.cnet> <alpine.DEB.2.10.1507301016210.921@vshiva-Udesk> <20150730200812.GA10832@amt.cnet> <20150802154807.GA19188@wheatley> <20150803151307.GA8228@amt.cnet>
User-Agent: Alpine 2.10 (DEB 1266 2009-07-14)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


Hello Marcelo/Martin,

Like I mentioned let me modify the documentation to better help understand 
the usage. Things like updating each package bitmask is already in the patches.

Lets discuss offline and come up a well defined proposal for change if any and 
then update that in next series. We seem to be just looping over same items.

Thanks,
Vikas

On Mon, 3 Aug 2015, Marcelo Tosatti wrote:

> On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:
>> On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
>>> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>>>
>>>>
>>>> Marcello,
>>>>
>>>>
>>>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>>>>>
>>>>> How about this:
>>>>>
>>>>> desiredclos (closid  p1  p2  p3 p4)
>>>>> 	     1       1   0   0  0
>>>>> 	     2	     0	 0   0  1
>>>>> 	     3	     0   1   1  0
>>>>
>>>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>>>> bits set and cant be changed (because the cgroup hierarchy would by
>>>> default make this to have all bits as all the children need to have
>>>> a subset of the root's bitmask). So if the user creates a cgroup and
>>>> not put any task in it , the tasks in the root cgroup could be still
>>>> using that part of the cache. Thats the reason i say we can have
>>>> really 'exclusive' masks.
>>>>
>>>> Or in other words - there is always a desired clos (0) which has all
>>>> parts set which acts like a default pool.
>>>>
>>>> Also the parts can overlap.  Please apply this for all the below
>>>> comments which will change the way they work.
>>>>
>>>>>
>>>>> p means part.
>>>>
>>>> I am assuming p = (a contiguous cache capacity bit mask)
>>>
>>> Yes.
>>>
>>>>> closid 1 is a exclusive cgroup.
>>>>> closid 2 is a "cache hog" class.
>>>>> closid 3 is "default closid".
>>>>>
>>>>> Desiredclos is what user has specified.
>>>>>
>>>>> Transition 1: desiredclos --> effectiveclos
>>>>> Clean all bits of unused closid's
>>>>> (that must be updated whenever a
>>>>> closid1 cgroup goes from empty->nonempty
>>>>> and vice-versa).
>>>>>
>>>>> effectiveclos (closid  p1  p2  p3 p4)
>>>>> 	       1       0   0   0  0
>>>>> 	       2       0   0   0  1
>>>>> 	       3       0   1   1  0
>>>>
>>>>>
>>>>> Transition 2: effectiveclos --> expandedclos
>>>>> expandedclos (closid  p1  p2  p3 p4)
>>>>> 	       1       0   0   0  0
>>>>> 	       2       0   0   0  1
>>>>> 	       3       1   1   1  0
>>>>> Then you have different inplacecos for each
>>>>> CPU (see pseudo-code below):
>>>>>
>>>>> On the following events.
>>>>>
>>>>> - task migration to new pCPU:
>>>>> - task creation:
>>>>>
>>>>> 	id = smp_processor_id();
>>>>> 	for (part = desiredclos.p1; ...; part++)
>>>>> 		/* if my cosid is set and any other
>>>>> 	   	   cosid is clear, for the part,
>>>>> 		   synchronize desiredclos --> inplacecos */
>>>>> 		if (part[mycosid] == 1 &&
>>>>> 		    part[any_othercosid] == 0)
>>>>> 			wrmsr(part, desiredclos);
>>>>>
>>>>
>>>> Currently the root cgroup would have all the bits set which will act
>>>> like a default cgroup where all the otherwise unused parts (assuming
>>>> they are a set of contiguous cache capacity bits) will be used.
>>>>
>>>> Otherwise the question is in the expandedclos - who decides to
>>>> expand the closx parts to include some of the unused parts.. - that
>>>> could just be a default root always ?
>>>
>>> Right, so the problem is for certain closid's you might never want
>>> to expand (because doing so would cause data to be cached in a
>>> cache way which might have high eviction rate in the future).
>>> See the example from Will.
>>>
>>> But for the default cache (that is "unclassified applications"
>>> i suppose it is beneficial to expand in most cases, that is,
>>> use maximum amount of cache irrespective of eviction rate, which
>>> is the behaviour that exists now without CAT).
>>>
>>> So perhaps a new flag "expand=y/n" can be added to the cgroup
>>> directories... What do you say?
>>>
>>> Userspace representation of CAT
>>> -------------------------------
>>>
>>> Usage model:
>>> 1) measure application performance without L3 cache reservation.
>>> 2) measure application perf with L3 cache reservation and
>>> X number of cache ways until desired performance is attained.
>>>
>>> Requirements:
>>> 1) Persistency of CLOS configuration across hardware. On migration
>>> of operating system or application between different hardware
>>> systems we'd like the following to be maintained:
>>>       - exclusive number of bytes (*) reserved to a certain CLOSid.
>>>       - shared number of bytes (*) reserved between a certain group
>>>         of CLOSid's.
>>>
>>> For both code and data, rounded down or up in cache way size.
>>>
>>> 2) Reasoning:
>>> Different CBM masks in different hardware platforms might be necessary
>>> to specify the same CLOS configuration, in terms of exclusive number of
>>> bytes and shared number of bytes. (cache-way rounded number of bytes).
>>> For example, due to L3 allocation by other hardware entities in certain parts
>>> of the cache it might be necessary to relocate CBM mask to achieve
>>> the same CLOS configuration.
>>>
>>> 3) Proposed format:
>>>
>>
>> Few questions from a random listener, I apologise if some of them are
>> in a wrong place due to me missing some information from past threads.
>>
>> I'm not sure whether the following proposal to the format is the
>> internal structure or what's going to be in cgroups.  If this is
>> user-visible interface, I think it could be a little less detailed.
>
> User visible interface. The idea is to have userspace code that performs
>
> [ user visible specification ]  ----> [ cbm bitmasks on present hardware
> 				       platform ]
>
> In systemd, probably (or whatever is between the user and the cgroup
> interface).
>
>>> sharedregionK.exclusive - Number of exclusive cache bytes reserved for
>>> 			shared region.
>>> sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
>>> 	 		shared region.
>>> sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
>>> 			shared region.
>>> sharedregionK.round_down - Round down to cache way bytes from respective number
>>> 		     specification (default is round up).
>>> sharedregionK.expand - y/n - Expand shared region to more cache ways
>>> 			when available (default N).
>>>
>>> cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
>>> 		    for cgroup.
>>> cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
>>> 		    for cgroup.
>>> cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
>>> 		    for cgroup.
>>
>> By exclusive, you mean that it's exclusive to the tasks in this
>> cgroup?
>
> Correct.
>
>> The thing is that we must differentiate between limiting some
>> process's from hogging the memory (like example 2 below) and making
>> some part of the cache exclusive for particular application (example 1
>> below).
>
> AFAICS there is no difference because: both require exclusive cache
> access: the hog wants exclusive access between any other user of its
> cachelines will be penalized. the high performance application wants
> exclusive cache access because any other user of its cachelines will
> penalize it.
>
> Where do you see the need to differentiate?
>
>> I just hope we won't need to add something similar to 'isolcpus=' just
>> so we can make sure none of the tasks in the root cgroup can spoil the
>> part of the cache we need to have exclusive.
>>
>> I'm not sure creating a new subgroup and moving all the tasks there
>> would work, It certainly is not possible with other cgroups, like the
>> cpuset cgroup mentioned beforehand.
>
> Why not? Should be able to place all tasks in a given cgroup? (trying
> to setup systemd to do that now...).
>
>> I also don't quite fully understand how the co-mounting with the
>> cpuset cgroup should work, but that's not design-related.
>
> Neither do I.
>
>> One more question, how does this work on systems with multiple L3
>> caches (e.g. large NUMA node systems)?  I'm guessing if the process is
>> running only on some CPUs, the wrmsr() will be called on that
>> particular CPU(s), right?
>
> Not in the current patchset, that has to be fixed...
>
>>> cgroupN.round_down - Round down to cache way bytes from respective number
>>> 		     specification (default is round up).
>>> cgroupN.expand - y/n - Expand shared region to more cache ways when
>>> 		       available (default N).
>>> cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
>>> regions)
>>>
>>> Example 1:
>>> One application with 2M exclusive cache, two applications
>>> with 1M exclusive each, sharing an expansive shared region of 1M.
>>>
>>> cgroup1.exclusive = 2M
>>>
>>> sharedregion1.exclusive = 1M
>>> sharedregion1.expand = Y
>>>
>>> cgroup2.exclusive = 1M
>>> cgroup2.shared = sharedregion1
>>>
>>> cgroup3.exclusive = 1M
>>> cgroup3.shared = sharedregion1
>>>
>>> Example 2:
>>> 3 high performance applications running, one of which is a cache hog
>>> with no cache locality.
>>>
>>> cgroup1.exclusive = 8M
>>> cgroup2.exclusive = 8M
>>>
>>> cgroup3.exclusive = 512K
>>> cgroup3.round_down = Y
>>>
>>> In all cases the default cgroup (which requires no explicit
>>> specification) is expansive and uses the remaining cache
>>> ways, including the ways shared by other hardware entities.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>