From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C06A7C4332F
	for <linux-mm@archiver.kernel.org>; Tue,  8 Nov 2022 23:10:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C703E8E0001; Tue,  8 Nov 2022 18:10:36 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C1FBD6B0072; Tue,  8 Nov 2022 18:10:36 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AE8258E0001; Tue,  8 Nov 2022 18:10:36 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 9EA356B0071
	for <linux-mm@kvack.org>; Tue,  8 Nov 2022 18:10:36 -0500 (EST)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 79BA4ABA4F
	for <linux-mm@kvack.org>; Tue,  8 Nov 2022 23:10:36 +0000 (UTC)
X-FDA: 80111821272.17.C6BA25E
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf21.hostedemail.com (Postfix) with ESMTP id 0F7AD1C0004
	for <linux-mm@kvack.org>; Tue,  8 Nov 2022 23:10:27 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1667948972;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=slTemHQCcB3hkLvb1RiLXsoYjitssgvv6xxcDrshU7o=;
	b=GLgqSWpz1nFxIdGWuxAS0oaUmv4afttlqnHIP+SKm724yeyQxiYjke5dGgGBOmo6HK/UkK
	CAHWzvH3upvxOQCRUsSiwQnB4hVQb/M6nEDrtBgvNei3dqgktTm6c+RrLqVrFeJ1DrtQy2
	gUDDnJeaCTVlNj4jTVdbgWzMf7ymQOU=
Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com
 [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-501-CnPPnjYINAGZizLS5jaMqw-1; Tue, 08 Nov 2022 18:09:31 -0500
X-MC-Unique: CnPPnjYINAGZizLS5jaMqw-1
Received: by mail-qk1-f198.google.com with SMTP id o13-20020a05620a2a0d00b006cf9085682dso14092225qkp.7
        for <linux-mm@kvack.org>; Tue, 08 Nov 2022 15:09:31 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=mime-version:user-agent:content-transfer-encoding:references
         :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=slTemHQCcB3hkLvb1RiLXsoYjitssgvv6xxcDrshU7o=;
        b=b2R+W8muC8ZRDIQAn4+4D7AoSy2KxzRGPji72pB7qLNChUocsrumTswaXJF8ke4Fqd
         3PYXiMg1qBsV1q8eQADpzT/VhmvWApDgf70vggC3X4MP6zGRzdQKRFbC3Zx5K05ORklv
         5xGGLDzcNxbI877n8wSPZPekJ7SH4yHFaAPspr6k7bhpMQl8REXzxXBy46E8frfSSSZC
         mbe9uF4PY752M8sYD/V2e5XLdfxdNvHJU1m104bf42g5Uc5xjLSBf4uUkk4Tm8DUFDdD
         hRKHAPV8wPvxd4OmqZa85TyhEMxoByMUJT4XeoetrG4dpKYZso4PY2rZUXghwSrF6s6u
         shgA==
X-Gm-Message-State: ACrzQf3kZhYX1e51htwrbi44Nj++bEaZp0G/R5u9OOR9rYRrJtsVHukR
	7gCZqMMLoeMW+GxKISG14ROpFH2igacTY4SCmYgZf4s3LSzGM1cGYBRkIM3tC/uWD0j+54mtvM5
	DTxII4xapIBs=
X-Received: by 2002:a0c:cb0f:0:b0:4bb:6099:68c2 with SMTP id o15-20020a0ccb0f000000b004bb609968c2mr53287883qvk.131.1667948970687;
        Tue, 08 Nov 2022 15:09:30 -0800 (PST)
X-Google-Smtp-Source: AMsMyM6ac0haSlPxTGVuFjAdem5E3oO8nZEIzcZh70nKeHtu80aFA0S1jxX9ip2s47AXxzFtbR4Y4Q==
X-Received: by 2002:a0c:cb0f:0:b0:4bb:6099:68c2 with SMTP id o15-20020a0ccb0f000000b004bb609968c2mr53287852qvk.131.1667948970386;
        Tue, 08 Nov 2022 15:09:30 -0800 (PST)
Received: from ?IPv6:2804:1b3:a802:4b05:8a1d:4f87:5d1a:2c54? ([2804:1b3:a802:4b05:8a1d:4f87:5d1a:2c54])
        by smtp.gmail.com with ESMTPSA id y22-20020a05620a44d600b006ec62032d3dsm10597420qkp.30.2022.11.08.15.09.26
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 08 Nov 2022 15:09:29 -0800 (PST)
Message-ID: <4a4a6c73f3776d65f70f7ca92eb26fc90ed3d51a.camel@redhat.com>
Subject: Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
From: Leonardo =?ISO-8859-1?Q?Br=E1s?= <leobras@redhat.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, 
 Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot
 <vincent.guittot@linaro.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>,
 Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel
 Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>,
 Valentin Schneider <vschneid@redhat.com>, Johannes Weiner
 <hannes@cmpxchg.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel
 Butt <shakeelb@google.com>, Muchun Song <songmuchun@bytedance.com>, Andrew
 Morton <akpm@linux-foundation.org>, Frederic Weisbecker
 <frederic@kernel.org>, Phil Auld <pauld@redhat.com>, Marcelo Tosatti
 <mtosatti@redhat.com>,  linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-mm@kvack.org
Date: Tue, 08 Nov 2022 20:09:25 -0300
In-Reply-To: <Y2i9h+TRdX9EOs0T@dhcp22.suse.cz>
References: <20221102020243.522358-1-leobras@redhat.com>
	 <Y2IwHVdgAJ6wfOVH@dhcp22.suse.cz>
	 <07810c49ef326b26c971008fb03adf9dc533a178.camel@redhat.com>
	 <Y2Pe45LHANFxxD7B@dhcp22.suse.cz>
	 <0183b60e79cda3a0f992d14b4db5a818cd096e33.camel@redhat.com>
	 <Y2TQLavnLVd4qHMT@dhcp22.suse.cz>
	 <3c4ae3bb70d92340d9aaaa1856928476641a8533.camel@redhat.com>
	 <Y2i9h+TRdX9EOs0T@dhcp22.suse.cz>
User-Agent: Evolution 3.46.1
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1667949029;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=slTemHQCcB3hkLvb1RiLXsoYjitssgvv6xxcDrshU7o=;
	b=shUeg3XdrBnWxHo1UB+38taUr7EBp4O9SBrMo6H99l5Dwem8Z4iPryhMSxfOhR1d8TXJa6
	4Kq2aq9Md8Av4UC2amg0xW8gvNsYvMuBTQWRTDPne667H+ymPn/Vv7lQL8Sck29nJ6NAUu
	tzbhnlfuphrW4VdDWvOuchbOcOd1aGI=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GLgqSWpz;
	spf=pass (imf21.hostedemail.com: domain of leobras@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=leobras@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667949029; a=rsa-sha256;
	cv=none;
	b=iL0VkwhjVVmNYBG5esdxqVdni1u7BZH9VxQDciN3wivbHGW8iOR8pRmnjYcPMLXiDiVCJl
	wuoP5UVobuTNqKPiG2OFHr+DxHobKWMD2t5uJ/0CdA7RLVx9kHXIxGNXK1pCe7OOIkU06n
	QAZEEBVgo3VRHAG1V3oT/9WSX4Y2ECQ=
X-Rspamd-Queue-Id: 0F7AD1C0004
X-Rspam-User: 
X-Rspamd-Server: rspam08
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GLgqSWpz;
	spf=pass (imf21.hostedemail.com: domain of leobras@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=leobras@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Stat-Signature: k57fumfbcx6hx3shzy9f9jrtgsdgbguo
X-HE-Tag: 1667949027-585828
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, 2022-11-07 at 09:10 +0100, Michal Hocko wrote:
> On Fri 04-11-22 22:45:58, Leonardo Br=C3=A1s wrote:
> > On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote:
> > > On Thu 03-11-22 13:53:41, Leonardo Br=C3=A1s wrote:
> > > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote:
> > > > > On Thu 03-11-22 11:59:20, Leonardo Br=C3=A1s wrote:
> > > [...]
> > > > > > I understand there will be a locking cost being paid in the iso=
lated CPUs when:
> > > > > > a) The isolated CPU is requesting the stock drain,
> > > > > > b) When the isolated CPUs do a syscall and end up using the pro=
tected structure
> > > > > > the first time after a remote drain.
> > > > >=20
> > > > > And anytime the charging path (consume_stock resp. refill_stock)
> > > > > contends with the remote draining which is out of control of the =
RT
> > > > > task. It is true that the RT kernel will turn that spin lock into=
 a
> > > > > sleeping RT lock and that could help with potential priority inve=
rsions
> > > > > but still quite costly thing I would expect.
> > > > >=20
> > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a=
 rt workload
> > > > > > should not expect the syscalls to be have a predictable time, s=
o it should be
> > > > > > fine.
> > > > >=20
> > > > > Now I am not sure I understand. If you do not consider charging p=
ath to
> > > > > be RT sensitive then why is this needed in the first place? What =
else
> > > > > would be populating the pcp cache on the isolated cpu? IRQs?
> > > >=20
> > > > I am mostly trying to deal with drain_all_stock() calling schedule_=
work_on() at
> > > > isolated_cpus. Since the scheduled drain_local_stock() will be comp=
eting for cpu
> > > > time with the RT workload, we can have preemption of the RT workloa=
d, which is a
> > > > problem for meeting the deadlines.
> > >=20
> > > Yes, this is understood. But it is not really clear to me why would a=
ny
> > > draining be necessary for such an isolated CPU if no workload other t=
han
> > > the RT (which pressumably doesn't charge any memory?) is running on t=
hat
> > > CPU? Is that the RT task during the initialization phase that leaves
> > > that cache behind or something else?
> >=20
> > (I am new to this part of the code, so please correct me when I miss so=
mething.)
> >=20
> > IIUC,=C2=A0if a process belongs to a control group with memory control,=
 the 'charge'
> > will happen when a memory page starts getting used by it.
>=20
> Yes, very broadly speaking.
>=20
> > So, if we assume a RT load in a isolated CPU will not charge any memory=
, we are
> > assuming it will never be part of a memory-controlled cgroup.
>=20
> If the memory cgroup controler is enabled then each user space process
> is a part of some memcg. If there is no specific memcg assigned then it
> will be a root cgroup and that is skipped during most charges except for
> kmem.

Oh, it makes sense.=20
Thanks for helping me understand that!=20

>=20
> > I mean, can we just assume this?=20
> >=20
> > If I got that right, would not that be considered a limitation? like
> > "If you don't want your workload to be interrupted by perCPU cache drai=
ning,
> > don't put it in a cgroup with memory control".
>=20
> We definitely do not want userspace make any assumptions on internal
> implementation details like caches.

Perfect, that was my expectation.=20

>=20
> > > Sorry for being so focused on this
> > > but I would like to understand on whether this is avoidable by a
> > > different startup scheme or it really needs to be addressed in some w=
ay.
> >=20
> > No worries, I am in fact happy you are giving it this much attention :)
> >=20
> > I also understand this is a considerable change in the locking strategy=
, and
> > avoiding that is the first thing that should be tried.
> >=20
> > >=20
> > > > One way I thought to solve that was introducing a remote drain, whi=
ch would
> > > > require a different strategy for locking, since not all accesses to=
 the pcp
> > > > caches would happen on a local CPU.=20
> > >=20
> > > Yeah, I am not supper happy about additional spin lock TBH. One
> > > potential way to go would be to completely avoid pcp cache for isolat=
ed
> > > CPUs.=C2=A0That would have some performance impact of course but on t=
he other
> > > hand it would give a more predictable behavior for those CPUs which
> > > sounds like a reasonable compromise to me. What do you think?
> >=20
> > You mean not having a perCPU stock, then?=20
> > So consume_stock() for isolated CPUs would always return false, causing
> > try_charge_memcg() always walking the slow path?
>=20
> Exactly.
>=20
> > IIUC, both my proposal and yours would degrade performance only when we=
 use
> > isolated CPUs + memcg. Is that correct?
>=20
> Yes, with a notable difference that with your spin lock option there is
> still a chance that the remote draining could influence the isolated CPU
> workload throug that said spinlock. If there is no pcp cache for that
> cpu being used then there is no potential interaction at all.

I see.=20
But the slow path is slow for some reason, right?
Does not it make use of any locks also? So on normal operation there could =
be a
potentially larger impact than a spinlock, even though there would be no
scheduled draining.

>=20
> > If so, it looks like the impact would be even bigger without perCPU sto=
ck ,
> > compared to introducing a spinlock.
> >=20
> > Unless, we are counting to this case where a remote CPU is draining an =
isolated
> > CPU, and the isolated CPU faults a page, and has to wait for the spinlo=
ck to be
> > released in the remote CPU. Well, this seems possible to happen, but I =
would
> > have to analyze how often would it happen, and how much would it impact=
 the
> > deadlines. I *guess* most of the RT workload's memory pages are pre-fau=
lted
> > before its starts, so it can avoid the faulting latency, but I need to =
confirm
> > that.
>=20
> Yes, that is a general practice and the reason why I was asking how real
> of a problem that is in practice.=C2=A0

I remember this was one common factor on deadlines being missed in the work=
load
analyzed. Need to redo the test to be sure.

> It is true true that appart from user
> space memory which can be under full control of the userspace there are
> kernel allocations which can be done on behalf of the process and those
> could be charged to memcg as well. So I can imagine the pcp cache could
> be populated even if the process is not faulting anything in during RT
> sensitive phase.

Humm, I think I will apply the change and do a comparative testing with
upstream. This should bring good comparison results.

>=20
> > On the other hand, compared to how it works now now, this should be a m=
ore
> > controllable way of introducing latency than a scheduled cache drain.
> >=20
> > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > predictability, but I am almost sure the cost in overall performance wo=
uld not
> > be fine.
>=20
> It is hard to estimate the overhead without measuring that. Do you think
> you can give it a try? If the performance is not really acceptable
> (which I would be really surprised) then we can think of a more complex
> solution.

Sure, I can try that.
Do you suggest any specific workload that happens to stress the percpu cach=
e
usage, with usual drains and so? Maybe I will also try with synthetic worlo=
ads
also.

>=20
> > With the possibility of prefaulting pages, do you see any scenario that=
 would
> > introduce some undesirable latency in the workload?
>=20
> My primary concern would be spin lock contention which is hard to
> predict with something like remote draining.

It makes sense. I will do some testing and come out with results for that.

Thanks for reviewing!
Leo