Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus

From: Michal Hocko <mhocko@suse.com>
To: "Leonardo Brás" <leobras@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Phil Auld <pauld@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH v1 0/3] Avoid scheduling cache draining to isolated cpus
Date: Wed, 9 Nov 2022 09:05:28 +0100	[thread overview]
Message-ID: <Y2tfSAgt/lBVcdvf@dhcp22.suse.cz> (raw)
In-Reply-To: <4a4a6c73f3776d65f70f7ca92eb26fc90ed3d51a.camel@redhat.com>

On Tue 08-11-22 20:09:25, Leonardo Brás wrote:
[...]
> > Yes, with a notable difference that with your spin lock option there is
> > still a chance that the remote draining could influence the isolated CPU
> > workload throug that said spinlock. If there is no pcp cache for that
> > cpu being used then there is no potential interaction at all.
> 
> I see. 
> But the slow path is slow for some reason, right?
> Does not it make use of any locks also? So on normal operation there could be a
> potentially larger impact than a spinlock, even though there would be no
> scheduled draining.

Well, for the regular (try_charge) path that is essentially page_counter_try_charge
which boils down to atomic_long_add_return of the memcg counter + all
parents up the hierarchy and high memory limit evaluation (essentially 2
atomic_reads for the memcg + all parents up the hierchy). That is not
whole of a lot - especially when the memcg hierarchy is not very deep.

Per cpu batch amortizes those per hierarchy updates as well as atomic
operations + cache lines bouncing on updates.

On the other hand spinlock would do the unconditional atomic updates as
well and even much more on CONFIG_RT. A plus is that the update will be
mostly local so cache line bouncing shouldn't be terrible. Unless
somebody heavily triggers pcp cache draining but this shouldn't be all
that common (e.g. when a memcg triggers its limit.

All that being said, I am still not convinced that the pcp cache bypass
for isolated CPUs would make a dramatic difference. Especially in the
context of workloads that tend to run on isolated CPUs and rarely enter
kernel.

> > It is true true that appart from user
> > space memory which can be under full control of the userspace there are
> > kernel allocations which can be done on behalf of the process and those
> > could be charged to memcg as well. So I can imagine the pcp cache could
> > be populated even if the process is not faulting anything in during RT
> > sensitive phase.
> 
> Humm, I think I will apply the change and do a comparative testing with
> upstream. This should bring good comparison results.

That would be certainly appreciated!

> > > On the other hand, compared to how it works now now, this should be a more
> > > controllable way of introducing latency than a scheduled cache drain.
> > > 
> > > Your suggestion on no-stocks/caches in isolated CPUs would be great for
> > > predictability, but I am almost sure the cost in overall performance would not
> > > be fine.
> > 
> > It is hard to estimate the overhead without measuring that. Do you think
> > you can give it a try? If the performance is not really acceptable
> > (which I would be really surprised) then we can think of a more complex
> > solution.
> 
> Sure, I can try that.
> Do you suggest any specific workload that happens to stress the percpu cache
> usage, with usual drains and so? Maybe I will also try with synthetic worloads
> also.

I really think you want to test it on the isolcpu aware workload.
Artificial benchmark are not all that useful in this context.
-- 
Michal Hocko
SUSE Labs