Re: [PATCH v7 00/13] fold per-CPU vmstats remotely

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Frederic Weisbecker <frederic@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	Aaron Tomlin <atomlin@atomlin.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Russell King <linux@armlinux.org.uk>,
	Huacai Chen <chenhuacai@kernel.org>,
	Heiko Carstens <hca@linux.ibm.com>,
	x86@kernel.org, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
Date: Thu, 27 Apr 2023 11:59:48 -0300	[thread overview]
Message-ID: <ZEqN5J3b8m8QxNEP@tpad> (raw)
In-Reply-To: <ZEoy2aYpGJ4/wOK7@dhcp22.suse.cz>

On Thu, Apr 27, 2023 at 10:31:21AM +0200, Michal Hocko wrote:
> On Wed 26-04-23 11:34:00, Marcelo Tosatti wrote:
> > On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote:
> [...]
> > > There are additional details that were not mentioned. When we think
> > > of flushing caches, or disabling per-CPU caches, this means that the
> > > isolated application loses the benefit of those caches (which means you
> > > are turning a "general purpose" programming environment into
> > > potentially slower environment for applications to execute).
> 
> I do not really buy this argument! Nothing is really free and somebody
> has to pay for the overhead. 

About the overhead, modern processors perform "cache locking":

https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-261.html

Which means that as long as memory is completly contained in a cacheline
(for write-back memory), then it is not necessary to perform the LOCK 
operation on the bus.
Multiple experiments have confirmed this is the case.

This is the case with per-CPU vmstat memory as well.

> You want highly specialized workload to
> enjoy all the performance while having high demand on latency yet the
> overhead has to pay everybody else.

Yes, the overhead is that code should avoid interrupting CPUs.

Your argument is that: "Avoiding the interruptions adds unsurmountable 
complexity therefore those workloads should not be supported" ?

It seems to me this argument can be used against any new piece of code 
of functionality that is added to the kernel, isnt it ?

The same argument could be used to reject (at the time) new additions 
such as RCU (because systems with large number of processors are a
a highly specialized workload), memory hotplug (same thing), PCI hotplug 
(same thing).

> > https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html
> 
> This is just talking about who benefits from isolation and I do not
> think there is any dispute in that regard. I haven't questioned that. My
> main argument was that those really need to be special and careful to
> achieve their goal 

I see.

> and Thomas says a very similar thing. I do not see
> any objection to an explicit programming model to achieve that goal.

Yes, but it seems to me that the best possible (and most widely
applicable) solution is to avoid any explicit programming if possible.

> > > (yes, of course, one has to be mindful of which system calls can be
> > > used, for example the execution time of system calls which take locks will
> > > depend on whether, and how many, users of those locks there are at a
> > > given moment).
> 
> This is simply not maintainble state. Once you enter the kernel you
> cannot really expect your _ultra low_ latency expectations.

Whether or not its OK to perform system calls is up to the application:
What matters is the latency expectation from the outside world [1]
VS how long it takes to execute a set of instructions.

I can give two concrete example:

1) Cyclictest use sys_nanosleep(). It makes sense
to abstract the details of HLT'ing to the operating system.

A whole class of programs (which must handle periodic tasks)
will sleep via the kernel.

2) The HPC example from Thomas, where:

"
 1     read_data_set() <- involving syscalls/OS obviously
 2     compute_set()   <- let me alone
 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

then it's at his discretion to decide to inflict a particular isolation
set on the task which is obviously ineffective while doing #1 and #3 but
might provide the so desired 0.9% boost for compute_set() which
dominates the judgement."

It seems the operating system is capable of providing an isolation free
environment for the application without explicit knowledge from it
(other than taskset).

So all the above are good reasons to try and avoid an explicit
programming interface (again, i did write an explicit programming
interface, and seen in practice its downsides). I will assume you now
understand and agree that the additional complexity added to the kernel
is worthwhile (since its not that huge complexity to perform particular
work remotely, with appropriate locking and/or usage of lockless
algorithms, these things have been around and used in the kernel for a
while now). 

As for the next topic...

> [...]
> > > So it seems to me (unless there are points that show otherwise, which
> > > would indicate that explicit userspace interfaces are preferred) _not_
> > > requiring userspace changes is a superior solution.
> > > 
> > > Perhaps the complexity should be judged for individual cases 
> > > of interruptions, and if a given interruption-free conversion 
> > > is seen as too complex, then a "disable feature which makes use of per-CPU
> > > caches" style solution can be made (and then userspace has to
> > > explicitly request for that per-CPU feature to be disabled).
> > > 
> > > But i don't see that this patchset introduces unmanageable complexity,
> > > neither: 
> 
> As I've tried to explain, I disagree about the approach you are taking.
> You are fixing your problem at a wrong layer. You really need to address
> the fundamental issue and that is that you do not want housekeeping done
> on isolated cpu(s) while your workload is running there.

OK, that is a problem. But the fact is that there are interfaces to request
work to be performed on remote CPUs (usually on per-CPU data), and those
must be addressed one-by-one. We (as in the community) are looking into
ways to address multiple classes of interruptions at once, for example:

https://lpc.events/event/16/contributions/1218/
https://www.spinics.net/lists/linux-s390/msg57118.html

"The current CPU isolation is a best effort approach and I agree that for
more strict isolation modes we need to be able to enforce that and hunt
down offenders and think about them one by one."

But we can't block IPIs or requests for work to be executed remotely.

> vmstat updates are just one of schedule_on_cpu users who could disturb
> your workload.  

Yes. But its one case which we see right now. So our approach so far has
been to monitor the workload and remove individual interruptions that
are observed.

But as long as you have code that is doing:

	queue_work_on(CPU, &this_work);
	flush_work(&this_work);

There is nothing one can do about it other than:

	1) Return an error to avoid the interruption
	   (which is what we are trying here:
	   https://lpc.events/event/16/contributions/1218/).
	   However this might not be suitable for all cases
	   (because you rely on the functionality).
	2) Convert it to avoid the remote work somehow.
	3) Don't queue the work (which might result in incorrect
	   system operation).

Again, we are trying to widen the number of callsites that can be 
handled with certain approaches (see the above URLs).

> We do not want to chase every single one and keep
> doing that for ever as new callers of that API are added. See the
> point? 

Yes, agree with this point. Possible solutions:

1) Change the APIs so that any new users that attempt
to use the APIs are encouraged to avoid executing code on
isolated CPUs (or have to handle the errors).

/**
 * queue_work_on - queue work on specific cpu
 * @cpu: CPU number to execute work on
 * @wq: workqueue to use
 * @work: work to queue
 *
 * We queue the work to a specific CPU, the caller must ensure it
 * can't go away.  Callers that fail to ensure that the specified
 * CPU cannot go away will execute on a randomly chosen CPU.
 *
 * Return: %false if @work was already on a queue, %true otherwise.
 */
bool queue_work_on(int cpu, struct workqueue_struct *wq,
                   struct work_struct *work)
{

Perhaps _fail variants of APIs to queue remote work 
(https://lore.kernel.org/lkml/20220908192859.546633738@redhat.com/T/#mc25ddea62ff095dba61d244fbfdca1f61221c915)
plus
a checkpatch.pl error in case of addition of queue_work_on would 
be helpful?

2) Change the patch acceptance criteria to avoid introducing 
queue_work_on's to isolated CPUs.

If you have additional suggestions or ideas, they are welcome.

> "Fixing" vmstat will not make your isolated workload more
> reliable. You really need a more generic solution rather than a quick
> hack.

It does as long as no one else executes queue_work_on() :-)

Yes, i understand and agree this is weak and fragile.

> Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC
> this is used by NOHZ. I do not remember any details but if anything this
> is something I would have a look into.

Its not sufficient since what it does is only flushing the per-CPU
vmstats when entering idle. There has been work on a patchset
(https://lkml.org/lkml/2022/9/24/306) to improve the infrastructure, to
call quiet_vmstat on return to userspace when nohz_full is used, but
honestly the remote flushing is superior: it potentially avoids many
unecessary flushes (which can happen if the application performs a lot
of system calls).

> There is close to 0 benefit to teaching remote stat flushing. As I've
> said stats are only for debugging purposes and imprecise values
> shouldn't matter. So this just adds a complexity without any actual real
> benefit.

Well there is the need to request a synchronization of the events from
all CPUs, right? For example:

int vmstat_refresh(struct ctl_table *table, int write,
                   void *buffer, size_t *lenp, loff_t *ppos)
{
        long val;
        int err;
        int i;

        /*
         * The regular update, every sysctl_stat_interval, may come later
         * than expected: leaving a significant amount in per_cpu buckets.
         * This is particularly misleading when checking a quantity of HUGE
         * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
         * which can equally be echo'ed to or cat'ted from (by root),
         * can be used to update the stats just before reading them.
         *
         * Oh, and since global_zone_page_state() etc. are so careful to hide
         * transiently negative values, report an error here if any of
         * the stats is negative, so we know to go looking for imbalance.
         */
        err = schedule_on_each_cpu(refresh_vm_stats);

So if you let a remote CPU dirty its own cache, then at somepoint you
need to flush back that cache.

Are you suggesting to disable synchronization of the per-CPU
vmstats for a given CPU (if isolated, for example), and then any
callsites which require proper values must loop over all CPUs
when requesting uptodate statistics?

Works for me and the addresses the requests from users 
(since it removes the interruption to isolated CPUs).