Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

From: Mel Gorman <mgorman@suse.de>
To: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, frederic@kernel.org, tglx@linutronix.de,
	mtosatti@redhat.com, linux-rt-users@vger.kernel.org,
	vbabka@suse.cz, cl@linux.com, paulmck@kernel.org,
	willy@infradead.org
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support
Date: Fri, 25 Mar 2022 10:48:00 +0000	[thread overview]
Message-ID: <20220325104800.GI4363@suse.de> (raw)
In-Reply-To: <3c24840e8378c69224974f321ec5c06a36a33dd3.camel@redhat.com>

On Thu, Mar 24, 2022 at 07:59:56PM +0100, Nicolas Saenz Julienne wrote:
> Hi Mel,
> 
> On Thu, 2022-03-03 at 11:45 +0000, Mel Gorman wrote:
> > For unrelated reasons I looked at using llist to avoid locks entirely. It
> > turns out it's not possible and needs a lock. We know "local_locks to
> > per-cpu spinlocks" took a large penalty so I considered alternatives on
> > how a lock could be used.  I found it's possible to both remote drain
> > the lists and avoid the disable/enable of IRQs entirely as long as a
> > preempting IRQ is willing to take the zone lock instead (should be very
> > rare). The IRQ part is a bit hairy though as softirqs are also a problem
> > and preempt-rt needs different rules and the llist has to sort PCP
> > refills which might be a loss in total. However, the remote draining may
> > still be interesting. The full series is at
> > https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2
> > 
> > It's still waiting on tests to complete and not all the changelogs are
> > complete which is why it's not posted.
> > 
> > This is a comparison of vanilla vs "local_locks to per-cpu spinlocks"
> > versus the git series up to "mm/page_alloc: Remotely drain per-cpu lists"
> > for the page faulting microbench I originally complained about.  The test
> > machine is a 2-socket CascadeLake machine.
> > 
> > pft timings
> >                                  5.17.0-rc5             5.17.0-rc5             5.17.0-rc5
> >                                     vanilla    mm-remotedrain-v2r1       mm-pcpdrain-v1r1
> > Amean     elapsed-1        32.54 (   0.00%)       33.08 *  -1.66%*       32.82 *  -0.86%*
> > Amean     elapsed-4         8.66 (   0.00%)        9.24 *  -6.72%*        8.69 *  -0.38%*
> > Amean     elapsed-7         5.02 (   0.00%)        5.43 *  -8.16%*        5.05 *  -0.55%*
> > Amean     elapsed-12        3.07 (   0.00%)        3.38 * -10.00%*        3.09 *  -0.72%*
> > Amean     elapsed-21        2.36 (   0.00%)        2.38 *  -0.89%*        2.19 *   7.39%*
> > Amean     elapsed-30        1.75 (   0.00%)        1.87 *  -6.50%*        1.62 *   7.59%*
> > Amean     elapsed-48        1.71 (   0.00%)        2.00 * -17.32%*        1.71 (  -0.08%)
> > Amean     elapsed-79        1.56 (   0.00%)        1.62 *  -3.84%*        1.56 (  -0.02%)
> > Amean     elapsed-80        1.57 (   0.00%)        1.65 *  -5.31%*        1.57 (  -0.04%)
> > 
> > Note the local_lock conversion took 1 1-17% penalty while the git tree
> > takes a negligile penalty while still allowing remote drains. It might
> > have some potential while being less complex than the RCU approach.
> 
> I finally got some time to look at this and made some progress:
> 
> First, I belive your 'mm-remotedrain-v2r1' results are wrong/inflated due to a
> bug in my series. Essentially, all 'this_cpu_ptr()' calls should've been
> 'raw_cpu_ptr()' and your build, which I bet enables CONFIG_DEBUG_PREEMPT,

I no longer have the logs but it could have and I didn't check the
dmesg at the time to see if there were warnings in it. It really should
add something to parse that log and automatically report if there are
unexpected warnings, oops or prove-locking warnings if enabled.

> wasted time trowing warnings about per-cpu variable usage with preemption
> enabled. Making the overall performance look worse than it actually is. My
> build didn't enable it, which made me miss this whole issue. I'm sorry for the
> noise and time wasted on such a silly thing. Note that the local_lock to
> spin_lock conversion can handle the preeemption alright, it is part of the
> design[1].
> 
> As for your idea of not disabling interrupts in the hot paths, it seems to
> close the performance gap created by the lock conversion. That said, I'm not
> sure I understand why you find the need to keep the local_locks around, not
> only it casuses problems for RT systems, but IIUC they aren't really protecting
> anything other than the 'this_cpu_ptr()' usage (which isn't really needed).

The local lock was preserved because something has to stabilise the per-cpu
pointer due to preemption and migration.  On !RT, that's a preempt_disable
and on RT it's a spinlock both which prevent a migration. The secondary
goal of using local lock was to allow some allocations to be done without
disabling IRQs at all with the penalty that an IRQ arriving in at the
wrong time will have to allocate directly from the buddy lists which
should be rare.

> I
> rewrote your patch on top of my lock conversion series and I'm in the process
> of testing it on multiple systems[2].
> 
> Let me know what you think.
> Thanks!
> 
> [1] It follows this pattern:
> 
> 	struct per_cpu_pages *pcp;
> 
> 	pcp = raw_cpu_ptr(page_zone(page)->per_cpu_pageset);
> 	// <- Migration here is OK: spin_lock protects vs eventual pcplist
> 	// access from local CPU as long as all list access happens through the
> 	// pcp pointer.
> 	spin_lock(&pcp->lock);
> 	do_stuff_with_pcp_lists(pcp);
> 	spin_unlock(&pcp->lock);
> 

And this was the part I am concerned with. We are accessing a PCP
structure that is not necessarily the one belonging to the CPU we
are currently running on. This type of pattern is warned about in
Documentation/locking/locktypes.rst

---8<---
A typical scenario is protection of per-CPU variables in thread context::

  struct foo *p = get_cpu_ptr(&var1);

  spin_lock(&p->lock);
  p->count += this_cpu_read(var2);

This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
not allow to acquire p->lock because get_cpu_ptr() implicitly disables
preemption. The following substitution works on both kernels::
---8<---

Now we don't explicitly have this pattern because there isn't an
obvious this_cpu_read() for example but it can accidentally happen for
counting. __count_zid_vm_events -> __count_vm_events -> raw_cpu_add is
an example although a harmless one.

Any of the mod_page_state ones are more problematic though because we
lock one PCP but potentially update the per-cpu pcp stats of another CPU
of a different PCP that we have not locked and those counters must be
accurate.

It *might* still be safe but it's subtle, it could be easily accidentally
broken in the future and it would be hard to detect because it would be
very slow corruption of VM counters like NR_FREE_PAGES that must be
accurate.

-- 
Mel Gorman
SUSE Labs