Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

From: Nicolas Saenz Julienne <nsaenzju@redhat.com>
To: Mel Gorman <mgorman@suse.de>
Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, frederic@kernel.org, tglx@linutronix.de,
	mtosatti@redhat.com, linux-rt-users@vger.kernel.org,
	vbabka@suse.cz, cl@linux.com, paulmck@kernel.org,
	willy@infradead.org
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support
Date: Thu, 24 Mar 2022 19:59:56 +0100	[thread overview]
Message-ID: <3c24840e8378c69224974f321ec5c06a36a33dd3.camel@redhat.com> (raw)
In-Reply-To: <20220303114550.GE4363@suse.de>

Hi Mel,

On Thu, 2022-03-03 at 11:45 +0000, Mel Gorman wrote:
> For unrelated reasons I looked at using llist to avoid locks entirely. It
> turns out it's not possible and needs a lock. We know "local_locks to
> per-cpu spinlocks" took a large penalty so I considered alternatives on
> how a lock could be used.  I found it's possible to both remote drain
> the lists and avoid the disable/enable of IRQs entirely as long as a
> preempting IRQ is willing to take the zone lock instead (should be very
> rare). The IRQ part is a bit hairy though as softirqs are also a problem
> and preempt-rt needs different rules and the llist has to sort PCP
> refills which might be a loss in total. However, the remote draining may
> still be interesting. The full series is at
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2
> 
> It's still waiting on tests to complete and not all the changelogs are
> complete which is why it's not posted.
> 
> This is a comparison of vanilla vs "local_locks to per-cpu spinlocks"
> versus the git series up to "mm/page_alloc: Remotely drain per-cpu lists"
> for the page faulting microbench I originally complained about.  The test
> machine is a 2-socket CascadeLake machine.
> 
> pft timings
>                                  5.17.0-rc5             5.17.0-rc5             5.17.0-rc5
>                                     vanilla    mm-remotedrain-v2r1       mm-pcpdrain-v1r1
> Amean     elapsed-1        32.54 (   0.00%)       33.08 *  -1.66%*       32.82 *  -0.86%*
> Amean     elapsed-4         8.66 (   0.00%)        9.24 *  -6.72%*        8.69 *  -0.38%*
> Amean     elapsed-7         5.02 (   0.00%)        5.43 *  -8.16%*        5.05 *  -0.55%*
> Amean     elapsed-12        3.07 (   0.00%)        3.38 * -10.00%*        3.09 *  -0.72%*
> Amean     elapsed-21        2.36 (   0.00%)        2.38 *  -0.89%*        2.19 *   7.39%*
> Amean     elapsed-30        1.75 (   0.00%)        1.87 *  -6.50%*        1.62 *   7.59%*
> Amean     elapsed-48        1.71 (   0.00%)        2.00 * -17.32%*        1.71 (  -0.08%)
> Amean     elapsed-79        1.56 (   0.00%)        1.62 *  -3.84%*        1.56 (  -0.02%)
> Amean     elapsed-80        1.57 (   0.00%)        1.65 *  -5.31%*        1.57 (  -0.04%)
> 
> Note the local_lock conversion took 1 1-17% penalty while the git tree
> takes a negligile penalty while still allowing remote drains. It might
> have some potential while being less complex than the RCU approach.

I finally got some time to look at this and made some progress:

First, I belive your 'mm-remotedrain-v2r1' results are wrong/inflated due to a
bug in my series. Essentially, all 'this_cpu_ptr()' calls should've been
'raw_cpu_ptr()' and your build, which I bet enables CONFIG_DEBUG_PREEMPT,
wasted time trowing warnings about per-cpu variable usage with preemption
enabled. Making the overall performance look worse than it actually is. My
build didn't enable it, which made me miss this whole issue. I'm sorry for the
noise and time wasted on such a silly thing. Note that the local_lock to
spin_lock conversion can handle the preeemption alright, it is part of the
design[1].

As for your idea of not disabling interrupts in the hot paths, it seems to
close the performance gap created by the lock conversion. That said, I'm not
sure I understand why you find the need to keep the local_locks around, not
only it casuses problems for RT systems, but IIUC they aren't really protecting
anything other than the 'this_cpu_ptr()' usage (which isn't really needed). I
rewrote your patch on top of my lock conversion series and I'm in the process
of testing it on multiple systems[2].

Let me know what you think.
Thanks!

[1] It follows this pattern:

	struct per_cpu_pages *pcp;

	pcp = raw_cpu_ptr(page_zone(page)->per_cpu_pageset);
	// <- Migration here is OK: spin_lock protects vs eventual pcplist
	// access from local CPU as long as all list access happens through the
	// pcp pointer.
	spin_lock(&pcp->lock);
	do_stuff_with_pcp_lists(pcp);
	spin_unlock(&pcp->lock);

[2] See:

  git://git.kernel.org/pub/scm/linux/kernel/git/nsaenz/linux-rpi.git pcpdrain-sl-v3r1

-- 
Nicolás Sáenz