Re: [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock

From: Vlastimil Babka <vbabka@suse.cz>
To: Mel Gorman <mgorman@techsingularity.net>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Michal Hocko <mhocko@kernel.org>, Hugh Dickins <hughd@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH 5/7] mm/page_alloc: Protect PCP lists with a spinlock
Date: Thu, 16 Jun 2022 17:59:34 +0200	[thread overview]
Message-ID: <af8f5c7d-5f7f-17ad-835e-1fdea2d8dda8@suse.cz> (raw)
In-Reply-To: <20220613125622.18628-6-mgorman@techsingularity.net>

On 6/13/22 14:56, Mel Gorman wrote:
> Currently the PCP lists are protected by using local_lock_irqsave to
> prevent migration and IRQ reentrancy but this is inconvenient.  Remote
> draining of the lists is impossible and a workqueue is required and every
> task allocation/free must disable then enable interrupts which is
> expensive.
> 
> As preparation for dealing with both of those problems, protect the lists
> with a spinlock.  The IRQ-unsafe version of the lock is used because IRQs
> are already disabled by local_lock_irqsave.  spin_trylock is used in
> preparation for a time when local_lock could be used instead of
> lock_lock_irqsave.
> 
> The per_cpu_pages still fits within the same number of cache lines after
> this patch relative to before the series.
> 
> struct per_cpu_pages {
>         spinlock_t                 lock;                 /*     0     4 */
>         int                        count;                /*     4     4 */
>         int                        high;                 /*     8     4 */
>         int                        batch;                /*    12     4 */
>         short int                  free_factor;          /*    16     2 */
>         short int                  expire;               /*    18     2 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         struct list_head           lists[13];            /*    24   208 */
> 
>         /* size: 256, cachelines: 4, members: 7 */
>         /* sum members: 228, holes: 1, sum holes: 4 */
>         /* padding: 24 */
> } __attribute__((__aligned__(64)));
> 
> There is overhead in the fast path due to acquiring the spinlock even
> though the spinlock is per-cpu and uncontended in the common case.  Page
> Fault Test (PFT) running on a 1-socket reported the following results on a
> 1 socket machine.
> 
>                                      5.18.0-rc1               5.18.0-rc1
>                                         vanilla         mm-pcpdrain-v2r1
> Hmean     faults/sec-1   886331.5718 (   0.00%)   885462.7479 (  -0.10%)
> Hmean     faults/sec-3  2337706.1583 (   0.00%)  2332130.4909 *  -0.24%*
> Hmean     faults/sec-5  2851594.2897 (   0.00%)  2844123.9307 (  -0.26%)
> Hmean     faults/sec-7  3543251.5507 (   0.00%)  3516889.0442 *  -0.74%*
> Hmean     faults/sec-8  3947098.0024 (   0.00%)  3916162.8476 *  -0.78%*
> Stddev    faults/sec-1     2302.9105 (   0.00%)     2065.0845 (  10.33%)
> Stddev    faults/sec-3     7275.2442 (   0.00%)     6033.2620 (  17.07%)
> Stddev    faults/sec-5    24726.0328 (   0.00%)    12525.1026 (  49.34%)
> Stddev    faults/sec-7     9974.2542 (   0.00%)     9543.9627 (   4.31%)
> Stddev    faults/sec-8     9468.0191 (   0.00%)     7958.2607 (  15.95%)
> CoeffVar  faults/sec-1        0.2598 (   0.00%)        0.2332 (  10.24%)
> CoeffVar  faults/sec-3        0.3112 (   0.00%)        0.2587 (  16.87%)
> CoeffVar  faults/sec-5        0.8670 (   0.00%)        0.4404 (  49.21%)
> CoeffVar  faults/sec-7        0.2815 (   0.00%)        0.2714 (   3.60%)
> CoeffVar  faults/sec-8        0.2399 (   0.00%)        0.2032 (  15.28%)
> 
> There is a small hit in the number of faults per second but given that the
> results are more stable, it's borderline noise.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Tested-by: Minchan Kim <minchan@kernel.org>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

With comments:

> -static void free_unref_page_commit(struct page *page, int migratetype,
> -				   unsigned int order)
> +/* Returns true if the page was committed to the per-cpu list. */
> +static bool free_unref_page_commit(struct page *page, int migratetype,
> +				   unsigned int order, bool locked)

'bool locked' schemes are frowned upon. Although this does some preparatory
work outside of the locked section...

>  {
>  	struct zone *zone = page_zone(page);
>  	struct per_cpu_pages *pcp;
>  	int high;
>  	int pindex;
>  	bool free_high;
> +	unsigned long __maybe_unused UP_flags;
>  
>  	__count_vm_event(PGFREE);
>  	pcp = this_cpu_ptr(zone->per_cpu_pageset);
>  	pindex = order_to_pindex(migratetype, order);

... maybe it would be simpler to just do the locking always outside? We
aren't expecting any contention anyway except if draining is in progress,
and thus making the lock hold time a bunch of instructions longer won't make
any difference?

That said I don't know yet how patches 6 and 7 change this...

[...]

> @@ -3465,10 +3510,19 @@ void free_unref_page(struct page *page, unsigned int order)
>  void free_unref_page_list(struct list_head *list)
>  {
>  	struct page *page, *next;
> +	struct per_cpu_pages *pcp;
> +	struct zone *locked_zone;
>  	unsigned long flags;
>  	int batch_count = 0;
>  	int migratetype;
>  
> +	/*
> +	 * An empty list is possible. Check early so that the later
> +	 * lru_to_page() does not potentially read garbage.
> +	 */ 
> +	if (list_empty(list))
> +		return;

There's another check below and list_for_each_entry_safe() is fine with an
empty list, so this one seems unnecessary?

>  	/* Prepare pages for freeing */
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		unsigned long pfn = page_to_pfn(page);
> @@ -3489,8 +3543,31 @@ void free_unref_page_list(struct list_head *list)
>  		}
>  	}
>  
> +	/*
> +	 * Preparation could have drained the list due to failing to prepare
> +	 * or all pages are being isolated.
> +	 */
> +	if (list_empty(list))
> +		return;
> +
>  	local_lock_irqsave(&pagesets.lock, flags);
> +
> +	page = lru_to_page(list);