Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list

From: "Yin, Fengwei" <fengwei.yin@intel.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: <linux-mm@kvack.org>, <akpm@linux-foundation.org>,
	<kirill@shutemov.name>, <yuzhao@google.com>,
	<ryan.roberts@arm.com>, <ying.huang@intel.com>
Subject: Re: [PATCH v3 2/2] lru: allow large batched add large folio to lru list
Date: Mon, 15 May 2023 10:14:51 +0800	[thread overview]
Message-ID: <c078f73e-ffb4-e6d2-425e-8803c0243092@intel.com> (raw)
In-Reply-To: <8d4f938e-4f0a-bb97-3890-910b5838d6f5@intel.com>

Hi Matthew,

On 5/5/2023 1:51 PM, Yin, Fengwei wrote:
> Hi Matthew,
> 
> On 4/30/2023 6:35 AM, Matthew Wilcox wrote:
>> On Sat, Apr 29, 2023 at 04:27:59PM +0800, Yin Fengwei wrote:
>>> @@ -22,6 +23,7 @@ struct address_space;
>>>  struct pagevec {
>>>  	unsigned char nr;
>>>  	bool percpu_pvec_drained;
>>> +	unsigned short nr_pages;
>>
>> I still don't like storing nr_pages in the pagevec/folio_batch.
>>
> 
> What about the change like following:
Soft ping.

Regards
Yin, Fengwei

> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 57cb01b042f6..5e7e9c0734ab 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -228,8 +228,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
>  static void folio_batch_add_and_move(struct folio_batch *fbatch,
>                 struct folio *folio, move_fn_t move_fn)
>  {
> -       if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) &&
> -           !lru_cache_disabled())
> +       int nr_pages = folio_nr_pages(folio);
> +
> +       if (folio_batch_add(fbatch, folio) && !lru_cache_disabled() &&
> +           (!folio_test_large(folio) || (nr_pages <= (PAGEVEC_SIZE + 1))))
>                 return;
>         folio_batch_move_lru(fbatch, move_fn);
>  }
> 
> 
> I did testing about the lru lock contention with different folio size
> with will-it-scale + deferred queue lock contention mitigated:
>   - If large folio size is 16K (order 2), the lru lock takes 64.31% cpu runtime
>   - If large folio size is 64K (order 4), the lru lock takes 24.24% cpu runtime
> This is as our expectation: The larger size of folio, the less lru lock
> contention.
> 
> It's acceptable to not batched operate on large folio which is large
> enough. PAGEVEC_SIZE + 1 is chosen here based on following reasons:
>   - acceptable max memory size per batch: 15 x 16 x 4096 = 983040 bytes
>   - the folios with size larger than it will not apply batched operation.
>     But the lru lock contention is not high already.
> 
> 
> I collected data with lru contention when run will-it-scale.page_fault1:
> 
> folio with order 2:
>   Without the change:
>   -   64.31%     0.23%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
>      + 64.07% folio_lruvec_lock_irqsave
> 
>   With the change:
>   -   21.55%     0.21%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
>      + 21.34% folio_lruvec_lock_irqsave
> 
> folio with order 4:
>   Without the change:
>   -   24.24%     0.15%  page_fault1_pro  [kernel.kallsyms]           [k] folio_lruvec_lock_irqsave
>      + 24.09% folio_lruvec_lock_irqsave
> 
>   With the change:
>   -   2.20%     0.09%  page_fault1_pro  [kernel.kallsyms]            [k] folio_lruvec_lock_irqsave
>      + 2.11% folio_lruvec_lock_irqsave
> 
> folio with order 5:
>   -   8.21%     0.16%  page_fault1_pro  [kernel.kallsyms]  [k] folio_lruvec_lock_irqsave
>      + 8.05% folio_lruvec_lock_irqsave
> 
> 
> Regards
> Yin, Fengwei
>