Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()

* Re: [PATCH/RFC] mm: add and use batched version of __tlb_remove_table()
       [not found] <20211217081909.596413-1-nikita.yushchenko@virtuozzo.com>
@ 2021-12-17 18:26 ` Dave Hansen
  2021-12-18 14:31   ` Nikita Yushchenko
  2021-12-17 18:39 ` Sam Ravnborg
  2021-12-18  0:37 ` Peter Zijlstra
  2 siblings, 1 reply; 8+ messages in thread
From: Dave Hansen @ 2021-12-17 18:26 UTC (permalink / raw)
  To: Nikita Yushchenko, Will Deacon, Aneesh Kumar K.V, Andrew Morton,
	Nick Piggin, Peter Zijlstra, Catalin Marinas, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, David S. Miller,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Arnd Bergmann
  Cc: x86, linux-kernel, linux-arch, linux-mm, linuxppc-dev,
	linux-s390, sparclinux, kernel

On 12/17/21 12:19 AM, Nikita Yushchenko wrote:
> When batched page table freeing via struct mmu_table_batch is used, the
> final freeing in __tlb_remove_table_free() executes a loop, calling
> arch hook __tlb_remove_table() to free each table individually.
> 
> Shift that loop down to archs. This allows archs to optimize it, by
> freeing multiple tables in a single release_pages() call. This is
> faster than individual put_page() calls, especially with memcg
> accounting enabled.

Could we quantify "faster"?  There's a non-trivial amount of code being
added here and it would be nice to back it up with some cold-hard numbers.

> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -95,11 +95,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
>  
>  static void __tlb_remove_table_free(struct mmu_table_batch *batch)
>  {
> -	int i;
> -
> -	for (i = 0; i < batch->nr; i++)
> -		__tlb_remove_table(batch->tables[i]);
> -
> +	__tlb_remove_tables(batch->tables, batch->nr);
>  	free_page((unsigned long)batch);
>  }

This leaves a single call-site for __tlb_remove_table():

> static void tlb_remove_table_one(void *table)
> {
>         tlb_remove_table_sync_one();
>         __tlb_remove_table(table);
> }

Is that worth it, or could it just be:

	__tlb_remove_tables(&table, 1);

?

> -void free_pages_and_swap_cache(struct page **pages, int nr)
> +static void __free_pages_and_swap_cache(struct page **pages, int nr,
> +		bool do_lru)
>  {
> -	struct page **pagep = pages;
>  	int i;
>  
> -	lru_add_drain();
> +	if (do_lru)
> +		lru_add_drain();
>  	for (i = 0; i < nr; i++)
> -		free_swap_cache(pagep[i]);
> -	release_pages(pagep, nr);
> +		free_swap_cache(pages[i]);
> +	release_pages(pages, nr);
> +}
> +
> +void free_pages_and_swap_cache(struct page **pages, int nr)
> +{
> +	__free_pages_and_swap_cache(pages, nr, true);
> +}
> +
> +void free_pages_and_swap_cache_nolru(struct page **pages, int nr)
> +{
> +	__free_pages_and_swap_cache(pages, nr, false);
>  }

This went unmentioned in the changelog.  But, it seems like there's a
specific optimization here.  In the exiting code,
free_pages_and_swap_cache() is wasteful if no page in pages[] is on the
LRU.  It doesn't need the lru_add_drain().

Any code that knows it is freeing all non-LRU pages can call
free_pages_and_swap_cache_nolru() which should perform better than
free_pages_and_swap_cache().

Should we add this to the for loop in __free_pages_and_swap_cache()?

	for (i = 0; i < nr; i++) {
		if (!do_lru)
			VM_WARN_ON_ONCE_PAGE(PageLRU(pagep[i]),
					     pagep[i]);
		free_swap_cache(...);
	}

But, even more than that, do all the architectures even need the
free_swap_cache()?  PageSwapCache() will always be false on x86, which
makes the loop kinda silly.  x86 could, for instance, just do:

static inline void __tlb_remove_tables(void **tables, int nr)
{
	release_pages((struct page **)tables, nr);
}

I _think_ this will work everywhere that has whole pages as page tables.
 Taking that one step further, what if we only had one generic:

static inline void tlb_remove_tables(void **tables, int nr)
{
	int i;

#ifdef ARCH_PAGE_TABLES_ARE_FULL_PAGE
	release_pages((struct page **)tables, nr);
#else
	arch_tlb_remove_tables(tables, i);
#endif
}

Architectures that set ARCH_PAGE_TABLES_ARE_FULL_PAGE (or whatever)
don't need to implement __tlb_remove_table() at all *and* can do
release_pages() directly.

This avoids all the  confusion with the swap cache and LRU naming.

^ permalink raw reply	[flat|nested] 8+ messages in thread