Re: [PATCH v2] Increase page and bit waitqueue hash size

From: Ingo Molnar <mingo@kernel.org>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, Anton Blanchard <anton@ozlabs.org>
Subject: Re: [PATCH v2] Increase page and bit waitqueue hash size
Date: Wed, 17 Mar 2021 09:38:30 +0100	[thread overview]
Message-ID: <20210317083830.GC3881262@gmail.com> (raw)
In-Reply-To: <20210317075427.587806-1-npiggin@gmail.com>

* Nicholas Piggin <npiggin@gmail.com> wrote:

> The page waitqueue hash is a bit small (256 entries) on very big systems. A
> 16 socket 1536 thread POWER9 system was found to encounter hash collisions
> and excessive time in waitqueue locking at times. This was intermittent and
> hard to reproduce easily with the setup we had (very little real IO
> capacity). The theory is that sometimes (depending on allocation luck)
> important pages would happen to collide a lot in the hash, slowing down page
> locking, causing the problem to snowball.
> 
> An small test case was made where threads would write and fsync different
> pages, generating just a small amount of contention across many pages.
> 
> Increasing page waitqueue hash size to 262144 entries increased throughput
> by 182% while also reducing standard deviation 3x. perf before the increase:
> 
>   36.23%  [k] _raw_spin_lock_irqsave                -      -
>               |
>               |--34.60%--wake_up_page_bit
>               |          0
>               |          iomap_write_end.isra.38
>               |          iomap_write_actor
>               |          iomap_apply
>               |          iomap_file_buffered_write
>               |          xfs_file_buffered_aio_write
>               |          new_sync_write
> 
>   17.93%  [k] native_queued_spin_lock_slowpath      -      -
>               |
>               |--16.74%--_raw_spin_lock_irqsave
>               |          |
>               |           --16.44%--wake_up_page_bit
>               |                     iomap_write_end.isra.38
>               |                     iomap_write_actor
>               |                     iomap_apply
>               |                     iomap_file_buffered_write
>               |                     xfs_file_buffered_aio_write
> 
> This patch uses alloc_large_system_hash to allocate a bigger system hash
> that scales somewhat with memory size. The bit/var wait-queue is also
> changed to keep code matching, albiet with a smaller scale factor.
> 
> A very small CONFIG_BASE_SMALL option is also added because these are two
> of the biggest static objects in the image on very small systems.
> 
> This hash could be made per-node, which may help reduce remote accesses
> on well localised workloads, but that adds some complexity with indexing
> and hotplug, so until we get a less artificial workload to test with,
> keep it simple.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  kernel/sched/wait_bit.c | 30 +++++++++++++++++++++++-------
>  mm/filemap.c            | 24 +++++++++++++++++++++---
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
> index 02ce292b9bc0..dba73dec17c4 100644
> --- a/kernel/sched/wait_bit.c
> +++ b/kernel/sched/wait_bit.c
> @@ -2,19 +2,24 @@
>  /*
>   * The implementation of the wait_bit*() and related waiting APIs:
>   */
> +#include <linux/memblock.h>
>  #include "sched.h"
>  
> -#define WAIT_TABLE_BITS 8
> -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)

Ugh, 256 entries is almost embarrassingly small indeed.

I've put your patch into sched/core, unless Andrew is objecting.

> -	for (i = 0; i < WAIT_TABLE_SIZE; i++)
> +	if (!CONFIG_BASE_SMALL) {
> +		bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
> +							sizeof(wait_queue_head_t),
> +							0,
> +							22,
> +							0,
> +							&bit_wait_table_bits,
> +							NULL,
> +							0,
> +							0);
> +	}
> +	for (i = 0; i < BIT_WAIT_TABLE_SIZE; i++)
>  		init_waitqueue_head(bit_wait_table + i);

Meta suggestion: maybe the CONFIG_BASE_SMALL ugliness could be folded 
into alloc_large_system_hash() itself?

> --- a/mm/filemap.c
> +++ b/mm/filemap.c

>  static wait_queue_head_t *page_waitqueue(struct page *page)
>  {
> -	return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)];
> +	return &page_wait_table[hash_ptr(page, page_wait_table_bits)];
>  }

I'm wondering whether you've tried to make this NUMA aware through 
page->node?

Seems like another useful step when having a global hash ...

Thanks,

	Ingo