linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, Anton Blanchard <anton@ozlabs.org>
Subject: Re: [PATCH v2] Increase page and bit waitqueue hash size
Date: Wed, 17 Mar 2021 09:38:30 +0100	[thread overview]
Message-ID: <20210317083830.GC3881262@gmail.com> (raw)
In-Reply-To: <20210317075427.587806-1-npiggin@gmail.com>


* Nicholas Piggin <npiggin@gmail.com> wrote:

> The page waitqueue hash is a bit small (256 entries) on very big systems. A
> 16 socket 1536 thread POWER9 system was found to encounter hash collisions
> and excessive time in waitqueue locking at times. This was intermittent and
> hard to reproduce easily with the setup we had (very little real IO
> capacity). The theory is that sometimes (depending on allocation luck)
> important pages would happen to collide a lot in the hash, slowing down page
> locking, causing the problem to snowball.
> 
> An small test case was made where threads would write and fsync different
> pages, generating just a small amount of contention across many pages.
> 
> Increasing page waitqueue hash size to 262144 entries increased throughput
> by 182% while also reducing standard deviation 3x. perf before the increase:
> 
>   36.23%  [k] _raw_spin_lock_irqsave                -      -
>               |
>               |--34.60%--wake_up_page_bit
>               |          0
>               |          iomap_write_end.isra.38
>               |          iomap_write_actor
>               |          iomap_apply
>               |          iomap_file_buffered_write
>               |          xfs_file_buffered_aio_write
>               |          new_sync_write
> 
>   17.93%  [k] native_queued_spin_lock_slowpath      -      -
>               |
>               |--16.74%--_raw_spin_lock_irqsave
>               |          |
>               |           --16.44%--wake_up_page_bit
>               |                     iomap_write_end.isra.38
>               |                     iomap_write_actor
>               |                     iomap_apply
>               |                     iomap_file_buffered_write
>               |                     xfs_file_buffered_aio_write
> 
> This patch uses alloc_large_system_hash to allocate a bigger system hash
> that scales somewhat with memory size. The bit/var wait-queue is also
> changed to keep code matching, albiet with a smaller scale factor.
> 
> A very small CONFIG_BASE_SMALL option is also added because these are two
> of the biggest static objects in the image on very small systems.
> 
> This hash could be made per-node, which may help reduce remote accesses
> on well localised workloads, but that adds some complexity with indexing
> and hotplug, so until we get a less artificial workload to test with,
> keep it simple.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  kernel/sched/wait_bit.c | 30 +++++++++++++++++++++++-------
>  mm/filemap.c            | 24 +++++++++++++++++++++---
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
> index 02ce292b9bc0..dba73dec17c4 100644
> --- a/kernel/sched/wait_bit.c
> +++ b/kernel/sched/wait_bit.c
> @@ -2,19 +2,24 @@
>  /*
>   * The implementation of the wait_bit*() and related waiting APIs:
>   */
> +#include <linux/memblock.h>
>  #include "sched.h"
>  
> -#define WAIT_TABLE_BITS 8
> -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)

Ugh, 256 entries is almost embarrassingly small indeed.

I've put your patch into sched/core, unless Andrew is objecting.

> -	for (i = 0; i < WAIT_TABLE_SIZE; i++)
> +	if (!CONFIG_BASE_SMALL) {
> +		bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
> +							sizeof(wait_queue_head_t),
> +							0,
> +							22,
> +							0,
> +							&bit_wait_table_bits,
> +							NULL,
> +							0,
> +							0);
> +	}
> +	for (i = 0; i < BIT_WAIT_TABLE_SIZE; i++)
>  		init_waitqueue_head(bit_wait_table + i);


Meta suggestion: maybe the CONFIG_BASE_SMALL ugliness could be folded 
into alloc_large_system_hash() itself?

> --- a/mm/filemap.c
> +++ b/mm/filemap.c

>  static wait_queue_head_t *page_waitqueue(struct page *page)
>  {
> -	return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)];
> +	return &page_wait_table[hash_ptr(page, page_wait_table_bits)];
>  }

I'm wondering whether you've tried to make this NUMA aware through 
page->node?

Seems like another useful step when having a global hash ...

Thanks,

	Ingo

  reply	other threads:[~2021-03-17  8:39 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17  7:54 [PATCH v2] Increase page and bit waitqueue hash size Nicholas Piggin
2021-03-17  8:38 ` Ingo Molnar [this message]
2021-03-17 10:02   ` Nicholas Piggin
2021-03-17 10:12 ` Rasmus Villemoes
2021-03-17 10:44   ` Nicholas Piggin
2021-03-17 19:26     ` Linus Torvalds
2021-03-17 22:22       ` Nicholas Piggin
2021-03-17 23:13         ` Linus Torvalds
2021-03-17 11:25 ` kernel test robot
2021-03-17 11:30 ` kernel test robot
2021-03-17 12:38 ` [tip: sched/core] sched/wait_bit, mm/filemap: " tip-bot2 for Nicholas Piggin
2021-03-17 15:16   ` Thomas Gleixner
2021-03-17 19:54     ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210317083830.GC3881262@gmail.com \
    --to=mingo@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=anton@ozlabs.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).