All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, Anton Blanchard <anton@ozlabs.org>
Subject: Re: [PATCH v2] Increase page and bit waitqueue hash size
Date: Wed, 17 Mar 2021 09:38:30 +0100	[thread overview]
Message-ID: <20210317083830.GC3881262@gmail.com> (raw)
In-Reply-To: <20210317075427.587806-1-npiggin@gmail.com>


* Nicholas Piggin <npiggin@gmail.com> wrote:

> The page waitqueue hash is a bit small (256 entries) on very big systems. A
> 16 socket 1536 thread POWER9 system was found to encounter hash collisions
> and excessive time in waitqueue locking at times. This was intermittent and
> hard to reproduce easily with the setup we had (very little real IO
> capacity). The theory is that sometimes (depending on allocation luck)
> important pages would happen to collide a lot in the hash, slowing down page
> locking, causing the problem to snowball.
> 
> An small test case was made where threads would write and fsync different
> pages, generating just a small amount of contention across many pages.
> 
> Increasing page waitqueue hash size to 262144 entries increased throughput
> by 182% while also reducing standard deviation 3x. perf before the increase:
> 
>   36.23%  [k] _raw_spin_lock_irqsave                -      -
>               |
>               |--34.60%--wake_up_page_bit
>               |          0
>               |          iomap_write_end.isra.38
>               |          iomap_write_actor
>               |          iomap_apply
>               |          iomap_file_buffered_write
>               |          xfs_file_buffered_aio_write
>               |          new_sync_write
> 
>   17.93%  [k] native_queued_spin_lock_slowpath      -      -
>               |
>               |--16.74%--_raw_spin_lock_irqsave
>               |          |
>               |           --16.44%--wake_up_page_bit
>               |                     iomap_write_end.isra.38
>               |                     iomap_write_actor
>               |                     iomap_apply
>               |                     iomap_file_buffered_write
>               |                     xfs_file_buffered_aio_write
> 
> This patch uses alloc_large_system_hash to allocate a bigger system hash
> that scales somewhat with memory size. The bit/var wait-queue is also
> changed to keep code matching, albiet with a smaller scale factor.
> 
> A very small CONFIG_BASE_SMALL option is also added because these are two
> of the biggest static objects in the image on very small systems.
> 
> This hash could be made per-node, which may help reduce remote accesses
> on well localised workloads, but that adds some complexity with indexing
> and hotplug, so until we get a less artificial workload to test with,
> keep it simple.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  kernel/sched/wait_bit.c | 30 +++++++++++++++++++++++-------
>  mm/filemap.c            | 24 +++++++++++++++++++++---
>  2 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
> index 02ce292b9bc0..dba73dec17c4 100644
> --- a/kernel/sched/wait_bit.c
> +++ b/kernel/sched/wait_bit.c
> @@ -2,19 +2,24 @@
>  /*
>   * The implementation of the wait_bit*() and related waiting APIs:
>   */
> +#include <linux/memblock.h>
>  #include "sched.h"
>  
> -#define WAIT_TABLE_BITS 8
> -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)

Ugh, 256 entries is almost embarrassingly small indeed.

I've put your patch into sched/core, unless Andrew is objecting.

> -	for (i = 0; i < WAIT_TABLE_SIZE; i++)
> +	if (!CONFIG_BASE_SMALL) {
> +		bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
> +							sizeof(wait_queue_head_t),
> +							0,
> +							22,
> +							0,
> +							&bit_wait_table_bits,
> +							NULL,
> +							0,
> +							0);
> +	}
> +	for (i = 0; i < BIT_WAIT_TABLE_SIZE; i++)
>  		init_waitqueue_head(bit_wait_table + i);


Meta suggestion: maybe the CONFIG_BASE_SMALL ugliness could be folded 
into alloc_large_system_hash() itself?

> --- a/mm/filemap.c
> +++ b/mm/filemap.c

>  static wait_queue_head_t *page_waitqueue(struct page *page)
>  {
> -	return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)];
> +	return &page_wait_table[hash_ptr(page, page_wait_table_bits)];
>  }

I'm wondering whether you've tried to make this NUMA aware through 
page->node?

Seems like another useful step when having a global hash ...

Thanks,

	Ingo

  reply	other threads:[~2021-03-17  8:39 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17  7:54 [PATCH v2] Increase page and bit waitqueue hash size Nicholas Piggin
2021-03-17  8:38 ` Ingo Molnar [this message]
2021-03-17 10:02   ` Nicholas Piggin
2021-03-17 10:12 ` Rasmus Villemoes
2021-03-17 10:44   ` Nicholas Piggin
2021-03-17 19:26     ` Linus Torvalds
2021-03-17 19:26       ` Linus Torvalds
2021-03-17 22:22       ` Nicholas Piggin
2021-03-17 23:13         ` Linus Torvalds
2021-03-17 23:13           ` Linus Torvalds
2021-03-17 11:25 ` kernel test robot
2021-03-17 11:25   ` kernel test robot
2021-03-17 11:30 ` kernel test robot
2021-03-17 11:30   ` kernel test robot
2021-03-17 12:38 ` [tip: sched/core] sched/wait_bit, mm/filemap: " tip-bot2 for Nicholas Piggin
2021-03-17 15:16   ` Thomas Gleixner
2021-03-17 19:54     ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210317083830.GC3881262@gmail.com \
    --to=mingo@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=anton@ozlabs.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.