linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nicholas Piggin <npiggin@gmail.com>
To: linux-kernel@vger.kernel.org,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Anton Blanchard <anton@ozlabs.org>,
	linux-mm@kvack.org, Ingo Molnar <mingo@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v2] Increase page and bit waitqueue hash size
Date: Wed, 17 Mar 2021 20:44:41 +1000	[thread overview]
Message-ID: <1615977781.fijkhk7ep5.astroid@bobo.none> (raw)
In-Reply-To: <89cb49c0-2736-dd4c-f401-4b88c22cc11f@rasmusvillemoes.dk>

Excerpts from Rasmus Villemoes's message of March 17, 2021 8:12 pm:
> On 17/03/2021 08.54, Nicholas Piggin wrote:
> 
>> +#if CONFIG_BASE_SMALL
>> +static const unsigned int page_wait_table_bits = 4;
>>  static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;
> 
>>  
>> +	if (!CONFIG_BASE_SMALL) {
>> +		page_wait_table = alloc_large_system_hash("page waitqueue hash",
>> +							sizeof(wait_queue_head_t),
>> +							0,
> 
> So, how does the compiler not scream at you for assigning to an array,
> even if it's inside an if (0)?
> 

Argh, because I didn't test small. Sorry I had the BASE_SMALL setting in 
another patch and thought it would be a good idea to mash them together. 
In hindsight probably not even if it did build.

Thanks,
Nick

--
[PATCH v3] Increase page and bit waitqueue hash size

The page waitqueue hash is a bit small (256 entries) on very big systems. A
16 socket 1536 thread POWER9 system was found to encounter hash collisions
and excessive time in waitqueue locking at times. This was intermittent and
hard to reproduce easily with the setup we had (very little real IO
capacity). The theory is that sometimes (depending on allocation luck)
important pages would happen to collide a lot in the hash, slowing down page
locking, causing the problem to snowball.

An small test case was made where threads would write and fsync different
pages, generating just a small amount of contention across many pages.

Increasing page waitqueue hash size to 262144 entries increased throughput
by 182% while also reducing standard deviation 3x. perf before the increase:

  36.23%  [k] _raw_spin_lock_irqsave                -      -
              |
              |--34.60%--wake_up_page_bit
              |          0
              |          iomap_write_end.isra.38
              |          iomap_write_actor
              |          iomap_apply
              |          iomap_file_buffered_write
              |          xfs_file_buffered_aio_write
              |          new_sync_write

  17.93%  [k] native_queued_spin_lock_slowpath      -      -
              |
              |--16.74%--_raw_spin_lock_irqsave
              |          |
              |           --16.44%--wake_up_page_bit
              |                     iomap_write_end.isra.38
              |                     iomap_write_actor
              |                     iomap_apply
              |                     iomap_file_buffered_write
              |                     xfs_file_buffered_aio_write

This patch uses alloc_large_system_hash to allocate a bigger system hash
that scales somewhat with memory size. The bit/var wait-queue is also
changed to keep code matching, albiet with a smaller scale factor.

This hash could be made per-node, which may help reduce remote accesses
on well localised workloads, but that adds some complexity with indexing
and hotplug, so until we get a less artificial workload to test with,
keep it simple.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/sched/wait_bit.c | 25 ++++++++++++++++++-------
 mm/filemap.c            | 16 ++++++++++++++--
 2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 02ce292b9bc0..3cc5fa552516 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -2,19 +2,20 @@
 /*
  * The implementation of the wait_bit*() and related waiting APIs:
  */
+#include <linux/memblock.h>
 #include "sched.h"
 
-#define WAIT_TABLE_BITS 8
-#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
-
-static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
+#define BIT_WAIT_TABLE_SIZE (1 << BIT_WAIT_TABLE_BITS)
+#define BIT_WAIT_TABLE_BITS bit_wait_table_bits
+static unsigned int bit_wait_table_bits __ro_after_init;
+static wait_queue_head_t *bit_wait_table __ro_after_init;
 
 wait_queue_head_t *bit_waitqueue(void *word, int bit)
 {
 	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
 	unsigned long val = (unsigned long)word << shift | bit;
 
-	return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
+	return bit_wait_table + hash_long(val, BIT_WAIT_TABLE_BITS);
 }
 EXPORT_SYMBOL(bit_waitqueue);
 
@@ -152,7 +153,7 @@ EXPORT_SYMBOL(wake_up_bit);
 
 wait_queue_head_t *__var_waitqueue(void *p)
 {
-	return bit_wait_table + hash_ptr(p, WAIT_TABLE_BITS);
+	return bit_wait_table + hash_ptr(p, BIT_WAIT_TABLE_BITS);
 }
 EXPORT_SYMBOL(__var_waitqueue);
 
@@ -246,6 +247,16 @@ void __init wait_bit_init(void)
 {
 	int i;
 
-	for (i = 0; i < WAIT_TABLE_SIZE; i++)
+	bit_wait_table = alloc_large_system_hash("bit waitqueue hash",
+						sizeof(wait_queue_head_t),
+						0,
+						22,
+						0,
+						&bit_wait_table_bits,
+						NULL,
+						0,
+						0);
+
+	for (i = 0; i < BIT_WAIT_TABLE_SIZE; i++)
 		init_waitqueue_head(bit_wait_table + i);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 43700480d897..df5a3ef4031b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
 #include <linux/security.h>
 #include <linux/cpuset.h>
 #include <linux/hugetlb.h>
+#include <linux/memblock.h>
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/shmem_fs.h>
@@ -990,9 +991,10 @@ EXPORT_SYMBOL(__page_cache_alloc);
  * at a cost of "thundering herd" phenomena during rare hash
  * collisions.
  */
-#define PAGE_WAIT_TABLE_BITS 8
 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
-static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned;
+#define PAGE_WAIT_TABLE_BITS page_wait_table_bits
+static unsigned int page_wait_table_bits __ro_after_init;
+static wait_queue_head_t *page_wait_table __ro_after_init;
 
 static wait_queue_head_t *page_waitqueue(struct page *page)
 {
@@ -1003,6 +1005,16 @@ void __init pagecache_init(void)
 {
 	int i;
 
+	page_wait_table = alloc_large_system_hash("page waitqueue hash",
+						sizeof(wait_queue_head_t),
+						0,
+						21,
+						0,
+						&page_wait_table_bits,
+						NULL,
+						0,
+						0);
+
 	for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++)
 		init_waitqueue_head(&page_wait_table[i]);
 
-- 
2.23.0

> Rasmus
> 

  reply	other threads:[~2021-03-17 10:45 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-17  7:54 [PATCH v2] Increase page and bit waitqueue hash size Nicholas Piggin
2021-03-17  8:38 ` Ingo Molnar
2021-03-17 10:02   ` Nicholas Piggin
2021-03-17 10:12 ` Rasmus Villemoes
2021-03-17 10:44   ` Nicholas Piggin [this message]
2021-03-17 19:26     ` Linus Torvalds
2021-03-17 22:22       ` Nicholas Piggin
2021-03-17 23:13         ` Linus Torvalds
2021-03-17 11:25 ` kernel test robot
2021-03-17 11:30 ` kernel test robot
2021-03-17 12:38 ` [tip: sched/core] sched/wait_bit, mm/filemap: " tip-bot2 for Nicholas Piggin
2021-03-17 15:16   ` Thomas Gleixner
2021-03-17 19:54     ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1615977781.fijkhk7ep5.astroid@bobo.none \
    --to=npiggin@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=anton@ozlabs.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=mingo@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).