linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nicholas Piggin <npiggin@gmail.com>
To: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	agruenba@redhat.com, rpeterso@redhat.com,
	mgorman@techsingularity.net, peterz@infradead.org,
	luto@kernel.org, swhiteho@redhat.com,
	torvalds@linux-foundation.org
Subject: Re: [RFC][PATCH] make global bitlock waitqueues per-node
Date: Wed, 21 Dec 2016 22:30:56 +1000	[thread overview]
Message-ID: <20161221223056.17c37dd6@roar.ozlabs.ibm.com> (raw)
In-Reply-To: <20161219225826.F8CB356F@viggo.jf.intel.com>

On Mon, 19 Dec 2016 14:58:26 -0800
Dave Hansen <dave.hansen@linux.intel.com> wrote:

> I saw a 4.8->4.9 regression (details below) that I attributed to:
> 
> 	9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
> 
> That commit took the bitlock waitqueues from being dynamically-allocated
> per-zone to being statically allocated and global.  As suggested by
> Linus, this makes them per-node, but keeps them statically-allocated.
> 
> It leaves us with more waitqueues than the global approach, inherently
> scales it up as we gain nodes, and avoids generating code for
> page_zone() which was evidently quite ugly.  The patch is pretty darn
> tiny too.
> 
> This turns what was a ~40% 4.8->4.9 regression into a 17% gain over
> what on 4.8 did.  That gain is a _bit_ surprising, but not entirely
> unexpected since we now get much simpler code from no page_zone() and a
> fixed-size array for which we don't have to follow a pointer (and get to
> do power-of-2 math).
> 
> This boots in a small VM and on a multi-node NUMA system, but has not
> been tested widely.  It definitely needs to be validated to make sure
> it properly initialzes the new structure in the various node hotplug
> cases before anybody goes applying it.
> 
> Original report below (sent privately by accident).
> 
>  include/linux/mmzone.h |    5 +++++
>  kernel/sched/core.c    |   16 ----------------
>  mm/filemap.c           |   22 +++++++++++++++++++++-
>  mm/page_alloc.c        |    5 +++++
>  4 files changed, 31 insertions(+), 17 deletions(-)
> 
> ---
> 
> I'm seeing a 4.8->4.9 regression:
> 
> 		https://www.sr71.net/~dave/intel/2016-12-19-page_fault_3.processes.png
> 
> This is on a 160-thread 8-socket system.  The workloads is a bunch of
> processes faulting in pages to separate MAP_SHARED mappings:
> 
> 	https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
> 
> The smoking gun in the profiles is that __wake_up_bit() (called via
> unlock_page()) goes from ~1% to 4% in the profiles.
> 
> The workload is pretty maniacally touching random parts of the
> waitqueue, and missing the cache heavily.  Now that it is shared, I
> suspect we're transferring cachelines across node boundaries in a way
> that we were not with the per-zone waitqueues.
> 
> This is *only* showing up with MAP_SHARED pages, not with anonymous
> pages.  I think we do a lock_page()/unlock_page() pair in
> do_shared_fault(), which we avoid in the anonymous case.  Reverting:
> 
> 	9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
> 
> restores things to 4.8 behavior.  The fact that automated testing didn't
> catch this probably means that it's pretty rare to find hardware that
> actually shows the problem, so I don't think it's worth reverting
> anything in mainline.

I've been doing a bit of testing, and I don't know why you're seeing
this.

I don't think I've been able to trigger any actual page lock contention
so nothing gets put on the waitqueue to really bounce cache lines around
that I can see. Yes there are some loads to check the waitqueue, but
those should be cached read shared among CPUs so not cause interconnect
traffic I would have thought.

After testing my PageWaiters patch, I'm maybe seeing 2% improvment on
this workload (although powerpc doesn't do so well on this one due to
virtual memory management overheads -- maybe x86 will show a bit more)

But the point is that after this, there should be no waitqueue activity
at all. I haven't chased up a system with a lot of IOPS connected that
would be needed to realistically test contended cases (which I thought
I needed to show an improvement from per-node tables, but your test
suggests not...)

Thanks,
Nick


> 
> ut, the commit says:
> 
> >     As part of that earlier discussion, we had a much better solution for
> >     the NUMA scalability issue - by just making the page lock have a
> >     separate contention bit, the waitqueue doesn't even have to be looked at
> >     for the normal case.  
> 
> So, maybe we should do that moving forward since we at least found one
> case that's pretty aversely affected.
> 
> Cc: Andreas Gruenbacher <agruenba@redhat.com>
> Cc: Bob Peterson <rpeterso@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Steven Whitehouse <swhiteho@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> 
> ---
> 
>  b/include/linux/mmzone.h |    5 +++++
>  b/kernel/sched/core.c    |   16 ----------------
>  b/mm/filemap.c           |   22 +++++++++++++++++++++-
>  b/mm/page_alloc.c        |    5 +++++
>  4 files changed, 31 insertions(+), 17 deletions(-)
> 
> diff -puN include/linux/mmzone.h~static-per-zone-waitqueue include/linux/mmzone.h
> --- a/include/linux/mmzone.h~static-per-zone-waitqueue	2016-12-19 11:35:12.210823059 -0800
> +++ b/include/linux/mmzone.h	2016-12-19 13:11:53.335170271 -0800
> @@ -27,6 +27,9 @@
>  #endif
>  #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
>  
> +#define WAIT_TABLE_BITS 8
> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> +
>  /*
>   * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
>   * costly to service.  That is between allocation orders which should
> @@ -662,6 +665,8 @@ typedef struct pglist_data {
>  	unsigned long		min_slab_pages;
>  #endif /* CONFIG_NUMA */
>  
> +	wait_queue_head_t	wait_table[WAIT_TABLE_SIZE];
> +
>  	/* Write-intensive fields used by page reclaim */
>  	ZONE_PADDING(_pad1_)
>  	spinlock_t		lru_lock;
> diff -puN kernel/sched/core.c~static-per-zone-waitqueue kernel/sched/core.c
> --- a/kernel/sched/core.c~static-per-zone-waitqueue	2016-12-19 11:35:12.212823149 -0800
> +++ b/kernel/sched/core.c	2016-12-19 11:35:12.225823738 -0800
> @@ -7509,27 +7509,11 @@ static struct kmem_cache *task_group_cac
>  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
>  DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
>  
> -#define WAIT_TABLE_BITS 8
> -#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> -static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
> -
> -wait_queue_head_t *bit_waitqueue(void *word, int bit)
> -{
> -	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> -	unsigned long val = (unsigned long)word << shift | bit;
> -
> -	return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
> -}
> -EXPORT_SYMBOL(bit_waitqueue);
> -
>  void __init sched_init(void)
>  {
>  	int i, j;
>  	unsigned long alloc_size = 0, ptr;
>  
> -	for (i = 0; i < WAIT_TABLE_SIZE; i++)
> -		init_waitqueue_head(bit_wait_table + i);
> -
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	alloc_size += 2 * nr_cpu_ids * sizeof(void **);
>  #endif
> diff -puN mm/filemap.c~static-per-zone-waitqueue mm/filemap.c
> --- a/mm/filemap.c~static-per-zone-waitqueue	2016-12-19 11:35:12.215823285 -0800
> +++ b/mm/filemap.c	2016-12-19 14:51:23.881814379 -0800
> @@ -779,6 +779,26 @@ EXPORT_SYMBOL(__page_cache_alloc);
>  #endif
>  
>  /*
> + * We need 'nid' because page_waitqueue() needs to get the waitqueue
> + * for memory where virt_to_page() does not work, like highmem.
> + */
> +static wait_queue_head_t *__bit_waitqueue(void *word, int bit, int nid)
> +{
> +	const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> +	unsigned long val = (unsigned long)word << shift | bit;
> +
> +	return &NODE_DATA(nid)->wait_table[hash_long(val, WAIT_TABLE_BITS)];
> +}
> +
> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> +{
> +	const int __maybe_unused nid = page_to_nid(virt_to_page(word));
> +
> +	return __bit_waitqueue(word, bit, nid);
> +}
> +EXPORT_SYMBOL(bit_waitqueue);
> +
> +/*
>   * In order to wait for pages to become available there must be
>   * waitqueues associated with pages. By using a hash table of
>   * waitqueues where the bucket discipline is to maintain all
> @@ -790,7 +810,7 @@ EXPORT_SYMBOL(__page_cache_alloc);
>   */
>  wait_queue_head_t *page_waitqueue(struct page *page)
>  {
> -	return bit_waitqueue(page, 0);
> +	return __bit_waitqueue(page, 0, page_to_nid(page));
>  }
>  EXPORT_SYMBOL(page_waitqueue);
>  
> diff -puN mm/page_alloc.c~static-per-zone-waitqueue mm/page_alloc.c
> --- a/mm/page_alloc.c~static-per-zone-waitqueue	2016-12-19 11:35:12.219823466 -0800
> +++ b/mm/page_alloc.c	2016-12-19 13:10:56.587613213 -0800
> @@ -5872,6 +5872,7 @@ void __paginginit free_area_init_node(in
>  	pg_data_t *pgdat = NODE_DATA(nid);
>  	unsigned long start_pfn = 0;
>  	unsigned long end_pfn = 0;
> +	int i;
>  
>  	/* pg_data_t should be reset to zero when it's allocated */
>  	WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
> @@ -5892,6 +5893,10 @@ void __paginginit free_area_init_node(in
>  				  zones_size, zholes_size);
>  
>  	alloc_node_mem_map(pgdat);
> +
> +	/* per-node page waitqueue initialization: */
> +	for (i = 0; i < WAIT_TABLE_SIZE; i++)
> +		init_waitqueue_head(&pgdat->wait_table[i]);
>  #ifdef CONFIG_FLAT_NODE_MEM_MAP
>  	printk(KERN_DEBUG "free_area_init_node: node %d, pgdat %08lx, node_mem_map %08lx\n",
>  		nid, (unsigned long)pgdat,
> _
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-12-21 12:31 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-19 22:58 [RFC][PATCH] make global bitlock waitqueues per-node Dave Hansen
     [not found] ` <CA+55aFwK6JdSy9v_BkNYWNdfK82sYA1h3qCSAJQ0T45cOxeXmQ@mail.gmail.com>
2016-12-20  0:20   ` Dave Hansen
2016-12-20  2:31     ` Nicholas Piggin
2016-12-20 12:58       ` Mel Gorman
2016-12-20 13:21         ` Nicholas Piggin
2016-12-20 17:31     ` Linus Torvalds
2016-12-20 18:02       ` Linus Torvalds
2016-12-21  8:09         ` Peter Zijlstra
2016-12-21  8:32           ` Peter Zijlstra
2016-12-21 18:02             ` Linus Torvalds
2016-12-21 18:33               ` Nicholas Piggin
2016-12-21 19:01                 ` Nicholas Piggin
2016-12-21 19:50                   ` Linus Torvalds
2016-12-22  2:07                     ` Nicholas Piggin
2016-12-22 19:28               ` Hugh Dickins
2016-12-21 10:26           ` Nicholas Piggin
2016-12-20  2:26 ` Nicholas Piggin
2016-12-21 12:30 ` Nicholas Piggin [this message]
2016-12-21 18:12   ` Linus Torvalds
2016-12-21 18:40     ` Nicholas Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161221223056.17c37dd6@roar.ozlabs.ibm.com \
    --to=npiggin@gmail.com \
    --cc=agruenba@redhat.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=peterz@infradead.org \
    --cc=rpeterso@redhat.com \
    --cc=swhiteho@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).