linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Allison Henderson <allison.henderson@oracle.com>
To: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers
Date: Fri, 3 Sep 2021 14:55:52 -0700	[thread overview]
Message-ID: <60b698a6-6ed6-e733-a201-0d630900cccd@oracle.com> (raw)
In-Reply-To: <20210902095927.911100-8-david@fromorbit.com>



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> 
> The problem is when we are logging lots of large objects, we hit
> kvmalloc really damn hard with costly order allocations, and
> behaviour utterly sucks:
> 
>       - 49.73% xlog_cil_commit
> 	 - 31.62% kvmalloc_node
> 	    - 29.96% __kmalloc_node
> 	       - 29.38% kmalloc_large_node
> 		  - 29.33% __alloc_pages
> 		     - 24.33% __alloc_pages_slowpath.constprop.0
> 			- 18.35% __alloc_pages_direct_compact
> 			   - 17.39% try_to_compact_pages
> 			      - compact_zone_order
> 				 - 15.26% compact_zone
> 				      5.29% __pageblock_pfn_to_page
> 				      3.71% PageHuge
> 				    - 1.44% isolate_migratepages_block
> 					 0.71% set_pfnblock_flags_mask
> 				   1.11% get_pfnblock_flags_mask
> 			   - 0.81% get_page_from_freelist
> 			      - 0.59% _raw_spin_lock_irqsave
> 				 - do_raw_spin_lock
> 				      __pv_queued_spin_lock_slowpath
> 			- 3.24% try_to_free_pages
> 			   - 3.14% shrink_node
> 			      - 2.94% shrink_slab.constprop.0
> 				 - 0.89% super_cache_count
> 				    - 0.66% xfs_fs_nr_cached_objects
> 				       - 0.65% xfs_reclaim_inodes_count
> 					    0.55% xfs_perag_get_tag
> 				   0.58% kfree_rcu_shrink_count
> 			- 2.09% get_page_from_freelist
> 			   - 1.03% _raw_spin_lock_irqsave
> 			      - do_raw_spin_lock
> 				   __pv_queued_spin_lock_slowpath
> 		     - 4.88% get_page_from_freelist
> 			- 3.66% _raw_spin_lock_irqsave
> 			   - do_raw_spin_lock
> 				__pv_queued_spin_lock_slowpath
> 	    - 1.63% __vmalloc_node
> 	       - __vmalloc_node_range
> 		  - 1.10% __alloc_pages_bulk
> 		     - 0.93% __alloc_pages
> 			- 0.92% get_page_from_freelist
> 			   - 0.89% rmqueue_bulk
> 			      - 0.69% _raw_spin_lock
> 				 - do_raw_spin_lock
> 				      __pv_queued_spin_lock_slowpath
> 	   13.73% memcpy_erms
> 	 - 2.22% kvfree
> 
> On this workload, that's almost a dozen CPUs all trying to compact
> and reclaim memory inside kvmalloc_node at the same time. Yet it is
> regularly falling back to vmalloc despite all that compaction, page
> and shrinker reclaim that direct reclaim is doing. Copying all the
> metadata is taking far less CPU time than allocating the storage!
> 
> Direct reclaim should be considered extremely harmful.
> 
> This is a high frequency, high throughput, CPU usage and latency
> sensitive allocation. We've got memory there, and we're using
> kvmalloc to allow memory allocation to avoid doing lots of work to
> try to do contiguous allocations.
> 
> Except it still does *lots of costly work* that is unnecessary.
> 
> Worse: the only way to avoid the slowpath page allocation trying to
> do compaction on costly allocations is to turn off direct reclaim
> (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
> 
> Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
> GFP_KERNEL allocation context, so you only get kmalloc!". This
> cuts off the vmalloc fallback, and this leads to almost instant OOM
> problems which ends up in filesystems deadlocks, shutdowns and/or
> kernel crashes.
> 
> I want some basic kvmalloc behaviour:
> 
> - kmalloc for a contiguous range with fail fast semantics - no
>    compaction direct reclaim if the allocation enters the slow path.
> - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
> 
> The really, really stupid part about this is these kvmalloc() calls
> are run under memalloc_nofs task context, so all the allocations are
> always reduced to GFP_NOFS regardless of the fact that kvmalloc
> requires GFP_KERNEL to be passed in. IOWs, we're already telling
> kvmalloc to behave differently to the gfp flags we pass in, but it
> still won't allow vmalloc to be run with anything other than
> GFP_KERNEL.
> 
> So, this patch open codes the kvmalloc() in the commit path to have
> the above described behaviour. The result is we more than halve the
> CPU time spend doing kvmalloc() in this path and transaction commits
> with 64kB objects in them more than doubles. i.e. we get ~5x
> reduction in CPU usage per costly-sized kvmalloc() invocation and
> the profile looks like this:
> 
>    - 37.60% xlog_cil_commit
> 	16.01% memcpy_erms
>        - 8.45% __kmalloc
> 	 - 8.04% kmalloc_order_trace
> 	    - 8.03% kmalloc_order
> 	       - 7.93% alloc_pages
> 		  - 7.90% __alloc_pages
> 		     - 4.05% __alloc_pages_slowpath.constprop.0
> 			- 2.18% get_page_from_freelist
> 			- 1.77% wake_all_kswapds
> ....
> 				    - __wake_up_common_lock
> 				       - 0.94% _raw_spin_lock_irqsave
> 		     - 3.72% get_page_from_freelist
> 			- 2.43% _raw_spin_lock_irqsave
>        - 5.72% vmalloc
> 	 - 5.72% __vmalloc_node_range
> 	    - 4.81% __get_vm_area_node.constprop.0
> 	       - 3.26% alloc_vmap_area
> 		  - 2.52% _raw_spin_lock
> 	       - 1.46% _raw_spin_lock
> 	      0.56% __alloc_pages_bulk
>        - 4.66% kvfree
> 	 - 3.25% vfree
> 	    - __vfree
> 	       - 3.23% __vunmap
> 		  - 1.95% remove_vm_area
> 		     - 1.06% free_vmap_area_noflush
> 			- 0.82% _raw_spin_lock
> 		     - 0.68% _raw_spin_lock
> 		  - 0.92% _raw_spin_lock
> 	 - 1.40% kfree
> 	    - 1.36% __free_pages
> 	       - 1.35% __free_pages_ok
> 		  - 1.02% _raw_spin_lock_irqsave
> 
> It's worth noting that over 50% of the CPU time spent allocating
> these shadow buffers is now spent on spinlocks. So the shadow buffer
> allocation overhead is greatly reduced by getting rid of direct
> reclaim from kmalloc, and could probably be made even less costly if
> vmalloc() didn't use global spinlocks to protect it's structures.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, thanks for all the explaining.  I will try out your set with mine, 
and do a few perfs myself.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
> ---
>   fs/xfs/xfs_log_cil.c | 46 +++++++++++++++++++++++++++++++++-----------
>   1 file changed, 35 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index fff68aad254e..81ebf03bfa5c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -185,6 +185,39 @@ xlog_cil_iovec_space(
>   			sizeof(uint64_t));
>   }
>   
> +/*
> + * shadow buffers can be large, so we need to use kvmalloc() here to ensure
> + * success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts to fall
> + * back to vmalloc, so we can't actually do anything useful with gfp flags to
> + * control the kmalloc() behaviour within kvmalloc(). Hence kmalloc() will do
> + * direct reclaim and compaction in the slow path, both of which are
> + * horrendously expensive. We just want kmalloc to fail fast and fall back to
> + * vmalloc if it can't get somethign straight away from the free lists or buddy
> + * allocator. Hence we have to open code kvmalloc outselves here.
> + *
> + * Also, we are in memalloc_nofs_save task context here, so despite the use of
> + * GFP_KERNEL here, we are actually going to be doing GFP_NOFS allocations. This
> + * is actually the only way to make vmalloc() do GFP_NOFS allocations, so lets
> + * just all pretend this is a GFP_KERNEL context operation....
> + */
> +static inline void *
> +xlog_cil_kvmalloc(
> +	size_t		size)
> +{
> +	gfp_t		flags = GFP_KERNEL;
> +	void		*p;
> +
> +	flags &= ~__GFP_DIRECT_RECLAIM;
> +	flags |= __GFP_NOWARN | __GFP_NORETRY;
> +	do {
> +		p = kmalloc(buf_size, flags);
> +		if (!p)
> +			p = vmalloc(buf_size);
> +	} while (!p);
> +
> +	return p;
> +}
> +
>   /*
>    * Allocate or pin log vector buffers for CIL insertion.
>    *
> @@ -293,25 +326,16 @@ xlog_cil_alloc_shadow_bufs(
>   		 */
>   		if (!lip->li_lv_shadow ||
>   		    buf_size > lip->li_lv_shadow->lv_size) {
> -
>   			/*
>   			 * We free and allocate here as a realloc would copy
> -			 * unnecessary data. We don't use kmem_zalloc() for the
> +			 * unnecessary data. We don't use kvzalloc() for the
>   			 * same reason - we don't need to zero the data area in
>   			 * the buffer, only the log vector header and the iovec
>   			 * storage.
>   			 */
>   			kmem_free(lip->li_lv_shadow);
> +			lv = xlog_cil_kvmalloc(buf_size);
>   
> -			/*
> -			 * We are in transaction context, which means this
> -			 * allocation will pick up GFP_NOFS from the
> -			 * memalloc_nofs_save/restore context the transaction
> -			 * holds. This means we can use GFP_KERNEL here so the
> -			 * generic kvmalloc() code will run vmalloc on
> -			 * contiguous page allocation failure as we require.
> -			 */
> -			lv = kvmalloc(buf_size, GFP_KERNEL);
>   			memset(lv, 0, xlog_cil_iovec_space(niovecs));
>   
>   			INIT_LIST_HEAD(&lv->lv_list);
> 

  reply	other threads:[~2021-09-03 21:55 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-02  9:59 [RFC PATCH 0/7] xfs: intent item whiteouts Dave Chinner
2021-09-02  9:59 ` [PATCH 1/7] xfs: add log item flags to indicate intents Dave Chinner
2021-09-03 21:08   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 2/7] xfs: tag transactions that contain intent done items Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 3/7] xfs: factor a move some code in xfs_log_cil.c Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 4/7] xfs: add log item method to return related intents Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 5/7] xfs: whiteouts release intents that are not in the AIL Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 6/7] [RFC] xfs: intent item whiteouts Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers Dave Chinner
2021-09-03 21:55   ` Allison Henderson [this message]
2021-09-09 11:37 ` [RFC PATCH 0/7] xfs: intent item whiteouts Christoph Hellwig
2021-09-09 21:21   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=60b698a6-6ed6-e733-a201-0d630900cccd@oracle.com \
    --to=allison.henderson@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --subject='Re: [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).