All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alli <allison.henderson@oracle.com>
To: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 19/18] xfs: can't use kmem_zalloc() for attribute buffers
Date: Tue, 10 May 2022 17:54:15 -0700	[thread overview]
Message-ID: <f2fdf0afdec6e1dcef75f5d7b51bb6db8f062208.camel@oracle.com> (raw)
In-Reply-To: <20220510222716.GW1098723@dread.disaster.area>

On Wed, 2022-05-11 at 08:27 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because when running fsmark workloads with 64kB xattrs, heap
> allocation of >64kB buffers for the attr item name/value buffer
> will fail and deadlock.
> 
> ....
>  XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8417) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8409) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8428) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8430) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8437) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8433) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8406) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8412) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8432) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
>  XFS: fs_mark(8424) possible memory allocation deadlock size 65768 in
> kmem_alloc (mode:0x2d40)
> ....
> 
> I'd use kvmalloc(), but if we are doing 15,000 64kB xattr creates a
> second, the attempt to use kmalloc() in kvmalloc() results in a huge
> amount of direct reclaim work that is guaranteed to fail occurs
> before it falls back to vmalloc:
> 
> - 48.19% xfs_attr_create_intent
>   - 46.89% xfs_attri_init
>      - kvmalloc_node
> 	- 46.04% __kmalloc_node
> 	   - kmalloc_large_node
> 	      - 45.99% __alloc_pages
> 		 - 39.39% __alloc_pages_slowpath.constprop.0
> 		    - 38.89% __alloc_pages_direct_compact
> 		       - 38.71% try_to_compact_pages
> 			  - compact_zone_order
> 			  - compact_zone
> 			     - 21.09% isolate_migratepages_block
> 				  10.31% PageHuge
> 				  5.82% set_pfnblock_flags_mask
> 				  0.86% get_pfnblock_flags_mask
> 			     - 4.48% __reset_isolation_suitable
> 				  4.44% __reset_isolation_pfn
> 			     - 3.56% __pageblock_pfn_to_page
> 				  1.33% pfn_to_online_page
> 			       2.83% get_pfnblock_flags_mask
> 			     - 0.87% migrate_pages
> 				  0.86% compaction_alloc
> 			       0.84% find_suitable_fallback
> 		 - 6.60% get_page_from_freelist
> 		      4.99% clear_page_erms
> 		    - 1.19% _raw_spin_lock_irqsave
> 		       - do_raw_spin_lock
> 			    __pv_queued_spin_lock_slowpath
> 	- 0.86% __vmalloc_node_range
> 	     0.65% __alloc_pages_bulk
> 
> So lift xlog_cil_kvmalloc(), rename it to xlog_kvmalloc() and use
> that instead because it has sane fail-fast behaviour for the
> embedded kmalloc attempt. It also provides __GFP_NOFAIL guarantees
> that kvmalloc() won't do, either....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
This looks ok to me, didnt run across any test failures
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>  fs/xfs/xfs_attr_item.c | 35 +++++++++++++++--------------------
>  fs/xfs/xfs_log_cil.c   | 35 +----------------------------------
>  fs/xfs/xfs_log_priv.h  | 34 ++++++++++++++++++++++++++++++++++
>  3 files changed, 50 insertions(+), 54 deletions(-)
> 
> diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> index 56f678c965b7..e8ac88d9fd14 100644
> --- a/fs/xfs/xfs_attr_item.c
> +++ b/fs/xfs/xfs_attr_item.c
> @@ -44,7 +44,7 @@ xfs_attri_item_free(
>  	struct xfs_attri_log_item	*attrip)
>  {
>  	kmem_free(attrip->attri_item.li_lv_shadow);
> -	kmem_free(attrip);
> +	kvfree(attrip);
>  }
>  
>  /*
> @@ -119,11 +119,11 @@ xfs_attri_item_format(
>  			sizeof(struct xfs_attri_log_format));
>  	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_NAME,
>  			attrip->attri_name,
> -			xlog_calc_iovec_len(attrip->attri_name_len));
> +			attrip->attri_name_len);
>  	if (attrip->attri_value_len > 0)
>  		xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_ATTR_VALUE,
>  				attrip->attri_value,
> -				xlog_calc_iovec_len(attrip-
> >attri_value_len));
> +				attrip->attri_value_len);
>  }
>  
>  /*
> @@ -163,26 +163,21 @@ xfs_attri_init(
>  
>  {
>  	struct xfs_attri_log_item	*attrip;
> -	uint32_t			name_vec_len = 0;
> -	uint32_t			value_vec_len = 0;
> -	uint32_t			buffer_size;
> -
> -	if (name_len)
> -		name_vec_len = xlog_calc_iovec_len(name_len);
> -	if (value_len)
> -		value_vec_len = xlog_calc_iovec_len(value_len);
> -
> -	buffer_size = name_vec_len + value_vec_len;
> +	uint32_t			buffer_size = name_len + value_len;
>  
>  	if (buffer_size) {
> -		attrip = kmem_zalloc(sizeof(struct xfs_attri_log_item)
> +
> -				    buffer_size, KM_NOFS);
> -		if (attrip == NULL)
> -			return NULL;
> +		/*
> +		 * This could be over 64kB in length, so we have to use
> +		 * kvmalloc() for this. But kvmalloc() utterly sucks,
> so we
> +		 * use own version.
> +		 */
> +		attrip = xlog_kvmalloc(sizeof(struct
> xfs_attri_log_item) +
> +					buffer_size);
>  	} else {
> -		attrip = kmem_cache_zalloc(xfs_attri_cache,
> -					  GFP_NOFS | __GFP_NOFAIL);
> +		attrip = kmem_cache_alloc(xfs_attri_cache,
> +					GFP_NOFS | __GFP_NOFAIL);
>  	}
> +	memset(attrip, 0, sizeof(struct xfs_attri_log_item));
>  
>  	attrip->attri_name_len = name_len;
>  	if (name_len)
> @@ -195,7 +190,7 @@ xfs_attri_init(
>  	if (value_len)
>  		attrip->attri_value = ((char *)attrip) +
>  				sizeof(struct xfs_attri_log_item) +
> -				name_vec_len;
> +				name_len;
>  	else
>  		attrip->attri_value = NULL;
>  
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 42ace9b091d8..b4023693b89f 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -219,39 +219,6 @@ xlog_cil_iovec_space(
>  			sizeof(uint64_t));
>  }
>  
> -/*
> - * shadow buffers can be large, so we need to use kvmalloc() here to
> ensure
> - * success. Unfortunately, kvmalloc() only allows GFP_KERNEL
> contexts to fall
> - * back to vmalloc, so we can't actually do anything useful with gfp
> flags to
> - * control the kmalloc() behaviour within kvmalloc(). Hence
> kmalloc() will do
> - * direct reclaim and compaction in the slow path, both of which are
> - * horrendously expensive. We just want kmalloc to fail fast and
> fall back to
> - * vmalloc if it can't get somethign straight away from the free
> lists or buddy
> - * allocator. Hence we have to open code kvmalloc outselves here.
> - *
> - * Also, we are in memalloc_nofs_save task context here, so despite
> the use of
> - * GFP_KERNEL here, we are actually going to be doing GFP_NOFS
> allocations. This
> - * is actually the only way to make vmalloc() do GFP_NOFS
> allocations, so lets
> - * just all pretend this is a GFP_KERNEL context operation....
> - */
> -static inline void *
> -xlog_cil_kvmalloc(
> -	size_t		buf_size)
> -{
> -	gfp_t		flags = GFP_KERNEL;
> -	void		*p;
> -
> -	flags &= ~__GFP_DIRECT_RECLAIM;
> -	flags |= __GFP_NOWARN | __GFP_NORETRY;
> -	do {
> -		p = kmalloc(buf_size, flags);
> -		if (!p)
> -			p = vmalloc(buf_size);
> -	} while (!p);
> -
> -	return p;
> -}
> -
>  /*
>   * Allocate or pin log vector buffers for CIL insertion.
>   *
> @@ -368,7 +335,7 @@ xlog_cil_alloc_shadow_bufs(
>  			 * storage.
>  			 */
>  			kmem_free(lip->li_lv_shadow);
> -			lv = xlog_cil_kvmalloc(buf_size);
> +			lv = xlog_kvmalloc(buf_size);
>  
>  			memset(lv, 0, xlog_cil_iovec_space(niovecs));
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 4aa95b68450a..46f989641eda 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -679,4 +679,38 @@ xlog_valid_lsn(
>   */
>  void xlog_cil_pcp_dead(struct xlog *log, unsigned int cpu);
>  
> +/*
> + * Log vector and shadow buffers can be large, so we need to use
> kvmalloc() here
> + * to ensure success. Unfortunately, kvmalloc() only allows
> GFP_KERNEL contexts
> + * to fall back to vmalloc, so we can't actually do anything useful
> with gfp
> + * flags to control the kmalloc() behaviour within kvmalloc(). Hence
> kmalloc()
> + * will do direct reclaim and compaction in the slow path, both of
> which are
> + * horrendously expensive. We just want kmalloc to fail fast and
> fall back to
> + * vmalloc if it can't get somethign straight away from the free
> lists or
> + * buddy allocator. Hence we have to open code kvmalloc outselves
> here.
> + *
> + * This assumes that the caller uses memalloc_nofs_save task context
> here, so
> + * despite the use of GFP_KERNEL here, we are going to be doing
> GFP_NOFS
> + * allocations. This is actually the only way to make vmalloc() do
> GFP_NOFS
> + * allocations, so lets just all pretend this is a GFP_KERNEL
> context
> + * operation....
> + */
> +static inline void *
> +xlog_kvmalloc(
> +	size_t		buf_size)
> +{
> +	gfp_t		flags = GFP_KERNEL;
> +	void		*p;
> +
> +	flags &= ~__GFP_DIRECT_RECLAIM;
> +	flags |= __GFP_NOWARN | __GFP_NORETRY;
> +	do {
> +		p = kmalloc(buf_size, flags);
> +		if (!p)
> +			p = vmalloc(buf_size);
> +	} while (!p);
> +
> +	return p;
> +}
> +
>  #endif	/* __XFS_LOG_PRIV_H__ */


      parent reply	other threads:[~2022-05-11  0:54 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-09  0:41 [PATCH 00/18 V4] XFS: LARP state machine and recovery rework Dave Chinner
2022-05-09  0:41 ` [PATCH 01/18] xfs: avoid empty xattr transaction when attrs are inline Dave Chinner
2022-05-09 16:43   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 02/18] xfs: initialise attrd item to zero Dave Chinner
2022-05-09 16:43   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 03/18] xfs: make xattri_leaf_bp more useful Dave Chinner
2022-05-10 22:58   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 04/18] xfs: rework deferred attribute operation setup Dave Chinner
2022-05-10 23:04   ` Darrick J. Wong
2022-05-11  0:57     ` Dave Chinner
2022-05-09  0:41 ` [PATCH 05/18] xfs: separate out initial attr_set states Dave Chinner
2022-05-10 23:12   ` Darrick J. Wong
2022-05-11  1:06     ` Dave Chinner
2022-05-11  1:08       ` Darrick J. Wong
2022-05-11  1:38         ` Dave Chinner
2022-05-11  8:35           ` Dave Chinner
2022-05-11 15:39             ` Darrick J. Wong
2022-05-12  0:57               ` Dave Chinner
2022-05-09  0:41 ` [PATCH 06/18] xfs: kill XFS_DAC_LEAF_ADDNAME_INIT Dave Chinner
2022-05-10 23:15   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 07/18] xfs: consolidate leaf/node states in xfs_attr_set_iter Dave Chinner
2022-05-10 23:20   ` Darrick J. Wong
2022-05-11  1:09     ` Dave Chinner
2022-05-09  0:41 ` [PATCH 08/18] xfs: split remote attr setting out from replace path Dave Chinner
2022-05-10 23:22   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 09/18] xfs: XFS_DAS_LEAF_REPLACE state only needed if !LARP Dave Chinner
2022-05-10 23:24   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 10/18] xfs: remote xattr removal in xfs_attr_set_iter() is conditional Dave Chinner
2022-05-10 23:26   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 11/18] xfs: clean up final attr removal in xfs_attr_set_iter Dave Chinner
2022-05-10 23:29   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 12/18] xfs: xfs_attr_set_iter() does not need to return EAGAIN Dave Chinner
2022-05-10 23:30   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 13/18] xfs: introduce attr remove initial states into xfs_attr_set_iter Dave Chinner
2022-05-10 23:37   ` Darrick J. Wong
2022-05-10 23:40     ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 14/18] xfs: switch attr remove to xfs_attri_set_iter Dave Chinner
2022-05-10 23:40   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 15/18] xfs: remove xfs_attri_remove_iter Dave Chinner
2022-05-10 23:42   ` Darrick J. Wong
2022-05-09  0:41 ` [PATCH 16/18] xfs: use XFS_DA_OP flags in deferred attr ops Dave Chinner
2022-05-10 22:20   ` [PATCH 16/18 v2] " Dave Chinner
2022-05-10 23:47     ` Darrick J. Wong
2022-05-10 23:49     ` Alli
2022-05-09  0:41 ` [PATCH 17/18] xfs: ATTR_REPLACE algorithm with LARP enabled needs rework Dave Chinner
2022-05-10 22:31   ` Alli
2022-05-10 23:53   ` Darrick J. Wong
2022-05-11  1:14     ` Dave Chinner
2022-05-09  0:41 ` [PATCH 18/18] xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify Dave Chinner
2022-05-10 22:31   ` Alli
2022-05-10 23:54   ` Darrick J. Wong
2022-05-10 22:27 ` [PATCH 19/18] xfs: can't use kmem_zalloc() for attribute buffers Dave Chinner
2022-05-10 23:59   ` Darrick J. Wong
2022-05-11  0:54     ` Dave Chinner
2022-05-11  1:10       ` Darrick J. Wong
2022-05-11  0:54   ` Alli [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f2fdf0afdec6e1dcef75f5d7b51bb6db8f062208.camel@oracle.com \
    --to=allison.henderson@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.