Re: [LKP] Re: [xfs] 016a23388c: stress-ng.xattr.ops_per_sec 58.4% improvement

From: Dave Chinner <david@fromorbit.com>
To: "Yin, Fengwei" <fengwei.yin@intel.com>
Cc: kernel test robot <oliver.sang@intel.com>,
	Dave Chinner <dchinner@redhat.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	linux-xfs@vger.kernel.org, lkp@lists.01.org, lkp@intel.com
Subject: Re: [LKP] Re: [xfs] 016a23388c: stress-ng.xattr.ops_per_sec 58.4% improvement
Date: Wed, 3 Aug 2022 11:02:46 +1000	[thread overview]
Message-ID: <20220803010246.GB3600936@dread.disaster.area> (raw)
In-Reply-To: <9d4e37e2-417e-7a2f-a187-5c7b680c6777@intel.com>

On Tue, Aug 02, 2022 at 02:36:00PM +0800, Yin, Fengwei wrote:
> Hi Dave,
> 
> On 7/22/2022 6:01 AM, Dave Chinner wrote:
> > A huge amount of spinlock contention in the xlog_commit_cil() path
> > went away. The commit identified doesn't remove/change any
> > spinlocks, it actually adds more overhead to the critical section of
> > the above spinlock in preparation for removing said spinlocks.
> > 
> > That removal happens in the next commit in that series - c0fb4765c508 ("xfs:
> > convert CIL to unordered per cpu lists") - so I'd be expecting a
> > bisect to demonstrate that the spinlock contention goes away with
> > the commit that removed the spinlocks (as it does in all the testing
> > of this I've done over the past 2 years), not the commit this bisect
> > identified. Hence I think the bisect went wrong somewhere...
> 
> We did some investigation and got:
> 
> commit:
>   df7a4a2134b0a ("xfs: convert CIL busy extents to per-cpu")
>   016a23388cdcb ("xfs: Add order IDs to log items in CIL")
>   c0fb4765c5086 ("xfs: convert CIL to unordered per cpu lists")
> 
> df7a4a2134b0a201 016a23388cdcb2740deb1379dc4 c0fb4765c5086cfd00f1158f5f4
> ---------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \
>      62.07            +0.0%      62.09            -0.0%      62.06        stress-ng.time.elapsed_time
>      62.07            +0.0%      62.09            -0.0%      62.06        stress-ng.time.elapsed_time.max
>       2237            +0.0%       2237            +0.0%       2237        stress-ng.time.file_system_inputs
>       1842 ±  4%     +16.9%       2152 ±  3%     +17.6%       2166        stress-ng.time.involuntary_context_switches
>     551.00            -0.3%     549.10            -0.3%     549.40        stress-ng.time.major_page_faults
>       6376            -1.1%       6305 ±  2%      +0.6%       6416        stress-ng.time.maximum_resident_set_size
>       9704            -0.3%       9676            -0.1%       9691        stress-ng.time.minor_page_faults
>       4096            +0.0%       4096            +0.0%       4096        stress-ng.time.page_size
>     841.90            -2.4%     821.70            -2.4%     821.90        stress-ng.time.percent_of_cpu_this_job_got
>     512.83            -3.4%     495.24            -3.6%     494.18        stress-ng.time.system_time
>      10.05 ±  8%     +52.3%      15.30 ±  3%     +61.1%      16.19 ±  2%  stress-ng.time.user_time
>       2325 ± 16%     +66.5%       3873 ±  7%     +70.3%       3962 ±  6%  stress-ng.time.voluntary_context_switches
>       1544 ±  4%     +54.4%       2385           +63.9%       2531        stress-ng.xattr.ops
> 
> Yes. commit c0fb4765c5086 ("xfs: convert CIL to unordered per cpu lists")
> could bring performance gain also. But the most performance gain (54.4%)
> is from commit 016a23388cdcb ("xfs: Add order IDs to log items in CIL").
> 
> 
> Based on commit 016a23388cdcb and add following change:
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 6bc540898e3a..7c6c91a0a12d 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -659,9 +659,14 @@ xlog_cil_insert_items(
>                         continue;
>  
>                 lip->li_order_id = order;
> -               if (!list_empty(&lip->li_cil))
> -                       continue;
> -               list_add_tail(&lip->li_cil, &cil->xc_cil);
> +
> +               /*
> +                * Only move the item if it isn't already at the tail. This is
> +                * to prevent a transient list_empty() state when reinserting
> +                * an item that is already the only item in the CIL.
> +                */
> +               if (!list_is_last(&lip->li_cil, &cil->xc_cil))
> +                       list_move_tail(&lip->li_cil, &cil->xc_cil);
>         }
> 
> The performance will drop to the same level as commit df7a4a2134b0a
>  ("xfs: convert CIL busy extents to per-cpu"):

Thanks for looking into this.

This looks like a case of the workload being right at the threshold
of catastrophic cache line contention breakdown on this workload.
i.e.  a single extra exclusive cache miss inside the spin lock
critical region is sufficient overload the memory bus bandwidth and
so the cacheline contention on the lock from nothing to "all the
spare CPU time is spent contending on the spin lock cache line".

IOWs, the lock contention problem doesn't go away with this commit,
it just falls back under the critical threshold where the cache
coherency protocol runs out of memory bus bandwidth bouncing dirty
cachelines around all the CPU cores and so causes spinlock overhead
to go from linear cost to exponential cost.

That a single cacheline miss avoids all the spinlock contention is
just pure luck - it'll come back if other work is being done on the
machine that consumes enough memory bandwidth to push this back over
the edge. Hence the real fix for the spinlock contention problem is
still the patch I pointed out that removes the spinlocks
altogether...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com