From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4259E7B for ; Mon, 20 Jun 2022 00:07:15 +0000 (UTC) Received: from dread.disaster.area (pa49-181-2-147.pa.nsw.optusnet.com.au [49.181.2.147]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 3063210E73F1; Mon, 20 Jun 2022 09:40:13 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1o34WZ-008hWb-8E; Mon, 20 Jun 2022 09:40:11 +1000 Date: Mon, 20 Jun 2022 09:40:11 +1000 From: Dave Chinner To: "Williams, Dan J" Cc: "linux-xfs@vger.kernel.org" , "jack@suse.cz" , "djwong@kernel.org" , "nvdimm@lists.linux.dev" , "Gomatam, Sravani" Subject: Re: [PATCH 8/8] xfs: drop async cache flushes from CIL commits. Message-ID: <20220619234011.GK227878@dread.disaster.area> References: <20220330011048.1311625-1-david@fromorbit.com> <20220330011048.1311625-9-david@fromorbit.com> <2820766805073c176e1a65a61fad2ef8ad0f9766.camel@intel.com> Precedence: bulk X-Mailing-List: nvdimm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <2820766805073c176e1a65a61fad2ef8ad0f9766.camel@intel.com> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=VuxAv86n c=1 sm=1 tr=0 ts=62afb3de a=ivVLWpVy4j68lT4lJFbQgw==:117 a=ivVLWpVy4j68lT4lJFbQgw==:17 a=8nJEP1OIZ-IA:10 a=JPEYwPQDsx4A:10 a=20KFwNOVAAAA:8 a=7-415B0cAAAA:8 a=XA7BxA_fJqiVx7Dl2EIA:9 a=wPNLvfGTeEIA:10 a=biEYGPWJfzWAr4FL6Ov7:22 On Thu, Jun 16, 2022 at 10:23:10PM +0000, Williams, Dan J wrote: > On Wed, 2022-03-30 at 12:10 +1100, Dave Chinner wrote: > > From: Dave Chinner > > > > Jan Kara reported a performance regression in dbench that he > > bisected down to commit bad77c375e8d ("xfs: CIL checkpoint > > flushes caches unconditionally"). > > > > Whilst developing the journal flush/fua optimisations this cache was > > part of, it appeared to made a significant difference to > > performance. However, now that this patchset has settled and all the > > correctness issues fixed, there does not appear to be any > > significant performance benefit to asynchronous cache flushes. > > > > In fact, the opposite is true on some storage types and workloads, > > where additional cache flushes that can occur from fsync heavy > > workloads have measurable and significant impact on overall > > throughput. > > > > Local dbench testing shows little difference on dbench runs with > > sync vs async cache flushes on either fast or slow SSD storage, and > > no difference in streaming concurrent async transaction workloads > > like fs-mark. > > > > Fast NVME storage. > > > > > From `dbench -t 30`, CIL scale: > > > > clients         async                   sync > >                 BW      Latency         BW      Latency > > 1                935.18   0.855          915.64   0.903 > > 8               2404.51   6.873         2341.77   6.511 > > 16              3003.42   6.460         2931.57   6.529 > > 32              3697.23   7.939         3596.28   7.894 > > 128             7237.43  15.495         7217.74  11.588 > > 512             5079.24  90.587         5167.08  95.822 > > > > fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize > > > >         create          chown           unlink > > async   1m41s           1m16s           2m03s > > sync    1m40s           1m19s           1m54s > > > > Slower SATA SSD storage: > > > > > From `dbench -t 30`, CIL scale: > > > > clients         async                   sync > >                 BW      Latency         BW      Latency > > 1                 78.59  15.792           83.78  10.729 > > 8                367.88  92.067          404.63  59.943 > > 16               564.51  72.524          602.71  76.089 > > 32               831.66 105.984          870.26 110.482 > > 128             1659.76 102.969         1624.73  91.356 > > 512             2135.91 223.054         2603.07 161.160 > > > > fsmark, 16 threads, create w/32k logbsize > > > >         create          unlink > > async   5m06s           4m15s > > sync    5m00s           4m22s > > > > And on Jan's test machine: > > > >                    5.18-rc8-vanilla       5.18-rc8-patched > > Amean     1        71.22 (   0.00%)       64.94 *   8.81%* > > Amean     2        93.03 (   0.00%)       84.80 *   8.85%* > > Amean     4       150.54 (   0.00%)      137.51 *   8.66%* > > Amean     8       252.53 (   0.00%)      242.24 *   4.08%* > > Amean     16      454.13 (   0.00%)      439.08 *   3.31%* > > Amean     32      835.24 (   0.00%)      829.74 *   0.66%* > > Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%* > > > > Performance and cache flush behaviour is restored to pre-regression > > levels. > > > > As such, we can now consider the async cache flush mechanism an > > unnecessary exercise in premature optimisation and hence we can > > now remove it and the infrastructure it requires completely. > > It turns out this regresses umount latency on DAX filesystems. Sravani > reached out to me after noticing that v5.18 takes up to 10 minutes to > complete umount. On a whim I guessed this patch and upon revert of > commit 919edbadebe1 ("xfs: drop async cache flushes from CIL commits") > on top of v5.18 umount time goes back down to v5.17 levels. That doesn't change the fact we are issuing cache flushes from the log checkpoint code - it just changes how we issue them. We removed the explicit blkdev_issue_flush_async() call from the cache path and went back to the old way of doing things (attaching it directly to the first IO of a journal checkpoint) when it became clear the async flush was causing performance regressions on storage with really slow cache flush semantics by causing too many extra cache flushes to be issued. As I just captured from /dev/pmem0 on 5.19-rc2 when a fsync was issued after a 1MB write: 259,1 1 513 41.695930264 4615 Q FWFSM 8388688 + 8 [xfs_io] 259,1 1 514 41.695934668 4615 C FWFSM 8388688 + 8 [0] You can see that the journal IO was issued as: REQ_PREFLUSH | REQ_OP_WRITE | REQ_FUA | REQ_SYNC | REQ_META As it was a single IO journal checkpoint - that's where the REQ_FUA came from. So if we create 5000 zero length files instead to create a large, multi-IO checkpoint, we get this pattern from the journal: 259,1 2 103 90.614842783 101 Q FWSM 8388720 + 64 [kworker/u33:1] 259,1 2 104 90.614928169 101 C FWSM 8388720 + 64 [0] 259,1 2 105 90.615012360 101 Q WSM 8388784 + 64 [kworker/u33:1] 259,1 2 106 90.615028941 101 C WSM 8388784 + 64 [0] 259,1 2 107 90.615102783 101 Q WSM 8388848 + 64 [kworker/u33:1] 259,1 2 108 90.615118677 101 C WSM 8388848 + 64 [0] ..... 259,1 2 211 90.619375834 101 Q WSM 8392176 + 64 [kworker/u33:1] 259,1 2 212 90.619391042 101 C WSM 8392176 + 64 [0] 259,1 2 213 90.619415946 134 Q FWFSM 8392240 + 16 [kworker/2:1] 259,1 2 214 90.619420274 134 C FWFSM 8392240 + 16 [0] And you can see that the first IO has REQ_PREFLUSH set, and the last IO (the commit record) has both REQ_PREFLUSH and REQ_FUA set. IOWs, the filesystem is issuing IO with exactly the semantics it requires from the block device, and expecting the block device to b flushing caches entirely on the first IO of a checkpoint. > Perf confirms that all of that CPU time is being spent in > arch_wb_cache_pmem(). It likely means that rather than amortizing that > same latency periodically throughout the workload run, it is all being > delayed until umount. For completeness, this is what the umount IO looks like: 259,1 5 1 98.680129260 10166 Q FWFSM 8392256 + 8 [umount] 259,1 5 2 98.680135797 10166 C FWFSM 8392256 + 8 [0] 259,1 3 429 98.680341063 4977 Q WM 0 + 8 [xfsaild/pmem0] 259,1 3 430 98.680362599 4977 C WM 0 + 8 [0] 259,1 5 3 98.680616201 10166 Q FWFSM 8392264 + 8 [umount] 259,1 5 4 98.680619218 10166 C FWFSM 8392264 + 8 [0] 259,1 3 431 98.680767121 4977 Q WM 0 + 8 [xfsaild/pmem0] 259,1 3 432 98.680770938 4977 C WM 0 + 8 [0] 259,1 5 5 98.680836733 10166 Q FWFSM 8392272 + 8 [umount] 259,1 5 6 98.680839560 10166 C FWFSM 8392272 + 8 [0] 259,1 12 7 98.683546633 10166 Q FWS [umount] 259,1 12 8 98.683551424 10166 C FWS 0 [0] You can see 3 journal writes there with REQ_PREFLUSH set before XFS calls blkdev_issue_flush() (FWS of zero bytes) in xfs_free_buftarg() just before tearing down DAX state and freeing the buftarg. Which one of these cache flush operations is taking 10 minutes to complete? That will tell us a lot more about what is going on... > I assume this latency would also show up without DAX if page-cache is > now allowed to continue growing, or is there some other signal that > triggers async flushes in that case? I think you've misunderstood what the "async" part of "async cache flushes" actually did. It was an internal journal write optimisation introduced in bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally") in 5.14 that didn't work out and was reverted in 5.18. It didn't change the cache flushing semantics of the journal, just reverted to the same REQ_PREFLUSH behaviour we had for a decade leading up the the "async flush" change in the 5.14 kernel. To me, this smells of a pmem block device cache flush issue, not a filesystem problem... Cheers, Dave. -- Dave Chinner david@fromorbit.com