From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4DFB4C2D0A8 for ; Mon, 28 Sep 2020 23:34:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0A0D12076A for ; Mon, 28 Sep 2020 23:34:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726338AbgI1Xes (ORCPT ); Mon, 28 Sep 2020 19:34:48 -0400 Received: from mail105.syd.optusnet.com.au ([211.29.132.249]:56913 "EHLO mail105.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726867AbgI1Xes (ORCPT ); Mon, 28 Sep 2020 19:34:48 -0400 Received: from dread.disaster.area (pa49-195-191-192.pa.nsw.optusnet.com.au [49.195.191.192]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 7CB923AAEDB; Tue, 29 Sep 2020 09:09:11 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1kN2Gc-0007AH-9z; Tue, 29 Sep 2020 09:09:10 +1000 Date: Tue, 29 Sep 2020 09:09:10 +1000 From: Dave Chinner To: Brian Foster Cc: "Darrick J. Wong" , linux-xfs@vger.kernel.org Subject: Re: [PATCH 3/3] xfs: periodically relog deferred intent items Message-ID: <20200928230910.GH14422@dread.disaster.area> References: <160083917978.1401135.9502772939838940679.stgit@magnolia> <160083919968.1401135.1020138085396332868.stgit@magnolia> <20200927230823.GA14422@dread.disaster.area> <20200928151627.GA4883@bfoster> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200928151627.GA4883@bfoster> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=YKPhNiOx c=1 sm=1 tr=0 cx=a_idp_d a=vvDRHhr1aDYKXl+H6jx2TA==:117 a=vvDRHhr1aDYKXl+H6jx2TA==:17 a=kj9zAlcOel0A:10 a=reM5J-MqmosA:10 a=eJfxgxciAAAA:8 a=7-415B0cAAAA:8 a=pqlqNcGz_VZVua4l6aUA:9 a=NytZxH15jsp84EeW:21 a=EHmayZst4U5jbfoL:21 a=CjuIK1q_8ugA:10 a=xM9caqqi1sUkTy8OJ5Uh:22 a=biEYGPWJfzWAr4FL6Ov7:22 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Mon, Sep 28, 2020 at 11:16:27AM -0400, Brian Foster wrote: > On Mon, Sep 28, 2020 at 09:08:23AM +1000, Dave Chinner wrote: > > On Tue, Sep 22, 2020 at 10:33:19PM -0700, Darrick J. Wong wrote: > > > +xfs_defer_relog( > > > + struct xfs_trans **tpp, > > > + struct list_head *dfops) > > > +{ > > > + struct xfs_defer_pending *dfp; > > > + xfs_lsn_t threshold_lsn; > > > + > > > + ASSERT((*tpp)->t_flags & XFS_TRANS_PERM_LOG_RES); > > > + > > > + /* > > > + * Figure out where we need the tail to be in order to maintain the > > > + * minimum required free space in the log. > > > + */ > > > + threshold_lsn = xlog_grant_push_threshold((*tpp)->t_mountp->m_log, 0); > > > + if (threshold_lsn == NULLCOMMITLSN) > > > + return 0; > > > > This smells of premature optimisation. > > > > When we are in a tail-pushing scenario (i.e. any sort of > > sustained metadata workload) this will always return true, and so we > > will relog every intent that isn't in the current checkpoint every > > time this is called. Under light load, we don't care if we add a > > little bit of relogging overhead as the CIL slowly flushes/pushes - > > it will have neglible impact on performance because there is little > > load on the journal. > > > > That seems like an overly broad and not necessarily correct assumption. > The threshold above is tail relative and hardcoded (with the zero param) > to the default AIL pushing threshold, which is 25% of the log. If we The pushing threshold is 75% of the log space has been reserved by the grant head, not that there's only 25% of the log free. The reservation is typically much, much larger than the change that is actually written to the log, and so we can be tail pushing even when the log is largely empty and the CIL hasn't reached it's threshold for pushing. Think about it - if we have a log with a 256kB log stripe unit, a itruncate reservation is somewhere around 1.5MB in size (700kB * 2), and if we have reflink/rmap enabled, it's closer to 6MB in size. Yet a typical modification will not use any of the 512kB*8 space reserved for the LSU (because CIL!) and might only use 10kB of new log space in the CIL because that's all that was newly dirtied. That means we reserve 6MB of log space for each itruncate transaction in flight, (directory ops are almost as heavyweight) so it only takes 16 transactions in flight to consume 100MB of reserve grant head space. If we've got a log smaller than 130MB in this situation, we are into tail pushing -even if the physical log is empty-. IOWs, tail pushing is not triggered by how much physical log space has been consumed - it is triggered by reservation pressure. And we can trivially drive the system into reservation pressure with concurrent workloads > assume sustained tail pushing conditions, a committed intent doesn't > trigger relog again until it falls within that threshold in the on-disk > log. The current CIL (nonblocking) size threshold is half that at 12.5%. Or smaller. For small logs it is 12.5% of the log size. For larger logs, it caps at 32MB (8 * (256k << 4)). > So relative to a given rolling transaction processing an intent chain, > there's a decent number of full CIL checkpoints that can occur before > any of those intents need to relog again (if at all), regardless of log > size. Your logic is inverted because reservation pressure does not reflect CIL pressure. Log space consumption and reservation pressure were decoupled by the in-memory relogging that the CIL enables, hence given a specific intent chain, the likely probability is that it will not span more than a single CIL context. i.e. it will complete rolling entirely within a single CIL commit. As the log gets larger and the CIL increases in size, the probability that an intent chain completes within a single CIL commit goes up (e.g. 6MB reservation vs 32MB CIL push threshold). The lighter the load, the more likely it is that a transaction chain will complete within a single CIL checkpoint, even on small logs. > Without that logic, we're changing behavior to relog the entire chain in > every CIL checkpoint. That's a fairly significant change in behavior > with less predictable breakdown conditions. Do we know how long a single > chain is going to be? Yes: the typical worst case chain length is defined by the transaction reservation log count. That is supposed to cover > %99 of typical log space requirements for that operation. For any chain that is longer than this, we end up in the slow regrant path that reserves a unit of log space at a time, and these are the cases where the transaction must be careful about not pinning the tail of the log. i.e. long chains are exactly the case where we want to relog the intents in every new CIL context. This is no different to the way we relog the inode in -every transaction roll- whether it is modified or not so that we can guarantee it doesn't pin the tail of the log when we need to regrant space on commit.... > Do we know how many CPUs are concurrently > processing "long chain" operations? Yes: as many as the grant head reservations will allow. That's always been the bound maximum transaction concurrency we can support. > Do we know whether an external > workload is causing even more frequent checkpoints than governed by the > CIL size limit? Irrelevant, because checkpoint frequency affects the relogging efficiency of -everything- that running transactions. i.e. the impact is not isolated to relogging a few small intents, it affects relogging of all metadata, from AG headers to btree blocks, directories and inodes. IOWs, the overhead of relogging intents will be lost in the overhead of completely relogging entire dirty metadata blocks that the transactions also touch, even though they may have only newly dirtied a few bytes in each buffer. > The examples in the commit logs in this series refer to > chains of hundreds of intents with hundreds of transaction rolls. Truncating a file with 10 million extents will roll the transaction 10 million times. Having a long intent chain is no different from a "don't pin the tail of the log" perspective than truncating a badly fragmented file. Both need to relog their outstanding items that could pin the tail of the log as frequently as they commit... This is a basic principle of permanent transactions: they must guarantee forwards progress by ensuring they can't pin the tail of the log at any time. That is, if you hold an item that *may* pin the tail of the log when you need to regrant log space, you need to move those items to the head of the log -before- you attempt to regrant log space. We do not do that with intents, and conditional relogging doesn't provide that guarantee, either. Relogging the item once the item is no longer at the head of the log provides that forwards progress guarantee. I don't want to slap a band-aid over the problem in the name of "performance". Correctness comes first, then if we find a performance problem we'll address that without breaking correctness. > Those types of scenarios are hard enough to reason about, > particularly when we consider boxes with hundreds of CPUs, so I'm > somewhat skeptical of us accurately predicting > performance/behavior over an implementation that bounds processing > more explicitly. ISTM that these are all essentially undefined > contributing factors. I don't find these things hard to reason about, nor do I consider them undefined factors. This is largely because I've been dealing with journalling and AIL issues at this scale for the best part of 15 years. Everything I think about in this area starts with the requirement "this needs to scale to 10,000+ concurrent transactions across thousands of CPUs". The reason for that can be found in ancient history, such as: commit 249a8c1124653fa90f3a3afff869095a31bc229f Author: David Chinner Date: Tue Feb 5 12:13:32 2008 +1100 [XFS] Move AIL pushing into it's own thread When many hundreds to thousands of threads all try to do simultaneous transactions and the log is in a tail-pushing situation (i.e. full), we can get multiple threads walking the AIL list and contending on the AIL lock. .... commit 28f1bc4aabea316d01ca7a59c38d619bc17e3550 Author: Dave Chinner Date: Fri Feb 15 15:19:57 2008 +0000 Prevent AIL lock contention during transaction completion When hundreds of processors attempt to commit transactions at the same time, they can contend on the AIL lock when updating the tail LSN held in the in-core log structure. .... The workload that lead to the lockless grant head reservation code, the single threaded, cursor based AIL traversals, the atomic tail accounting, etc was a MPI workload on a 2048p machine doing an MPI synchronised close on ~6 fds per CPU on all CPUs at once. The lockless log code reduced this part of the workload from 4 hours to 4 seconds.... Servers these days are still small compared to the SGI machines that XFS ran on in the mid-2000s.... > > However, when we are under heavy load the code will now be reading > > the grant head and log position accounting variables during every > > commit, hence greatly increasing the number and temporal > > distribution of accesses to the hotest cachelines in the log. We > > currently never access these cache lines during commit unless the > > unit reservation has run out and we have to regrant physical log > > space for the transaction to continue (i.e. we are into slow path > > commit code). IOWs, this is like causing far more than double the > > number of accesses to the grant head, the log tail, the > > last_sync_lsn, etc, all of which is unnecessary exactly when we care > > about minimising contention on the log space accounting variables... > > > > If we're under sustained tail pushing pressure, blocking transactions > have already hit this cacheline as well, FWIW. No, tail pushing != blocking. If the AIL pushing is keeping up with the targets that are being set (as frequently happens with large logs and fast storage), then we rarely run out of log space and we do not block new reservations waiting for reservation space. > But if we're concerned about cacheline access in the commit path > specifically, I'm sure we could come up with any number of optimizations > to address that directly. E.g., we could sample the threshold > occasionally instead of on every roll, wait until the current > transaction has consumed all logres units (where we'll be hitting that > line anyways), shift state tracking over to xfsaild via setting a flag > on log items with an ->iop_relog() handler set that happen to pin the > tail, etc. etc. That said, I do think any such added complexity should > be justified with some accompanying profiling data (and I like the idea > of new stats counters mentioned in the separate mail). Yes, there's lots of ways the pushing threshold could be done differently, but to focus on that misses the point that intents violate forwards progress guarantees rolling transactions are supposed to provide the log. I simply made the "access overhead" argument as a supporting point that this change is also likely to have unintended side effects... Cheers, Dave. -- Dave Chinner david@fromorbit.com