From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E9CBC43331 for ; Fri, 6 Sep 2019 00:02:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4754F206BB for ; Fri, 6 Sep 2019 00:02:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389975AbfIFACX (ORCPT ); Thu, 5 Sep 2019 20:02:23 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:34649 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389335AbfIFACX (ORCPT ); Thu, 5 Sep 2019 20:02:23 -0400 Received: from dread.disaster.area (pa49-181-255-194.pa.nsw.optusnet.com.au [49.181.255.194]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id EFDA743DCAE; Fri, 6 Sep 2019 10:02:06 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92) (envelope-from ) id 1i61hV-0003Yr-7L; Fri, 06 Sep 2019 10:02:05 +1000 Date: Fri, 6 Sep 2019 10:02:05 +1000 From: Dave Chinner To: Brian Foster Cc: linux-xfs@vger.kernel.org Subject: Re: [PATCH 7/7] xfs: push the grant head when the log head moves forward Message-ID: <20190906000205.GL1119@dread.disaster.area> References: <20190904042451.9314-1-david@fromorbit.com> <20190904042451.9314-8-david@fromorbit.com> <20190904193442.GA52970@bfoster> <20190904225056.GL1119@dread.disaster.area> <20190905162533.GA59149@bfoster> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190905162533.GA59149@bfoster> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0 a=YO9NNpcXwc8z/SaoS+iAiA==:117 a=YO9NNpcXwc8z/SaoS+iAiA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=J70Eh1EUuV4A:10 a=20KFwNOVAAAA:8 a=7-415B0cAAAA:8 a=nV-I2cFLb9qsVq0Z6zsA:9 a=YG-dbgyQE0bxXHgz:21 a=lZceffnXvbFfOs1u:21 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Thu, Sep 05, 2019 at 12:25:33PM -0400, Brian Foster wrote: > On Thu, Sep 05, 2019 at 08:50:56AM +1000, Dave Chinner wrote: > > On Wed, Sep 04, 2019 at 03:34:42PM -0400, Brian Foster wrote: > > > On Wed, Sep 04, 2019 at 02:24:51PM +1000, Dave Chinner wrote: > > > > From: Dave Chinner > > > > +/* > > > > + * Completion of a iclog IO does not imply that a transaction has completed, as > > > > + * transactions can be large enough to span many iclogs. We cannot change the > > > > + * tail of the log half way through a transaction as this may be the only > > > > + * transaction in the log and moving the tail to point to the middle of it > > > > + * will prevent recovery from finding the start of the transaction. Hence we > > > > + * should only update the last_sync_lsn if this iclog contains transaction > > > > + * completion callbacks on it. > > > > + * > > > > + * We have to do this before we drop the icloglock to ensure we are the only one > > > > + * that can update it. > > > > + * > > > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick > > > > + * the reservation grant head pushing. This is due to the fact that the push > > > > + * target is bound by the current last_sync_lsn value. Hence if we have a large > > > > + * amount of log space bound up in this committing transaction then the > > > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from > > > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we > > > > + * should push the AIL to ensure the push target (and hence the grant head) is > > > > + * no longer bound by the old log head location and can move forwards and make > > > > + * progress again. > > > > + */ > > > > +static void > > > > +xlog_state_set_callback( > > > > + struct xlog *log, > > > > + struct xlog_in_core *iclog, > > > > + xfs_lsn_t header_lsn) > > > > +{ > > > > + iclog->ic_state = XLOG_STATE_CALLBACK; > > > > + > > > > + ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), header_lsn) <= 0); > > > > + > > > > + if (list_empty_careful(&iclog->ic_callbacks)) > > > > + return; > > > > + > > > > + atomic64_set(&log->l_last_sync_lsn, header_lsn); > > > > + xlog_grant_push_ail(log, 0); > > > > + > > > > > > Nit: extra whitespace line above. > > > > Fixed. > > > > > This still seems racy to me, FWIW. What if the AIL is empty (i.e. the > > > push is skipped)? > > > > If the AIL is empty, then it's a no-op because pushing on the AIL is > > not going to make more log space become free. Besides, that's not > > the problem being solved here - reservation wakeups on first insert > > into the AIL are already handled by xfs_trans_ail_update_bulk() and > > hence the first patch in the series. This patch is addressing the > > Nothing currently wakes up reservation waiters on first AIL insertion. Nor should it be necessary - it's the removal from the AIL that frees up log space, not insertion. The update operation is a remove followed by an insert - the remove part of that operation is what may free up log space, not the insert. Hence if we need to wake the log reservation waiters on first AIL insert to fix a bug, we haven't found the underlying problem is preventing log space from being freed... > > > i.e. the AIL is for moving the tail of the log - this code moves the > > head of the log. But both impact on the AIL push target (it is based on > > the distance between the head and tail), so we need > > to update the push target just in case this commit does not move > > the tail... > > > > > What if xfsaild completes this push before the > > > associated log items land in the AIL or we race with xfsaild emptying > > > the AIL? Why not just reuse/update the existing grant head wake up logic > > > in the iclog callback itself? E.g., something like the following > > > (untested): > > > > > And the raciness concerns..? AFAICT this still opens a race window where > the AIL can idle on the push target before AIL insertion. I don't know what race you see - if the AIL completes a push before we insert new objects at the head from the current commit, then it does not matter one bit because the items are being inserted at the log head, not the log tail where the pushing occurs at. If we are inserting objects into the AIL within the push target window, then there is something else very wrong going on, because when the log is out of space the push target should be nowhere near the LSN we are inserting inew objects into the AIL at. (i.e. they should be 3/4s of the log apart...) > > So, from #1 we see that unconditional wakeups are not necessary in > > the scenario you pose, and from #2 it's not a viable solution even > > if it was required. > > > > However, #1 indicates other problems if a xfs_log_space_wake() call > > is necessary in this case. No reservations space and an empty AIL > > implies that the CIL pins the entire log - a pending commit that > > hasn't finished flushing and the current context that is > > aggregating. This implies we've violated a much more important rule > > of the on-disk log format: finding the head and tail of the log > > requires no individual commit be larger than 50% of the log. > > > > I described this exact problem days ago in the original thread. There's > no need to rehash it here. FWIW, I can reproduce much worse than 50% log > consumption aggregated outside of the AIL with the current code and it > doesn't depend on a nonpreemptible kernel (though the workqueue fix > looks legit to me). I'm not rehashing anything intentionally - I'm responding to the questions you are asking me directly in this thread. Maybe I am going over something you've already mentioned in a previous thread, and maybe that hasn't occurred to me because you didn't reference it and the similarites didn't occur to me because I've spend more time looking at the code trying to understand how this "impossible situation" was occurring than reading mailing list discussions. I've been certain that we were seeing was some fundamental rule was being violated to cause this "log full, AIL empty", but I couldn't work out exactly what it was. I was even questioning whether I understood the basic operation of the log because there was no way I could see that CIL would not push during log recovery until the log was full. I said this to Darrick yesterday morning on #xfs: [5/9/19 12:56] there's something bothering me about this log head update thing and I can't put my finger on what it is.... It wasn't until Chandan's trace showed me the CPU hold-off problem with the CIL workqueue. A couple of hours later, after I'd seen Chandan's trace: [5/9/19 14:26] oooohhhh [5/9/19 14:27] this isn't a premeptible kernel, is it? And that was the thing that I couldn't put my finger on - I couldn't work out how a CIL push was being delayed so long on a multi-cpu system with lots of idle CPU that we had a completely empty AIL when we ran out of reservation space. IOWs, I didn't know the right question to ask until I saw the answer in front of me. I've never seen a "CIL checkpoint too large" issue manifiest in the real world, but it's been there since delayed logging was introduced. I knew about this issue right from the start, but it was largely a theoretical concern because workqueue scheduling preempts userspace and so is mostly only ever delayed by the number of transactions in a single syscall. And for large, ongoing transactions like a truncate, it will yield the moment we have to pull in metadata from disk. What's new in recent kernels is the in-core inode unlinked processing mechanisms have changed the way both the syscall and log recovery mechanisms work (merged in 5.1, IIRC), and it looks like it no longer blocks in log recovery like it used to. Given Christoph first reported this generic/530 issue in May there's a fair correlation indicating that the two are linked. i.e. we changed the unlinked inode processing in a way that the kernel can now runs tens of thousands of unlink transactions without yeilding the CPU. That violated the "CIL push work will run within a few transactions of the background push occurring" mechanism the workqueue provided us with and that, fundamentally, is the underlying issue here. It's not a CIL vs empty AIL vs log reservation exhaustion race condition - that's just an observable symptom. To that end, I have been prototyping patches to fix this exact problem as part of the non-blocking inode reclaim series. I've been looking at this because the CIL pins so much memory on large logs and I wanted to put an upper bound on it that wasn't measured in GBs of RAM. Hence I'm planning to pull these out into a separate series now as it's clear that non-preemptible kernels and workqueues do not play well together and that the more we use workqueues for async processing, the more we introduce a potential real-world vector for CIL overruns... Cheers, Dave. -- Dave Chinner david@fromorbit.com