From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:55217 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752018AbcJOXHe (ORCPT ); Sat, 15 Oct 2016 19:07:34 -0400 Date: Sun, 16 Oct 2016 09:34:54 +1100 From: Dave Chinner Subject: Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Message-ID: <20161015223454.GS23194@dastard> References: <06aade22-b29e-f55e-7f00-39154f220aa6@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <06aade22-b29e-f55e-7f00-39154f220aa6@fb.com> Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Chris Mason Cc: linux-xfs@vger.kernel.org On Fri, Oct 14, 2016 at 08:27:24AM -0400, Chris Mason wrote: > > Hi Dave, > > This is part of a series of patches we're growing to fix a perf > regression on a few straggler tiers that are still on v3.10. In this > case, hadoop had to switch back to v3.10 because v4.x is as much as 15% > slower on recent kernels. > > Between v3.10 and v4.x, kswapd is less effective overall. This leads > more and more procs to get bogged down in direct reclaim Using SYNC_WAIT > in xfs_reclaim_inodes_ag(). > > Since slab shrinking happens very early in direct reclaim, we've seen > systems with 130GB of ram where hundreds of procs are stuck on the xfs > slab shrinker fighting to walk a slab 900MB in size. They'd have better > luck moving on to the page cache instead. We've already scanned the page cache for direct reclaim by the time we get to running the shrinkers. Indeed, the amount of work the shrinkers do is directly controlled by the amount of work done scanning the page cache beforehand.... > Also, we're going into direct reclaim much more often than we should > because kswapd is getting stuck on XFS inode locks and writeback. Where and what locks, exactly? > Dropping the SYNC_WAIT means that kswapd can move on to other things and > let the async worker threads get kicked to work on the inodes. Correct me if I'm wrong, but we introduced this shrinker behaviour long before 3.10. I think the /explicit/ SYNC_WAIT was added some time around 3.0 when I added the background async reclaim, but IIRC the synchronous reclaim behaviour predates that by quite a bit. IOWs, XFS shrinkers have the same blocking behaviour in 3.10 as they do in current 4.x kernels. Hence if you're getting problems with excessive blocking during reclaim on more recent 4.x kernels, then it's more likely there is a change in memory reclaim balance in the vmscan code that drives the shrinkers - that has definitely changed between 3.10 and 4.x, but the XFS shrinker behaviour has not. What XFS is doing is not wrong - the synchrnous behaviour is the primary memory reclaim feedback mechanism that prevents reclaim from trashing the working set of clean inodes when under memory pressure. It's also the choke point where we prevent lots of concurrent threads from trying to do reclaim at once, contending on locks and inodes and causing catastrophic IO breakdown because such reclaim results in random IO patterns for inode writeback instead of nice clean ascending offset ordered IO. This happens to be exactly the situation you are describing - hundreds of processes all trying to run inode reclaim concurrently, which if we don't block will repaid trash the clean inode cache leaving only the dirty inodes around. Hence to remove the SYNC_WAIT simply breaks the overload prevention and throttling feedback mechanisms the XFS inode shrinker uses to maintain performance under severe memory pressure. > We're still working on the series, and this is only compile tested on > current Linus git. I'm working out some better simulations for the > hadoop workload to stuff into Mel's tests. Numbers from prod take > roughly 3 days to stabilize, so I haven't isolated this patch from the rest > of the series. > > Unpatched v4.x our base allocation stall rate goes up to as much as > 200-300/sec, averaging 70/sec. The series I'm finalizing gets that > number down to < 1 /sec. That doesn't mean you've 'fixed' the problem. AFAICT you're just disconnecting memory allocation from the rate at which inode reclaim can clean and reclaim inodes. Yes, direct reclaim stalls will go away, but that doesn't speed up inode reclaim, nor does it do anything to prevent random inode writeback patterns. With the amount of memory your machines have, removing SYNC_WAIT will simply hide the problem by leaving the inodes dirty until journal pressure forces the inode to be cleaned and then the background reclaimer can free it. In general, that doesn't work particularly well because journals can index more dirty objects that can be easily cached in memory, and that's why we have to drive IO and throttle direct reclaim. > Omar Sandoval did some digging and found you added the SYNC_WAIT in > response to a workload I sent ages ago. That commit (which I had to go find because you helpfully didn't quote it) was a7b339f1b869 ("xfs: introduce background inode reclaim work") introduced asynchronous background reclaim work and so made the reclaim function able to handle both async and sync reclaim. To maintain the direct reclaim throttling behaviour of the shrinker, that function now needed to be told to be sycnhronous, hence the addtion of the SYNC_WAIT. We didn't introduce sync reclaim with this commit in 2.6.39(!), we've had that behaviour since well before that. Hence if the analysis performed stopped at this point in history, it was flawed. > I tried to make this OOM with > fsmark creating empty files, and it has been soaking in memory > constrained workloads in production for almost two weeks. That workload reclaims mostly from the background worker because a) the majority of allocations are GFP_NOFS and so the inode shrinker doesn't run very often, b) inode allocation is sequential so inode writeback optimises into large sequential writes, not slow random writes, c) it generates significant journal pressure so drives inode writeback through tail-pushing the journal, and d) sequential, ascending order inode allocation and inode writeback optimisations means reclaim rarely comes across dirty inodes because it reclaims inodes in the same order they are allocated and cleaned. I'm sure there's a solution to whatever problem is occurring at FB, but you're going to need to start by describing the problems and documenting the analysis so we can understand what the problem is before we start discussion potential solutions. It seems that the problem is excessive direct reclaim concurrency and XFS throttling concurrency back to the maximum the underlying filesystem can optimise sanely, but nobody has explained why that is happening on 4.x and not 3.10. i.e. We need to clearly understand the root cause of the problem before we discuss potential solutions. Changes to long standing (and unchanged) behaviours might end up being the fix, but it's not something we'll do without first understanding the problem or exploring other potential solutions... Cheers, Dave. -- Dave Chinner david@fromorbit.com