From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:48644 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755469AbcJNM2N (ORCPT ); Fri, 14 Oct 2016 08:28:13 -0400 From: Chris Mason Subject: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim Message-ID: <06aade22-b29e-f55e-7f00-39154f220aa6@fb.com> Date: Fri, 14 Oct 2016 08:27:24 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Dave Chinner , linux-xfs@vger.kernel.org Hi Dave, This is part of a series of patches we're growing to fix a perf regression on a few straggler tiers that are still on v3.10. In this case, hadoop had to switch back to v3.10 because v4.x is as much as 15% slower on recent kernels. Between v3.10 and v4.x, kswapd is less effective overall. This leads more and more procs to get bogged down in direct reclaim Using SYNC_WAIT in xfs_reclaim_inodes_ag(). Since slab shrinking happens very early in direct reclaim, we've seen systems with 130GB of ram where hundreds of procs are stuck on the xfs slab shrinker fighting to walk a slab 900MB in size. They'd have better luck moving on to the page cache instead. Also, we're going into direct reclaim much more often than we should because kswapd is getting stuck on XFS inode locks and writeback. Dropping the SYNC_WAIT means that kswapd can move on to other things and let the async worker threads get kicked to work on the inodes. We're still working on the series, and this is only compile tested on current Linus git. I'm working out some better simulations for the hadoop workload to stuff into Mel's tests. Numbers from prod take roughly 3 days to stabilize, so I haven't isolated this patch from the rest of the series. Unpatched v4.x our base allocation stall rate goes up to as much as 200-300/sec, averaging 70/sec. The series I'm finalizing gets that number down to < 1 /sec. Omar Sandoval did some digging and found you added the SYNC_WAIT in response to a workload I sent ages ago. I tried to make this OOM with fsmark creating empty files, and it has been soaking in memory constrained workloads in production for almost two weeks. Signed-off-by: Chris Mason --- fs/xfs/xfs_icache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index bf2d607..63938fb 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -1195,7 +1195,7 @@ xfs_reclaim_inodes_nr( xfs_reclaim_work_queue(mp); xfs_ail_push_all(mp->m_ail); - return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan); + return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan); } /* -- 2.9.3