From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:57365 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756140AbcKJXC7 (ORCPT ); Thu, 10 Nov 2016 18:02:59 -0500 Date: Fri, 11 Nov 2016 10:02:00 +1100 From: Dave Chinner Subject: Re: [PATCH 0/2] XFS buffer cache scalability improvements Message-ID: <20161110230200.GI28922@dastard> References: <1476821653-2595-1-git-send-email-dev@lynxeye.de> <20161018212116.GC23194@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20161018212116.GC23194@dastard> Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Lucas Stach Cc: linux-xfs@vger.kernel.org Hi Lucas, On Wed, Oct 19, 2016 at 08:21:16AM +1100, Dave Chinner wrote: > On Tue, Oct 18, 2016 at 10:14:11PM +0200, Lucas Stach wrote: > > The second patch is logical follow up. The rhashtable cache index is protected by > > RCU and does not need any additional locking. By switching the buffer cache entries > > over to RCU freeing the buffer cache can be operated in a completely lock-free > > manner. This should help scalability in the long run. > > Yup, that's another reason I'd considered rhashtables :P > > However, this is where it gets hairy. The buffer lifecycle is > intricate, subtle, and has a history of nasty bugs that just never > seem to go away. This change will require a lot of verification > work to ensure things like the LRU manipulations haven't been > compromised by the removal of this lock... ..... > It's a performance modification - any performance/profile numbers > that show the improvement? Here's why detailed performance measurement for changes like this are important: RCU freeing and lockless lookups are not a win for the XFS buffer cache. I fixed the code to be RCU safe (use bp->b_lock, XFS_BSTATE_FREE and memory barriers) and verified that it worked OK (no regressions over a nightly xfstests cycle across all my test machines) so this morning I've run performance tests against it. I've also modified the rhashtable patch with all my review comments: [dchinner: reduce minimum hash size to an acceptable size for large filesystems with many AGs with no active use.] [dchinner: remove stale rbtree asserts.] [dchinner: use xfs_buf_map for compare function argument.] [dchinner: make functions static.] [dchinner: remove redundant comments.] The results show that RCU freeing significantly slows down my fsmark file creation benchmark (https://lkml.org/lkml/2016/3/31/1161) that hammers the buffer cache - it is not uncommon for profiles to show 10-11% CPU usage in _xfs_buf_find() with the rbtree implementation. These tests were run on 4.9-rc4+for-next: files/s wall time sys CPU rbtree: 220910+/-1.9e+04 4m21.639s 48m17.206s rhashtable: 227518+/-1.9e+04 4m22.179s 45m41.639s With RCU freeing: 198181+/-3e+04 4m45.286s 51m36.842s So we can see that rbtree->rhashtable reduces system time and increases create a little but not significantly, and the overall runtime is pretty much unchanged. However, adding RCU lookup/freeing to the rhashtable shows a siginficant degradation - 10% decrease in create rate, 50% increase in create rate stddev, and a 10% increase in system time. Not good. The reasons for this change is quite obvious from my monitoring: there is significant change in memory usage footprint and memory reclaim overhead. RCU freeing delays the freeing of the buffers, which means the buffer cache shrinker is not actually freeing memory when demand occurs. Instead, freeing is delayed to the end of the rcu grace period and hence does not releive pressure quickly. Hence memory reclaim transfers that pressure to other caches, increasing reclaim scanning work and allocation latency. The result is higher reclaim CPU usage, a very different memory usage profile over the course of the test and, ultimately, lower performance. Further, because we normally cycle through buffers so fast, RCU freeing means that we are no longer finding hot buffers in the slab cache during allocation. The idea of the slab cache is that on heavily cycled slabs the objects being allocated are the ones that were just freed and so are still hot in the CPU caches. When we switch to RCU freeing, this no longer happens because freeing only ever happens in sparse, periodic batches. Hence we end up allocating cache cold buffers and so end up with more cache misses when first allocating new buffers. IOWs, the loss of performance due to RCU freeing is not made up by the anticipated reduction the overhead of uncontended locks on lookup. This is mainly because there is no measurable lookup locking overhead now that rhashtables are used. rbtree CPU profile: - 9.39% _xfs_buf_find 0.92% xfs_perag_get ¿ - 0.90% xfs_buf_trylock ¿ 0.71% down_trylock rhashtable: - 2.62% _xfs_buf_find - 1.12% xfs_perag_get + 0.58% radix_tree_lookup - 0.91% xfs_buf_trylock 0.82% down_trylock rhashtable+RCU: - 2.31% _xfs_buf_find 0.91% xfs_perag_get - 0.83% xfs_buf_trylock 0.75% down_trylock So with the rhashtable change in place, we've already removed the cause of the pag_buf_lock contention (the rbtree pointer chasing) so there just isn't any overhead that using RCU can optimise away. Hence there's no gains to amortise the efficiency losses using RCU freeing introduces, and as a result using RCU is slower than traditional locking techniques. I'll keep testing the rhashtbale code - it look solid enough at this point to consider it for the 4.10 cycle. Cheers, Dave. -- Dave Chinner david@fromorbit.com