From: "Darrick J. Wong" <darrick.wong@oracle.com> To: Dave Chinner <david@fromorbit.com> Cc: linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH 04/26] xfs: Improve metadata buffer reclaim accountability Date: Thu, 31 Oct 2019 14:05:51 -0700 Message-ID: <20191031210551.GK15221@magnolia> (raw) In-Reply-To: <20191031205049.GS4614@dread.disaster.area> On Fri, Nov 01, 2019 at 07:50:49AM +1100, Dave Chinner wrote: > On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote: > > On Thu, Oct 31, 2019 at 08:43:35AM +1100, Dave Chinner wrote: > > > On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote: > > > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > > > > objects - it also frees the pages attached to the buffers. Make sure > > > > > the memory reclaim code accounts for this memory being freed > > > > > correctly, similar to how the inode shrinker accounts for pages > > > > > freed from the page cache due to mapping invalidation. > > > > > > > > > > We also need to make sure that the mm subsystem knows these are > > > > > reclaimable objects. We provide the memory reclaim subsystem with a > > > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > > > > that way. > > > > > > > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > > > > like we do inodes. > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > --- > > > > > fs/xfs/xfs_buf.c | 6 +++++- > > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > > index e484f6bead53..45b470f55ad7 100644 > > > > > --- a/fs/xfs/xfs_buf.c > > > > > +++ b/fs/xfs/xfs_buf.c > > > > > @@ -324,6 +324,9 @@ xfs_buf_free( > > > > > > > > > > __free_page(page); > > > > > } > > > > > + if (current->reclaim_state) > > > > > + current->reclaim_state->reclaimed_slab += > > > > > + bp->b_page_count; > > > > > > > > Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together. > > > > > > > > > } else if (bp->b_flags & _XBF_KMEM) > > > > > kmem_free(bp->b_addr); > > > > > _xfs_buf_free_pages(bp); > > > > > @@ -2064,7 +2067,8 @@ int __init > > > > > xfs_buf_init(void) > > > > > { > > > > > xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", > > > > > - KM_ZONE_HWALIGN, NULL); > > > > > + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, > > > > > > > > I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply > > > > to a particular "use another node" memory policy when slab is in use. > > > > Was that your intent? > > > > > > It's more documentation than anything - that we shouldn't be piling > > > these structures all on to one node because that can have severe > > > issues with NUMA memory reclaim algorithms. i.e. the xfs-buf > > > shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single > > > node can reclaim all the xfs-bufs on that node without touching any > > > other node. > > > > > > That means, for example, if we instantiate all the AG header buffers > > > on a single node (e.g. like we do at mount time) then memory > > > pressure on that one node will generate IO stalls across the entire > > > filesystem as other nodes doing work have to repopulate the buffer > > > cache for any allocation for freeing of space/inodes.. > > > > > > IOWs, for large NUMA systems using cpusets this cache should be > > > spread around all of memory, especially as it has NUMA aware > > > reclaim. For everyone else, it's just documentation that improper > > > cgroup or NUMA memory policy could cause you all sorts of problems > > > with this cache. > > > > > > It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in > > > filesystems for inode caches largely because, at the time (~2006), > > > the only reclaimable cache that could grow to any size large enough > > > to cause problems was the inode cache. It's been cargo-culted ever > > > since, whether it is needed or not (e.g. ceph). > > > > > > In the case of the xfs_bufs, I've been running workloads recently > > > that cache several million xfs_bufs and only a handful of inodes > > > rather than the other way around. If we spread inodes because > > > caching millions on a single node can cause problems on large NUMA > > > machines, then we also need to spread xfs_bufs... > > > > Hmm, could we capture this as a comment somewhere? > > Sure, but where? We're planning on getting rid of the KM_ZONE flags > in the near future, and most of this is specific to the impacts on > XFS. I could put it in xfs-super.c above where we initialise all the > slabs, I guess. Probably a separate patch, though.... Sounds like a reasonable place (to me) to record the fact that we want inodes and metadata buffers not to end up concentrating on a single node. At least until we start having NUMA systems with a separate "IO node" in which to confine all the IO threads and whatnot <shudder>. :P --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
next prev parent reply index Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-10-09 3:20 [PATCH V2 00/26] mm, xfs: non-blocking inode reclaim Dave Chinner 2019-10-09 3:20 ` [PATCH 01/26] xfs: Lower CIL flush limit for large logs Dave Chinner 2019-10-11 12:39 ` Brian Foster 2019-10-30 17:08 ` Darrick J. Wong 2019-10-09 3:21 ` [PATCH 02/26] xfs: Throttle commits on delayed background CIL push Dave Chinner 2019-10-11 12:38 ` Brian Foster 2019-10-09 3:21 ` [PATCH 03/26] xfs: don't allow log IO to be throttled Dave Chinner 2019-10-11 9:35 ` Christoph Hellwig 2019-10-11 12:39 ` Brian Foster 2019-10-30 17:14 ` Darrick J. Wong 2019-10-09 3:21 ` [PATCH 04/26] xfs: Improve metadata buffer reclaim accountability Dave Chinner 2019-10-11 12:39 ` Brian Foster 2019-10-11 12:57 ` Christoph Hellwig 2019-10-11 23:14 ` Dave Chinner 2019-10-11 23:13 ` Dave Chinner 2019-10-12 12:05 ` Brian Foster 2019-10-13 3:14 ` Dave Chinner 2019-10-14 13:05 ` Brian Foster 2019-10-30 17:25 ` Darrick J. Wong 2019-10-30 21:43 ` Dave Chinner 2019-10-31 3:06 ` Darrick J. Wong 2019-10-31 20:50 ` Dave Chinner 2019-10-31 21:05 ` Darrick J. Wong [this message] 2019-10-31 21:22 ` Christoph Hellwig 2019-11-03 21:26 ` Dave Chinner 2019-11-04 23:08 ` Darrick J. Wong 2019-10-09 3:21 ` [PATCH 05/26] xfs: correctly acount for reclaimable slabs Dave Chinner 2019-10-11 12:39 ` Brian Foster 2019-10-30 17:16 ` Darrick J. Wong 2019-10-09 3:21 ` [PATCH 06/26] xfs: synchronous AIL pushing Dave Chinner 2019-10-11 9:42 ` Christoph Hellwig 2019-10-11 12:40 ` Brian Foster 2019-10-11 23:15 ` Dave Chinner 2019-10-09 3:21 ` [PATCH 07/26] xfs: tail updates only need to occur when LSN changes Dave Chinner 2019-10-11 9:50 ` Christoph Hellwig 2019-10-11 12:40 ` Brian Foster 2019-10-09 3:21 ` [PATCH 08/26] mm: directed shrinker work deferral Dave Chinner 2019-10-14 8:46 ` Christoph Hellwig 2019-10-14 13:06 ` Brian Foster 2019-10-18 7:59 ` Dave Chinner 2019-10-09 3:21 ` [PATCH 09/26] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers Dave Chinner 2019-10-09 3:21 ` [PATCH 10/26] mm: factor shrinker work calculations Dave Chinner 2019-10-09 3:21 ` [PATCH 11/26] shrinker: defer work only to kswapd Dave Chinner 2019-10-09 3:21 ` [PATCH 12/26] shrinker: clean up variable types and tracepoints Dave Chinner 2019-10-09 3:21 ` [PATCH 13/26] mm: reclaim_state records pages reclaimed, not slabs Dave Chinner 2019-10-09 3:21 ` [PATCH 14/26] mm: back off direct reclaim on excessive shrinker deferral Dave Chinner 2019-10-11 16:21 ` Matthew Wilcox 2019-10-11 23:20 ` Dave Chinner 2019-10-09 3:21 ` [PATCH 15/26] mm: kswapd backoff for shrinkers Dave Chinner 2019-10-09 3:21 ` [PATCH 16/26] xfs: synchronous AIL pushing Dave Chinner 2019-10-11 10:18 ` Christoph Hellwig 2019-10-11 15:29 ` Brian Foster 2019-10-11 23:27 ` Dave Chinner 2019-10-12 12:08 ` Brian Foster 2019-10-09 3:21 ` [PATCH 17/26] xfs: don't block kswapd in inode reclaim Dave Chinner 2019-10-11 15:29 ` Brian Foster 2019-10-09 3:21 ` [PATCH 18/26] xfs: reduce kswapd blocking on inode locking Dave Chinner 2019-10-11 10:29 ` Christoph Hellwig 2019-10-09 3:21 ` [PATCH 19/26] xfs: kill background reclaim work Dave Chinner 2019-10-11 10:31 ` Christoph Hellwig 2019-10-09 3:21 ` [PATCH 20/26] xfs: use AIL pushing for inode reclaim IO Dave Chinner 2019-10-11 17:38 ` Brian Foster 2019-10-09 3:21 ` [PATCH 21/26] xfs: remove mode from xfs_reclaim_inodes() Dave Chinner 2019-10-11 10:39 ` Christoph Hellwig 2019-10-14 13:07 ` Brian Foster 2019-10-09 3:21 ` [PATCH 22/26] xfs: track reclaimable inodes using a LRU list Dave Chinner 2019-10-11 10:42 ` Christoph Hellwig 2019-10-14 13:07 ` Brian Foster 2019-10-09 3:21 ` [PATCH 23/26] xfs: reclaim inodes from the LRU Dave Chinner 2019-10-11 10:56 ` Christoph Hellwig 2019-10-30 23:25 ` Dave Chinner 2019-10-09 3:21 ` [PATCH 24/26] xfs: remove unusued old inode reclaim code Dave Chinner 2019-10-09 3:21 ` [PATCH 25/26] xfs: rework unreferenced inode lookups Dave Chinner 2019-10-11 12:55 ` Christoph Hellwig 2019-10-11 13:39 ` Peter Zijlstra 2019-10-11 23:38 ` Dave Chinner 2019-10-14 13:07 ` Brian Foster 2019-10-17 1:24 ` Dave Chinner 2019-10-17 7:57 ` Brian Foster 2019-10-18 20:29 ` Dave Chinner 2019-10-09 3:21 ` [PATCH 26/26] xfs: use xfs_ail_push_all_sync in xfs_reclaim_inodes Dave Chinner 2019-10-11 9:55 ` Christoph Hellwig 2019-10-09 7:06 ` [PATCH V2 00/26] mm, xfs: non-blocking inode reclaim Christoph Hellwig 2019-10-11 19:03 ` Josef Bacik 2019-10-11 23:48 ` Dave Chinner 2019-10-12 0:19 ` Josef Bacik 2019-10-12 0:48 ` Dave Chinner
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20191031210551.GK15221@magnolia \ --to=darrick.wong@oracle.com \ --cc=david@fromorbit.com \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=linux-xfs@vger.kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-XFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \ linux-xfs@vger.kernel.org public-inbox-index linux-xfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git