From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 13 Jul 2018 10:36:14 +1000 From: Dave Chinner To: James Bottomley Cc: Linus Torvalds , Matthew Wilcox , Waiman Long , Michal Hocko , Al Viro , Jonathan Corbet , "Luis R. Rodriguez" , Kees Cook , Linux Kernel Mailing List , linux-fsdevel , linux-mm , "open list:DOCUMENTATION" , Jan Kara , Paul McKenney , Andrew Morton , Ingo Molnar , Miklos Szeredi , Larry Woodman , "Wangkai (Kevin,C)" Subject: Re: [PATCH v6 0/7] fs/dcache: Track & limit # of negative dentries Message-ID: <20180713003614.GW2234@dastard> References: <9f24c043-1fca-ee86-d609-873a7a8f7a64@redhat.com> <1531330947.3260.13.camel@HansenPartnership.com> <18c5cbfe-403b-bb2b-1d11-19d324ec6234@redhat.com> <1531336913.3260.18.camel@HansenPartnership.com> <4d49a270-23c9-529f-f544-65508b6b53cc@redhat.com> <1531411494.18255.6.camel@HansenPartnership.com> <20180712164932.GA3475@bombadil.infradead.org> <1531416080.18255.8.camel@HansenPartnership.com> <1531425435.18255.17.camel@HansenPartnership.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1531425435.18255.17.camel@HansenPartnership.com> Sender: owner-linux-mm@kvack.org List-ID: On Thu, Jul 12, 2018 at 12:57:15PM -0700, James Bottomley wrote: > What surprises me most about this behaviour is the steadiness of the > page cache ... I would have thought we'd have shrunk it somewhat given > the intense call on the dcache. Oh, good, the page cache vs superblock shrinker balancing still protects the working set of each cache the way it's supposed to under heavy single cache pressure. :) Keep in mind that the amount of work slab cache shrinkers perform is directly proportional to the amount of page cache reclaim that is performed and the size of the slab cache being reclaimed. IOWs, under a "single cache pressure" workload we should be directing reclaim work to the huge cache creating the pressure and do very little reclaim from other caches.... [ What follows from here is conjecture, but is based on what I've seen in the past 10+ years on systems with large numbers of negative dentries and fragmented dentry/inode caches. ] However, this only reaches steady state if the reclaim rate can keep ahead of the allocation rate. This single threaded micro-workload won't result in an internally fragmented dentry slab cache, so reclaim is going to be as efficient as possible and have the CPU to keep up with the allocation rate. i.e. Bulk negative dentry reclaim is cheap, in LRU order, and frees slab pages quickly and efficiently in large batches so steady state is easily reached. Problems arise when the slab *page* reclaim rate drops below allocation rate. i.e when you have short term (negative) dentries mixed into the same slab pages as long term stable dentries. This causes the dentry cache to fragment internally - reclaim hits the negative dentries and creates large numbers of partial pages - and so reclaim of negative dentries will fail to free memory. Creating new negative dentries then fills these partial pages first, and so the alloc/reclaim cycles on negative dentries only ever produce partial pages and never free slab cache pages. IOWs, the cost of reclaim slab *pages* goes way up despite the fact that the cost of reclaiming individual dentries has remained the same. That's the underlying problem here - the cost of reclaiming dentries is constant but the cost of reclaiming *slab pages* is not. It is not uncommon to have to trash 90% of the dentry or inode caches to reduce internal fragmentation down to the point where pages start to get freed and the per-slab-page reclaim cost reduces to be less than the allocation cost. Then we see the system return to normal steady state behaviour. In situations where lots of negative dentries are created by production workloads, that "90%" of the cache that needs to be reclaimed to fix the internal fragmentation issue is all negative dentries and just enough of the real dentries to be freeing quantities of partial pages in the slab. Hence negative dentries are seen as the problem because they make up the vast majority of the dentries that get reclaimed when the problem goes away. By limiting the number of negative dentries in this case, internal slab fragmentation is reduced such that reclaim cost never gets out of control. While it appears to "fix" the symptoms, it doesn't address the underlying problem. It is a partial solution at best but at worst it's another opaque knob that nobody knows how or when to tune. Very few microbenchmarks expose this internal slab fragmentation problem because they either don't run long enough, don't create memory pressure, or don't have access patterns that mix long and short term slab objects together in a way that causes slab fragmentation. Run some cold cache directory traversals (git status?) at the same time you are creating negative dentries so you create pinned partial pages in the slab cache and see how the behaviour changes.... Cheers, Dave. -- Dave Chinner david@fromorbit.com