On Fri, Apr 29 2016, Steven Whitehouse wrote: > Hi, > > On 29/04/16 06:35, NeilBrown wrote: >> If we could similarly move evict() into kswapd (and I believe we can) >> then most file systems would do nothing in reclaim context that >> interferes with allocation context. > evict() is an issue, but moving it into kswapd would be a potential > problem for GFS2. We already have a memory allocation issue when > recovery is taking place and memory is short. The code path is as follows: > > 1. Inode is scheduled for eviction (which requires deallocation) > 2. The glock is required in order to perform the deallocation, which > implies getting a DLM lock > 3. Another node in the cluster fails, so needs recovery > 4. When the DLM lock is requested, it gets blocked until recovery is > complete (for the failed node) > 5. Recovery is performed using a userland fencing utility > 6. Fencing requires memory and then blocks on the eviction > 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on > DLM lock, DLM lock waiting on fencing) You even have user-space in the loop there - impressive! You can't really pass GFP_NOFS to a user-space thread, can you :-? > > It doesn't happen often, but we've been looking at the best place to > break that cycle, and one of the things we've been wondering is whether > we could avoid deallocation evictions from memory related contexts, or > at least make it async somehow. I think "async" is definitely the answer and I think evict()/evict_inode() is the best place to focus attention. I can see now (thanks) that just moving the evict() call to kswapd isn't really a solution as it will just serve to block kswapd and so lots of other freeing of memory won't happen. I'm now imagining giving ->evict_inode() a "don't sleep" flag and allowing it to return -EAGAIN. In that case evict would queue the inode to kswapd (or maybe another thread) for periodic retry. The flag would only get set when prune_icache_sb() calls dispose_list() to call evict(). Other uses (e.g. unmount, iput) would still be synchronous. How difficult would it be to change gfs's evict_inode() to optionally never block? For this to work we would need to add a way for deactivate_locked_super() to wait for all the async evictions to complete. Currently prune_icache_sb() is called under s_umount. If we moved part of the eviction out of that lock some other synchronization would be needed. Possibly a per-superblock list of "inodes being evicted" would suffice. Thanks, NeilBrown > > The issue is that it is not possible to know in advance whether an > eviction will result in mearly writing things back to disk (because the > inode is being dropped from cache, but still resides on disk) which is > easy to do, or whether it requires a full deallocation (n_link==0) which > may require significant resources and time, > > Steve.