On Wed, May 04 2016, Dave Chinner wrote: > FWIW, I don't think making evict() non-blocking is going to be worth > the effort here. Making memory reclaim wait on a priority ordered > queue while asynchronous reclaim threads run reclaim as efficiently > as possible and wakes waiters as it frees the memory the waiters > require is a model that has been proven to work in the past, and > this appears to me to be the model you are advocating for. I agree > that direct reclaim needs to die and be replaced with something > entirely more predictable, controllable and less prone to deadlock > contexts - you just need to convince the mm developers that it will > perform and scale better than what we have now. > > In the mean time, having a slightly more fine grained GFP_NOFS > equivalent context will allow us to avoid the worst of the current > GFP_NOFS problems with very little extra code. You have painted two pictures here. The first is an ideal which does look a lot like the sort of outcome I was aiming for, but is more than a small step away. The second is a band-aid which would take us in exactly the wrong direction. It makes an interface which people apparently find hard to use (or easy to misused) - the setting of __GFP_FS - and makes it more complex. Certainly it would be more powerful, but I think it would also be more misused. So I ask myself: can we take some small steps towards 'A' and thereby enable at least the functionality enabled by 'B'? A core design principle for me is to enable filesystems to take control of their own destiny. They should have the information available to make the decisions they need to make, and the opportunity to carry them out. All the places where direct reclaim currently calls into filesystems carry the 'gfp' flags so the file system can decide what to do, with one exception: evict_inode. So my first proposal would be to rectify that. - redefine .nr_cached_objects and .free_cached_objects so that, if they are defined, they are responsible for s_dentry_lru and s_inode_lru. e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes two calls to list_lru_shrink_count. This would require exporting prune_dcache_sb and prune_icache_sb but otherwise should be a fairly straight forward change. If nr_cached_objects were defined, super_cache_scan would no longer abort without __GFP_FS - that test would be left to the filesystem. - Now any filesystem that wants to can stash it's super_block pointer in current->journal_info while doing memory allocations, and abort any reclaim attempts (release_page, shrinker, nr_cached_objects) if and only if current->journal_info == "my superblock". This can be done without the core mm code knowing any more than it already does. - A more sophisticated filesystem might import much of the code for prune_icache_sb() - either by copy/paste or by exporting some vfs internals - and then store an inode pointer in current->journal_info and only abort reclaim which touches that inode. - if a filesystem happens to know that it will never block in any of these reclaim calls, it can always allow prune_dcache_sb to run, and never needs to use GFP_NOFS. I think NFS might be close to being able to do this as it flushes everything on last-close. But that is something that NFS developers can care about (or not) quite independently from mm people. - Maybe some fs developer will try to enable free_cached_objects to do as much work as possible for every inode, but never deadlock. It could do its own fs-specfic deadlock detection, or could queue work to a work queue and wait a limited time for it. Or something. If some filesystem developer comes up with something that works really well, developers of other filesystems might copy it - or not as they choose. Maybe ->journal_info isn't perfect for this. It is currently only safe for reclaim code to compare it against a known value. It is not safe to dereference it to see if it points to a known value. That could possibly be cleaned up, or another task_struct field could be provided for filesystems to track their state. Or do you find a task_struct field unacceptable and there is some reason and that an explicitly passed cookie is superior? My key point is that we shouldn't try to plumb some new abstraction through the MM code so there is a new pattern for all filesystems to follow. Rather the mm/vfs should get out of the filesystems' way as much as possible and let them innovate independently. Thanks for your time, NeilBrown