On Fri, Apr 29 2016, Michal Hocko wrote: > > One think I have learned is that shrinkers can be really complex and > getting rid of GFP_NOFS will be really hard so I would really like to > start the easiest way possible and remove the direct usage and replace > it by scope one which would at least _explain_ why it is needed. I think > this is a reasonable _first_ step and a large step ahead because we have > a good chance to get rid of a large number of those which were used > "just because I wasn't sure and this should be safe, right?". I wouldn't > be surprised if we end up with a very small number of both scope and > direct usage in the end. Yes, shrinkers can be complex. About two of them are. We could fix lots and lots of call sites, or fix two shrinkers. OK, that's a bit unfair as fixing one of the shrinkers involves changing many ->evict_inode() functions. But that would be a very focused change. I think your proposal is little more than re-arranging deck chairs on the titanic. Yes, it might give everybody a better view of the iceberg but the iceberg is still there and in reality we can already see it. The main iceberg is evict_inode. It appears in both examples given so far: xfs and gfs. There are other little icebergs but they won't last long after evict_inode is dealt with. One particular problem with your process-context idea is that it isn't inherited across threads. Steve Whitehouse's example in gfs shows how allocation dependencies can even cross into user space. A more localized one that I have seen is that NFSv4 sometimes needs to start up a state-management thread (particularly if the server restarted). It uses kthread_run(), which doesn't actually create the thread but asks kthreadd to do it. If NFS writeout is waiting for state management it would need to make sure that kthreadd runs in allocation context to avoid deadlock. I feel that I've forgotten some important detail here and this might have been fixed somehow, but the point still stands that the allocation context can cross from thread to thread and can effectively become anything and everything. It is OK to wait for memory to be freed. It is not OK to wait for any particular piece of memory to be freed because you don't always know who is waiting for you, or who you really are waiting on to free that memory. Whenever trying to free memory I think you need to do best-effort without blocking. > > I would also like to revisit generic inode/dentry shrinker and see > whether it could be more __GFP_FS friendly. As you say many FS might > even not depend on some FS internal locks so pushing GFP_FS check down > the layers might make a lot of sense and allow to clean some [id]cache > even for __GFP_FS context. I think the only part of prune_dcache_sb() that might need care is iput() which boils down to evict(). The final unlink for NFS silly-rename might happen in there too (in d_iput). shrinking the dcache seems rather late to be performing that unlink though, so I've probably missed some key detail. If we find a way for evict(), when called from the shrinker, to be non-blocking, and generally require all shrinkers to be non-blocking, then many of these allocation problems evaporate. Thanks, NeilBrown