* [PATCH 0/2] GFP_NOFAIL cleanups @ 2016-12-01 15:25 Michal Hocko 2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko 2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko 0 siblings, 2 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman, David Rientjes, linux-mm, LKML Hi, I have posted this as an RFC previously [1] as there was no fundamental disagreement I would like to ask for inclusion. Tetsuo has noticed [2] that recent changes have changed GFP_NOFAIL semantic for costly order requests. I believe that the primary reason why this happened is that our GFP_NOFAIL checks are too scattered and it is really easy to forget about adding one. That's why I am proposing patch 1 which consolidates all the nofail handling at a single place. This should help to make this code better maintainable. Patch 2 on top is a further attempt to make GFP_NOFAIL semantic less surprising. As things stand currently GFP_NOFAIL overrides the oom killer prevention code which is both subtle and not really needed. The patch 2 has more details about issues this might cause. I would consider both patches more a cleanup than anything else. Any feedback is highly appreciated. [1] http://lkml.kernel.org/r/20161123064925.9716-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/1479387004-5998-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath 2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko @ 2016-12-01 15:25 ` Michal Hocko 2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko 1 sibling, 0 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman, David Rientjes, linux-mm, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> Tetsuo Handa has pointed out that 0a0337e0d1d1 ("mm, oom: rework oom detection") has subtly changed semantic for costly high order requests with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now. My code inspection didn't reveal any such users in the tree but it is true that this might lead to unexpected allocation failures and subsequent OOPs. __alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently. There are few special cases but we are lacking a catch all place to be sure we will not miss any case where the non failing allocation might fail. This patch reorganizes the code a bit and puts all those special cases under nopage label which is the generic go-to-fail path. Non failing allocations are retried or those that cannot retry like non-sleeping allocation go to the failure point directly. This should make the code flow much easier to follow and make it less error prone for future changes. While we are there we have to move the stall check up to catch potentially looping non-failing allocations. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/page_alloc.c | 68 ++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 41 insertions(+), 27 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0fbfead6aa7d..76c0b6bb0baf 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3627,32 +3627,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto got_pg; /* Caller is not willing to reclaim, we can't balance anything */ - if (!can_direct_reclaim) { - /* - * All existing users of the __GFP_NOFAIL are blockable, so warn - * of any new users that actually allow this type of allocation - * to fail. - */ - WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL); + if (!can_direct_reclaim) goto nopage; + + /* Make sure we know about allocations which stall for too long */ + if (time_after(jiffies, alloc_start + stall_timeout)) { + warn_alloc(gfp_mask, + "page alloction stalls for %ums, order:%u", + jiffies_to_msecs(jiffies-alloc_start), order); + stall_timeout += 10 * HZ; } /* Avoid recursion of direct reclaim */ - if (current->flags & PF_MEMALLOC) { - /* - * __GFP_NOFAIL request from this context is rather bizarre - * because we cannot reclaim anything and only can loop waiting - * for somebody to do a work for us. - */ - if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { - cond_resched(); - goto retry; - } + if (current->flags & PF_MEMALLOC) goto nopage; - } /* Avoid allocations with no watermarks from looping endlessly */ - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL)) + if (test_thread_flag(TIF_MEMDIE)) goto nopage; @@ -3679,14 +3670,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT)) goto nopage; - /* Make sure we know about allocations which stall for too long */ - if (time_after(jiffies, alloc_start + stall_timeout)) { - warn_alloc(gfp_mask, - "page alloction stalls for %ums, order:%u", - jiffies_to_msecs(jiffies-alloc_start), order); - stall_timeout += 10 * HZ; - } - if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, did_some_progress > 0, &no_progress_loops)) goto retry; @@ -3715,6 +3698,37 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, } nopage: + /* + * Make sure that __GFP_NOFAIL request doesn't leak out and make sure + * we always retry + */ + if (gfp_mask & __GFP_NOFAIL) { + /* + * All existing users of the __GFP_NOFAIL are blockable, so warn + * of any new users that actually require GFP_NOWAIT + */ + if (WARN_ON_ONCE(!can_direct_reclaim)) + goto fail; + + /* + * PF_MEMALLOC request from this context is rather bizarre + * because we cannot reclaim anything and only can loop waiting + * for somebody to do a work for us + */ + WARN_ON_ONCE(current->flags & PF_MEMALLOC); + + /* + * non failing costly orders are a hard requirement which we + * are not prepared for much so let's warn about these users + * so that we can identify them and convert them to something + * else. + */ + WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); + + cond_resched(); + goto retry; + } +fail: warn_alloc(gfp_mask, "page allocation failure: order:%u", order); got_pg: -- 2.10.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko 2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko @ 2016-12-01 15:25 ` Michal Hocko 2016-12-02 7:23 ` Vlastimil Babka 2016-12-05 13:45 ` Tetsuo Handa 1 sibling, 2 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-01 15:25 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Tetsuo Handa, Johannes Weiner, Mel Gorman, David Rientjes, linux-mm, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is not obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by 82553a937f12 ("oom: invoke oom killer for __GFP_NOFAIL"). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee any forward progress for those exceptional cases - e.g. it will hardly help to form costly order - I believe we certainly do not want to kill all processes and eventually panic the system just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences - it is much better this request would loop for ever than the massive system disruption, lowmem is also highly unlikely to be freed during OOM killer and GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Instead we do allow nofail requests to access memory reserves to move forward in both cases when the OOM killer is invoked and when it should be supressed. __alloc_pages_nowmark helper has been introduced for that purpose. Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 2 +- mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++---------------------- 2 files changed, 59 insertions(+), 38 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ec9f11d4f094..12a6fce85f61 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) * make sure exclude 0 mask - all other users should have at least * ___GFP_DIRECT_RECLAIM to get here. */ - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) return true; /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 76c0b6bb0baf..7102641147c4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3044,6 +3044,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...) } static inline struct page * +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order, + const struct alloc_context *ac) +{ + struct page *page; + + page = get_page_from_freelist(gfp_mask, order, + ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); + /* + * fallback to ignore cpuset restriction if our nodes + * are depleted + */ + if (!page) + page = get_page_from_freelist(gfp_mask, order, + ALLOC_NO_WATERMARKS, ac); + + return page; +} + +static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, const struct alloc_context *ac, unsigned long *did_some_progress) { @@ -3078,47 +3097,41 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (page) goto out; - if (!(gfp_mask & __GFP_NOFAIL)) { - /* Coredumps can quickly deplete all memory reserves */ - if (current->flags & PF_DUMPCORE) - goto out; - /* The OOM killer will not help higher order allocs */ - if (order > PAGE_ALLOC_COSTLY_ORDER) - goto out; - /* The OOM killer does not needlessly kill tasks for lowmem */ - if (ac->high_zoneidx < ZONE_NORMAL) - goto out; - if (pm_suspended_storage()) - goto out; - /* - * XXX: GFP_NOFS allocations should rather fail than rely on - * other request to make a forward progress. - * We are in an unfortunate situation where out_of_memory cannot - * do much for this context but let's try it to at least get - * access to memory reserved if the current task is killed (see - * out_of_memory). Once filesystems are ready to handle allocation - * failures more gracefully we should just bail out here. - */ + /* Coredumps can quickly deplete all memory reserves */ + if (current->flags & PF_DUMPCORE) + goto out; + /* The OOM killer will not help higher order allocs */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + goto out; + /* The OOM killer does not needlessly kill tasks for lowmem */ + if (ac->high_zoneidx < ZONE_NORMAL) + goto out; + if (pm_suspended_storage()) + goto out; + /* + * XXX: GFP_NOFS allocations should rather fail than rely on + * other request to make a forward progress. + * We are in an unfortunate situation where out_of_memory cannot + * do much for this context but let's try it to at least get + * access to memory reserved if the current task is killed (see + * out_of_memory). Once filesystems are ready to handle allocation + * failures more gracefully we should just bail out here. + */ + + /* The OOM killer may not free memory on a specific node */ + if (gfp_mask & __GFP_THISNODE) + goto out; - /* The OOM killer may not free memory on a specific node */ - if (gfp_mask & __GFP_THISNODE) - goto out; - } /* Exhausted what can be done so it's blamo time */ - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { + if (out_of_memory(&oc)) { *did_some_progress = 1; - if (gfp_mask & __GFP_NOFAIL) { - page = get_page_from_freelist(gfp_mask, order, - ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); - /* - * fallback to ignore cpuset restriction if our nodes - * are depleted - */ - if (!page) - page = get_page_from_freelist(gfp_mask, order, - ALLOC_NO_WATERMARKS, ac); - } + /* + * Help non-failing allocations by giving them access to memory + * reserves + */ + if (gfp_mask & __GFP_NOFAIL) + page = __alloc_pages_nowmark(gfp_mask, order, ac); } out: mutex_unlock(&oom_lock); @@ -3725,6 +3738,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, */ WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); + /* + * Help non-failing allocations by giving them access to memory + * reserves + */ + page = __alloc_pages_nowmark(gfp_mask, order, ac); + if (page) + goto got_pg; + cond_resched(); goto retry; } -- 2.10.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko @ 2016-12-02 7:23 ` Vlastimil Babka 2016-12-05 13:45 ` Tetsuo Handa 1 sibling, 0 replies; 23+ messages in thread From: Vlastimil Babka @ 2016-12-02 7:23 UTC (permalink / raw) To: Michal Hocko, Andrew Morton Cc: Tetsuo Handa, Johannes Weiner, Mel Gorman, David Rientjes, linux-mm, LKML, Michal Hocko On 12/01/2016 04:25 PM, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > __alloc_pages_may_oom makes sure to skip the OOM killer depending on > the allocation request. This includes lowmem requests, costly high > order requests and others. For a long time __GFP_NOFAIL acted as an > override for all those rules. This is not documented and it can be quite > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of > the existing open coded loops around allocator to nofail request (and we > have done that in the past) then such a change would have a non trivial > side effect which is not obvious. Note that the primary motivation for > skipping the OOM killer is to prevent from pre-mature invocation. > > The exception has been added by 82553a937f12 ("oom: invoke oom killer > for __GFP_NOFAIL"). The changelog points out that the oom killer has to > be invoked otherwise the request would be looping for ever. But this > argument is rather weak because the OOM killer doesn't really guarantee > any forward progress for those exceptional cases - e.g. it will hardly > help to form costly order - I believe we certainly do not want to kill > all processes and eventually panic the system just because there is a > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all > the consequences - it is much better this request would loop for ever > than the massive system disruption, lowmem is also highly unlikely to be > freed during OOM killer and GFP_NOFS request could trigger while there > is still a lot of memory pinned by filesystems. > > This patch simply removes the __GFP_NOFAIL special case in order to have > a more clear semantic without surprising side effects. Instead we do > allow nofail requests to access memory reserves to move forward in both > cases when the OOM killer is invoked and when it should be supressed. > __alloc_pages_nowmark helper has been introduced for that purpose. > > Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko 2016-12-02 7:23 ` Vlastimil Babka @ 2016-12-05 13:45 ` Tetsuo Handa 2016-12-05 14:10 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Tetsuo Handa @ 2016-12-05 13:45 UTC (permalink / raw) To: mhocko, akpm Cc: vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel, mhocko Michal Hocko wrote: > __alloc_pages_may_oom makes sure to skip the OOM killer depending on > the allocation request. This includes lowmem requests, costly high > order requests and others. For a long time __GFP_NOFAIL acted as an > override for all those rules. This is not documented and it can be quite > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of > the existing open coded loops around allocator to nofail request (and we > have done that in the past) then such a change would have a non trivial > side effect which is not obvious. Note that the primary motivation for > skipping the OOM killer is to prevent from pre-mature invocation. > > The exception has been added by 82553a937f12 ("oom: invoke oom killer > for __GFP_NOFAIL"). The changelog points out that the oom killer has to > be invoked otherwise the request would be looping for ever. But this > argument is rather weak because the OOM killer doesn't really guarantee > any forward progress for those exceptional cases - e.g. it will hardly > help to form costly order - I believe we certainly do not want to kill > all processes and eventually panic the system just because there is a > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all > the consequences - it is much better this request would loop for ever > than the massive system disruption, lowmem is also highly unlikely to be > freed during OOM killer and GFP_NOFS request could trigger while there > is still a lot of memory pinned by filesystems. I disagree. I believe that panic caused by OOM killer is much much better than a locked up system. I hate to add new locations that can lockup inside page allocator. This is __GFP_NOFAIL and reclaim has failed. Administrator has to go in front of console and press SysRq-f until locked up situation gets resolved is silly. If there is a nasty driver asking for order-9 page with __GFP_NOFAIL, fix that driver. > > This patch simply removes the __GFP_NOFAIL special case in order to have > a more clear semantic without surprising side effects. Instead we do > allow nofail requests to access memory reserves to move forward in both > cases when the OOM killer is invoked and when it should be supressed. > __alloc_pages_nowmark helper has been introduced for that purpose. > > Signed-off-by: Michal Hocko <mhocko@suse.com> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-05 13:45 ` Tetsuo Handa @ 2016-12-05 14:10 ` Michal Hocko 2016-12-06 8:27 ` Michal Hocko 2016-12-06 10:38 ` Tetsuo Handa 0 siblings, 2 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-05 14:10 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel On Mon 05-12-16 22:45:19, Tetsuo Handa wrote: > Michal Hocko wrote: > > __alloc_pages_may_oom makes sure to skip the OOM killer depending on > > the allocation request. This includes lowmem requests, costly high > > order requests and others. For a long time __GFP_NOFAIL acted as an > > override for all those rules. This is not documented and it can be quite > > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM > > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of > > the existing open coded loops around allocator to nofail request (and we > > have done that in the past) then such a change would have a non trivial > > side effect which is not obvious. Note that the primary motivation for > > skipping the OOM killer is to prevent from pre-mature invocation. > > > > The exception has been added by 82553a937f12 ("oom: invoke oom killer > > for __GFP_NOFAIL"). The changelog points out that the oom killer has to > > be invoked otherwise the request would be looping for ever. But this > > argument is rather weak because the OOM killer doesn't really guarantee > > any forward progress for those exceptional cases - e.g. it will hardly > > help to form costly order - I believe we certainly do not want to kill > > all processes and eventually panic the system just because there is a > > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all > > the consequences - it is much better this request would loop for ever > > than the massive system disruption, lowmem is also highly unlikely to be > > freed during OOM killer and GFP_NOFS request could trigger while there > > is still a lot of memory pinned by filesystems. > > I disagree. I believe that panic caused by OOM killer is much much better > than a locked up system. I hate to add new locations that can lockup inside > page allocator. This is __GFP_NOFAIL and reclaim has failed. As a matter of fact any __GFP_NOFAIL can lockup inside the allocator. Full stop. There is no guaranteed way to make a forward progress with the current page allocator implementation. So we are somewhere in the middle between pre-mature and pointless system disruption (GFP_NOFS with a lots of metadata or lowmem request) where the OOM killer even might not help and potential lockup which is inevitable with the current design. Dunno about you but I would rather go with the first option. To be honest I really fail to understand your line of argumentation. We have this do { cond_resched(); } (page = alloc_page(GFP_NOFS)); vs. page = alloc_page(GFP_NOFS | __GFP_NOFAIL); the first one doesn't invoke OOM killer while the later does. This discrepancy just cannot make any sense... The same is true for alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL) Now we can discuss whether it is a _good_ idea to not invoke OOM killer for those exceptions but whatever we do __GFP_NOFAIL is not a way to give such a subtle side effect. Or do you disagree even with that? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-05 14:10 ` Michal Hocko @ 2016-12-06 8:27 ` Michal Hocko 2016-12-06 10:38 ` Tetsuo Handa 1 sibling, 0 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-06 8:27 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel On Mon 05-12-16 15:10:09, Michal Hocko wrote: [...] > So we are somewhere in the middle between pre-mature and pointless > system disruption (GFP_NOFS with a lots of metadata or lowmem request) > where the OOM killer even might not help and potential lockup which is > inevitable with the current design. Dunno about you but I would rather > go with the first option. To be honest I really fail to understand your > line of argumentation. We have this > do { > cond_resched(); > } (page = alloc_page(GFP_NOFS)); This should have been while (!(page = alloc_page(GFP_NOFS))) of course... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-05 14:10 ` Michal Hocko 2016-12-06 8:27 ` Michal Hocko @ 2016-12-06 10:38 ` Tetsuo Handa 2016-12-06 11:03 ` Vlastimil Babka 2016-12-06 19:22 ` Michal Hocko 1 sibling, 2 replies; 23+ messages in thread From: Tetsuo Handa @ 2016-12-06 10:38 UTC (permalink / raw) To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel Michal Hocko wrote: > On Mon 05-12-16 22:45:19, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > __alloc_pages_may_oom makes sure to skip the OOM killer depending on > > > the allocation request. This includes lowmem requests, costly high > > > order requests and others. For a long time __GFP_NOFAIL acted as an > > > override for all those rules. This is not documented and it can be quite > > > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM > > > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of > > > the existing open coded loops around allocator to nofail request (and we > > > have done that in the past) then such a change would have a non trivial > > > side effect which is not obvious. Note that the primary motivation for > > > skipping the OOM killer is to prevent from pre-mature invocation. > > > > > > The exception has been added by 82553a937f12 ("oom: invoke oom killer > > > for __GFP_NOFAIL"). The changelog points out that the oom killer has to > > > be invoked otherwise the request would be looping for ever. But this > > > argument is rather weak because the OOM killer doesn't really guarantee > > > any forward progress for those exceptional cases - e.g. it will hardly > > > help to form costly order - I believe we certainly do not want to kill > > > all processes and eventually panic the system just because there is a > > > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all > > > the consequences - it is much better this request would loop for ever > > > than the massive system disruption, lowmem is also highly unlikely to be > > > freed during OOM killer and GFP_NOFS request could trigger while there > > > is still a lot of memory pinned by filesystems. > > > > I disagree. I believe that panic caused by OOM killer is much much better > > than a locked up system. I hate to add new locations that can lockup inside > > page allocator. This is __GFP_NOFAIL and reclaim has failed. > > As a matter of fact any __GFP_NOFAIL can lockup inside the allocator. You are trying to increase possible locations of lockups by changing default behavior of __GFP_NOFAIL. > Full stop. There is no guaranteed way to make a forward progress with > the current page allocator implementation. Then, will you accept kmallocwd until page allocator implementation can provide a guaranteed way to make a forward progress? > > So we are somewhere in the middle between pre-mature and pointless > system disruption (GFP_NOFS with a lots of metadata or lowmem request) > where the OOM killer even might not help and potential lockup which is > inevitable with the current design. Dunno about you but I would rather > go with the first option. To be honest I really fail to understand your > line of argumentation. We have this > do { > cond_resched(); > } while (!(page = alloc_page(GFP_NOFS))); > vs. > page = alloc_page(GFP_NOFS | __GFP_NOFAIL); > > the first one doesn't invoke OOM killer while the later does. This > discrepancy just cannot make any sense... The same is true for > > alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL) > > Now we can discuss whether it is a _good_ idea to not invoke OOM killer > for those exceptions but whatever we do __GFP_NOFAIL is not a way to > give such a subtle side effect. Or do you disagree even with that? "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority. Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL allocation requests fail without invoking the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given. With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY; __GFP_NOFAIL allocation requests will loop forever without invoking the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given. Those callers which prefer lockup over panic can specify both __GFP_NORETRY and __GFP_NOFAIL. You are trying to change behavior of __GFP_NOFAIL without asking whether existing __GFP_NOFAIL users want to invoke the OOM killer. And the story is not specific to existing __GFP_NOFAIL users; it applies to existing GFP_NOFS users as well. Quoting from http://lkml.kernel.org/r/20161125131806.GB24353@dhcp22.suse.cz : > > Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to > > commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit > > 869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from > > re-entering the fs code") ? My understanding is that mkdir() system call > > caused memory allocation for inode creation and that memory allocation > > caused memory reclaim which had to be !__GFP_FS. > > I will have a look later, thanks for the points. What is your answer to this problem? For those who prefer panic over lockup, please provide a mean to invoke the OOM killer (e.g. __GFP_WANT_OOM_KILLER). If you provide __GFP_WANT_OOM_KILLER, you can change -#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS) +#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_WANT_OOM_KILLER) in gfp.h and add __GFP_WANT_OOM_KILLER to any existing __GFP_NOFAIL/GFP_NOFS users who prefer panic over lockup. Then, I think I'm fine with - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) - return true; + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_WANT_OOM_KILLER)) + return false; in out_of_memory(). As you know, quite few people are responding to discussions regarding almost OOM situation. Changing default behavior without asking existing users is annoying. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-06 10:38 ` Tetsuo Handa @ 2016-12-06 11:03 ` Vlastimil Babka 2016-12-06 19:25 ` Michal Hocko 2016-12-06 19:22 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Vlastimil Babka @ 2016-12-06 11:03 UTC (permalink / raw) To: Tetsuo Handa, mhocko Cc: akpm, hannes, mgorman, rientjes, linux-mm, linux-kernel On 12/06/2016 11:38 AM, Tetsuo Handa wrote: >> >> So we are somewhere in the middle between pre-mature and pointless >> system disruption (GFP_NOFS with a lots of metadata or lowmem request) >> where the OOM killer even might not help and potential lockup which is >> inevitable with the current design. Dunno about you but I would rather >> go with the first option. To be honest I really fail to understand your >> line of argumentation. We have this >> do { >> cond_resched(); >> } while (!(page = alloc_page(GFP_NOFS))); >> vs. >> page = alloc_page(GFP_NOFS | __GFP_NOFAIL); >> >> the first one doesn't invoke OOM killer while the later does. This >> discrepancy just cannot make any sense... The same is true for >> >> alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL) >> >> Now we can discuss whether it is a _good_ idea to not invoke OOM killer >> for those exceptions but whatever we do __GFP_NOFAIL is not a way to >> give such a subtle side effect. Or do you disagree even with that? > > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" > silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority. I guess that wasn't intended? > Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL > allocation requests fail without invoking the OOM killer when both > __GFP_NORETRY and __GFP_NOFAIL are given. > > With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY; > __GFP_NOFAIL allocation requests will loop forever without invoking > the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given. Does such combination of flag make sense? Should we warn about it, or even silently remove __GFP_NORETRY in such case? > Those callers which prefer lockup over panic can specify both > __GFP_NORETRY and __GFP_NOFAIL. What lockup exactly, if __GFP_NORETRY did lead to fail? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-06 11:03 ` Vlastimil Babka @ 2016-12-06 19:25 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-06 19:25 UTC (permalink / raw) To: Vlastimil Babka Cc: Tetsuo Handa, akpm, hannes, mgorman, rientjes, linux-mm, linux-kernel On Tue 06-12-16 12:03:02, Vlastimil Babka wrote: > On 12/06/2016 11:38 AM, Tetsuo Handa wrote: > >> > >> So we are somewhere in the middle between pre-mature and pointless > >> system disruption (GFP_NOFS with a lots of metadata or lowmem request) > >> where the OOM killer even might not help and potential lockup which is > >> inevitable with the current design. Dunno about you but I would rather > >> go with the first option. To be honest I really fail to understand your > >> line of argumentation. We have this > >> do { > >> cond_resched(); > >> } while (!(page = alloc_page(GFP_NOFS))); > >> vs. > >> page = alloc_page(GFP_NOFS | __GFP_NOFAIL); > >> > >> the first one doesn't invoke OOM killer while the later does. This > >> discrepancy just cannot make any sense... The same is true for > >> > >> alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL) > >> > >> Now we can discuss whether it is a _good_ idea to not invoke OOM killer > >> for those exceptions but whatever we do __GFP_NOFAIL is not a way to > >> give such a subtle side effect. Or do you disagree even with that? > > > > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" > > silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority. > > I guess that wasn't intended? I even didn't think about that possibility because it just doesn't make any sense. > > Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL > > allocation requests fail without invoking the OOM killer when both > > __GFP_NORETRY and __GFP_NOFAIL are given. > > > > With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY; > > __GFP_NOFAIL allocation requests will loop forever without invoking > > the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given. > > Does such combination of flag make sense? Should we warn about it, or > even silently remove __GFP_NORETRY in such case? No this combination doesn't make any sense. I seriously doubt we should even care about it and simply following the stronger requirement makes more sense from a semantic point of view. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-06 10:38 ` Tetsuo Handa 2016-12-06 11:03 ` Vlastimil Babka @ 2016-12-06 19:22 ` Michal Hocko 2016-12-08 12:53 ` Tetsuo Handa 1 sibling, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-06 19:22 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel On Tue 06-12-16 19:38:38, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 05-12-16 22:45:19, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > __alloc_pages_may_oom makes sure to skip the OOM killer depending on > > > > the allocation request. This includes lowmem requests, costly high > > > > order requests and others. For a long time __GFP_NOFAIL acted as an > > > > override for all those rules. This is not documented and it can be quite > > > > surprising as well. E.g. GFP_NOFS requests are not invoking the OOM > > > > killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of > > > > the existing open coded loops around allocator to nofail request (and we > > > > have done that in the past) then such a change would have a non trivial > > > > side effect which is not obvious. Note that the primary motivation for > > > > skipping the OOM killer is to prevent from pre-mature invocation. > > > > > > > > The exception has been added by 82553a937f12 ("oom: invoke oom killer > > > > for __GFP_NOFAIL"). The changelog points out that the oom killer has to > > > > be invoked otherwise the request would be looping for ever. But this > > > > argument is rather weak because the OOM killer doesn't really guarantee > > > > any forward progress for those exceptional cases - e.g. it will hardly > > > > help to form costly order - I believe we certainly do not want to kill > > > > all processes and eventually panic the system just because there is a > > > > nasty driver asking for order-9 page with GFP_NOFAIL not realizing all > > > > the consequences - it is much better this request would loop for ever > > > > than the massive system disruption, lowmem is also highly unlikely to be > > > > freed during OOM killer and GFP_NOFS request could trigger while there > > > > is still a lot of memory pinned by filesystems. > > > > > > I disagree. I believe that panic caused by OOM killer is much much better > > > than a locked up system. I hate to add new locations that can lockup inside > > > page allocator. This is __GFP_NOFAIL and reclaim has failed. > > > > As a matter of fact any __GFP_NOFAIL can lockup inside the allocator. > > You are trying to increase possible locations of lockups by changing > default behavior of __GFP_NOFAIL. I disagree. I have tried to explain that it is much more important to have fewer silent side effects than optimize for the very worst case. I simply do not see __GFP_NOFAIL lockups so common to even care or tweak their semantic in a weird way. It seems you prefer to optimize for the absolute worst case and even for that case you cannot offer anything better than randomly OOM killing random processes until the system somehow resurrects or panics. I consider this a very bad design. So let's agree to disagree here. > > Full stop. There is no guaranteed way to make a forward progress with > > the current page allocator implementation. > > Then, will you accept kmallocwd until page allocator implementation > can provide a guaranteed way to make a forward progress? No, I find your kmallocwd too complex for the advantage it provides. > > So we are somewhere in the middle between pre-mature and pointless > > system disruption (GFP_NOFS with a lots of metadata or lowmem request) > > where the OOM killer even might not help and potential lockup which is > > inevitable with the current design. Dunno about you but I would rather > > go with the first option. To be honest I really fail to understand your > > line of argumentation. We have this > > do { > > cond_resched(); > > } while (!(page = alloc_page(GFP_NOFS))); > > vs. > > page = alloc_page(GFP_NOFS | __GFP_NOFAIL); > > > > the first one doesn't invoke OOM killer while the later does. This > > discrepancy just cannot make any sense... The same is true for > > > > alloc_page(GFP_DMA) vs alloc_page(GFP_DMA|__GFP_NOFAIL) > > > > Now we can discuss whether it is a _good_ idea to not invoke OOM killer > > for those exceptions but whatever we do __GFP_NOFAIL is not a way to > > give such a subtle side effect. Or do you disagree even with that? > > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" > silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority. > > Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL > allocation requests fail without invoking the OOM killer when both > __GFP_NORETRY and __GFP_NOFAIL are given. Sigh... __GFP_NORETRY | __GFP_NOFAIL _doesn't_ make _any_ sense what so ever. > With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY; > __GFP_NOFAIL allocation requests will loop forever without invoking > the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given. So what? Strictly speaking __GFP_NOFAIL should be always stronger but I really fail to see why we should even consider __GFP_NORETRY in that context. I definitely do not want to complicate the page fault path for a nonsense combination of flags. > Those callers which prefer lockup over panic can specify both > __GFP_NORETRY and __GFP_NOFAIL. No! This combination just doesn't make any sense. The same way how __GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense as well. Please use a common sense! > You are trying to change behavior of > __GFP_NOFAIL without asking whether existing __GFP_NOFAIL users > want to invoke the OOM killer. Invoking or not invoking the oom killer is the page allocator internal business. No code outside of the MM is to talk about those decisions. The fact that we provide a lightweight allocation mode which doesn't invoke the OOM killer is a mere implementation detail. > And the story is not specific to existing __GFP_NOFAIL users; > it applies to existing GFP_NOFS users as well. > > Quoting from http://lkml.kernel.org/r/20161125131806.GB24353@dhcp22.suse.cz : > > > Will you look at http://marc.info/?t=120716967100004&r=1&w=2 which lead to > > > commit a02fe13297af26c1 ("selinux: prevent rentry into the FS") and commit > > > 869ab5147e1eead8 ("SELinux: more GFP_NOFS fixups to prevent selinux from > > > re-entering the fs code") ? My understanding is that mkdir() system call > > > caused memory allocation for inode creation and that memory allocation > > > caused memory reclaim which had to be !__GFP_FS. > > > > I will have a look later, thanks for the points. > > What is your answer to this problem? For those who prefer panic over lockup, > please provide a mean to invoke the OOM killer (e.g. __GFP_WANT_OOM_KILLER). Please stop shifting discussion off the scope of the discussed patch. This just distracts from the main point. I really think that at least patch 1 is a good clean up and the second one makes a lot of sense as well from the semantic point of view. Now you constantly push to the extreme with very strong statements without any actual data point. This is not really helpful! If you believe that my argumentation is incorrect then you are free to nak the patch with your reasoning. But please stop this nit picking on nonsense combination of flags. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-06 19:22 ` Michal Hocko @ 2016-12-08 12:53 ` Tetsuo Handa 2016-12-08 13:47 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Tetsuo Handa @ 2016-12-08 12:53 UTC (permalink / raw) To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel Michal Hocko wrote: > On Tue 06-12-16 19:38:38, Tetsuo Handa wrote: > > You are trying to increase possible locations of lockups by changing > > default behavior of __GFP_NOFAIL. > > I disagree. I have tried to explain that it is much more important to > have fewer silent side effects than optimize for the very worst case. I > simply do not see __GFP_NOFAIL lockups so common to even care or tweak > their semantic in a weird way. It seems you prefer to optimize for the > absolute worst case and even for that case you cannot offer anything > better than randomly OOM killing random processes until the system > somehow resurrects or panics. I consider this a very bad design. So > let's agree to disagree here. You think that invoking the OOM killer with __GFP_NOFAIL is worse than locking up with __GFP_NOFAIL. But I think that locking up with __GFP_NOFAIL is worse than invoking the OOM killer with __GFP_NOFAIL. If we could agree with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL is given, we can avoid locking up while minimizing possibility of invoking the OOM killer... I suggest "when you change something, ask users who are affected by your change" because patch 2 has values-based conflict. > > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" > > silently changes __GFP_NOFAIL vs. __GFP_NORETRY priority. > > > > Currently, __GFP_NORETRY is stronger than __GFP_NOFAIL; __GFP_NOFAIL > > allocation requests fail without invoking the OOM killer when both > > __GFP_NORETRY and __GFP_NOFAIL are given. > > Sigh... __GFP_NORETRY | __GFP_NOFAIL _doesn't_ make _any_ sense what so > ever. > > > With [PATCH 1/2], __GFP_NOFAIL becomes stronger than __GFP_NORETRY; > > __GFP_NOFAIL allocation requests will loop forever without invoking > > the OOM killer when both __GFP_NORETRY and __GFP_NOFAIL are given. > > So what? Strictly speaking __GFP_NOFAIL should be always stronger but I > really fail to see why we should even consider __GFP_NORETRY in that > context. I definitely do not want to complicate the page fault path for > a nonsense combination of flags. > > > Those callers which prefer lockup over panic can specify both > > __GFP_NORETRY and __GFP_NOFAIL. > > No! This combination just doesn't make any sense. The same way how > __GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense > as well. Please use a common sense! I wonder why I'm accused so much. I mentioned that patch 2 might be a garbage because patch 1 alone unexpectedly provided a mean to retry forever without invoking the OOM killer. You are not describing that fact in the description. You are not describing what combinations are valid and which flag is stronger requirement in gfp.h (e.g. __GFP_NOFAIL v.s. __GFP_NORETRY). > Invoking or not invoking the oom killer is the page allocator internal > business. No code outside of the MM is to talk about those decisions. > The fact that we provide a lightweight allocation mode which doesn't > invoke the OOM killer is a mere implementation detail. __GFP_NOFAIL allocation requests for e.g. fs writeback is considered as code inside the MM because they are operations for reclaiming memory. Such __GFP_NOFAIL allocation requests should be given a chance to choose which one (possibility of lockup by not invoking the OOM killer or possibility of panic by invoking the OOM killer) they prefer. Therefore, > If you believe that my argumentation is incorrect then you are free to > nak the patch with your reasoning. But please stop this nit picking on > nonsense combination of flags. Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> on patch 2 unless "you explain these patches to __GFP_NOFAIL users and provide a mean to invoke the OOM killer if someone chose possibility of panic" or "you accept kmallocwd". > > > Full stop. There is no guaranteed way to make a forward progress with > > > the current page allocator implementation. > > > > Then, will you accept kmallocwd until page allocator implementation > > can provide a guaranteed way to make a forward progress? > > No, I find your kmallocwd too complex for the advantage it provides. My kmallocwd provides us two advantages. One is that, if the cause of lockup is not memory allocation request, kmallocwd gives us a proof that you are doing well. You seriously tend to ignore corner cases with your wish that the absolute worst case won't happen; that makes me seriously explorer corner cases as a secure coder for proactive protection; that irritates you further. You say that I constantly push to the extreme with very strong statements without any actual data point, but I say that you constantly reject without any proof just because you have never heard. The reality is that we can hardly expect people to have knowledge/skills for reporting corner cases. The other is that, synchronous mechanism (like warn_alloc()) is prone to corner cases. We won't be able to catch all corner cases before people are trapped by them. If the cause of lockup is memory allocation request, kmallocwd gives us a trigger to take actions. This keeps me away from exploring corner cases which you think unlikely happen. This helps you to choose whatever semantic/logic you prefer. Since I'm not a __GFP_NOFAIL user, I don't care as long as lockups are detected and reported using a catch-all approach (i.e. asynchronous mechanism). Instead of cruelly rejecting kmallocwd with "too complex", will you explain why you feel complex (as a reply to kmallocwd thread)? I have my goal for written as such, but there would be room for reducing complexity if you explain details. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-08 12:53 ` Tetsuo Handa @ 2016-12-08 13:47 ` Michal Hocko 2016-12-11 11:23 ` Tetsuo Handa 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-08 13:47 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel On Thu 08-12-16 21:53:44, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 06-12-16 19:38:38, Tetsuo Handa wrote: > > > You are trying to increase possible locations of lockups by changing > > > default behavior of __GFP_NOFAIL. > > > > I disagree. I have tried to explain that it is much more important to > > have fewer silent side effects than optimize for the very worst case. I > > simply do not see __GFP_NOFAIL lockups so common to even care or tweak > > their semantic in a weird way. It seems you prefer to optimize for the > > absolute worst case and even for that case you cannot offer anything > > better than randomly OOM killing random processes until the system > > somehow resurrects or panics. I consider this a very bad design. So > > let's agree to disagree here. > > You think that invoking the OOM killer with __GFP_NOFAIL is worse than > locking up with __GFP_NOFAIL. Yes and I have explained why. > But I think that locking up with __GFP_NOFAIL > is worse than invoking the OOM killer with __GFP_NOFAIL. Without any actual arguments based just on handwaving. > If we could agree > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL > is given, we can avoid locking up while minimizing possibility of invoking > the OOM killer... I do not understand. We do __alloc_pages_nowmark even when oom is called for GFP_NOFAIL. > I suggest "when you change something, ask users who are affected by > your change" because patch 2 has values-based conflict. > [...] > > > Those callers which prefer lockup over panic can specify both > > > __GFP_NORETRY and __GFP_NOFAIL. > > > > No! This combination just doesn't make any sense. The same way how > > __GFP_REPEAT | GFP_NOWAIT or __GFP_REPEAT | __GFP_NORETRY make no sense > > as well. Please use a common sense! > > I wonder why I'm accused so much. I mentioned that patch 2 might be a > garbage because patch 1 alone unexpectedly provided a mean to retry forever > without invoking the OOM killer. Which is the whole point of the patch and the changelog is vocal about that. Even explaining why it is desirable to not override decisions when the oom killer is not invoked. Please reread that and object if the argument is not correct. > You are not describing that fact in the > description. You are not describing what combinations are valid and > which flag is stronger requirement in gfp.h (e.g. __GFP_NOFAIL v.s. > __GFP_NORETRY). Sigh... I really fail to see why I should describe an impossible gfp mask combination which is _not_ used in the kernel. Please stop this strawman, I am really tired of it. > > Invoking or not invoking the oom killer is the page allocator internal > > business. No code outside of the MM is to talk about those decisions. > > The fact that we provide a lightweight allocation mode which doesn't > > invoke the OOM killer is a mere implementation detail. > > __GFP_NOFAIL allocation requests for e.g. fs writeback is considered as > code inside the MM because they are operations for reclaiming memory. > Such __GFP_NOFAIL allocation requests should be given a chance to choose > which one (possibility of lockup by not invoking the OOM killer or > possibility of panic by invoking the OOM killer) they prefer. Please be more specific. How and why they should choose that. Which allocation are we talking about and why do you believe that the current implementation with access to memory reserves is not sufficient. > Therefore, > > > If you believe that my argumentation is incorrect then you are free to > > nak the patch with your reasoning. But please stop this nit picking on > > nonsense combination of flags. > > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > on patch 2 unless "you explain these patches to __GFP_NOFAIL users and > provide a mean to invoke the OOM killer if someone chose possibility of > panic" I believe that the changelog contains my reasonining and so far I haven't heard any _argument_ from you why they are wrong. You just managed to nitpick on an impossible and pointless gfp_mask combination and some handwaving on possible lockups without any backing arguments. This is not something I would consider as a basis for a serious nack. So if you really hate this patch then do please start being reasonable and put some real arguments into your review without any off topics and/or strawman arguments without any relevance. > or "you accept kmallocwd". Are you serious? Are you really suggesting that your patch has to be accepted in order to have this one in? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-08 13:47 ` Michal Hocko @ 2016-12-11 11:23 ` Tetsuo Handa 2016-12-11 13:53 ` Tetsuo Handa 2016-12-12 8:48 ` Michal Hocko 0 siblings, 2 replies; 23+ messages in thread From: Tetsuo Handa @ 2016-12-11 11:23 UTC (permalink / raw) To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel Michal Hocko wrote: > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote: > > If we could agree > > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL > > is given, we can avoid locking up while minimizing possibility of invoking > > the OOM killer... > > I do not understand. We do __alloc_pages_nowmark even when oom is called > for GFP_NOFAIL. Where is that? I can find __alloc_pages_nowmark() after out_of_memory() if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL is given. What I mean is below patch folded into "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath". --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3057,6 +3057,25 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...) } static inline struct page * +__alloc_pages_nowmark(gfp_t gfp_mask, unsigned int order, + const struct alloc_context *ac) +{ + struct page *page; + + page = get_page_from_freelist(gfp_mask, order, + ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); + /* + * fallback to ignore cpuset restriction if our nodes + * are depleted + */ + if (!page) + page = get_page_from_freelist(gfp_mask, order, + ALLOC_NO_WATERMARKS, ac); + + return page; +} + +static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, const struct alloc_context *ac, unsigned long *did_some_progress) { @@ -3118,21 +3137,8 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...) goto out; } /* Exhausted what can be done so it's blamo time */ - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { + if (out_of_memory(&oc)) *did_some_progress = 1; - - if (gfp_mask & __GFP_NOFAIL) { - page = get_page_from_freelist(gfp_mask, order, - ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); - /* - * fallback to ignore cpuset restriction if our nodes - * are depleted - */ - if (!page) - page = get_page_from_freelist(gfp_mask, order, - ALLOC_NO_WATERMARKS, ac); - } - } out: mutex_unlock(&oom_lock); return page; @@ -3738,6 +3744,19 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) */ WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); + /* + * Help non-failing allocations by giving them access to memory + * reserves + */ + page = __alloc_pages_nowmark(gfp_mask, order, ac); + if (page) + goto got_pg; + + /* Memory reserves did not help. Start killing things. */ + page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); + if (page) + goto got_pg; + cond_resched(); goto retry; } > > Therefore, > > > > > If you believe that my argumentation is incorrect then you are free to > > > nak the patch with your reasoning. But please stop this nit picking on > > > nonsense combination of flags. > > > > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > > on patch 2 unless "you explain these patches to __GFP_NOFAIL users and > > provide a mean to invoke the OOM killer if someone chose possibility of > > panic" > > I believe that the changelog contains my reasoning and so far I > haven't heard any _argument_ from you why they are wrong. You just > managed to nitpick on an impossible and pointless gfp_mask combination > and some handwaving on possible lockups without any backing arguments. > This is not something I would consider as a basis for a serious nack. So > if you really hate this patch then do please start being reasonable and > put some real arguments into your review without any off topics and/or > strawman arguments without any relevance. > Are you aware that I'm not objecting to "change __GFP_NOFAIL not to invoke the OOM killer". What I'm objecting is that you are trying to change !__GFP_FS && !__GFP_NOFAIL allocation requests without offering transition plan like below. --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) * make sure exclude 0 mask - all other users should have at least * ___GFP_DIRECT_RECLAIM to get here. */ - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) + if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_WANT_OOM_KILLER)) return true; /* If you do want to push --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) * make sure exclude 0 mask - all other users should have at least * ___GFP_DIRECT_RECLAIM to get here. */ - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) return true; /* change, ask existing __GFP_NOFAIL users with explanations like below I believe that __GFP_NOFAIL should not imply invocation of the OOM killer. Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer. But since currently the OOM killer is not invoked unless either __GFP_FS or __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking by not invoking the OOM killer. Although I can't prove that this change never causes livelock, I don't want to provide an alternative flag like __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must agree with accepting the risk introduced by this change. and confirm that all existing __GFP_NOFAIL users are willing to accept the risk of livelocking by not invoking the OOM killer. Unless you do this procedure, I continue: Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > or "you accept kmallocwd". > > Are you serious? Are you really suggesting that your patch has to be > accepted in order to have this one in? I'm surprised you think my kmallocwd as a choice for applying this patch, for I was assuming that you choose the procedure above which is super straightforward. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-11 11:23 ` Tetsuo Handa @ 2016-12-11 13:53 ` Tetsuo Handa 2016-12-12 8:52 ` Michal Hocko 2016-12-12 8:48 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Tetsuo Handa @ 2016-12-11 13:53 UTC (permalink / raw) To: mhocko; +Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote: > > > If we could agree > > > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL > > > is given, we can avoid locking up while minimizing possibility of invoking > > > the OOM killer... > > > > I do not understand. We do __alloc_pages_nowmark even when oom is called > > for GFP_NOFAIL. > > Where is that? I can find __alloc_pages_nowmark() after out_of_memory() > if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before > out_of_memory() if __GFP_NOFAIL is given. > > What I mean is below patch folded into > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath". > Oops, I wrongly implemented "__alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL is given." case. Updated version is shown below. --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3116,23 +3116,27 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...) /* The OOM killer may not free memory on a specific node */ if (gfp_mask & __GFP_THISNODE) goto out; + } else { + /* + * Help non-failing allocations by giving them access to memory + * reserves + */ + page = get_page_from_freelist(gfp_mask, order, + ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); + /* + * fallback to ignore cpuset restriction if our nodes + * are depleted + */ + if (!page) + page = get_page_from_freelist(gfp_mask, order, + ALLOC_NO_WATERMARKS, ac); + if (page) + goto out; } + /* Exhausted what can be done so it's blamo time */ - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { + if (out_of_memory(&oc)) *did_some_progress = 1; - - if (gfp_mask & __GFP_NOFAIL) { - page = get_page_from_freelist(gfp_mask, order, - ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); - /* - * fallback to ignore cpuset restriction if our nodes - * are depleted - */ - if (!page) - page = get_page_from_freelist(gfp_mask, order, - ALLOC_NO_WATERMARKS, ac); - } - } out: mutex_unlock(&oom_lock); return page; @@ -3738,6 +3742,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) */ WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); + /* Try memory reserves and then start killing things. */ + page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); + if (page) + goto got_pg; + cond_resched(); goto retry; } I'm calling __alloc_pages_may_oom() from nopage: because we reach without calling __alloc_pages_may_oom(), for PATCH 1/2 is not for stop enforcing the OOM killer for __GFP_NOFAIL. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-11 13:53 ` Tetsuo Handa @ 2016-12-12 8:52 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-12 8:52 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel On Sun 11-12-16 22:53:55, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote: > > > > If we could agree > > > > with calling __alloc_pages_nowmark() before out_of_memory() if __GFP_NOFAIL > > > > is given, we can avoid locking up while minimizing possibility of invoking > > > > the OOM killer... > > > > > > I do not understand. We do __alloc_pages_nowmark even when oom is called > > > for GFP_NOFAIL. > > > > Where is that? I can find __alloc_pages_nowmark() after out_of_memory() > > if __GFP_NOFAIL is given, but I can't find __alloc_pages_nowmark() before > > out_of_memory() if __GFP_NOFAIL is given. > > > > What I mean is below patch folded into > > "[PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath". > > > Oops, I wrongly implemented "__alloc_pages_nowmark() before out_of_memory() if > __GFP_NOFAIL is given." case. Updated version is shown below. If you want to introduce such a change then make sure to justify it properly in the changelog. I will not comment on this change here because I believe it is not directly needed for neither of the two patches. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-11 11:23 ` Tetsuo Handa 2016-12-11 13:53 ` Tetsuo Handa @ 2016-12-12 8:48 ` Michal Hocko 2016-12-14 10:34 ` Michal Hocko 1 sibling, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-12 8:48 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel On Sun 11-12-16 20:23:47, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 08-12-16 21:53:44, Tetsuo Handa wrote: [...] > > > > If you believe that my argumentation is incorrect then you are free to > > > > nak the patch with your reasoning. But please stop this nit picking on > > > > nonsense combination of flags. > > > > > > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > > > > on patch 2 unless "you explain these patches to __GFP_NOFAIL users and > > > provide a mean to invoke the OOM killer if someone chose possibility of > > > panic" > > > > I believe that the changelog contains my reasoning and so far I > > haven't heard any _argument_ from you why they are wrong. You just > > managed to nitpick on an impossible and pointless gfp_mask combination > > and some handwaving on possible lockups without any backing arguments. > > This is not something I would consider as a basis for a serious nack. So > > if you really hate this patch then do please start being reasonable and > > put some real arguments into your review without any off topics and/or > > strawman arguments without any relevance. > > > > Are you aware that I'm not objecting to "change __GFP_NOFAIL not to invoke > the OOM killer". What I'm objecting is that you are trying to change > !__GFP_FS && !__GFP_NOFAIL allocation requests without offering transition > plan like below. > > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) > * make sure exclude 0 mask - all other users should have at least > * ___GFP_DIRECT_RECLAIM to get here. > */ > - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) > + if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_WANT_OOM_KILLER)) > return true; I have already asked that but let me ask again. _Who_ would use this flag and not risk the pre-mature OOM killer invocation? [...] > I believe that __GFP_NOFAIL should not imply invocation of the OOM killer. > Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer. > But since currently the OOM killer is not invoked unless either __GFP_FS or > __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM > killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking > by not invoking the OOM killer. Although I can't prove that this change > never causes livelock, I don't want to provide an alternative flag like > __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must > agree with accepting the risk introduced by this change. I think you are seriously misled here. First of all, I have gone through GFP_NOFS | GFP_NOFAIL users and _none_ of them have added the nofail flag to enforce the OOM killer. Those users just want to express that an allocation failure is simply not acceptable. Most of them were simply conversions from the open-conded do { } while (! (page = page_alloc(GFP_NOFS)); loops. Which _does_ not invoke the OOM killer. And that is the most importatnt point here. Why the above open coded (and as you say lockup prone) loop is OK while GFP_NOFAIL varian should behave any differently? > and confirm that all existing __GFP_NOFAIL users are willing to accept > the risk of livelocking by not invoking the OOM killer. > > Unless you do this procedure, I continue: > > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> I was hoping for some actual arguments but I am afraid this is circling in a loop. You are still handwaving with theoretical lockups without any actual proof they are real. While I am not saying the risk is not there I also say that there are other aspects to consider - lockups will happen only if there are no other GFP_FS requests which trigger the OOM which is quite unlikely in most situations - triggering oom for GFP_NOFS | GFP_NOFAIL has a non negligible risk of pre-mature OOM killer invocation for the same reason we do not trigger oom for GFP_NOFS. Even worse metadata heavy workloads are much harder to contain so this might be used as a DoS vector. - one of the primary point of GFP_NOFAIL existence is to prevent from open coding endless loops in the code because the page allocator can handle most situations more gracefully (e.g. grant access to memory reserves). Having a completely different OOM killer behavior is both confusing and encourages abuse. If we have users who definitely need to control the OOM behavior then we should add a gfp flag for them. But this needs a strong use case and consider whether there are other options to go around that. I can add the above to the changelog if you think this is helpful but I still maintain my position that your "this might cause lockups theoretically" is unfounded and not justified to block the patch. I will of course retract this patch if you can demonstrate the issue is real or that any of my argumentation in the changelog is not correct. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-12 8:48 ` Michal Hocko @ 2016-12-14 10:34 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-14 10:34 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, vbabka, hannes, mgorman, rientjes, linux-mm, linux-kernel On Mon 12-12-16 09:48:37, Michal Hocko wrote: > On Sun 11-12-16 20:23:47, Tetsuo Handa wrote: [...] > > I believe that __GFP_NOFAIL should not imply invocation of the OOM killer. > > Therefore, I want to change __GFP_NOFAIL not to invoke the OOM killer. > > But since currently the OOM killer is not invoked unless either __GFP_FS or > > __GFP_NOFAIL is specified, changing __GFP_NOFAIL not to invoke the OOM > > killer introduces e.g. GFP_NOFS | __GFP_NOFAIL users a risk of livelocking > > by not invoking the OOM killer. Although I can't prove that this change > > never causes livelock, I don't want to provide an alternative flag like > > __GFP_WANT_OOM_KILLER. Therefore, all existing __GFP_NOFAIL users must > > agree with accepting the risk introduced by this change. > > I think you are seriously misled here. First of all, I have gone through > GFP_NOFS | GFP_NOFAIL users and _none_ of them have added the nofail > flag to enforce the OOM killer. Those users just want to express that an > allocation failure is simply not acceptable. Most of them were simply > conversions from the open-conded > do { } while (! (page = page_alloc(GFP_NOFS)); > loops. Which _does_ not invoke the OOM killer. And that is the most > importatnt point here. Why the above open coded (and as you say lockup > prone) loop is OK while GFP_NOFAIL varian should behave any differently? > > > and confirm that all existing __GFP_NOFAIL users are willing to accept > > the risk of livelocking by not invoking the OOM killer. > > > > Unless you do this procedure, I continue: > > > > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > I was hoping for some actual arguments but I am afraid this is circling > in a loop. You are still handwaving with theoretical lockups without any > actual proof they are real. While I am not saying the risk is not there > I also say that there are other aspects to consider > - lockups will happen only if there are no other GFP_FS requests > which trigger the OOM which is quite unlikely in most > situations > - triggering oom for GFP_NOFS | GFP_NOFAIL has a non negligible > risk of pre-mature OOM killer invocation for the same reason > we do not trigger oom for GFP_NOFS. Even worse metadata heavy > workloads are much harder to contain so this might be used as > a DoS vector. > - one of the primary point of GFP_NOFAIL existence is to prevent > from open coding endless loops in the code because the page > allocator can handle most situations more gracefully (e.g. > grant access to memory reserves). Having a completely > different OOM killer behavior is both confusing and encourages > abuse. If we have users who definitely need to control the OOM > behavior then we should add a gfp flag for them. But this > needs a strong use case and consider whether there are other > options to go around that. > > I can add the above to the changelog if you think this is helpful but I > still maintain my position that your "this might cause lockups > theoretically" is unfounded and not justified to block the patch. I will > of course retract this patch if you can demonstrate the issue is real or > that any of my argumentation in the changelog is not correct. I was thinking about this some more and realized that there is a different risk which this patch would introduce and have to be considered. Heavy GFP_NOFS | __GFP_NOFAIL users might actually deplete memory reserves. This was less of a problem with the current code because we invoke the oom killer and so at least _some_ memory might be freed. I will think about it some more but I guess I will just allow a partial access in the no-oom case. I will post the patch 1 in the meantime because I believe this is a reasonable cleanup. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OOM: Better, but still there on 4.9 @ 2016-12-16 7:39 Michal Hocko 2016-12-16 15:58 ` OOM: Better, but still there on Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-16 7:39 UTC (permalink / raw) To: Nils Holland Cc: linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs [CC linux-mm and btrfs guys] On Thu 15-12-16 23:57:04, Nils Holland wrote: [...] > Of course, none of this are workloads that are new / special in any > way - prior to 4.8, I never experienced any issues doing the exact > same things. > > Dec 15 19:02:16 teela kernel: kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0 > Dec 15 19:02:18 teela kernel: kworker/u4:5 cpuset=/ mems_allowed=0 > Dec 15 19:02:18 teela kernel: CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2 > Dec 15 19:02:18 teela kernel: Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014 > Dec 15 19:02:18 teela kernel: Workqueue: writeback wb_workfn (flush-btrfs-1) > Dec 15 19:02:18 teela kernel: eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292 > Dec 15 19:02:18 teela kernel: eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734 > Dec 15 19:02:18 teela kernel: eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000 > Dec 15 19:02:18 teela kernel: Call Trace: > Dec 15 19:02:18 teela kernel: [<c142bcce>] dump_stack+0x47/0x69 > Dec 15 19:02:18 teela kernel: [<c1163332>] dump_header+0x60/0x178 > Dec 15 19:02:18 teela kernel: [<c1431876>] ? ___ratelimit+0x86/0xe0 > Dec 15 19:02:18 teela kernel: [<c110795f>] oom_kill_process+0x20f/0x3d0 > Dec 15 19:02:18 teela kernel: [<c1043895>] ? has_capability_noaudit+0x15/0x20 > Dec 15 19:02:18 teela kernel: [<c11075c7>] ? oom_badness.part.13+0xb7/0x130 > Dec 15 19:02:18 teela kernel: [<c1107df9>] out_of_memory+0xd9/0x260 > Dec 15 19:02:18 teela kernel: [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80 > Dec 15 19:02:18 teela kernel: [<c110414d>] pagecache_get_page+0xad/0x270 > Dec 15 19:02:18 teela kernel: [<c13664a6>] alloc_extent_buffer+0x116/0x3e0 > Dec 15 19:02:18 teela kernel: [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10 > Dec 15 19:02:18 teela kernel: [<c132a57f>] btrfs_alloc_tree_block+0x1ef/0x5f0 > Dec 15 19:02:18 teela kernel: [<c130f7c3>] __btrfs_cow_block+0x143/0x5f0 > Dec 15 19:02:18 teela kernel: [<c130fe1a>] btrfs_cow_block+0x13a/0x220 > Dec 15 19:02:18 teela kernel: [<c13132f1>] btrfs_search_slot+0x1d1/0x870 > Dec 15 19:02:18 teela kernel: [<c132fcdd>] btrfs_lookup_file_extent+0x4d/0x60 > Dec 15 19:02:18 teela kernel: [<c1354fe6>] __btrfs_drop_extents+0x176/0x1070 > Dec 15 19:02:18 teela kernel: [<c1150377>] ? kmem_cache_alloc+0xb7/0x190 > Dec 15 19:02:18 teela kernel: [<c133dbb5>] ? start_transaction+0x65/0x4b0 > Dec 15 19:02:18 teela kernel: [<c1150597>] ? __kmalloc+0x147/0x1e0 > Dec 15 19:02:18 teela kernel: [<c1345005>] cow_file_range_inline+0x215/0x6b0 > Dec 15 19:02:18 teela kernel: [<c13459fc>] cow_file_range.isra.49+0x55c/0x6d0 > Dec 15 19:02:18 teela kernel: [<c1361795>] ? lock_extent_bits+0x75/0x1e0 > Dec 15 19:02:18 teela kernel: [<c1346d51>] run_delalloc_range+0x441/0x470 > Dec 15 19:02:18 teela kernel: [<c13626e4>] writepage_delalloc.isra.47+0x144/0x1e0 > Dec 15 19:02:18 teela kernel: [<c1364548>] __extent_writepage+0xd8/0x2b0 > Dec 15 19:02:18 teela kernel: [<c1365c4c>] extent_writepages+0x25c/0x380 > Dec 15 19:02:18 teela kernel: [<c1342cd0>] ? btrfs_real_readdir+0x610/0x610 > Dec 15 19:02:18 teela kernel: [<c133ff0f>] btrfs_writepages+0x1f/0x30 > Dec 15 19:02:18 teela kernel: [<c110ff85>] do_writepages+0x15/0x40 > Dec 15 19:02:18 teela kernel: [<c1190a95>] __writeback_single_inode+0x35/0x2f0 > Dec 15 19:02:18 teela kernel: [<c119112e>] writeback_sb_inodes+0x16e/0x340 > Dec 15 19:02:18 teela kernel: [<c119145a>] wb_writeback+0xaa/0x280 > Dec 15 19:02:18 teela kernel: [<c1191de8>] wb_workfn+0xd8/0x3e0 > Dec 15 19:02:18 teela kernel: [<c104fd34>] process_one_work+0x114/0x3e0 > Dec 15 19:02:18 teela kernel: [<c1050b4f>] worker_thread+0x2f/0x4b0 > Dec 15 19:02:18 teela kernel: [<c1050b20>] ? create_worker+0x180/0x180 > Dec 15 19:02:18 teela kernel: [<c10552e7>] kthread+0x97/0xb0 > Dec 15 19:02:18 teela kernel: [<c1055250>] ? __kthread_parkme+0x60/0x60 > Dec 15 19:02:18 teela kernel: [<c19b5cb7>] ret_from_fork+0x1b/0x28 > Dec 15 19:02:18 teela kernel: Mem-Info: > Dec 15 19:02:18 teela kernel: active_anon:58685 inactive_anon:90 isolated_anon:0 > active_file:274324 inactive_file:281962 isolated_file:0 OK, so there is still some anonymous memory that could be swapped out and quite a lot of page cache. This might be harder to reclaim because the allocation is a GFP_NOFS request which is limited in its reclaim capabilities. It might be possible that those pagecache pages are pinned in some way by the the filesystem. > unevictable:0 dirty:649 writeback:0 unstable:0 > slab_reclaimable:40662 slab_unreclaimable:17754 > mapped:7382 shmem:202 pagetables:351 bounce:0 > free:206736 free_pcp:332 free_cma:0 > Dec 15 19:02:18 teela kernel: Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no > Dec 15 19:02:18 teela kernel: DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 813 3474 3474 > Dec 15 19:02:18 teela kernel: Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB And this shows that there is no anonymous memory in the lowmem zone. Note that this request cannot use the highmem zone so no swap out would help. So if we are not able to reclaim those pages on the file LRU then we are out of luck > Dec 15 19:02:18 teela kernel: lowmem_reserve[]: 0 0 21292 21292 > Dec 15 19:02:18 teela kernel: HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB That being said, the OOM killer invocation is clearly pointless and pre-mature. We normally do not invoke it normally for GFP_NOFS requests exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which behaves differently. I am about to change that but my last attempt [1] has to be rethought. Now another thing is that the __GFP_NOFAIL which has this nasty side effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from early transaction abort") in 4.3 so I am quite surprised that this has shown up only in 4.8. Anyway there might be some other changes in the btrfs which could make it more subtle. I believe the right way to go around this is to pursue what I've started in [1]. I will try to prepare something for testing today for you. Stay tuned. But I would be really happy if somebody from the btrfs camp could check the NOFS aspect of this allocation. We have already seen allocation stalls from this path quite recently [1] http://lkml.kernel.org/r/20161201152517.27698-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/20161214101743.GA25578@dhcp22.suse.cz -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: OOM: Better, but still there on 2016-12-16 7:39 OOM: Better, but still there on 4.9 Michal Hocko @ 2016-12-16 15:58 ` Michal Hocko 2016-12-16 15:58 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-16 15:58 UTC (permalink / raw) To: Nils Holland Cc: linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs On Fri 16-12-16 08:39:41, Michal Hocko wrote: [...] > That being said, the OOM killer invocation is clearly pointless and > pre-mature. We normally do not invoke it normally for GFP_NOFS requests > exactly for these reasons. But this is GFP_NOFS|__GFP_NOFAIL which > behaves differently. I am about to change that but my last attempt [1] > has to be rethought. > > Now another thing is that the __GFP_NOFAIL which has this nasty side > effect has been introduced by me d1b5c5671d01 ("btrfs: Prevent from > early transaction abort") in 4.3 so I am quite surprised that this has > shown up only in 4.8. Anyway there might be some other changes in the > btrfs which could make it more subtle. > > I believe the right way to go around this is to pursue what I've started > in [1]. I will try to prepare something for testing today for you. Stay > tuned. But I would be really happy if somebody from the btrfs camp could > check the NOFS aspect of this allocation. We have already seen > allocation stalls from this path quite recently Could you try to run with the two following patches? ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-16 15:58 ` OOM: Better, but still there on Michal Hocko @ 2016-12-16 15:58 ` Michal Hocko 2016-12-16 17:31 ` Johannes Weiner 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-16 15:58 UTC (permalink / raw) To: Nils Holland Cc: linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs, Michal Hocko From: Michal Hocko <mhocko@suse.com> __alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is not obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by 82553a937f12 ("oom: invoke oom killer for __GFP_NOFAIL"). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee any forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. The pre-mature OOM killer is a real issue as reported by Nils Holland kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0 kworker/u4:5 cpuset=/ mems_allowed=0 CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2 Hardware name: Hewlett-Packard Compaq 15 Notebook PC/21F7, BIOS F.22 08/06/2014 Workqueue: writeback wb_workfn (flush-btrfs-1) eff0b604 c142bcce eff0b734 00000000 eff0b634 c1163332 00000000 00000292 eff0b634 c1431876 eff0b638 e7fb0b00 e7fa2900 e7fa2900 c1b58785 eff0b734 eff0b678 c110795f c1043895 eff0b664 c11075c7 00000007 00000000 00000000 Call Trace: [<c142bcce>] dump_stack+0x47/0x69 [<c1163332>] dump_header+0x60/0x178 [<c1431876>] ? ___ratelimit+0x86/0xe0 [<c110795f>] oom_kill_process+0x20f/0x3d0 [<c1043895>] ? has_capability_noaudit+0x15/0x20 [<c11075c7>] ? oom_badness.part.13+0xb7/0x130 [<c1107df9>] out_of_memory+0xd9/0x260 [<c110ba0b>] __alloc_pages_nodemask+0xbfb/0xc80 [<c110414d>] pagecache_get_page+0xad/0x270 [<c13664a6>] alloc_extent_buffer+0x116/0x3e0 [<c1334a2e>] btrfs_find_create_tree_block+0xe/0x10 [...] Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB lowmem_reserve[]: 0 0 21292 21292 HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB this is a GFP_NOFS|__GFP_NOFAIL request which invokes the OOM killer because there is clearly nothing reclaimable in the zone Normal while there is a lot of page cache which is most probably pinned by the fs but GFP_NOFS cannot reclaim it. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Instead we do allow nofail requests to access memory reserves to move forward in both cases when the OOM killer is invoked and when it should be supressed. In the later case we are more careful and only allow a partial access because we do not want to risk the whole reserves depleting. There are users doing GFP_NOFS|__GFP_NOFAIL heavily (e.g. __getblk_gfp -> grow_dev_page). Introduce __alloc_pages_cpuset_fallback helper which allows to bypass allocation constrains for the given gfp mask while it enforces cpusets whenever possible. Reported-by: Nils Holland <nholland@tisys.org> Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 2 +- mm/page_alloc.c | 97 ++++++++++++++++++++++++++++++++++++--------------------- 2 files changed, 62 insertions(+), 37 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ec9f11d4f094..12a6fce85f61 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) * make sure exclude 0 mask - all other users should have at least * ___GFP_DIRECT_RECLAIM to get here. */ - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) return true; /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 095e2fa286de..d6bc3e4f1a0c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3057,6 +3057,26 @@ void warn_alloc(gfp_t gfp_mask, const char *fmt, ...) } static inline struct page * +__alloc_pages_cpuset_fallback(gfp_t gfp_mask, unsigned int order, + unsigned int alloc_flags, + const struct alloc_context *ac) +{ + struct page *page; + + page = get_page_from_freelist(gfp_mask, order, + alloc_flags|ALLOC_CPUSET, ac); + /* + * fallback to ignore cpuset restriction if our nodes + * are depleted + */ + if (!page) + page = get_page_from_freelist(gfp_mask, order, + alloc_flags, ac); + + return page; +} + +static inline struct page * __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, const struct alloc_context *ac, unsigned long *did_some_progress) { @@ -3091,47 +3111,42 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (page) goto out; - if (!(gfp_mask & __GFP_NOFAIL)) { - /* Coredumps can quickly deplete all memory reserves */ - if (current->flags & PF_DUMPCORE) - goto out; - /* The OOM killer will not help higher order allocs */ - if (order > PAGE_ALLOC_COSTLY_ORDER) - goto out; - /* The OOM killer does not needlessly kill tasks for lowmem */ - if (ac->high_zoneidx < ZONE_NORMAL) - goto out; - if (pm_suspended_storage()) - goto out; - /* - * XXX: GFP_NOFS allocations should rather fail than rely on - * other request to make a forward progress. - * We are in an unfortunate situation where out_of_memory cannot - * do much for this context but let's try it to at least get - * access to memory reserved if the current task is killed (see - * out_of_memory). Once filesystems are ready to handle allocation - * failures more gracefully we should just bail out here. - */ + /* Coredumps can quickly deplete all memory reserves */ + if (current->flags & PF_DUMPCORE) + goto out; + /* The OOM killer will not help higher order allocs */ + if (order > PAGE_ALLOC_COSTLY_ORDER) + goto out; + /* The OOM killer does not needlessly kill tasks for lowmem */ + if (ac->high_zoneidx < ZONE_NORMAL) + goto out; + if (pm_suspended_storage()) + goto out; + /* + * XXX: GFP_NOFS allocations should rather fail than rely on + * other request to make a forward progress. + * We are in an unfortunate situation where out_of_memory cannot + * do much for this context but let's try it to at least get + * access to memory reserved if the current task is killed (see + * out_of_memory). Once filesystems are ready to handle allocation + * failures more gracefully we should just bail out here. + */ + + /* The OOM killer may not free memory on a specific node */ + if (gfp_mask & __GFP_THISNODE) + goto out; - /* The OOM killer may not free memory on a specific node */ - if (gfp_mask & __GFP_THISNODE) - goto out; - } /* Exhausted what can be done so it's blamo time */ - if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { + if (out_of_memory(&oc)) { *did_some_progress = 1; - if (gfp_mask & __GFP_NOFAIL) { - page = get_page_from_freelist(gfp_mask, order, - ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); - /* - * fallback to ignore cpuset restriction if our nodes - * are depleted - */ - if (!page) - page = get_page_from_freelist(gfp_mask, order, + /* + * Help non-failing allocations by giving them access to memory + * reserves + */ + if (gfp_mask & __GFP_NOFAIL) + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_NO_WATERMARKS, ac); - } } out: mutex_unlock(&oom_lock); @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, */ WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); + /* + * Help non-failing allocations by giving them access to memory + * reserves but do not use ALLOC_NO_WATERMARKS because this + * could deplete whole memory reserves which would just make + * the situation worse + */ + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); + if (page) + goto got_pg; + cond_resched(); goto retry; } -- 2.10.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-16 15:58 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko @ 2016-12-16 17:31 ` Johannes Weiner 2016-12-16 22:12 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Johannes Weiner @ 2016-12-16 17:31 UTC (permalink / raw) To: Michal Hocko Cc: Nils Holland, linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs, Michal Hocko On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote: > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) > * make sure exclude 0 mask - all other users should have at least > * ___GFP_DIRECT_RECLAIM to get here. > */ > - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) > + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) > return true; This makes sense, we should go back to what we had here. Because it's not that the reported OOMs are premature - there is genuinely no more memory reclaimable from the allocating context - but that this class of allocations should never invoke the OOM killer in the first place. > @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > */ > WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); > > + /* > + * Help non-failing allocations by giving them access to memory > + * reserves but do not use ALLOC_NO_WATERMARKS because this > + * could deplete whole memory reserves which would just make > + * the situation worse > + */ > + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); > + if (page) > + goto got_pg; > + But this should be a separate patch, IMO. Do we observe GFP_NOFS lockups when we don't do this? Don't we risk premature exhaustion of the memory reserves, and it's better to wait for other reclaimers to make some progress instead? Should we give reserve access to all GFP_NOFS allocations, or just the ones from a reclaim/cleaning context? All that should go into the changelog of a separate allocation booster patch, I think. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-16 17:31 ` Johannes Weiner @ 2016-12-16 22:12 ` Michal Hocko 2016-12-17 11:17 ` Tetsuo Handa 0 siblings, 1 reply; 23+ messages in thread From: Michal Hocko @ 2016-12-16 22:12 UTC (permalink / raw) To: Johannes Weiner Cc: Nils Holland, linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs On Fri 16-12-16 12:31:51, Johannes Weiner wrote: > On Fri, Dec 16, 2016 at 04:58:08PM +0100, Michal Hocko wrote: > > @@ -1013,7 +1013,7 @@ bool out_of_memory(struct oom_control *oc) > > * make sure exclude 0 mask - all other users should have at least > > * ___GFP_DIRECT_RECLAIM to get here. > > */ > > - if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) > > + if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) > > return true; > > This makes sense, we should go back to what we had here. Because it's > not that the reported OOMs are premature - there is genuinely no more > memory reclaimable from the allocating context - but that this class > of allocations should never invoke the OOM killer in the first place. agreed, at least not with the current implementtion. If we had a proper accounting where we know that the memory pinned by the fs is not really there then we could invoke the oom killer and be safe > > @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > */ > > WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); > > > > + /* > > + * Help non-failing allocations by giving them access to memory > > + * reserves but do not use ALLOC_NO_WATERMARKS because this > > + * could deplete whole memory reserves which would just make > > + * the situation worse > > + */ > > + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); > > + if (page) > > + goto got_pg; > > + > > But this should be a separate patch, IMO. > > Do we observe GFP_NOFS lockups when we don't do this? this is hard to tell but considering users like grow_dev_page we can get stuck with a very slow progress I believe. Those allocations could see some help. > Don't we risk > premature exhaustion of the memory reserves, and it's better to wait > for other reclaimers to make some progress instead? waiting for other reclaimers would be preferable but we should at least give these some priority, which is what ALLOC_HARDER should help with. > Should we give > reserve access to all GFP_NOFS allocations, or just the ones from a > reclaim/cleaning context? I would focus only for those which are important enough. Which are those is a harder question. But certainly those with GFP_NOFAIL are important enough. > All that should go into the changelog of a separate allocation booster > patch, I think. The reason I did both in the same patch is to address the concern about potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen ALLOC_HARDER to give the minimum portion of the reserves so that we do not risk other high priority users to be blocked out but still help a bit at least and prevent from starvation when other reclaimers are faster to consume the reclaimed memory. I can extend the changelog of course but I believe that having both changes together makes some sense. NOFS|NOFAIL allocations are not all that rare and sometimes we really depend on them making a further progress. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-16 22:12 ` Michal Hocko @ 2016-12-17 11:17 ` Tetsuo Handa 2016-12-18 16:37 ` Michal Hocko 0 siblings, 1 reply; 23+ messages in thread From: Tetsuo Handa @ 2016-12-17 11:17 UTC (permalink / raw) To: Michal Hocko, Johannes Weiner Cc: Nils Holland, linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs Michal Hocko wrote: > On Fri 16-12-16 12:31:51, Johannes Weiner wrote: >>> @@ -3737,6 +3752,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >>> */ >>> WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER); >>> >>> + /* >>> + * Help non-failing allocations by giving them access to memory >>> + * reserves but do not use ALLOC_NO_WATERMARKS because this >>> + * could deplete whole memory reserves which would just make >>> + * the situation worse >>> + */ >>> + page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac); >>> + if (page) >>> + goto got_pg; >>> + >> >> But this should be a separate patch, IMO. >> >> Do we observe GFP_NOFS lockups when we don't do this? > > this is hard to tell but considering users like grow_dev_page we can get > stuck with a very slow progress I believe. Those allocations could see > some help. > >> Don't we risk >> premature exhaustion of the memory reserves, and it's better to wait >> for other reclaimers to make some progress instead? > > waiting for other reclaimers would be preferable but we should at least > give these some priority, which is what ALLOC_HARDER should help with. > >> Should we give >> reserve access to all GFP_NOFS allocations, or just the ones from a >> reclaim/cleaning context? > > I would focus only for those which are important enough. Which are those > is a harder question. But certainly those with GFP_NOFAIL are important > enough. > >> All that should go into the changelog of a separate allocation booster >> patch, I think. > > The reason I did both in the same patch is to address the concern about > potential lockups when NOFS|NOFAIL cannot make any progress. I've chosen > ALLOC_HARDER to give the minimum portion of the reserves so that we do > not risk other high priority users to be blocked out but still help a > bit at least and prevent from starvation when other reclaimers are > faster to consume the reclaimed memory. > > I can extend the changelog of course but I believe that having both > changes together makes some sense. NOFS|NOFAIL allocations are not all > that rare and sometimes we really depend on them making a further > progress. > I feel that allowing access to memory reserves based on __GFP_NOFAIL might not make sense. My understanding is that actual I/O operation triggered by I/O requests by filesystem code are processed by other threads. Even if we grant access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code, I think that it is possible that memory allocations by underlying bio code fails to make a further progress unless memory reserves are granted as well. Below is a typical trace which I observe under OOM lockuped situation (though this trace is from an OOM stress test using XFS). ---------------------------------------- [ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 seq=48 gfp=0x2400000(GFP_NOIO) order=0 delay=430400 uninterruptible [ 1845.187248] kworker/2:1 D12712 14498 2 0x00000080 [ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn [ 1845.187252] Call Trace: [ 1845.187253] ? __schedule+0x23f/0xba0 [ 1845.187254] schedule+0x38/0x90 [ 1845.187255] schedule_timeout+0x205/0x4a0 [ 1845.187256] ? del_timer_sync+0xd0/0xd0 [ 1845.187257] schedule_timeout_uninterruptible+0x25/0x30 [ 1845.187258] __alloc_pages_nodemask+0x1035/0x10e0 [ 1845.187259] ? alloc_request_struct+0x14/0x20 [ 1845.187261] alloc_pages_current+0x96/0x1b0 [ 1845.187262] ? bio_alloc_bioset+0x20f/0x2e0 [ 1845.187264] bio_copy_kern+0xc4/0x180 [ 1845.187265] blk_rq_map_kern+0x6f/0x120 [ 1845.187268] __scsi_execute.isra.23+0x12f/0x160 [ 1845.187270] scsi_execute_req_flags+0x8f/0x100 [ 1845.187271] sr_check_events+0xba/0x2b0 [sr_mod] [ 1845.187274] cdrom_check_events+0x13/0x30 [cdrom] [ 1845.187275] sr_block_check_events+0x25/0x30 [sr_mod] [ 1845.187276] disk_check_events+0x5b/0x150 [ 1845.187277] disk_events_workfn+0x17/0x20 [ 1845.187278] process_one_work+0x1fc/0x750 [ 1845.187279] ? process_one_work+0x167/0x750 [ 1845.187279] worker_thread+0x126/0x4a0 [ 1845.187280] kthread+0x10a/0x140 [ 1845.187281] ? process_one_work+0x750/0x750 [ 1845.187282] ? kthread_create_on_node+0x60/0x60 [ 1845.187283] ret_from_fork+0x2a/0x40 ---------------------------------------- I think that this GFP_NOIO allocation request needs to consume more memory reserves than GFP_NOFS allocation request to make progress. Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request in order to allow access to memory reserves as well as GFP_NOFS | __GFP_NOFAIL allocation request? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically 2016-12-17 11:17 ` Tetsuo Handa @ 2016-12-18 16:37 ` Michal Hocko 0 siblings, 0 replies; 23+ messages in thread From: Michal Hocko @ 2016-12-18 16:37 UTC (permalink / raw) To: Tetsuo Handa Cc: Johannes Weiner, Nils Holland, linux-kernel, linux-mm, Chris Mason, David Sterba, linux-btrfs On Sat 17-12-16 20:17:07, Tetsuo Handa wrote: [...] > I feel that allowing access to memory reserves based on __GFP_NOFAIL might not > make sense. My understanding is that actual I/O operation triggered by I/O > requests by filesystem code are processed by other threads. Even if we grant > access to memory reserves to GFP_NOFS | __GFP_NOFAIL allocations by fs code, > I think that it is possible that memory allocations by underlying bio code > fails to make a further progress unless memory reserves are granted as well. IO layer should rely on mempools to guarantee a forward progress. > Below is a typical trace which I observe under OOM lockuped situation (though > this trace is from an OOM stress test using XFS). > > ---------------------------------------- > [ 1845.187246] MemAlloc: kworker/2:1(14498) flags=0x4208060 switches=323636 seq=48 gfp=0x2400000(GFP_NOIO) order=0 delay=430400 uninterruptible > [ 1845.187248] kworker/2:1 D12712 14498 2 0x00000080 > [ 1845.187251] Workqueue: events_freezable_power_ disk_events_workfn > [ 1845.187252] Call Trace: > [ 1845.187253] ? __schedule+0x23f/0xba0 > [ 1845.187254] schedule+0x38/0x90 > [ 1845.187255] schedule_timeout+0x205/0x4a0 > [ 1845.187256] ? del_timer_sync+0xd0/0xd0 > [ 1845.187257] schedule_timeout_uninterruptible+0x25/0x30 > [ 1845.187258] __alloc_pages_nodemask+0x1035/0x10e0 > [ 1845.187259] ? alloc_request_struct+0x14/0x20 > [ 1845.187261] alloc_pages_current+0x96/0x1b0 > [ 1845.187262] ? bio_alloc_bioset+0x20f/0x2e0 > [ 1845.187264] bio_copy_kern+0xc4/0x180 > [ 1845.187265] blk_rq_map_kern+0x6f/0x120 > [ 1845.187268] __scsi_execute.isra.23+0x12f/0x160 > [ 1845.187270] scsi_execute_req_flags+0x8f/0x100 > [ 1845.187271] sr_check_events+0xba/0x2b0 [sr_mod] > [ 1845.187274] cdrom_check_events+0x13/0x30 [cdrom] > [ 1845.187275] sr_block_check_events+0x25/0x30 [sr_mod] > [ 1845.187276] disk_check_events+0x5b/0x150 > [ 1845.187277] disk_events_workfn+0x17/0x20 > [ 1845.187278] process_one_work+0x1fc/0x750 > [ 1845.187279] ? process_one_work+0x167/0x750 > [ 1845.187279] worker_thread+0x126/0x4a0 > [ 1845.187280] kthread+0x10a/0x140 > [ 1845.187281] ? process_one_work+0x750/0x750 > [ 1845.187282] ? kthread_create_on_node+0x60/0x60 > [ 1845.187283] ret_from_fork+0x2a/0x40 > ---------------------------------------- > > I think that this GFP_NOIO allocation request needs to consume more memory reserves > than GFP_NOFS allocation request to make progress. AFAIU, this is an allocation path which doesn't block a forward progress on a regular IO. It is merely a check whether there is a new medium in the CDROM (aka regular polling of the device). I really fail to see any reason why this one should get any access to memory reserves at all. I actually do not see any reason why it should be NOIO in the first place but I am not familiar with this code much so there might be some reasons for that. The fact that it might stall under a heavy memory pressure is sad but who actually cares? > Do we want to add __GFP_NOFAIL to this GFP_NOIO allocation request > in order to allow access to memory reserves as well as GFP_NOFS | > __GFP_NOFAIL allocation request? Why? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2016-12-18 16:37 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-01 15:25 [PATCH 0/2] GFP_NOFAIL cleanups Michal Hocko 2016-12-01 15:25 ` [PATCH 1/2] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko 2016-12-01 15:25 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko 2016-12-02 7:23 ` Vlastimil Babka 2016-12-05 13:45 ` Tetsuo Handa 2016-12-05 14:10 ` Michal Hocko 2016-12-06 8:27 ` Michal Hocko 2016-12-06 10:38 ` Tetsuo Handa 2016-12-06 11:03 ` Vlastimil Babka 2016-12-06 19:25 ` Michal Hocko 2016-12-06 19:22 ` Michal Hocko 2016-12-08 12:53 ` Tetsuo Handa 2016-12-08 13:47 ` Michal Hocko 2016-12-11 11:23 ` Tetsuo Handa 2016-12-11 13:53 ` Tetsuo Handa 2016-12-12 8:52 ` Michal Hocko 2016-12-12 8:48 ` Michal Hocko 2016-12-14 10:34 ` Michal Hocko 2016-12-16 7:39 OOM: Better, but still there on 4.9 Michal Hocko 2016-12-16 15:58 ` OOM: Better, but still there on Michal Hocko 2016-12-16 15:58 ` [PATCH 2/2] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko 2016-12-16 17:31 ` Johannes Weiner 2016-12-16 22:12 ` Michal Hocko 2016-12-17 11:17 ` Tetsuo Handa 2016-12-18 16:37 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).