* [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. @ 2017-06-01 11:43 Tetsuo Handa 2017-06-01 11:59 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-06-01 11:43 UTC (permalink / raw) To: akpm, linux-mm Cc: Tetsuo Handa, Cong Wang, Dave Hansen, Johannes Weiner, Mel Gorman, Michal Hocko, Vlastimil Babka Cong Wang has reported a lockup when running LTP memcg_stress test [1]. Although the cause of allocation stall is unknown (no OOM killer messages found within partially written output), that test put much stress with uncontrolled concurrent warn_alloc() calls enough to trigger NMI watchdog on soft lockup at cpu_relax() loop in dump_stack(). (Immaterial lines removed.) ---------- [16213.477525] Call Trace: [16213.487314] [<ffffffff86395ab7>] dump_stack+0x4d/0x66 [16213.497723] [<ffffffff8619a6c6>] warn_alloc+0x116/0x130 [16213.505627] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [cleanup:7598] [16213.505713] CPU: 5 PID: 7598 Comm: cleanup Not tainted 4.9.23.el7.twitter.x86_64 #1 [16213.505714] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.2.3 11/07/2013 [16213.505722] RIP: 0010:[<ffffffff86395a93>] [<ffffffff86395a93>] dump_stack+0x29/0x66 [16213.505795] Call Trace: [16213.505799] [<ffffffff8619a6c6>] warn_alloc+0x116/0x130 [16213.505804] [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0 [16213.505807] [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230 [16213.505810] [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140 [16213.505814] [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0 [16213.505822] [<ffffffff86194312>] filemap_fault+0x312/0x4d0 [16213.505826] [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50 [16213.505828] [<ffffffff861c2c71>] __do_fault+0x71/0x130 [16213.505830] [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0 [16213.505832] [<ffffffff863b434d>] ? list_del+0xd/0x30 [16213.505833] [<ffffffff86257e58>] ? ep_poll+0x308/0x320 [16213.505835] [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0 [16213.505837] [<ffffffff86050c50>] do_page_fault+0x20/0x70 [16213.505839] [<ffffffff86700aa2>] page_fault+0x22/0x30 [16213.505877] Code: 5d c3 55 83 c9 ff 48 89 e5 41 54 53 9c 5b fa 65 8b 15 4a 47 c7 79 89 c8 f0 0f b1 15 48 a3 92 00 83 f8 ff 74 0a 39 c2 74 0b 53 9d <f3> 90 eb dd 45 31 e4 eb 06 41 bc 01 00 00 00 48 c7 c7 41 1a a2 [16214.250659] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [scribed:3905] [16214.250765] CPU: 17 PID: 3905 Comm: scribed Tainted: G L 4.9.23.el7.twitter.x86_64 #1 [16214.250767] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.2.3 11/07/2013 [16214.250776] RIP: 0010:[<ffffffff86395a93>] [<ffffffff86395a93>] dump_stack+0x29/0x66 [16214.250840] Call Trace: [16214.250843] [<ffffffff8619a6c6>] warn_alloc+0x116/0x130 [16214.250846] [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0 [16214.250849] [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230 [16214.250853] [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140 [16214.250855] [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0 [16214.250857] [<ffffffff86194312>] filemap_fault+0x312/0x4d0 [16214.250859] [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50 [16214.250860] [<ffffffff861c2c71>] __do_fault+0x71/0x130 [16214.250863] [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0 [16214.250865] [<ffffffff860c25c1>] ? pick_next_task_fair+0x471/0x4a0 [16214.250869] [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0 [16214.250871] [<ffffffff86050c50>] do_page_fault+0x20/0x70 [16214.250873] [<ffffffff86700aa2>] page_fault+0x22/0x30 [16214.250918] Code: 5d c3 55 83 c9 ff 48 89 e5 41 54 53 9c 5b fa 65 8b 15 4a 47 c7 79 89 c8 f0 0f b1 15 48 a3 92 00 83 f8 ff 74 0a 39 c2 74 0b 53 9d <f3> 90 eb dd 45 31 e4 eb 06 41 bc 01 00 00 00 48 c7 c7 41 1a a2 [16215.157526] [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0 [16215.177523] [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230 [16215.197540] [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140 [16215.217331] [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0 [16215.237374] [<ffffffff86194312>] filemap_fault+0x312/0x4d0 [16215.257136] [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50 [16215.276950] [<ffffffff861c2c71>] __do_fault+0x71/0x130 [16215.287555] [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0 [16215.307538] [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0 [16215.327165] [<ffffffff86050c50>] do_page_fault+0x20/0x70 [16215.346964] [<ffffffff86700aa2>] page_fault+0x22/0x30 ---------- Both the printk() flooding problem caused by uncontrolled concurrent warn_alloc() calls and the OOM killer being unable to send SIGKILL problem caused by preemption were already pointed out by me [2], but these problems are left unfixed because Michal does not like serialization from allocation path because he is worrying about unexpected side effects [3]. But it's time to at least workaround this problem because the former problem is now reported by other than me. We have a choice (spinlock or mutex or [4]) here for serializing warn_alloc() messages. If we can assume that that NMI context never calls __alloc_pages_nodemask() without __GFP_NOWARN, we could use a spinlock so that all memory allocation stall/failure messages do not get mixed. But since the lockup above was triggered by busy waiting at dump_stack(), we might still hit lockups if we do busy waiting at warn_alloc(). Thus, spinlock is not a safe choice. If we can give up serializing !__GFP_DIRECT_RECLAIM memory allocation failure messages, we could use a mutex. Although avoid mixing both memory allocation stall/failure messages and the OOM killer messages would be nice, oom_lock mutex should not be used for this purpose, for waiting for oom_lock mutex at warn_alloc() can prevent anybody from calling out_of_memory() from __alloc_pages_may_oom() (i.e. after all lockups). Therefore, this patch uses a mutex dedicated for warn_alloc() like suggested in [3]. I believe that the safest fix is [4] which offloads reporting of allocation stalls to the khungtaskd kernel thread so that we can eliminate a lot of noisy duplicates, avoid ratelimit filtering which randomly drops important traces needed for debugging, and allow triggering more useful actions. [1] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com [2] http://lkml.kernel.org/r/1481020439-5867-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp [3] http://lkml.kernel.org/r/20161212125535.GA3185@dhcp22.suse.cz [4] http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> --- mm/page_alloc.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 682ecac..4f98ff6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3185,10 +3185,13 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) va_list args; static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); + static DEFINE_MUTEX(warn_alloc_lock); if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs)) return; + if (gfp_mask & __GFP_DIRECT_RECLAIM) + mutex_lock(&warn_alloc_lock); pr_warn("%s: ", current->comm); va_start(args, fmt); @@ -3207,6 +3210,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) dump_stack(); warn_alloc_show_mem(gfp_mask, nodemask); + if (gfp_mask & __GFP_DIRECT_RECLAIM) + mutex_unlock(&warn_alloc_lock); } static inline struct page * -- 1.8.3.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-01 11:43 [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable Tetsuo Handa @ 2017-06-01 11:59 ` Michal Hocko 2017-06-01 13:11 ` Tetsuo Handa 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-06-01 11:59 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, Cong Wang, Dave Hansen, Johannes Weiner, Mel Gorman, Vlastimil Babka On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. This seems to be on an old and not pristine kernel. Does it happen also on the vanilla up-to-date kernel? [...] > Therefore, this patch uses a mutex dedicated for warn_alloc() like > suggested in [3]. As I've said previously. We have rate limiting and if that doesn't work out well, let's tune it. The lock should be the last resort to go with. We already throttle show_mem, maybe we can throttle dump_stack as well, although it sounds a bit strange that this adds so much to the picture. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-01 11:59 ` Michal Hocko @ 2017-06-01 13:11 ` Tetsuo Handa 2017-06-01 13:28 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-06-01 13:11 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka Michal Hocko wrote: > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > This seems to be on an old and not pristine kernel. Does it happen also > on the vanilla up-to-date kernel? 4.9 is not an old kernel! It might be close to the kernel version which enterprise distributions would choose for their next long term supported version. And please stop saying "can you reproduce your problem with latest linux-next (or at least latest linux)?" Not everybody can use the vanilla up-to-date kernel! What I'm pushing via kmallocwd patch is to prepare for overlooked problems so that enterprise distributors can collect information and identify what changes are needed to be backported. As long as you ignore problems not happened with latest linux-next (or at least latest linux), enterprise distribution users can do nothing. > > [...] > > Therefore, this patch uses a mutex dedicated for warn_alloc() like > > suggested in [3]. > > As I've said previously. We have rate limiting and if that doesn't work > out well, let's tune it. The lock should be the last resort to go with. > We already throttle show_mem, maybe we can throttle dump_stack as well, > although it sounds a bit strange that this adds so much to the picture. Ratelimiting never works well. It randomly drops information which is useful for debugging. Uncontrolled concurrent dump_stack() causes lockups. And restricting dump_stack() drops more information. What we should do is to yield CPU time to operations which might do useful things (let threads not doing memory allocation; e.g. let printk kernel threads to flush pending buffer, let console drivers write the output to consoles, let watchdog kernel threads report what is happening). When memory allocation request is stalling, serialization via waiting for a lock does help. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-01 13:11 ` Tetsuo Handa @ 2017-06-01 13:28 ` Michal Hocko 2017-06-01 22:10 ` Andrew Morton 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-06-01 13:28 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > > > This seems to be on an old and not pristine kernel. Does it happen also > > on the vanilla up-to-date kernel? > > 4.9 is not an old kernel! It might be close to the kernel version which > enterprise distributions would choose for their next long term supported > version. > > And please stop saying "can you reproduce your problem with latest > linux-next (or at least latest linux)?" Not everybody can use the vanilla > up-to-date kernel! The changelog mentioned that the source of stalls is not clear so this might be out-of-tree patches doing something wrong and dump_stack showing up just because it is called often. This wouldn't be the first time I have seen something like that. I am not really keen on adding heavy lifting for something that is not clearly debugged and based on hand waving and speculations. > What I'm pushing via kmallocwd patch is to prepare for overlooked problems > so that enterprise distributors can collect information and identify what > changes are needed to be backported. > > As long as you ignore problems not happened with latest linux-next (or > at least latest linux), enterprise distribution users can do nothing. > > > > > [...] > > > Therefore, this patch uses a mutex dedicated for warn_alloc() like > > > suggested in [3]. > > > > As I've said previously. We have rate limiting and if that doesn't work > > out well, let's tune it. The lock should be the last resort to go with. > > We already throttle show_mem, maybe we can throttle dump_stack as well, > > although it sounds a bit strange that this adds so much to the picture. > > Ratelimiting never works well. It randomly drops information which is > useful for debugging. Uncontrolled concurrent dump_stack() causes lockups. > And restricting dump_stack() drops more information. As long as the dump_stack can be a source of the stalls, which I am not so sure about, then we should rate limit it. > What we should do is to yield CPU time to operations which might do useful > things (let threads not doing memory allocation; e.g. let printk kernel > threads to flush pending buffer, let console drivers write the output to > consoles, let watchdog kernel threads report what is happening). yes we call that preemptive kernel... > When memory allocation request is stalling, serialization via waiting > for a lock does help. Which will mean that those unlucky ones which stall will stall even more because they will wait on a lock with potentially many others. While this certainly is a throttling mechanism it is also a big hammer. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-01 13:28 ` Michal Hocko @ 2017-06-01 22:10 ` Andrew Morton 2017-06-02 7:18 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Andrew Morton @ 2017-06-01 22:10 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote: > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > > > > > This seems to be on an old and not pristine kernel. Does it happen also > > > on the vanilla up-to-date kernel? > > > > 4.9 is not an old kernel! It might be close to the kernel version which > > enterprise distributions would choose for their next long term supported > > version. > > > > And please stop saying "can you reproduce your problem with latest > > linux-next (or at least latest linux)?" Not everybody can use the vanilla > > up-to-date kernel! > > The changelog mentioned that the source of stalls is not clear so this > might be out-of-tree patches doing something wrong and dump_stack > showing up just because it is called often. This wouldn't be the first > time I have seen something like that. I am not really keen on adding > heavy lifting for something that is not clearly debugged and based on > hand waving and speculations. I'm thinking we should serialize warn_alloc anyway, to prevent the output from concurrent calls getting all jumbled together? I'm not sure I buy the "this isn't a mainline kernel" thing. warn_alloc() obviously isn't very robust, but we'd prefer that it be robust to peculiar situations, wild-n-wacky kernel patches, etc. It's a low-level thing and it should Just Work. I do think ratelimiting will be OK - if the kernel is producing such a vast stream of warn_alloc() output then nobody is going to be reading it all anyway. Probably just the first one is enough for operators to understand what's going wrong. So... I think both. ratelimit *and* serialize. Perhaps a simple but suitable way of doing that is simply to disallow concurrent warn_allocs: /* comment goes here */ if (test_and_set_bit(0, &foo)) return; ... clear_bit(0, &foo); or whatever? (And if we do decide to go with "mm,page_alloc: Serialize warn_alloc() if schedulable", please do add code comments explaining what's going on) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-01 22:10 ` Andrew Morton @ 2017-06-02 7:18 ` Michal Hocko 2017-06-02 11:13 ` Tetsuo Handa ` (3 more replies) 0 siblings, 4 replies; 43+ messages in thread From: Michal Hocko @ 2017-06-02 7:18 UTC (permalink / raw) To: Andrew Morton Cc: Tetsuo Handa, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka On Thu 01-06-17 15:10:22, Andrew Morton wrote: > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > > > > > > > This seems to be on an old and not pristine kernel. Does it happen also > > > > on the vanilla up-to-date kernel? > > > > > > 4.9 is not an old kernel! It might be close to the kernel version which > > > enterprise distributions would choose for their next long term supported > > > version. > > > > > > And please stop saying "can you reproduce your problem with latest > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla > > > up-to-date kernel! > > > > The changelog mentioned that the source of stalls is not clear so this > > might be out-of-tree patches doing something wrong and dump_stack > > showing up just because it is called often. This wouldn't be the first > > time I have seen something like that. I am not really keen on adding > > heavy lifting for something that is not clearly debugged and based on > > hand waving and speculations. > > I'm thinking we should serialize warn_alloc anyway, to prevent the > output from concurrent calls getting all jumbled together? dump_stack already serializes concurrent calls. > I'm not sure I buy the "this isn't a mainline kernel" thing. The changelog doesn't really explain what is going on and only speculates that the excessive warn_alloc is the cause. The kernel is 4.9.23.el7.twitter.x86_64 which I suspect contains a lot of stuff on top of 4.9. So I would really _like_ to see whether this is reproducible with the upstream kernel. Especially when this is a LTP test. > warn_alloc() obviously isn't very robust, but we'd prefer that it be > robust to peculiar situations, wild-n-wacky kernel patches, etc. It's > a low-level thing and it should Just Work. Yes I would agree and if we have an evidence that warn_alloc is really the problem then I am all for fixing it. There is no such evidence yet. Note that dump_stack serialization might be unfair because there is no queuing. Is it possible that this is the problem? If yes we should rather fix that because that is arguably even more low-level routine than warn_alloc. That being said. I strongly believe that this patch is not properly justified, issue fully understood and as such a disagree with adding a new lock on those grounds. Until the above is resolved Nacked-by: Michal Hocko <mhocko@suse.com> -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 7:18 ` Michal Hocko @ 2017-06-02 11:13 ` Tetsuo Handa 2017-06-02 12:15 ` Michal Hocko 2017-06-02 21:57 ` Cong Wang 2017-06-02 16:59 ` Cong Wang ` (2 subsequent siblings) 3 siblings, 2 replies; 43+ messages in thread From: Tetsuo Handa @ 2017-06-02 11:13 UTC (permalink / raw) To: mhocko, akpm Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka Michal Hocko wrote: > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: >> Michal Hocko wrote: >>> On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: >>>> Cong Wang has reported a lockup when running LTP memcg_stress test [1]. >>> >>> This seems to be on an old and not pristine kernel. Does it happen also >>> on the vanilla up-to-date kernel? >> >> 4.9 is not an old kernel! It might be close to the kernel version which >> enterprise distributions would choose for their next long term supported >> version. >> >> And please stop saying "can you reproduce your problem with latest >> linux-next (or at least latest linux)?" Not everybody can use the vanilla >> up-to-date kernel! > > The changelog mentioned that the source of stalls is not clear so this > might be out-of-tree patches doing something wrong and dump_stack > showing up just because it is called often. This wouldn't be the first > time I have seen something like that. I am not really keen on adding > heavy lifting for something that is not clearly debugged and based on > hand waving and speculations. You are asking users to prove that the problem is indeed in the MM subsystem, but you are thinking that kmallocwd which helps users to check whether the problem is indeed in the MM subsystem is not worth merging into mainline. As a result, we have to try things based on what you think handwaving and speculations. This is a catch-22. If you don't want handwaving/speculations, please please do provide a mechanism for checking (a) and (b) shown later. > >> What I'm pushing via kmallocwd patch is to prepare for overlooked problems >> so that enterprise distributors can collect information and identify what >> changes are needed to be backported. >> >> As long as you ignore problems not happened with latest linux-next (or >> at least latest linux), enterprise distribution users can do nothing. >> >>> >>> [...] >>>> Therefore, this patch uses a mutex dedicated for warn_alloc() like >>>> suggested in [3]. >>> >>> As I've said previously. We have rate limiting and if that doesn't work >>> out well, let's tune it. The lock should be the last resort to go with. >>> We already throttle show_mem, maybe we can throttle dump_stack as well, >>> although it sounds a bit strange that this adds so much to the picture. >> >> Ratelimiting never works well. It randomly drops information which is >> useful for debugging. Uncontrolled concurrent dump_stack() causes lockups. >> And restricting dump_stack() drops more information. > > As long as the dump_stack can be a source of the stalls, which I am not > so sure about, then we should rate limit it. I'm to some degree suspecting (a) shown later, but I can't prove it due to lack of a mechanism for debugging. Looking at timestamps of output, the delay between lines is varying; sometimes multiple lines are printed within one microsecond, sometimes two lines are printed with ten or twenty milliseconds. This fluctuation state resembles what I can observe when I put the system under overstressed condition (i.e. out of CPU time for making progress). But again, I can't prove it because I'm also suspecting (b) shown later due to use of memcg and multiple nodes. Since the server used in that report is Dell Inc. PowerEdge C6220/03C9JJ, I estimate that the total CPUs installed is 12 cores * 2 slots = 24 CPUs. (I can confirm that at least 21 CPUs are recognized from "CPU: 20" output.) Since Cong was trying to run memcg stress test with 150 memcg groups, I estimate that there were 150 threads running. This means that the system might have been put under memory pressure where total number of threads looping inside the page allocator dominates total number of available CPUs. Since Cong assigned 0.5GB memory limit on each memcg group on a server which has 64GB of memory, I estimate that the system might experience non-memcg OOM due to 150 * 0.5G > 64G. Then, this situation resembles my testcase where I put the system under memory pressure where total number of threads looping inside the page allocator (about one hundred or so) dominates total number of available CPUs (i.e. 4) on a system with only 2GB or 4GB RAM (and no swap partition). What I can observe is that all CPU times are spent for pointless busy loop inside __alloc_pages_slowpath() and makes the OOM killer unable to send SIGKILL, and the system stalls to the level where printk() flooding happens due to uncontrolled concurrent warn_alloc(). And your response is always "Your system is already DOSed. I don't make changes for such system at all." and the situation remained unchanged until this moment (i.e. Cong's report). It is possible that (a) Cong is unintentionally making the system under DOS like my non-memcg testcase while Cong just wanted to test memcg but it is also possible that (b) Cong is reporting an unnoticed bug in the MM subsystem as well as (c) Cong is reporting a bug which does not exist in the latest linux-next kernel and you are suspecting only (c) without providing a mechanism for checking (a) and (b). kmallocwd helps users to check (a) and (b) whereas printk() flooding due to uncontrolled concurrent warn_alloc() prevents users from checking (a) and (b). This is really bad. > >> What we should do is to yield CPU time to operations which might do useful >> things (let threads not doing memory allocation; e.g. let printk kernel >> threads to flush pending buffer, let console drivers write the output to >> consoles, let watchdog kernel threads report what is happening). > > yes we call that preemptive kernel... > And the page allocator is not preemptive. It does not yield enough CPU time for other threads to do potentially useful things, allowing (a) to happen. >> When memory allocation request is stalling, serialization via waiting >> for a lock does help. > > Which will mean that those unlucky ones which stall will stall even more > because they will wait on a lock with potentially many others. While > this certainly is a throttling mechanism it is also a big hammer. According to my testing, the cause of stalls with flooding of printk() from warn_alloc() is exactly the lack of enough CPU time because the page allocator continues busy looping when memory allocation is stalling. Andrew Morton wrote: > I'm thinking we should serialize warn_alloc anyway, to prevent the > output from concurrent calls getting all jumbled together? Yes. According to my testing, serializing warn_alloc() can not yield enough CPU time because warn_alloc() is called only once per 10 seconds. Serializing - if (!mutex_trylock(&oom_lock)) { + if (mutex_lock_killable(&oom_lock)) { in __alloc_pages_may_oom() can yield enough CPU time to solve the stalls. > > I'm not sure I buy the "this isn't a mainline kernel" thing. > warn_alloc() obviously isn't very robust, but we'd prefer that it be > robust to peculiar situations, wild-n-wacky kernel patches, etc. It's > a low-level thing and it should Just Work. Michal is clear that he won't make warn_alloc() robust, please see http://lkml.kernel.org/r/20161019115525.GH7517@dhcp22.suse.cz . And since kmallocwd is stalling, I'm trying to make warn_alloc() robust. And I'm trying to yield CPU time as well as avoid getting all messages jumbled by serialization using a lock, and Michal does not like locking. We are deadlocked. :-( > > I do think ratelimiting will be OK - if the kernel is producing such a > vast stream of warn_alloc() output then nobody is going to be reading > it all anyway. Probably just the first one is enough for operators to > understand what's going wrong. I don't think ratelimiting is OK. Printing only current thread's traces helps little because the memory allocation stall problem interacts with other threads. Without reporting all potentially relevant threads, we won't know more than "something went wrong". That's where kmallocwd becomes useful. > > So... I think both. ratelimit *and* serialize. Perhaps a simple but > suitable way of doing that is simply to disallow concurrent warn_allocs: > > /* comment goes here */ > if (test_and_set_bit(0, &foo)) > return; > ... > clear_bit(0, &foo); > > or whatever? > > (And if we do decide to go with "mm,page_alloc: Serialize warn_alloc() > if schedulable", please do add code comments explaining what's going on) I think "no ratelimit" for __GFP_DIRECT_RECLAIM allocation requests (unless we do decide to go with state tracking like kmallocwd does). --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3187,11 +3187,13 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) DEFAULT_RATELIMIT_BURST); static DEFINE_MUTEX(warn_alloc_lock); - if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs)) + if (gfp_mask & __GFP_NOWARN) return; if (gfp_mask & __GFP_DIRECT_RECLAIM) mutex_lock(&warn_alloc_lock); + else if (!__ratelimit(&nopage_rs)) + return; pr_warn("%s: ", current->comm); va_start(args, fmt); Michal Hocko wrote: > On Thu 01-06-17 15:10:22, Andrew Morton wrote: > > I'm thinking we should serialize warn_alloc anyway, to prevent the > > output from concurrent calls getting all jumbled together? > > dump_stack already serializes concurrent calls. > > > I'm not sure I buy the "this isn't a mainline kernel" thing. > > The changelog doesn't really explain what is going on and only > speculates that the excessive warn_alloc is the cause. The kernel is > 4.9.23.el7.twitter.x86_64 which I suspect contains a lot of stuff on top > of 4.9. So I would really _like_ to see whether this is reproducible > with the upstream kernel. Especially when this is a LTP test. You are misunderstanding. This patch is intended for help users understand what was going on, for flooded/ratelimited/truncated/dropped logs prevents users from knowing what was going on. You are not allowing users to explain what is going on, by not allowing users to know what is going on. > > > warn_alloc() obviously isn't very robust, but we'd prefer that it be > > robust to peculiar situations, wild-n-wacky kernel patches, etc. It's > > a low-level thing and it should Just Work. > > Yes I would agree and if we have an evidence that warn_alloc is really > the problem then I am all for fixing it. There is no such evidence yet. I don't have access to a system with 24 CPUs. Thus, I use a system with 4 CPUs / 2GB RAM with linux-next-20170602 kernel. Testcase: ---------------------------------------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <signal.h> #include <sys/prctl.h> int main(int argc, char *argv[]) { struct sched_param sp = { 0 }; cpu_set_t cpu = { { 1 } }; static int pipe_fd[2] = { EOF, EOF }; char *buf = NULL; unsigned long size = 0; unsigned int i; int fd; pipe(pipe_fd); signal(SIGCLD, SIG_IGN); if (fork() == 0) { prctl(PR_SET_NAME, (unsigned long) "first-victim", 0, 0, 0); while (1) pause(); } close(pipe_fd[1]); sched_setaffinity(0, sizeof(cpu), &cpu); prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0); for (i = 0; i < 1024; i++) if (fork() == 0) { char c; /* Wait until the first-victim is OOM-killed. */ read(pipe_fd[0], &c, 1); /* Try to consume as much CPU time as possible. */ while(1); _exit(0); } close(pipe_fd[0]); fd = open("/dev/zero", O_RDONLY); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sched_setscheduler(0, SCHED_IDLE, &sp); prctl(PR_SET_NAME, (unsigned long) "idle-priority", 0, 0, 0); read(fd, buf, size); /* Will cause OOM due to overcommit */ kill(-1, SIGKILL); return 0; /* Not reached. */ } ---------------------------------------- Results from http://I-love.SAKURA.ne.jp/tmp/serial-20170602-1.txt.xz : ---------------------------------------- [ 46.896272] idle-priority invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 (...snipped...) [ 47.102375] Out of memory: Kill process 2210 (idle-priority) score 698 or sacrifice child [ 47.109067] Killed process 2211 (first-victim) total-vm:4168kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB [ 56.906104] postgres: page allocation stalls for 10003ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) [ 56.911263] systemd-journal: page allocation stalls for 10005ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) (...snipped...) [ 282.940538] pickup: page allocation stalls for 150003ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) (...snipped...) [ 283.203997] sysrq: SysRq : Kill All Tasks ---------------------------------------- This result shows that the OOM killer cannot be invoked again because userspace processes continued wasting CPU time after an OOM victim was selected. Yes, I know you will shout that "wasting CPU time in userspace in order to disturb the OOM killer is unfair". OK, here is another result. Testcase: ---------------------------------------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <sys/mman.h> #include <sys/prctl.h> static void setproctitle(const char *name) { prctl(PR_SET_NAME, (unsigned long) name, 0, 0, 0); } static cpu_set_t set = { { 1 } }; struct sched_param sp = { }; //static char filename[64] = { }; static char buffer[4096] = { }; static int file_io(void *unused) { const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600); setproctitle("mmap-file"); sched_setscheduler(0, SCHED_IDLE, &sp); while (write(fd, buffer, 1) > 0) sched_yield(); close(fd); return 0; } /* down_read(&mm->mmap_sem) requester which is chosen as an OOM victim. */ static int reader(void *pid) { int fd; char filename[32] = { }; snprintf(filename, sizeof(filename), "/proc/%u/stat", *(pid_t *)pid); fd = open(filename, O_RDONLY); setproctitle("mmap-read"); sched_setaffinity(0, sizeof(set), &set); //sched_setscheduler(0, SCHED_IDLE, &sp); while (pread(fd, buffer, sizeof(buffer), 0) > 0); close(fd); return 0; } /* down_write(&mm->mmap_sem) requester which is chosen as an OOM victim. */ static int writer(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); setproctitle("mmap-write"); sched_setaffinity(0, sizeof(set), &set); //sched_setscheduler(0, SCHED_IDLE, &sp); while (1) { void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); //if (ptr && ptr != (void *) -1) munmap(ptr, 4096); } return 0; } int main(int argc, char *argv[]) { #define MAX_PARALLEL 10 #define MAX_CONTENTION 10 int i; int j; pid_t pids[MAX_PARALLEL]; for (i = 0; i < MAX_PARALLEL; i++) { pids[i] = fork(); if (pids[i] == EOF) break; if (pids[i] == 0) { sleep(1); //for (i = 0; i < 2; i++) // clone(file_io, malloc(1024) + 1024, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); for (j = 0; j < MAX_CONTENTION; j++) { char *cp = malloc(4096); if (!cp) break; clone(writer, cp + 4096, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL); } pause(); _exit(0); } } for (i = 0; i < MAX_PARALLEL; i++) { if (pids[i] == EOF) break; for (j = 0; j < MAX_CONTENTION; j++) { char *cp = malloc(4096); if (!cp) break; clone(reader, cp + 4096, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, &pids[i]); } } { /* A dummy process for invoking the OOM killer. */ char *buf = NULL; unsigned long i; unsigned long size = 0; setproctitle("mmap-mem"); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); //sched_setscheduler(0, SCHED_IDLE, &sp); fprintf(stderr, "Eating memory\n"); for (i = 0; i < size; i += 4096) buf[i] = '\0'; /* Will cause OOM due to overcommit */ fprintf(stderr, "Exiting\n"); return 0; } } ---------------------------------------- Results from http://I-love.SAKURA.ne.jp/tmp/serial-20170602-2.txt.xz : ---------------------------------------- [ 123.771523] mmap-write invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 124.808940] mmap-write cpuset=/ mems_allowed=0 [ 124.811595] CPU: 0 PID: 2852 Comm: mmap-write Not tainted 4.12.0-rc3-next-20170602 #99 [ 124.815842] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 124.821171] Call Trace: [ 124.823106] dump_stack+0x86/0xcf [ 124.825336] dump_header+0x97/0x26d [ 124.827668] ? trace_hardirqs_on+0xd/0x10 [ 124.830222] oom_kill_process+0x203/0x470 [ 124.832778] out_of_memory+0x138/0x580 [ 124.835223] __alloc_pages_slowpath+0x1100/0x11f0 [ 124.838085] __alloc_pages_nodemask+0x308/0x3c0 [ 124.840850] alloc_pages_current+0x6a/0xe0 [ 124.843332] __page_cache_alloc+0x119/0x150 [ 124.845723] filemap_fault+0x3dc/0x950 [ 124.847932] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 124.850683] ? xfs_filemap_fault+0x5b/0x180 [xfs] [ 124.853427] ? down_read_nested+0x73/0xb0 [ 124.855792] xfs_filemap_fault+0x63/0x180 [xfs] [ 124.858327] __do_fault+0x1e/0x140 [ 124.860383] __handle_mm_fault+0xb2c/0x1090 [ 124.862760] handle_mm_fault+0x190/0x350 [ 124.865161] __do_page_fault+0x266/0x520 [ 124.867409] do_page_fault+0x30/0x80 [ 124.869846] page_fault+0x28/0x30 [ 124.871803] RIP: 0033:0x7fb997682dca [ 124.873875] RSP: 002b:0000000000777fe8 EFLAGS: 00010246 [ 124.876601] RAX: 00007fb997b6e000 RBX: 0000000000000000 RCX: 00007fb997682dca [ 124.880077] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000 [ 124.883551] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000 [ 124.886933] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002 [ 124.890336] R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000000001 [ 124.893853] Mem-Info: [ 126.408131] mmap-read: page allocation stalls for 10005ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) [ 126.408137] mmap-read cpuset=/ mems_allowed=0 (...snipped...) [ 350.182442] mmap-read: page allocation stalls for 230016ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) [ 350.182446] mmap-read cpuset=/ mems_allowed=0 (...snipped...) [ 350.836483] sysrq: SysRq : Kill All Tasks (...snipped...) [ 389.308085] active_anon:1146 inactive_anon:2777 isolated_anon:0 [ 389.308085] active_file:479 inactive_file:508 isolated_file:0 [ 389.308085] unevictable:0 dirty:0 writeback:0 unstable:0 [ 389.308085] slab_reclaimable:9536 slab_unreclaimable:15265 [ 389.308085] mapped:629 shmem:3535 pagetables:34 bounce:0 [ 389.308085] free:356689 free_pcp:596 free_cma:0 [ 389.308089] Node 0 active_anon:4584kB inactive_anon:11108kB active_file:1916kB inactive_file:2032kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2516kB dirty:0kB writeback:0kB shmem:14140kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 389.308090] Node 0 DMA free:15872kB min:440kB low:548kB high:656kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 389.308093] lowmem_reserve[]: 0 1561 1561 1561 [ 389.308097] Node 0 DMA32 free:1410884kB min:44612kB low:55764kB high:66916kB active_anon:4584kB inactive_anon:11108kB active_file:1916kB inactive_file:2032kB unevictable:0kB writepending:0kB present:2080640kB managed:1599404kB mlocked:0kB slab_reclaimable:38144kB slab_unreclaimable:61028kB kernel_stack:3808kB pagetables:136kB bounce:0kB free_pcp:2384kB local_pcp:696kB free_cma:0kB [ 389.308099] lowmem_reserve[]: 0 0 0 0 [ 389.308103] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15872kB [ 389.308121] Node 0 DMA32: 975*4kB (UME) 1199*8kB (UME) 1031*16kB (UME) 973*32kB (UME) 564*64kB (UME) 303*128kB (UME) 160*256kB (UME) 70*512kB (UME) 40*1024kB (UME) 21*2048kB (M) 272*4096kB (M) = 1410884kB [ 389.308142] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 389.308143] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 389.308143] 4522 total pagecache pages [ 389.308147] 0 pages in swap cache [ 389.308148] Swap cache stats: add 0, delete 0, find 0/0 [ 389.308149] Free swap = 0kB [ 389.308149] Total swap = 0kB [ 389.308150] 524157 pages RAM [ 389.308151] 0 pages HighMem/MovableOnly [ 389.308152] 120330 pages reserved [ 389.308153] 0 pages cma reserved [ 389.308153] 0 pages hwpoisoned [ 389.308155] Out of memory: Kill process 2649 (mmap-mem) score 783 or sacrifice child ---------------------------------------- This result shows that the OOM killer was not able to send SIGKILL until I gave up waiting and pressed SysRq-i because __alloc_pages_slowpath() continued wasting CPU time after the OOM killer tried to start printing memory information. We can avoid this case if we wait for oom_lock at __alloc_pages_may_oom(). > Note that dump_stack serialization might be unfair because there is no > queuing. Is it possible that this is the problem? If yes we should > rather fix that because that is arguably even more low-level routine than > warn_alloc. Wasting all CPU times in __alloc_pages_slowpath() enough to disallow the OOM killer to send SIGKILL is also unfair. And you refuse to queue allocating threads into oom_lock waiters at __alloc_pages_may_oom(). > > That being said. I strongly believe that this patch is not properly > justified, issue fully understood and as such a disagree with adding a > new lock on those grounds. > > Until the above is resolved > Nacked-by: Michal Hocko <mhocko@suse.com> So, you can't agree with inspecting synchronously so that users can know what is happening. Then, we have no choice other than use state tracking and inspect asynchronously like kmallocwd does. Andrew, I'm so happy if we can inspect asynchronously. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 11:13 ` Tetsuo Handa @ 2017-06-02 12:15 ` Michal Hocko 2017-06-02 17:13 ` Tetsuo Handa 2017-06-02 21:57 ` Cong Wang 1 sibling, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-06-02 12:15 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka On Fri 02-06-17 20:13:32, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > >> Michal Hocko wrote: > >>> On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > >>>> Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > >>> > >>> This seems to be on an old and not pristine kernel. Does it happen also > >>> on the vanilla up-to-date kernel? > >> > >> 4.9 is not an old kernel! It might be close to the kernel version which > >> enterprise distributions would choose for their next long term supported > >> version. > >> > >> And please stop saying "can you reproduce your problem with latest > >> linux-next (or at least latest linux)?" Not everybody can use the vanilla > >> up-to-date kernel! > > > > The changelog mentioned that the source of stalls is not clear so this > > might be out-of-tree patches doing something wrong and dump_stack > > showing up just because it is called often. This wouldn't be the first > > time I have seen something like that. I am not really keen on adding > > heavy lifting for something that is not clearly debugged and based on > > hand waving and speculations. > > You are asking users to prove that the problem is indeed in the MM subsystem, > but you are thinking that kmallocwd which helps users to check whether the > problem is indeed in the MM subsystem is not worth merging into mainline. > As a result, we have to try things based on what you think handwaving and > speculations. This is a catch-22. If you don't want handwaving/speculations, > please please do provide a mechanism for checking (a) and (b) shown later. configure watchdog to bug on soft lockup, take a crash dump, see what is going on there and you can draw a better picture of what is going on here. Seriously I am fed up with all the "let's do the async thing because it would tell much more" side discussions. You are trying to fix a soft lockup which alone is not a deadly condition. If the system is overwhelmed it can happen and if that is the case then we should care whether it gets resolved or it is a permanent livelock situation. If yes then we need to isolate which path is not preempting and why and place the cond_resched there. The page allocator contains preemption points, if we are lacking some for some pathological paths let's add them. For some reason you seem to be focused only on the warn_alloc path, though, while the real issue might be somewhere completely else. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 12:15 ` Michal Hocko @ 2017-06-02 17:13 ` Tetsuo Handa 0 siblings, 0 replies; 43+ messages in thread From: Tetsuo Handa @ 2017-06-02 17:13 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka Michal Hocko wrote: > On Fri 02-06-17 20:13:32, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > >> Michal Hocko wrote: > > >>> On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > >>>> Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > >>> > > >>> This seems to be on an old and not pristine kernel. Does it happen also > > >>> on the vanilla up-to-date kernel? > > >> > > >> 4.9 is not an old kernel! It might be close to the kernel version which > > >> enterprise distributions would choose for their next long term supported > > >> version. > > >> > > >> And please stop saying "can you reproduce your problem with latest > > >> linux-next (or at least latest linux)?" Not everybody can use the vanilla > > >> up-to-date kernel! > > > > > > The changelog mentioned that the source of stalls is not clear so this > > > might be out-of-tree patches doing something wrong and dump_stack > > > showing up just because it is called often. This wouldn't be the first > > > time I have seen something like that. I am not really keen on adding > > > heavy lifting for something that is not clearly debugged and based on > > > hand waving and speculations. > > > > You are asking users to prove that the problem is indeed in the MM subsystem, > > but you are thinking that kmallocwd which helps users to check whether the > > problem is indeed in the MM subsystem is not worth merging into mainline. > > As a result, we have to try things based on what you think handwaving and > > speculations. This is a catch-22. If you don't want handwaving/speculations, > > please please do provide a mechanism for checking (a) and (b) shown later. > > configure watchdog to bug on soft lockup, take a crash dump, see what > is going on there and you can draw a better picture of what is going on > here. Seriously I am fed up with all the "let's do the async thing > because it would tell much more" side discussions. You are trying to fix > a soft lockup which alone is not a deadly condition. Wrong and unacceptable. I'm trying to collect information for not only the cause of allocation stall reported by Cong Wang but also any allocation stalls (e.g. infinite too_many_isolated() trap from shrink_inactive_list() and infinite mempool_alloc() loop case). I was indeed surprised to know how severe a race condition on a system with many CPUs at http://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp . Therefore, like you suspect, it is possible that dump_stack() serialization might be unfair on some hardware, especially when a lot of allocating threads are calling warn_alloc(). But if you read Cong's report, you will find that allocation stalls began before the first detection of soft lockup due to dump_stack() serialization. This suggests that allocation stalls already began before the system enters into a state where a lot of allocating threads started calling warn_alloc(). Then, I think that it is natural to suspect that the first culprit is MM subsystem. If you believe that you can debug and explain why the OOM killer was not invoked in Cong's case without comparing how situation changes over time (i.e. with a crash dump only), please explain it in that thread. > If the system is > overwhelmed it can happen and if that is the case then we should care > whether it gets resolved or it is a permanent livelock situation. What I'm trying to check is exactly whether the stall gets resolved or the stall is a permanent livelock situation, via allowing users to take multiple snapshots of memory image and compare how situation changes over time. Taking a crash dump via kernel panic cannot allow users to take multiple snapshots of memory image. > If yes > then we need to isolate which path is not preempting and why and place > the cond_resched there. The page allocator contains preemption points, > if we are lacking some for some pathological paths let's add them. What I'm talking about is not the lack of cond_resched(). As far as we know, cond_resched() is inserted whereever necessary. What I'm talking about is that cond_resched() cannot yield enough CPU time. Looping with calling cond_resched() can yield only e.g. 1% of CPU time compared to looping without calling cond_resched(). But if e.g. 100 threads are looping with only cond_resched(), some thread waiting for wake up from schedule_timeout_killable(1) can delay a lot. If that some thread has idle priority, the delay can become too long to wait. > For > some reason you seem to be focused only on the warn_alloc path, though, > while the real issue might be somewhere completely else. Not only because uncontrolled concurrent warn_alloc() causes flooding of printk() and prevents users from knowing what is going on, but also warn_alloc() cannot be called when we hit e.g. infinite too_many_isolated() loop case or infinite mempool_alloc() loop case which after all prevents users from knowing what is going on. You are not providing users means to skip warn_alloc() when flooding of printk() is just a junk (i.e. the real issue is somewhere completely else) or call warn_alloc() when all printk() messages are needed (i.e. the issue is in the MM subsystem). Current implementation of warn_alloc() can act as launcher of corrupted spam messages. Without comparing how situation changes over time, via reliably saving all printk() messages which might potentially be relevant, we can't judge whether warn_alloc() flooding is just a collateral phenomenon of a real issue somewhere completely else. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 11:13 ` Tetsuo Handa 2017-06-02 12:15 ` Michal Hocko @ 2017-06-02 21:57 ` Cong Wang 2017-06-04 8:58 ` Tetsuo Handa 1 sibling, 1 reply; 43+ messages in thread From: Cong Wang @ 2017-06-02 21:57 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, Andrew Morton, linux-mm, dave.hansen, Johannes Weiner, Mel Gorman, vbabka On Fri, Jun 2, 2017 at 4:13 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > Since the server used in that report is Dell Inc. PowerEdge C6220/03C9JJ, > I estimate that the total CPUs installed is 12 cores * 2 slots = 24 CPUs. > (I can confirm that at least 21 CPUs are recognized from "CPU: 20" output.) Here is the lscpu output in case it is useful for you to debug: $ lscpu -ae CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE 0 0 0 0 0:0:0:0 yes 1 0 0 1 1:1:1:0 yes 2 0 0 2 2:2:2:0 yes 3 0 0 3 3:3:3:0 yes 4 0 0 4 4:4:4:0 yes 5 0 0 5 5:5:5:0 yes 6 1 1 6 6:6:6:1 yes 7 1 1 7 7:7:7:1 yes 8 1 1 8 8:8:8:1 yes 9 1 1 9 9:9:9:1 yes 10 1 1 10 10:10:10:1 yes 11 1 1 11 11:11:11:1 yes 12 0 0 0 0:0:0:0 yes 13 0 0 1 1:1:1:0 yes 14 0 0 2 2:2:2:0 yes 15 0 0 3 3:3:3:0 yes 16 0 0 4 4:4:4:0 yes 17 0 0 5 5:5:5:0 yes 18 1 1 6 6:6:6:1 yes 19 1 1 7 7:7:7:1 yes 20 1 1 8 8:8:8:1 yes 21 1 1 9 9:9:9:1 yes 22 1 1 10 10:10:10:1 yes 23 1 1 11 11:11:11:1 yes > Since Cong was trying to run memcg stress test with 150 memcg groups, I > estimate that there were 150 threads running. This means that the system > might have been put under memory pressure where total number of threads > looping inside the page allocator dominates total number of available CPUs. > Since Cong assigned 0.5GB memory limit on each memcg group on a server > which has 64GB of memory, I estimate that the system might experience > non-memcg OOM due to 150 * 0.5G > 64G. Just FYI: it is not us who picks those numbers, they are in the LTP test memcg_stress_test.sh. > > Then, this situation resembles my testcase where I put the system under > memory pressure where total number of threads looping inside the page > allocator (about one hundred or so) dominates total number of available > CPUs (i.e. 4) on a system with only 2GB or 4GB RAM (and no swap partition). > What I can observe is that all CPU times are spent for pointless busy loop > inside __alloc_pages_slowpath() and makes the OOM killer unable to send > SIGKILL, and the system stalls to the level where printk() flooding happens > due to uncontrolled concurrent warn_alloc(). > > And your response is always "Your system is already DOSed. I don't make > changes for such system at all." and the situation remained unchanged > until this moment (i.e. Cong's report). It is possible that > > (a) Cong is unintentionally making the system under DOS like > my non-memcg testcase while Cong just wanted to test memcg > > but it is also possible that > > (b) Cong is reporting an unnoticed bug in the MM subsystem > I suppose so when I report the warning, unless commit 63f53dea0c9866e93802d50a230c460a024 is a false alarm. ;) If I understand that commit correctly, it warns that we spend too much time on retrying and make no progress on the mm allocator slow path, which clearly indicates some problem. But I thought it is obvious we should OOM instead of hanging somewhere in this situation? (My mm knowledge is very limited.) > as well as > > (c) Cong is reporting a bug which does not exist in the latest > linux-next kernel > As I already mentioned in my original report, I know there are at least two similar warnings reported before: https://lkml.org/lkml/2016/12/13/529 https://bugzilla.kernel.org/show_bug.cgi?id=192981 I don't see any fix, nor I see they are similar to mine. > and you are suspecting only (c) without providing a mechanism for > checking (a) and (b). kmallocwd helps users to check (a) and (b) > whereas printk() flooding due to uncontrolled concurrent warn_alloc() > prevents users from checking (a) and (b). This is really bad. > >> >>> What we should do is to yield CPU time to operations which might do useful >>> things (let threads not doing memory allocation; e.g. let printk kernel >>> threads to flush pending buffer, let console drivers write the output to >>> consoles, let watchdog kernel threads report what is happening). >> >> yes we call that preemptive kernel... >> > > And the page allocator is not preemptive. It does not yield enough CPU > time for other threads to do potentially useful things, allowing (a) to > happen. > >>> When memory allocation request is stalling, serialization via waiting >>> for a lock does help. >> >> Which will mean that those unlucky ones which stall will stall even more >> because they will wait on a lock with potentially many others. While >> this certainly is a throttling mechanism it is also a big hammer. > > According to my testing, the cause of stalls with flooding of printk() from > warn_alloc() is exactly the lack of enough CPU time because the page > allocator continues busy looping when memory allocation is stalling. > In the retry loop, warn_alloc() is only called after stall is detected, not before, therefore waiting on the mutex does not contribute to at least the first stall. > > > Andrew Morton wrote: >> I'm thinking we should serialize warn_alloc anyway, to prevent the >> output from concurrent calls getting all jumbled together? > > Yes. According to my testing, serializing warn_alloc() can not yield > enough CPU time because warn_alloc() is called only once per 10 seconds. > Serializing > > - if (!mutex_trylock(&oom_lock)) { > + if (mutex_lock_killable(&oom_lock)) { > > in __alloc_pages_may_oom() can yield enough CPU time to solve the stalls. > For this point, I am with you, it would be helpful to serialize them in case we mix different warnings in dmesg. But you probably need to adjust the timestamps in case waiting on the mutex contributes to the stall too? [...] > This result shows that the OOM killer was not able to send SIGKILL until > I gave up waiting and pressed SysRq-i because __alloc_pages_slowpath() continued > wasting CPU time after the OOM killer tried to start printing memory information. > We can avoid this case if we wait for oom_lock at __alloc_pages_may_oom(). > Note, in my case OOM killer was probably not even invoked although the log I captured is a complete one... Thanks for looking into it! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 21:57 ` Cong Wang @ 2017-06-04 8:58 ` Tetsuo Handa 2017-06-04 15:05 ` Michal Hocko 2017-06-05 18:25 ` Cong Wang 0 siblings, 2 replies; 43+ messages in thread From: Tetsuo Handa @ 2017-06-04 8:58 UTC (permalink / raw) To: xiyou.wangcong Cc: mhocko, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka Cong Wang wrote: > Just FYI: it is not us who picks those numbers, they are in the LTP test > memcg_stress_test.sh. I see. No problem as long as 0.5GB per a memcg and 150 memcg groups parts are correct. > > but it is also possible that > > > > (b) Cong is reporting an unnoticed bug in the MM subsystem > > > > I suppose so when I report the warning, unless commit > 63f53dea0c9866e93802d50a230c460a024 is a false alarm. ;) 63f53dea0c9866e9 is really a half-baked troublesome alarm. :-( It has failed to alarm for more than 30+ minutes (won't be fixed), it is failing to alarm forever (not yet fixed), it has failed to alarm __GFP_NOWARN allocations (fixed in 4.12-rc1), it is still failing to print all possibly relevant threads (not yet fixed) etc. I can't expect that users find more than "something went wrong" from this alarm. You might have found a memcg related MM problem, but you are a victim who failed to find more than "the kernel got soft lockup due to parallel dump_stack() calls by this alarm". > > If I understand that commit correctly, it warns that we spend too > much time on retrying and make no progress on the mm allocator > slow path, which clearly indicates some problem. Yes, but it is far from reliable, and it won't become reliable. > > But I thought it is obvious we should OOM instead of hanging > somewhere in this situation? (My mm knowledge is very limited.) Yes, I think that your system should have invoked the OOM killer, though my MM knowledge is very limited as well. > > > > as well as > > > > (c) Cong is reporting a bug which does not exist in the latest > > linux-next kernel > > > > As I already mentioned in my original report, I know there are at least > two similar warnings reported before: > > https://lkml.org/lkml/2016/12/13/529 > https://bugzilla.kernel.org/show_bug.cgi?id=192981 > > I don't see any fix, nor I see they are similar to mine. No means for analyzing, no plan for fixing the problems. > >>> When memory allocation request is stalling, serialization via waiting > >>> for a lock does help. > >> > >> Which will mean that those unlucky ones which stall will stall even more > >> because they will wait on a lock with potentially many others. While > >> this certainly is a throttling mechanism it is also a big hammer. > > > > According to my testing, the cause of stalls with flooding of printk() from > > warn_alloc() is exactly the lack of enough CPU time because the page > > allocator continues busy looping when memory allocation is stalling. > > > > In the retry loop, warn_alloc() is only called after stall is detected, not > before, therefore waiting on the mutex does not contribute to at least > the first stall. Right, but waiting on the mutex inside warn_alloc() cannot yield enough CPU time for allowing log_buf readers to write to consoles. > > Andrew Morton wrote: > >> I'm thinking we should serialize warn_alloc anyway, to prevent the > >> output from concurrent calls getting all jumbled together? > > > > Yes. According to my testing, serializing warn_alloc() can not yield > > enough CPU time because warn_alloc() is called only once per 10 seconds. > > Serializing > > > > - if (!mutex_trylock(&oom_lock)) { > > + if (mutex_lock_killable(&oom_lock)) { > > > > in __alloc_pages_may_oom() can yield enough CPU time to solve the stalls. > > > > For this point, I am with you, it would be helpful to serialize them in > case we mix different warnings in dmesg. But you probably need to adjust > the timestamps in case waiting on the mutex contributes to the stall too? As long as Michal refuses serialization, we won't get helpful output. You can retry with my kmallocwd patch shown bottom. An example output is at http://I-love.SAKURA.ne.jp/tmp/sample-serial.log . Of course, kmallocwd can gather only basic information. You might need to gather more information by e.g. enabling tracepoints after analyzing basic information. > > [...] > > > This result shows that the OOM killer was not able to send SIGKILL until > > I gave up waiting and pressed SysRq-i because __alloc_pages_slowpath() continued > > wasting CPU time after the OOM killer tried to start printing memory information. > > We can avoid this case if we wait for oom_lock at __alloc_pages_may_oom(). > > > > Note, in my case OOM killer was probably not even invoked although > the log I captured is a complete one... Since you said The log I sent is partial, but that is already all what we captured, I can't find more in kern.log due to log rotation. you meant "the log I captured is an incomplete one", don't you? Under memory pressure, we can't expect printk() output (as well as tracepoint output) to be flushed to log files on a storage because writing to a file involves memory allocation. Therefore, you can try serial console or netconsole for saving printk() output more reliably than log files. Some hardware uses horribly slow / unreliable serial device. In that case netconsole would be better. If you use netconsole, you can use a utility I wrote for receiving netconsole messages available at https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/ . This utility can concatenate pr_cont() output heavily used in MM subsystem with one timestamp per a line. Below is kmallocwd patch backpoated for 4.9.30 kernel from http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp . Documentation/malloc-watchdog.txt part is stripped in order to reduce lines. include/linux/gfp.h | 8 + include/linux/oom.h | 4 include/linux/sched.h | 19 ++++ kernel/fork.c | 4 kernel/hung_task.c | 216 +++++++++++++++++++++++++++++++++++++++++++++++++- kernel/sysctl.c | 10 ++ lib/Kconfig.debug | 24 +++++ mm/mempool.c | 9 +- mm/oom_kill.c | 3 mm/page_alloc.c | 82 ++++++++++++++++++ 10 files changed, 377 insertions(+), 2 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index f8041f9de..cd4253f 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -460,6 +460,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, return __alloc_pages_node(nid, gfp_mask, order); } +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK +extern void start_memalloc_timer(const gfp_t gfp_mask, const int order); +extern void stop_memalloc_timer(const gfp_t gfp_mask); +#else +#define start_memalloc_timer(gfp_mask, order) do { } while (0) +#define stop_memalloc_timer(gfp_mask) do { } while (0) +#endif + #ifdef CONFIG_NUMA extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order); diff --git a/include/linux/oom.h b/include/linux/oom.h index b4e36e9..8487d1b 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -79,8 +79,12 @@ extern unsigned long oom_badness(struct task_struct *p, extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern unsigned int out_of_memory_count; +extern bool memalloc_maybe_stalling(void); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; extern int sysctl_panic_on_oom; +extern unsigned long sysctl_memalloc_task_warning_secs; #endif /* _INCLUDE_LINUX_OOM_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index f425eb3..5d48ecb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1472,6 +1472,22 @@ struct tlbflush_unmap_batch { bool writable; }; +struct memalloc_info { + /* Is current thread doing (nested) memory allocation? */ + u8 in_flight; + /* Watchdog kernel thread is about to report this task? */ + bool report; + /* Index used for memalloc_in_flight[] counter. */ + u8 idx; + /* For progress monitoring. */ + unsigned int sequence; + /* Started time in jiffies as of in_flight == 1. */ + unsigned long start; + /* Requested order and gfp flags as of in_flight == 1. */ + unsigned int order; + gfp_t gfp; +}; + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -1960,6 +1976,9 @@ struct task_struct { /* A live task holds one reference. */ atomic_t stack_refcount; #endif +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + struct memalloc_info memalloc; +#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/kernel/fork.c b/kernel/fork.c index 59faac4..8c2aef2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1624,6 +1624,10 @@ static __latent_entropy struct task_struct *copy_process( p->sequential_io_avg = 0; #endif +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + p->memalloc.sequence = 0; +#endif + /* Perform scheduler related setup. Assign this task to a CPU. */ retval = sched_fork(clone_flags, p); if (retval) diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 2b59c82..b6ce9a3 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -16,6 +16,8 @@ #include <linux/export.h> #include <linux/sysctl.h> #include <linux/utsname.h> +#include <linux/oom.h> +#include <linux/console.h> #include <trace/events/sched.h> /* @@ -141,6 +143,8 @@ static bool rcu_lock_break(struct task_struct *g, struct task_struct *t) get_task_struct(g); get_task_struct(t); rcu_read_unlock(); + if (console_trylock()) + console_unlock(); cond_resched(); rcu_read_lock(); can_cont = pid_alive(g) && pid_alive(t); @@ -193,6 +197,200 @@ static long hung_timeout_jiffies(unsigned long last_checked, MAX_SCHEDULE_TIMEOUT; } +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK +/* + * Zero means infinite timeout - no checking done: + */ +unsigned long __read_mostly sysctl_memalloc_task_warning_secs = + CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT; + +/* Filled by is_stalling_task(), used by only khungtaskd kernel thread. */ +static struct memalloc_info memalloc; + +/** + * is_stalling_task - Check and copy a task's memalloc variable. + * + * @task: A task to check. + * @expire: Timeout in jiffies. + * + * Returns true if a task is stalling, false otherwise. + */ +static bool is_stalling_task(const struct task_struct *task, + const unsigned long expire) +{ + const struct memalloc_info *m = &task->memalloc; + + if (likely(!m->in_flight || !time_after_eq(expire, m->start))) + return false; + /* + * start_memalloc_timer() guarantees that ->in_flight is updated after + * ->start is stored. + */ + smp_rmb(); + memalloc.sequence = m->sequence; + memalloc.start = m->start; + memalloc.order = m->order; + memalloc.gfp = m->gfp; + return time_after_eq(expire, memalloc.start); +} + +/* + * check_memalloc_stalling_tasks - Check for memory allocation stalls. + * + * @timeout: Timeout in jiffies. + * + * Returns number of stalling tasks. + * + * This function is marked as "noinline" in order to allow inserting dynamic + * probes (e.g. printing more information as needed using SystemTap, calling + * panic() if this function returned non 0 value). + */ +static noinline int check_memalloc_stalling_tasks(unsigned long timeout) +{ + enum { + MEMALLOC_TYPE_STALLING, /* Report as stalling task. */ + MEMALLOC_TYPE_DYING, /* Report as dying task. */ + MEMALLOC_TYPE_EXITING, /* Report as exiting task.*/ + MEMALLOC_TYPE_OOM_VICTIM, /* Report as OOM victim. */ + MEMALLOC_TYPE_UNCONDITIONAL, /* Report unconditionally. */ + }; + char buf[256]; + struct task_struct *g, *p; + unsigned long now; + unsigned long expire; + unsigned int sigkill_pending = 0; + unsigned int exiting_tasks = 0; + unsigned int memdie_pending = 0; + unsigned int stalling_tasks = 0; + + cond_resched(); + now = jiffies; + /* + * Report tasks that stalled for more than half of timeout duration + * because such tasks might be correlated with tasks that already + * stalled for full timeout duration. + */ + expire = now - timeout * (HZ / 2); + /* Count stalling tasks, dying and victim tasks. */ + rcu_read_lock(); + for_each_process_thread(g, p) { + bool report = false; + + if (test_tsk_thread_flag(p, TIF_MEMDIE)) { + report = true; + memdie_pending++; + } + if (fatal_signal_pending(p)) { + report = true; + sigkill_pending++; + } + if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) { + report = true; + exiting_tasks++; + } + if (is_stalling_task(p, expire)) { + report = true; + stalling_tasks++; + } + if (p->flags & PF_KSWAPD) + report = true; + p->memalloc.report = report; + } + rcu_read_unlock(); + if (!stalling_tasks) + return 0; + cond_resched(); + /* Report stalling tasks, dying and victim tasks. */ + pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n", + stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending, + out_of_memory_count); + cond_resched(); + sigkill_pending = 0; + exiting_tasks = 0; + memdie_pending = 0; + stalling_tasks = 0; + rcu_read_lock(); + restart_report: + for_each_process_thread(g, p) { + u8 type; + + if (likely(!p->memalloc.report)) + continue; + p->memalloc.report = false; + /* Recheck in case state changed meanwhile. */ + type = 0; + if (test_tsk_thread_flag(p, TIF_MEMDIE)) { + type |= (1 << MEMALLOC_TYPE_OOM_VICTIM); + memdie_pending++; + } + if (fatal_signal_pending(p)) { + type |= (1 << MEMALLOC_TYPE_DYING); + sigkill_pending++; + } + if ((p->flags & PF_EXITING) && p->state != TASK_DEAD) { + type |= (1 << MEMALLOC_TYPE_EXITING); + exiting_tasks++; + } + if (is_stalling_task(p, expire)) { + type |= (1 << MEMALLOC_TYPE_STALLING); + stalling_tasks++; + snprintf(buf, sizeof(buf), + " seq=%u gfp=0x%x(%pGg) order=%u delay=%lu", + memalloc.sequence, memalloc.gfp, + &memalloc.gfp, + memalloc.order, now - memalloc.start); + } else { + buf[0] = '\0'; + } + if (p->flags & PF_KSWAPD) + type |= (1 << MEMALLOC_TYPE_UNCONDITIONAL); + if (unlikely(!type)) + continue; + /* + * Victim tasks get pending SIGKILL removed before arriving at + * do_exit(). Therefore, print " exiting" instead for " dying". + */ + pr_warn("MemAlloc: %s(%u) flags=0x%x switches=%lu%s%s%s%s%s\n", + p->comm, p->pid, p->flags, p->nvcsw + p->nivcsw, buf, + (p->state & TASK_UNINTERRUPTIBLE) ? + " uninterruptible" : "", + (type & (1 << MEMALLOC_TYPE_EXITING)) ? + " exiting" : "", + (type & (1 << MEMALLOC_TYPE_DYING)) ? " dying" : "", + (type & (1 << MEMALLOC_TYPE_OOM_VICTIM)) ? + " victim" : ""); + sched_show_task(p); + /* + * Since there could be thousands of tasks to report, we always + * call cond_resched() after each report, in order to avoid RCU + * stalls. + * + * Since not yet reported tasks are marked as + * p->memalloc.report == T, this loop can restart even if + * "g" or "p" went away. + * + * TODO: Try to wait for a while (e.g. sleep until usage of + * printk() buffer becomes less than 75%) in order to avoid + * dropping messages. + */ + if (!rcu_lock_break(g, p)) + goto restart_report; + } + rcu_read_unlock(); + cond_resched(); + /* Show memory information. (SysRq-m) */ + show_mem(0); + /* Show workqueue state. */ + show_workqueue_state(); + /* Show lock information. (SysRq-d) */ + debug_show_all_locks(); + pr_warn("MemAlloc-Info: stalling=%u dying=%u exiting=%u victim=%u oom_count=%u\n", + stalling_tasks, sigkill_pending, exiting_tasks, memdie_pending, + out_of_memory_count); + return stalling_tasks; +} +#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */ + /* * Process updating of timeout sysctl */ @@ -227,12 +425,28 @@ void reset_hung_task_detector(void) static int watchdog(void *dummy) { unsigned long hung_last_checked = jiffies; +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + unsigned long stall_last_checked = hung_last_checked; +#endif set_user_nice(current, 0); for ( ; ; ) { unsigned long timeout = sysctl_hung_task_timeout_secs; long t = hung_timeout_jiffies(hung_last_checked, timeout); +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + unsigned long timeout2 = sysctl_memalloc_task_warning_secs; + long t2 = hung_timeout_jiffies(stall_last_checked, timeout2); + + if (t2 <= 0) { + if (memalloc_maybe_stalling()) + check_memalloc_stalling_tasks(timeout2); + stall_last_checked = jiffies; + continue; + } +#else + long t2 = t; +#endif if (t <= 0) { if (!atomic_xchg(&reset_hung_task, 0)) @@ -240,7 +454,7 @@ static int watchdog(void *dummy) hung_last_checked = jiffies; continue; } - schedule_timeout_interruptible(t); + schedule_timeout_interruptible(min(t, t2)); } return 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c1095cd..d8ee12a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1083,6 +1083,16 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .proc_handler = proc_dointvec_minmax, .extra1 = &neg_one, }, +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + { + .procname = "memalloc_task_warning_secs", + .data = &sysctl_memalloc_task_warning_secs, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_dohung_task_timeout_secs, + .extra2 = &hung_task_timeout_max, + }, +#endif #endif #ifdef CONFIG_RT_MUTEXES { diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a6c8db1..54e8da0 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -880,6 +880,30 @@ config WQ_WATCHDOG state. This can be configured through kernel parameter "workqueue.watchdog_thresh" and its sysfs counterpart. +config DETECT_MEMALLOC_STALL_TASK + bool "Detect tasks stalling inside memory allocator" + default n + depends on DETECT_HUNG_TASK + help + This option emits warning messages and traces when memory + allocation requests are stalling, in order to catch unexplained + hangups/reboots caused by memory allocation stalls. + +config DEFAULT_MEMALLOC_TASK_TIMEOUT + int "Default timeout for stalling task detection (in seconds)" + depends on DETECT_MEMALLOC_STALL_TASK + default 60 + help + This option controls the default timeout (in seconds) used + to determine when a task has become non-responsive and should + be considered stalling inside memory allocator. + + It can be adjusted at runtime via the kernel.memalloc_task_warning_secs + sysctl or by writing a value to + /proc/sys/kernel/memalloc_task_warning_secs. + + A timeout of 0 disables the check. The default is 60 seconds. + endmenu # "Debug lockups and hangs" config PANIC_ON_OOPS diff --git a/mm/mempool.c b/mm/mempool.c index 47a659d..8b449af 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -324,11 +324,14 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO); + start_memalloc_timer(gfp_temp, -1); repeat_alloc: element = pool->alloc(gfp_temp, pool->pool_data); - if (likely(element != NULL)) + if (likely(element != NULL)) { + stop_memalloc_timer(gfp_temp); return element; + } spin_lock_irqsave(&pool->lock, flags); if (likely(pool->curr_nr)) { @@ -341,6 +344,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) * for debugging. */ kmemleak_update_trace(element); + stop_memalloc_timer(gfp_temp); return element; } @@ -350,13 +354,16 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) */ if (gfp_temp != gfp_mask) { spin_unlock_irqrestore(&pool->lock, flags); + stop_memalloc_timer(gfp_temp); gfp_temp = gfp_mask; + start_memalloc_timer(gfp_temp, -1); goto repeat_alloc; } /* We must not sleep if !__GFP_DIRECT_RECLAIM */ if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) { spin_unlock_irqrestore(&pool->lock, flags); + stop_memalloc_timer(gfp_temp); return NULL; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ec9f11d..ab46d06 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -44,6 +44,8 @@ #define CREATE_TRACE_POINTS #include <trace/events/oom.h> +unsigned int out_of_memory_count; + int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; @@ -986,6 +988,7 @@ bool out_of_memory(struct oom_control *oc) unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + out_of_memory_count++; if (oom_killer_disabled) return false; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5b06fb3..de24961 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3507,8 +3507,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) enum compact_result compact_result; int compaction_retries; int no_progress_loops; +#ifndef CONFIG_DETECT_MEMALLOC_STALL_TASK unsigned long alloc_start = jiffies; unsigned int stall_timeout = 10 * HZ; +#endif unsigned int cpuset_mems_cookie; /* @@ -3682,6 +3684,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT)) goto nopage; +#ifndef CONFIG_DETECT_MEMALLOC_STALL_TASK /* Make sure we know about allocations which stall for too long */ if (time_after(jiffies, alloc_start + stall_timeout)) { warn_alloc(gfp_mask, @@ -3689,6 +3692,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) jiffies_to_msecs(jiffies-alloc_start), order); stall_timeout += 10 * HZ; } +#endif if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, did_some_progress > 0, &no_progress_loops)) @@ -3741,6 +3745,76 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) return page; } +#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK + +static DEFINE_PER_CPU_ALIGNED(int, memalloc_in_flight[2]); +static u8 memalloc_active_index; /* Either 0 or 1. */ + +/* Called periodically with sysctl_memalloc_task_warning_secs interval. */ +bool memalloc_maybe_stalling(void) +{ + int cpu; + int sum = 0; + const u8 idx = memalloc_active_index ^ 1; + + for_each_online_cpu(cpu) + sum += per_cpu(memalloc_in_flight[idx], cpu); + if (sum) + return true; + memalloc_active_index ^= 1; + return false; +} + +void start_memalloc_timer(const gfp_t gfp_mask, const int order) +{ + struct memalloc_info *m = ¤t->memalloc; + + /* We don't check for stalls for !__GFP_DIRECT_RECLAIM allocations. */ + if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) + return; + /* Record the beginning of memory allocation request. */ + if (!m->in_flight) { + m->sequence++; + m->start = jiffies; + m->order = order; + m->gfp = gfp_mask; + m->idx = memalloc_active_index; + /* + * is_stalling_task() depends on ->in_flight being updated + * after ->start is stored. + */ + smp_wmb(); + this_cpu_inc(memalloc_in_flight[m->idx]); + } + m->in_flight++; +} + +void stop_memalloc_timer(const gfp_t gfp_mask) +{ + struct memalloc_info *m = ¤t->memalloc; + + if ((gfp_mask & __GFP_DIRECT_RECLAIM) && !--m->in_flight) + this_cpu_dec(memalloc_in_flight[m->idx]); +} + +static void memalloc_counter_fold(int cpu) +{ + int counter; + u8 idx; + + for (idx = 0; idx < 2; idx++) { + counter = per_cpu(memalloc_in_flight[idx], cpu); + if (!counter) + continue; + this_cpu_add(memalloc_in_flight[idx], counter); + per_cpu(memalloc_in_flight[idx], cpu) = 0; + } +} + +#else +#define memalloc_counter_fold(cpu) do { } while (0) +#endif + /* * This is the 'heart' of the zoned buddy allocator. */ @@ -3825,7 +3899,9 @@ struct page * if (unlikely(ac.nodemask != nodemask)) ac.nodemask = nodemask; + start_memalloc_timer(alloc_mask, order); page = __alloc_pages_slowpath(alloc_mask, order, &ac); + stop_memalloc_timer(alloc_mask); out: if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page && @@ -6551,6 +6627,12 @@ static int page_alloc_cpu_notify(struct notifier_block *self, * race with what we are doing. */ cpu_vm_stats_fold(cpu); + + /* + * Zero the in-flight counters of the dead processor so that + * memalloc_maybe_stalling() needs to check only online processors. + */ + memalloc_counter_fold(cpu); } return NOTIFY_OK; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-04 8:58 ` Tetsuo Handa @ 2017-06-04 15:05 ` Michal Hocko 2017-06-04 21:43 ` Tetsuo Handa 2017-06-05 18:25 ` Cong Wang 1 sibling, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-06-04 15:05 UTC (permalink / raw) To: Tetsuo Handa Cc: xiyou.wangcong, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka On Sun 04-06-17 17:58:49, Tetsuo Handa wrote: [...] > > As I already mentioned in my original report, I know there are at least > > two similar warnings reported before: > > > > https://lkml.org/lkml/2016/12/13/529 > > https://bugzilla.kernel.org/show_bug.cgi?id=192981 > > > > I don't see any fix, nor I see they are similar to mine. > > No means for analyzing, no plan for fixing the problems. Stop this bullshit Tetsuo! Seriously, you are getting over the line! Nobody said we do not care. In order to do something about that we need to get further and relevant information. The first and the most important one is whether this is reproducible with the _clean_ vanilla kernel. If yes then reproduction steps including the system dependent ones would help us as well. If we know that we can start building a more comprehensive picture of what is going on. Unlike you I do not want to jump into "this must be print" conclusion. But stop this unjustified claims. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-04 15:05 ` Michal Hocko @ 2017-06-04 21:43 ` Tetsuo Handa 2017-06-05 5:37 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-06-04 21:43 UTC (permalink / raw) To: mhocko Cc: xiyou.wangcong, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka Michal Hocko wrote: > On Sun 04-06-17 17:58:49, Tetsuo Handa wrote: > [...] > > > As I already mentioned in my original report, I know there are at least > > > two similar warnings reported before: > > > > > > https://lkml.org/lkml/2016/12/13/529 > > > https://bugzilla.kernel.org/show_bug.cgi?id=192981 > > > > > > I don't see any fix, nor I see they are similar to mine. > > > > No means for analyzing, no plan for fixing the problems. > > Stop this bullshit Tetsuo! Seriously, you are getting over the line! > Nobody said we do not care. In order to do something about that we need > to get further and relevant information. What I'm asking for is the method for getting further and relevant information. And I get no positive feedback nor usable alternatives. > The first and the most > important one is whether this is reproducible with the _clean_ vanilla > kernel. At this point, distribution kernel users won't get any help from community, nor distribution kernel users won't be able to help community. Even more, you are asking that whether this is reproducible with the clean _latest_ (linux-next.git or at least linux.git) vanilla kernel. Therefore, only quite few kernel developers can involve this problem, for not everybody is good at establishing environments / steps for reproducing this problem. It makes getting feedback even more difficult. According to your LSFMM session ( https://lwn.net/Articles/718212/ ), you are worrying about out of reviewers. But it seems to me that your orientation keeps the gap between developers and users wider; only experienced developers like you know almost all things, all others will know almost nothing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-04 21:43 ` Tetsuo Handa @ 2017-06-05 5:37 ` Michal Hocko 2017-06-05 18:15 ` Cong Wang 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-06-05 5:37 UTC (permalink / raw) To: Tetsuo Handa Cc: xiyou.wangcong, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka On Mon 05-06-17 06:43:04, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > The first and the most > > important one is whether this is reproducible with the _clean_ vanilla > > kernel. > > At this point, distribution kernel users won't get any help from community, > nor distribution kernel users won't be able to help community. Running a distribution kernel is at risk that obscure bugs (like this one) will be asked to be reproduced on the vanilla kernel. I work to support a distribution kernel as well and I can tell you that I always do my best reproducing or at least pinpointing the issue before reporting it upstream. People working on the upstream kernel are quite busy and _demanding_ a support for something that should come from their vendor is a bit to much. > Even more, you are asking that whether this is reproducible with the clean > _latest_ (linux-next.git or at least linux.git) vanilla kernel. Therefore, > only quite few kernel developers can involve this problem, for not everybody > is good at establishing environments / steps for reproducing this problem. > It makes getting feedback even more difficult. Come on. This is clearly an artificial test executed pretty much intentionally with a known setup. > According to your LSFMM session ( https://lwn.net/Articles/718212/ ), > you are worrying about out of reviewers. But it seems to me that your > orientation keeps the gap between developers and users wider; only > experienced developers like you know almost all things, all others will > know almost nothing. I am really tired of your constant accusations. I think I have a pretty good track record in trying to help with as many bugs reported as possible. You keep hammering one particular part of the kernel and consider it alpha and omega of everything. I disagree because I would rather concentrate on something that actually matters in day to day user workloads. I have already tried to explain that to you several times but you just seem to ignore that so I will not waste my time more... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-05 5:37 ` Michal Hocko @ 2017-06-05 18:15 ` Cong Wang 2017-06-06 9:17 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Cong Wang @ 2017-06-05 18:15 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, Andrew Morton, linux-mm, dave.hansen, Johannes Weiner, Mel Gorman, vbabka On Sun, Jun 4, 2017 at 10:37 PM, Michal Hocko <mhocko@suse.com> wrote: > Running a distribution kernel is at risk that obscure bugs (like this > one) will be asked to be reproduced on the vanilla kernel. I work to > support a distribution kernel as well and I can tell you that I always > do my best reproducing or at least pinpointing the issue before > reporting it upstream. People working on the upstream kernel are quite > busy and _demanding_ a support for something that should come from their > vendor is a bit to much. I understand that. As I already explained, our kernel has _zero_ code that is not in upstream, it is just 4.9.23 plus some non-mm backports from latest. So my question is, is there any fix you believe that is relevant in linux-next but not in 4.9.23? We definitely can try to backport it too. I have checked the changelog since 4.9 and don't find anything obviously relevant. Meanwhile, I will try to run this LTP test repeatly to see if there is any luck. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-05 18:15 ` Cong Wang @ 2017-06-06 9:17 ` Michal Hocko 0 siblings, 0 replies; 43+ messages in thread From: Michal Hocko @ 2017-06-06 9:17 UTC (permalink / raw) To: Cong Wang Cc: Tetsuo Handa, Andrew Morton, linux-mm, dave.hansen, Johannes Weiner, Mel Gorman, vbabka On Mon 05-06-17 11:15:19, Cong Wang wrote: > On Sun, Jun 4, 2017 at 10:37 PM, Michal Hocko <mhocko@suse.com> wrote: > > Running a distribution kernel is at risk that obscure bugs (like this > > one) will be asked to be reproduced on the vanilla kernel. I work to > > support a distribution kernel as well and I can tell you that I always > > do my best reproducing or at least pinpointing the issue before > > reporting it upstream. People working on the upstream kernel are quite > > busy and _demanding_ a support for something that should come from their > > vendor is a bit to much. > > I understand that. As I already explained, our kernel has _zero_ code that > is not in upstream, it is just 4.9.23 plus some non-mm backports from latest. Yes I understand that. And as this is quite an obscure bug I think it would be safer to either run with a clean 4.9 stable or the current linus tree. The later would be preferable because there are some changes in the mm proper which might or might not be related (e.g. aa187507ef8bb which throttles a rather talkative warn_alloc_show_mem). > So my question is, is there any fix you believe that is relevant in linux-next > but not in 4.9.23? We definitely can try to backport it too. I have checked > the changelog since 4.9 and don't find anything obviously relevant. > > Meanwhile, I will try to run this LTP test repeatly to see if there is any luck. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-04 8:58 ` Tetsuo Handa 2017-06-04 15:05 ` Michal Hocko @ 2017-06-05 18:25 ` Cong Wang 2017-06-22 10:35 ` Tetsuo Handa 1 sibling, 1 reply; 43+ messages in thread From: Cong Wang @ 2017-06-05 18:25 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, Andrew Morton, linux-mm, dave.hansen, Johannes Weiner, Mel Gorman, vbabka On Sun, Jun 4, 2017 at 1:58 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > You can retry with my kmallocwd patch shown bottom. An example output is > at http://I-love.SAKURA.ne.jp/tmp/sample-serial.log . > > Of course, kmallocwd can gather only basic information. You might need to > gather more information by e.g. enabling tracepoints after analyzing basic > information. Sure, since it is a debugging patch we definitely can try it. > Since you said > > The log I sent is partial, but that is already all what we captured, > I can't find more in kern.log due to log rotation. > > you meant "the log I captured is an incomplete one", don't you? Right, sorry for my stupid typo. > > Below is kmallocwd patch backpoated for 4.9.30 kernel from > http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp . > Documentation/malloc-watchdog.txt part is stripped in order to reduce lines. Will do. But can't guarantee that we can reproduce it. ;) Thanks a lot! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-05 18:25 ` Cong Wang @ 2017-06-22 10:35 ` Tetsuo Handa 2017-06-22 22:53 ` Cong Wang 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-06-22 10:35 UTC (permalink / raw) To: xiyou.wangcong Cc: mhocko, akpm, linux-mm, dave.hansen, hannes, mgorman, vbabka Cong Wang wrote: > On Sun, Jun 4, 2017 at 1:58 AM, Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > You can retry with my kmallocwd patch shown bottom. An example output is > > at http://I-love.SAKURA.ne.jp/tmp/sample-serial.log . > > > > Of course, kmallocwd can gather only basic information. You might need to > > gather more information by e.g. enabling tracepoints after analyzing basic > > information. > > Sure, since it is a debugging patch we definitely can try it. > > > > > Below is kmallocwd patch backpoated for 4.9.30 kernel from > > http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp . > > Documentation/malloc-watchdog.txt part is stripped in order to reduce lines. > > Will do. But can't guarantee that we can reproduce it. ;) > Did you get a chance to try reproducing it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-22 10:35 ` Tetsuo Handa @ 2017-06-22 22:53 ` Cong Wang 0 siblings, 0 replies; 43+ messages in thread From: Cong Wang @ 2017-06-22 22:53 UTC (permalink / raw) To: Tetsuo Handa Cc: Michal Hocko, Andrew Morton, linux-mm, dave.hansen, Johannes Weiner, Mel Gorman, vbabka On Thu, Jun 22, 2017 at 3:35 AM, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > Cong Wang wrote: >> On Sun, Jun 4, 2017 at 1:58 AM, Tetsuo Handa >> <penguin-kernel@i-love.sakura.ne.jp> wrote: >> > You can retry with my kmallocwd patch shown bottom. An example output is >> > at http://I-love.SAKURA.ne.jp/tmp/sample-serial.log . >> > >> > Of course, kmallocwd can gather only basic information. You might need to >> > gather more information by e.g. enabling tracepoints after analyzing basic >> > information. >> >> Sure, since it is a debugging patch we definitely can try it. >> >> > >> > Below is kmallocwd patch backpoated for 4.9.30 kernel from >> > http://lkml.kernel.org/r/1495331504-12480-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp . >> > Documentation/malloc-watchdog.txt part is stripped in order to reduce lines. >> >> Will do. But can't guarantee that we can reproduce it. ;) >> > > Did you get a chance to try reproducing it? Not yet, I plan to apply your patch to our next kernel release but it doesn't happen yet. ;) I will let you know if I have any update. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 7:18 ` Michal Hocko 2017-06-02 11:13 ` Tetsuo Handa @ 2017-06-02 16:59 ` Cong Wang 2017-06-02 19:59 ` Andrew Morton 2017-07-08 4:59 ` Tetsuo Handa 3 siblings, 0 replies; 43+ messages in thread From: Cong Wang @ 2017-06-02 16:59 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, Tetsuo Handa, linux-mm, dave.hansen, Johannes Weiner, Mel Gorman, vbabka On Fri, Jun 2, 2017 at 12:18 AM, Michal Hocko <mhocko@suse.com> wrote: > The changelog doesn't really explain what is going on and only > speculates that the excessive warn_alloc is the cause. The kernel is > 4.9.23.el7.twitter.x86_64 which I suspect contains a lot of stuff on top > of 4.9. So I would really _like_ to see whether this is reproducible > with the upstream kernel. Especially when this is a LTP test. Just FYI: our kernel 4.9.23.el7.twitter.x86_64 is almost same with the upstream 4.9.23 release, with just _few_ non-mm patches backported. We do have test machines to test non-stable kernels but it is slightly harder to build an upstream kernel on them, I mean not as convenient as a kernel rpm... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 7:18 ` Michal Hocko 2017-06-02 11:13 ` Tetsuo Handa 2017-06-02 16:59 ` Cong Wang @ 2017-06-02 19:59 ` Andrew Morton 2017-06-03 2:57 ` Tetsuo Handa 2017-06-03 7:32 ` Michal Hocko 2017-07-08 4:59 ` Tetsuo Handa 3 siblings, 2 replies; 43+ messages in thread From: Andrew Morton @ 2017-06-02 19:59 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka On Fri, 2 Jun 2017 09:18:18 +0200 Michal Hocko <mhocko@suse.com> wrote: > On Thu 01-06-17 15:10:22, Andrew Morton wrote: > > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > > > > > > > > > This seems to be on an old and not pristine kernel. Does it happen also > > > > > on the vanilla up-to-date kernel? > > > > > > > > 4.9 is not an old kernel! It might be close to the kernel version which > > > > enterprise distributions would choose for their next long term supported > > > > version. > > > > > > > > And please stop saying "can you reproduce your problem with latest > > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla > > > > up-to-date kernel! > > > > > > The changelog mentioned that the source of stalls is not clear so this > > > might be out-of-tree patches doing something wrong and dump_stack > > > showing up just because it is called often. This wouldn't be the first > > > time I have seen something like that. I am not really keen on adding > > > heavy lifting for something that is not clearly debugged and based on > > > hand waving and speculations. > > > > I'm thinking we should serialize warn_alloc anyway, to prevent the > > output from concurrent calls getting all jumbled together? > > dump_stack already serializes concurrent calls. Sure. But warn_alloc() doesn't. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 19:59 ` Andrew Morton @ 2017-06-03 2:57 ` Tetsuo Handa 2017-06-03 7:32 ` Michal Hocko 1 sibling, 0 replies; 43+ messages in thread From: Tetsuo Handa @ 2017-06-03 2:57 UTC (permalink / raw) To: akpm, mhocko Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky, pmladek, akataria, syeh, thellstrom, charmainel, brianp, sbhatewara (Adding printk() and VMware folks.) Andrew Morton wrote: > On Fri, 2 Jun 2017 09:18:18 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > On Thu 01-06-17 15:10:22, Andrew Morton wrote: > > > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > > > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > > > > Michal Hocko wrote: > > > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > > > > > > > > > > > This seems to be on an old and not pristine kernel. Does it happen also > > > > > > on the vanilla up-to-date kernel? > > > > > > > > > > 4.9 is not an old kernel! It might be close to the kernel version which > > > > > enterprise distributions would choose for their next long term supported > > > > > version. > > > > > > > > > > And please stop saying "can you reproduce your problem with latest > > > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla > > > > > up-to-date kernel! > > > > > > > > The changelog mentioned that the source of stalls is not clear so this > > > > might be out-of-tree patches doing something wrong and dump_stack > > > > showing up just because it is called often. This wouldn't be the first > > > > time I have seen something like that. I am not really keen on adding > > > > heavy lifting for something that is not clearly debugged and based on > > > > hand waving and speculations. > > > > > > I'm thinking we should serialize warn_alloc anyway, to prevent the > > > output from concurrent calls getting all jumbled together? > > > > dump_stack already serializes concurrent calls. I don't think offloading serialization to dump_stack() is a polite behavior when the caller can do serialization. Not only it wastes a lot of CPU time but also just passes the DOS stress through to lower layers. printk() needs CPU time to write to console. If printk() flooding starts, I can observe that (1) timestamp in printk() output does not get incremented timely (2) printk() output is not written to virtual serial console timely which makes me give up observing whether situation is changing over time. Results from http://I-love.SAKURA.ne.jp/tmp/serial-20170602-2.txt.xz : ---------------------------------------- [ 123.771523] mmap-write invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 124.808940] mmap-write cpuset=/ mems_allowed=0 [ 124.811595] CPU: 0 PID: 2852 Comm: mmap-write Not tainted 4.12.0-rc3-next-20170602 #99 [ 124.815842] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 124.821171] Call Trace: [ 124.823106] dump_stack+0x86/0xcf [ 124.825336] dump_header+0x97/0x26d [ 124.827668] ? trace_hardirqs_on+0xd/0x10 [ 124.830222] oom_kill_process+0x203/0x470 [ 124.832778] out_of_memory+0x138/0x580 [ 124.835223] __alloc_pages_slowpath+0x1100/0x11f0 [ 124.838085] __alloc_pages_nodemask+0x308/0x3c0 [ 124.840850] alloc_pages_current+0x6a/0xe0 [ 124.843332] __page_cache_alloc+0x119/0x150 [ 124.845723] filemap_fault+0x3dc/0x950 [ 124.847932] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 124.850683] ? xfs_filemap_fault+0x5b/0x180 [xfs] [ 124.853427] ? down_read_nested+0x73/0xb0 [ 124.855792] xfs_filemap_fault+0x63/0x180 [xfs] [ 124.858327] __do_fault+0x1e/0x140 [ 124.860383] __handle_mm_fault+0xb2c/0x1090 [ 124.862760] handle_mm_fault+0x190/0x350 [ 124.865161] __do_page_fault+0x266/0x520 [ 124.867409] do_page_fault+0x30/0x80 [ 124.869846] page_fault+0x28/0x30 [ 124.871803] RIP: 0033:0x7fb997682dca [ 124.873875] RSP: 002b:0000000000777fe8 EFLAGS: 00010246 [ 124.876601] RAX: 00007fb997b6e000 RBX: 0000000000000000 RCX: 00007fb997682dca [ 124.880077] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000 [ 124.883551] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000 [ 124.886933] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002 [ 124.890336] R13: 0000000000000000 R14: 0000000000000003 R15: 0000000000000001 [ 124.893853] Mem-Info: [ 126.408131] mmap-read: page allocation stalls for 10005ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) [ 126.408137] mmap-read cpuset=/ mems_allowed=0 (...snipped...) [ 350.182442] mmap-read: page allocation stalls for 230016ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) [ 350.182446] mmap-read cpuset=/ mems_allowed=0 [ 350.182450] CPU: 0 PID: 2749 Comm: mmap-read Not tainted 4.12.0-rc3-next-20170602 #99 [ 350.182451] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 350.182451] Call Trace: [ 350.182455] dump_stack+0x86/0xcf [ 350.182457] warn_alloc+0x114/0x1c0 [ 350.182466] __alloc_pages_slowpath+0xbb8/0x11f0 [ 350.182481] __alloc_pages_nodemask+0x308/0x3c0 [ 350.182491] alloc_pages_current+0x6a/0xe0 [ 350.182495] __page_cache_alloc+0x119/0x150 [ 350.182498] filemap_fault+0x3dc/0x950 [ 350.182501] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 350.182525] ? xfs_filemap_fault+0x5b/0x180 [xfs] [ 350.182528] ? down_read_nested+0x73/0xb0 [ 350.182550] xfs_filemap_fault+0x63/0x180 [xfs] [ 350.182554] __do_fault+0x1e/0x140 [ 350.182557] __handle_mm_fault+0xb2c/0x1090 [ 350.182566] handle_mm_fault+0x190/0x350 [ 350.182569] __do_page_fault+0x266/0x520 [ 350.182575] do_page_fault+0x30/0x80 [ 350.182578] page_fault+0x28/0x30 [ 350.182580] RIP: 0033:0x400bfb [ 350.182581] RSP: 002b:00000000007cf570 EFLAGS: 00010207 [ 350.182582] RAX: 000000000000010d RBX: 0000000000000003 RCX: 00007fb997678443 [ 350.182583] RDX: 0000000000001000 RSI: 00000000006021a0 RDI: 0000000000000003 [ 350.182583] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000a [ 350.182584] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd42a62438 [ 350.182585] R13: 00007ffd42a62540 R14: 0000000000000000 R15: 0000000000000000 [ 350.836483] sysrq: SysRq : Kill All Tasks [ 350.876112] cleanupd (2180) used greatest stack depth: 10240 bytes left [ 351.219055] audit: type=1131 audit(1496388867.600:97): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.219060] audit: type=1131 audit(1496388867.603:98): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-udevd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.219064] audit: type=1131 audit(1496388867.607:99): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.229083] audit: type=1131 audit(1496388867.610:100): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=abrtd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.229092] audit: type=1131 audit(1496388867.613:101): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dbus comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.229095] audit: type=1131 audit(1496388867.617:102): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=polkit comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.229636] audit: type=1131 audit(1496388867.621:103): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=avahi-daemon comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.233058] audit: type=1131 audit(1496388867.624:104): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=irqbalance comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.236596] audit: type=1131 audit(1496388867.628:105): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=atd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 351.239144] audit: type=1131 audit(1496388867.630:106): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=crond comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [ 389.308085] active_anon:1146 inactive_anon:2777 isolated_anon:0 [ 389.308085] active_file:479 inactive_file:508 isolated_file:0 [ 389.308085] unevictable:0 dirty:0 writeback:0 unstable:0 [ 389.308085] slab_reclaimable:9536 slab_unreclaimable:15265 [ 389.308085] mapped:629 shmem:3535 pagetables:34 bounce:0 [ 389.308085] free:356689 free_pcp:596 free_cma:0 [ 389.308089] Node 0 active_anon:4584kB inactive_anon:11108kB active_file:1916kB inactive_file:2032kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2516kB dirty:0kB writeback:0kB shmem:14140kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 389.308090] Node 0 DMA free:15872kB min:440kB low:548kB high:656kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 389.308093] lowmem_reserve[]: 0 1561 1561 1561 [ 389.308097] Node 0 DMA32 free:1410884kB min:44612kB low:55764kB high:66916kB active_anon:4584kB inactive_anon:11108kB active_file:1916kB inactive_file:2032kB unevictable:0kB writepending:0kB present:2080640kB managed:1599404kB mlocked:0kB slab_reclaimable:38144kB slab_unreclaimable:61028kB kernel_stack:3808kB pagetables:136kB bounce:0kB free_pcp:2384kB local_pcp:696kB free_cma:0kB [ 389.308099] lowmem_reserve[]: 0 0 0 0 [ 389.308103] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15872kB [ 389.308121] Node 0 DMA32: 975*4kB (UME) 1199*8kB (UME) 1031*16kB (UME) 973*32kB (UME) 564*64kB (UME) 303*128kB (UME) 160*256kB (UME) 70*512kB (UME) 40*1024kB (UME) 21*2048kB (M) 272*4096kB (M) = 1410884kB [ 389.308142] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 389.308143] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 389.308143] 4522 total pagecache pages [ 389.308147] 0 pages in swap cache [ 389.308148] Swap cache stats: add 0, delete 0, find 0/0 [ 389.308149] Free swap = 0kB [ 389.308149] Total swap = 0kB [ 389.308150] 524157 pages RAM [ 389.308151] 0 pages HighMem/MovableOnly [ 389.308152] 120330 pages reserved [ 389.308153] 0 pages cma reserved [ 389.308153] 0 pages hwpoisoned [ 389.308155] Out of memory: Kill process 2649 (mmap-mem) score 783 or sacrifice child ---------------------------------------- Notice the timestamp jump between [ 351.239144] and [ 389.308085]. There was no such silence between the two lines. Rather, increment of timestamp obviously started delaying after printk() flooding started (e.g. regarding line-A and line-B where timestamp delta is only a few microseconds, line-B is written to console after magnitudes of many seconds of silence after line-A was written to console). And the delay of timestamp in the log (i.e. about 39 seconds in above example) is recovered (i.e. timestamp reflects actual uptime) only after I pressed SysRq-i in order to allow the rest of the system to use CPU time. I don't know why such delaying occurs (problem in printk() or VMware ?) but what I can say is that threads spinning inside __alloc_pages_slowpath() is giving too much stress to printk() and the rest of the system. Now, similar CPU time wasting situation is reported by Cong as soft lockup on a physical hardware. You say "Your system is already DOSed. You should configure your system not to allow giving such stress to MM subsystem." and I say back "printk() is already DOSed. You should configure MM subsystem not to allow giving such stress to printk()." > > Sure. But warn_alloc() doesn't. > I agree with Andrew. Callers of dump_stack() must do, if the callers are in schedulable context, a sensible flow-control so that lower layers are not annoyed . -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 19:59 ` Andrew Morton 2017-06-03 2:57 ` Tetsuo Handa @ 2017-06-03 7:32 ` Michal Hocko 2017-06-03 8:36 ` Tetsuo Handa 2017-06-03 13:21 ` Tetsuo Handa 1 sibling, 2 replies; 43+ messages in thread From: Michal Hocko @ 2017-06-03 7:32 UTC (permalink / raw) To: Andrew Morton Cc: Tetsuo Handa, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka On Fri 02-06-17 12:59:44, Andrew Morton wrote: > On Fri, 2 Jun 2017 09:18:18 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > On Thu 01-06-17 15:10:22, Andrew Morton wrote: > > > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > > > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > > > > Michal Hocko wrote: > > > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > > > > > > > > > > > This seems to be on an old and not pristine kernel. Does it happen also > > > > > > on the vanilla up-to-date kernel? > > > > > > > > > > 4.9 is not an old kernel! It might be close to the kernel version which > > > > > enterprise distributions would choose for their next long term supported > > > > > version. > > > > > > > > > > And please stop saying "can you reproduce your problem with latest > > > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla > > > > > up-to-date kernel! > > > > > > > > The changelog mentioned that the source of stalls is not clear so this > > > > might be out-of-tree patches doing something wrong and dump_stack > > > > showing up just because it is called often. This wouldn't be the first > > > > time I have seen something like that. I am not really keen on adding > > > > heavy lifting for something that is not clearly debugged and based on > > > > hand waving and speculations. > > > > > > I'm thinking we should serialize warn_alloc anyway, to prevent the > > > output from concurrent calls getting all jumbled together? > > > > dump_stack already serializes concurrent calls. > > Sure. But warn_alloc() doesn't. I really do not see why that would be much better, really. warn_alloc is more or less one line + dump_stack + warn_alloc_show_mem. Single line shouldn't be a big deal even though this is a continuation line actually. dump_stack already contains its own synchronization and the meminfo stuff is ratelimited to one per second. So why do we exactly wantt to put yet another lock on top? Just to stick them together? Well is this worth a new lock dependency between memory allocation and the whole printk stack or dump_stack? Maybe yes but this needs a much deeper consideration. Tetsuo is arguing that the locking will throttle warn_alloc callers and that can help other processes to move on. I would call it papering over a real issue which might be somewhere else and that is why I push back so hard. The initial report is far from complete and seeing 30+ seconds stalls without any indication that this is just a repeating stall after 10s and 20s suggests that we got stuck somewhere in the reclaim path. Moreover let's assume that the unfair locking in dump_stack has caused the stall. How would an warn_alloc lock help when there are other sources of dump_stack all over the kernel? Seriously, this whole discussion is based on hand waving. Like for any other patches, the real issue should be debugged, explained and discussed based on known facts, not speculations. As things stand now, my NACK still holds. I am not going to waste my time repeating same points all over again. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-03 7:32 ` Michal Hocko @ 2017-06-03 8:36 ` Tetsuo Handa 2017-06-05 7:10 ` Sergey Senozhatsky 2017-06-03 13:21 ` Tetsuo Handa 1 sibling, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-06-03 8:36 UTC (permalink / raw) To: mhocko, akpm Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky, pmladek Michal Hocko wrote: > On Fri 02-06-17 12:59:44, Andrew Morton wrote: > > On Fri, 2 Jun 2017 09:18:18 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > > > On Thu 01-06-17 15:10:22, Andrew Morton wrote: > > > > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > > > > > Michal Hocko wrote: > > > > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > > > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > > > > > > > > > > > > > This seems to be on an old and not pristine kernel. Does it happen also > > > > > > > on the vanilla up-to-date kernel? > > > > > > > > > > > > 4.9 is not an old kernel! It might be close to the kernel version which > > > > > > enterprise distributions would choose for their next long term supported > > > > > > version. > > > > > > > > > > > > And please stop saying "can you reproduce your problem with latest > > > > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla > > > > > > up-to-date kernel! > > > > > > > > > > The changelog mentioned that the source of stalls is not clear so this > > > > > might be out-of-tree patches doing something wrong and dump_stack > > > > > showing up just because it is called often. This wouldn't be the first > > > > > time I have seen something like that. I am not really keen on adding > > > > > heavy lifting for something that is not clearly debugged and based on > > > > > hand waving and speculations. > > > > > > > > I'm thinking we should serialize warn_alloc anyway, to prevent the > > > > output from concurrent calls getting all jumbled together? > > > > > > dump_stack already serializes concurrent calls. > > > > Sure. But warn_alloc() doesn't. > > I really do not see why that would be much better, really. warn_alloc is > more or less one line + dump_stack + warn_alloc_show_mem. Single line > shouldn't be a big deal even though this is a continuation line > actually. dump_stack already contains its own synchronization and the > meminfo stuff is ratelimited to one per second. So why do we exactly > wantt to put yet another lock on top? Just to stick them together? Well > is this worth a new lock dependency between memory allocation and the > whole printk stack or dump_stack? Maybe yes but this needs a much deeper > consideration. You are completely ignoring the fact that writing to consoles needs CPU time. My proposal is intended for not only grouping relevant lines together but also giving logbuf readers (currently a thread which is inside console_unlock(), which might be offloaded to a dedicated kernel thread in near future) CPU time for writing to consoles. > > Tetsuo is arguing that the locking will throttle warn_alloc callers and > that can help other processes to move on. I would call it papering over > a real issue which might be somewhere else and that is why I push back so > hard. The initial report is far from complete and seeing 30+ seconds > stalls without any indication that this is just a repeating stall after > 10s and 20s suggests that we got stuck somewhere in the reclaim path. That timestamp jump is caused by the fact that log_buf writers are consuming more CPU times than log_buf readers can consume. If I leave that situation more, printk() just starts printing "** %u printk messages dropped ** " line. There is nothing more to reclaim, allocating threads are looping with cond_resched() and schedule_timeout_uninterruptible(1) (which effectively becomes no-op when there are many other threads doing the same thing) only, logbuf reader cannot use enough CPU time, and the OOM killer remains oom_lock held (notice that this timestamp jump is between "invoked oom-killer: " line and "Out of memory: Kill process " line) which prevents reclaiming memory. > > Moreover let's assume that the unfair locking in dump_stack has caused > the stall. How would an warn_alloc lock help when there are other > sources of dump_stack all over the kernel? __alloc_pages_slowpath() is insane as a caller of dump_stack(). Basically __alloc_pages_slowpath() allows doing while (1) { cond_resched(); dump_stack(); } because all stalling treads can call warn_alloc(). Even though we ratelimit dump_stack() at both time_after() test and __ratelimit() test like while (1) { cond_resched(); if (time_after(jiffies, alloc_start + stall_timeout)) { if (!(gfp_mask & __GFP_NOWARN) && __ratelimit(&nopage_rs)) { dump_stack(); } stall_timeout += 10 * HZ; } } ratelimited threads are still doing while (1) { cond_resched(); } which still obviously remains the source of starving CPU time for writing to consoles. This problem won't be solved even if logbuf reader is offloaded to a kernel thread dedicated for printk(). > > Seriously, this whole discussion is based on hand waving. Like for > any other patches, the real issue should be debugged, explained and > discussed based on known facts, not speculations. As things stand now, > my NACK still holds. I am not going to waste my time repeating same > points all over again. It is not a hand waving. Doing unconstrained printk() loops (with cond_resched() only) inside kernel is seriously broken. We have to be careful not to allow CPU time consumption by logbuf writers (e.g. warn_alloc() from __alloc_pages_slowpath()) because logbuf reader needs CPU time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-03 8:36 ` Tetsuo Handa @ 2017-06-05 7:10 ` Sergey Senozhatsky 2017-06-05 9:36 ` Sergey Senozhatsky 0 siblings, 1 reply; 43+ messages in thread From: Sergey Senozhatsky @ 2017-06-05 7:10 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky, pmladek Hello, On (06/03/17 17:36), Tetsuo Handa wrote: [..] > > Tetsuo is arguing that the locking will throttle warn_alloc callers and > > that can help other processes to move on. I would call it papering over > > a real issue which might be somewhere else and that is why I push back so > > hard. The initial report is far from complete and seeing 30+ seconds > > stalls without any indication that this is just a repeating stall after > > 10s and 20s suggests that we got stuck somewhere in the reclaim path. > > That timestamp jump is caused by the fact that log_buf writers are consuming > more CPU times than log_buf readers can consume. If I leave that situation > more, printk() just starts printing "** %u printk messages dropped ** " line. hhmm... sorry, not sure I see how printk() would affect timer ticks. unless you do printing from timer IRQs, or always in deferred printk() mode, which runs from timer IRQ... timestamps are assigned at the moment we add a new message to the logbuf, not when we print it. so slow serial console really should not affect it. unless I'm missing something. vprintk_emit() { logbuf_lock_irqsave(flags); log_output(facility, level, lflags, dict, dictlen, text, text_len); { log_store() { msg->ts_nsec = local_clock(); ^^^^^^^^^^^^^ } } logbuf_unlock_irqrestore(flags); if (console_trylock()) console_unlock(); } I don't think vprintk_emit() was spinning on logbuf_lock_irqsave(), you would have seen spinlock lockup reports otherwise. in console_unlock() logbuf lock is acquired only to pick the first pending messages and, basically, do memcpy() to a static buffer. we don't call "slow console drivers" with the logbuf lock taken. so other CPUs are free/welcome to append new messages to the logbuf in the meantime (and read accurate local_clock()). so if you see spikes in messages' timestamps it's most likely because there was something between printk() calls that kept the CPU busy. /* or you had a ton of printk calls from other CPUs with noisy loglevels that were suppressed later in console_unlock(), see later. */ ... well, serial consoles can be slow, sure. > Notice the timestamp jump between [ 351.239144] and [ 389.308085]. do you have a restrictive console loglevel and a ton of messages that were simply filtered out by console loglevel check? we still store those messages to the logbuf (for dmesg, etc.) and process them in console_unlock(), but don't print to the serial console. so, in other words, logbuf: timestamp T0 message M0 -- visible loglevel timestamp T1 message M1 -- suppressed loglevel .... timestamp T100 message M101 -- suppressed loglevel timestamp T101 message M102 -- visible loglevel on the serial console you'll see T0 M0 T101 M102 which might look like a spike in timestamps (while there weren't any). just a thought. -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-05 7:10 ` Sergey Senozhatsky @ 2017-06-05 9:36 ` Sergey Senozhatsky 2017-06-05 15:02 ` Tetsuo Handa 0 siblings, 1 reply; 43+ messages in thread From: Sergey Senozhatsky @ 2017-06-05 9:36 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky, pmladek, Sergey Senozhatsky On (06/05/17 16:10), Sergey Senozhatsky wrote: [..] > > Notice the timestamp jump between [ 351.239144] and [ 389.308085]. > > do you have a restrictive console loglevel and a ton of messages that > were simply filtered out by console loglevel check? we still store those > messages to the logbuf (for dmesg, etc.) and process them in console_unlock(), > but don't print to the serial console. so, in other words, > > logbuf: > > timestamp T0 message M0 -- visible loglevel > timestamp T1 message M1 -- suppressed loglevel > .... > timestamp T100 message M101 -- suppressed loglevel > timestamp T101 message M102 -- visible loglevel > > on the serial console you'll see > > T0 M0 > T101 M102 > > which might look like a spike in timestamps (while there weren't any). > just a thought. does it make any difference if you disable preemption in console_unlock()? something like below... just curious... --- diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index a1aecf44ab07..25fe408cb994 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2204,6 +2204,8 @@ void console_unlock(void) return; } + preempt_disable(); + for (;;) { struct printk_log *msg; size_t ext_len = 0; @@ -2260,9 +2262,6 @@ void console_unlock(void) call_console_drivers(ext_text, ext_len, text, len); start_critical_timings(); printk_safe_exit_irqrestore(flags); - - if (do_cond_resched) - cond_resched(); } console_locked = 0; @@ -2274,6 +2273,8 @@ void console_unlock(void) up_console_sem(); + preempt_enable(); + /* * Someone could have filled up the buffer again, so re-check if there's * something to flush. In case we cannot trylock the console_sem again, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-05 9:36 ` Sergey Senozhatsky @ 2017-06-05 15:02 ` Tetsuo Handa 0 siblings, 0 replies; 43+ messages in thread From: Tetsuo Handa @ 2017-06-05 15:02 UTC (permalink / raw) To: sergey.senozhatsky.work Cc: mhocko, akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky, pmladek Thank you for explanation, Sergey. Sergey Senozhatsky wrote: > Hello, > > On (06/03/17 17:36), Tetsuo Handa wrote: > [..] > > > Tetsuo is arguing that the locking will throttle warn_alloc callers and > > > that can help other processes to move on. I would call it papering over > > > a real issue which might be somewhere else and that is why I push back so > > > hard. The initial report is far from complete and seeing 30+ seconds > > > stalls without any indication that this is just a repeating stall after > > > 10s and 20s suggests that we got stuck somewhere in the reclaim path. > > > > That timestamp jump is caused by the fact that log_buf writers are consuming > > more CPU times than log_buf readers can consume. If I leave that situation > > more, printk() just starts printing "** %u printk messages dropped ** " line. > > hhmm... sorry, not sure I see how printk() would affect timer ticks. unless > you do printing from timer IRQs, or always in deferred printk() mode, which > runs from timer IRQ... timestamps are assigned at the moment we add a new > message to the logbuf, not when we print it. so slow serial console really > should not affect it. unless I'm missing something. All printk() are from warn_alloc(). I retested using stop watch, and confirmed that console is printing pending output at full speed during the timestamp jump. Thus, it seems that this timestamp jump was caused by simply log_buf reader had been busy, and the OOM killer processing resumed after all pending output was consumed by log_buf reader. > I don't think vprintk_emit() was spinning on logbuf_lock_irqsave(), > you would have seen spinlock lockup reports otherwise. in console_unlock() > logbuf lock is acquired only to pick the first pending messages and, > basically, do memcpy() to a static buffer. we don't call "slow console > drivers" with the logbuf lock taken. so other CPUs are free/welcome to > append new messages to the logbuf in the meantime (and read accurate > local_clock()). Yes. The local_clock() value seems to be correct. > > so if you see spikes in messages' timestamps it's most likely because > there was something between printk() calls that kept the CPU busy. > Flooding of warn_alloc() from __alloc_pages_slowpath() kept log_buf reader busy enough to block resuming processing of the OOM killer. If warn_alloc() refrained from flooding, log_buf reader will be able to consume pending output more quickly, and we won't observe slowdown nor timestamp jump. > does it make any difference if you disable preemption in console_unlock()? > something like below... just curious... Yes, this change reduces stalls a lot. But I don't think changing printk() side for this problem is correct. > > --- > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c > index a1aecf44ab07..25fe408cb994 100644 > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -2204,6 +2204,8 @@ void console_unlock(void) > return; > } > > + preempt_disable(); > + > for (;;) { > struct printk_log *msg; > size_t ext_len = 0; > @@ -2260,9 +2262,6 @@ void console_unlock(void) > call_console_drivers(ext_text, ext_len, text, len); > start_critical_timings(); > printk_safe_exit_irqrestore(flags); > - > - if (do_cond_resched) > - cond_resched(); > } > console_locked = 0; > > @@ -2274,6 +2273,8 @@ void console_unlock(void) > > up_console_sem(); > > + preempt_enable(); > + > /* > * Someone could have filled up the buffer again, so re-check if there's > * something to flush. In case we cannot trylock the console_sem again, This change is a subset of enclosing whole oom_kill_process() steps with preempt_disable()/preempt_enable(), which was already rejected. Regarding the OOM killer preempted by console_unlock() from printk() problem, it will be mitigated by offloading to the printk kernel thread. But offloading solves only the OOM killer preempted by console_unlock() case. The OOM killer can still be preempted by schedule_timeout_killable(1). Also, Cong Wang's case was (presumably) unable to invoke the OOM killer case. When the OOM killer can make forward progress, accelerating log_buf readers instead of throttling log_buf writers might make sense. But when the OOM killer cannot make forward progress, we need to make sure that log_buf writer (i.e. warn_alloc()) is slower than log_buf reader (i.e. printk kernel thread). So, coming back to warn_alloc(). I don't think that current ratelimit choice is enough to give log_buf reader enough CPU time. Users won't be happy with randomly filtered, otherwise flooded/mixed warn_alloc() output. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-03 7:32 ` Michal Hocko 2017-06-03 8:36 ` Tetsuo Handa @ 2017-06-03 13:21 ` Tetsuo Handa 1 sibling, 0 replies; 43+ messages in thread From: Tetsuo Handa @ 2017-06-03 13:21 UTC (permalink / raw) To: mhocko, akpm Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky, pmladek Michal Hocko wrote: > I really do not see why that would be much better, really. warn_alloc is > more or less one line + dump_stack + warn_alloc_show_mem. Single line > shouldn't be a big deal even though this is a continuation line > actually. dump_stack already contains its own synchronization and the > meminfo stuff is ratelimited to one per second. So why do we exactly > wantt to put yet another lock on top? Just to stick them together? Well > is this worth a new lock dependency between memory allocation and the > whole printk stack or dump_stack? Maybe yes but this needs a much deeper > consideration. If you don't want to add a new lock dependency between memory allocation and the whole printk stack or dump_stack, I'm glad to reuse existing lock dependency between memory allocation and the whole printk stack or dump_stack named oom_lock mutex. Although oom_lock is used for serializing invocation of the OOM killer so that we don't send SIGKILL more than needed, things done by the OOM killer inside dump_header() with oom_lock held includes what warn_alloc() does (i.e. call printk(), dump_stack() and show_mem()). If we reuse oom_lock for serializing warn_alloc() calls, we can also solve a problem that the OOM killer cannot send SIGKILL due to preempted by other threads spinning inside __alloc_pages_slowpath() waiting for the OOM killer to make progress. Although you said we should not abuse an unrelated lock at http://lkml.kernel.org/r/20161212125535.GA3185@dhcp22.suse.cz , out_of_memory() and warn_alloc() are closely related because they are called when the system cannot allocate memory. We need to allow users to reliably save printk() output, and serialization for yielding enough CPU time (by reusing oom_lock mutex or adding warn_alloc_lock mutex) is one of prerequisites for saving printk() output. "The fact that it doesn't throttle properly means that we should tune its parameters." does not work here. We need to avoid wasting CPU time via effectively "while(1) cond_resched();" situation in __alloc_pages_slowpath(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-06-02 7:18 ` Michal Hocko ` (2 preceding siblings ...) 2017-06-02 19:59 ` Andrew Morton @ 2017-07-08 4:59 ` Tetsuo Handa 2017-07-10 13:21 ` Michal Hocko 3 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-07-08 4:59 UTC (permalink / raw) To: mhocko, akpm Cc: linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka Michal Hocko wrote: > On Thu 01-06-17 15:10:22, Andrew Morton wrote: > > On Thu, 1 Jun 2017 15:28:08 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > > > On Thu 01-06-17 22:11:13, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > On Thu 01-06-17 20:43:47, Tetsuo Handa wrote: > > > > > > Cong Wang has reported a lockup when running LTP memcg_stress test [1]. > > > > > > > > > > This seems to be on an old and not pristine kernel. Does it happen also > > > > > on the vanilla up-to-date kernel? > > > > > > > > 4.9 is not an old kernel! It might be close to the kernel version which > > > > enterprise distributions would choose for their next long term supported > > > > version. > > > > > > > > And please stop saying "can you reproduce your problem with latest > > > > linux-next (or at least latest linux)?" Not everybody can use the vanilla > > > > up-to-date kernel! > > > > > > The changelog mentioned that the source of stalls is not clear so this > > > might be out-of-tree patches doing something wrong and dump_stack > > > showing up just because it is called often. This wouldn't be the first > > > time I have seen something like that. I am not really keen on adding > > > heavy lifting for something that is not clearly debugged and based on > > > hand waving and speculations. > > > > I'm thinking we should serialize warn_alloc anyway, to prevent the > > output from concurrent calls getting all jumbled together? > > dump_stack already serializes concurrent calls. > > > I'm not sure I buy the "this isn't a mainline kernel" thing. > > The changelog doesn't really explain what is going on and only > speculates that the excessive warn_alloc is the cause. The kernel is > 4.9.23.el7.twitter.x86_64 which I suspect contains a lot of stuff on top > of 4.9. So I would really _like_ to see whether this is reproducible > with the upstream kernel. Especially when this is a LTP test. > > > warn_alloc() obviously isn't very robust, but we'd prefer that it be > > robust to peculiar situations, wild-n-wacky kernel patches, etc. It's > > a low-level thing and it should Just Work. > > Yes I would agree and if we have an evidence that warn_alloc is really > the problem then I am all for fixing it. There is no such evidence yet. > Note that dump_stack serialization might be unfair because there is no > queuing. Is it possible that this is the problem? If yes we should > rather fix that because that is arguably even more low-level routine than > warn_alloc. > > That being said. I strongly believe that this patch is not properly > justified, issue fully understood and as such a disagree with adding a > new lock on those grounds. > > Until the above is resolved > Nacked-by: Michal Hocko <mhocko@suse.com> Apart from what happened to Cong Wang's case, I'm really bothered by jumbled messages caused by concurrent warn_alloc() calls. My test case is using linux-next without any out-of-tree patches. Thus, adding a new lock on those grounds should be acceptable. Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz : Michal Hocko wrote: > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > You are rejecting serialization under OOM without giving a chance to test > > side effects of serialization under OOM at linux-next.git. I call such attitude > > "speculation" which you never accept. > > No I am rejecting abusing the lock for purpose it is not aimed for. Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable? Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting jumbled, there should be no reason you reject this lock. If you don't like locks, can you instead accept below one? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 80e4adb..3ac382c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3900,9 +3900,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) /* Make sure we know about allocations which stall for too long */ if (time_after(jiffies, alloc_start + stall_timeout)) { + static bool wait; + + while (cmpxchg(&wait, false, true)) + schedule_timeout_uninterruptible(1); warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask, "page allocation stalls for %ums, order:%u", jiffies_to_msecs(jiffies-alloc_start), order); + wait = false; stall_timeout += 10 * HZ; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-08 4:59 ` Tetsuo Handa @ 2017-07-10 13:21 ` Michal Hocko 2017-07-10 13:54 ` Tetsuo Handa 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-07-10 13:21 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka On Sat 08-07-17 13:59:54, Tetsuo Handa wrote: [...] > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz : > Michal Hocko wrote: > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > > You are rejecting serialization under OOM without giving a chance to test > > > side effects of serialization under OOM at linux-next.git. I call such attitude > > > "speculation" which you never accept. > > > > No I am rejecting abusing the lock for purpose it is not aimed for. > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable? > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting > jumbled, there should be no reason you reject this lock. > > If you don't like locks, can you instead accept below one? No, seriously! Just think about what you are proposing. You are stalling and now you will stall _random_ tasks even more. Some of them for unbound amount of time because of inherent unfairness of cmpxchg. If there is a _real_ problem it should be debugged and fixed. If this is a limitation of what printk can handle then we should think how to throttle it even more (e.g. does it make much sense to dump_stack when it hasn't changed since the last time?). If this is about dump_stack taking too long then we should look into it but we definitely should add a more on top. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 80e4adb..3ac382c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3900,9 +3900,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > > /* Make sure we know about allocations which stall for too long */ > if (time_after(jiffies, alloc_start + stall_timeout)) { > + static bool wait; > + > + while (cmpxchg(&wait, false, true)) > + schedule_timeout_uninterruptible(1); > warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask, > "page allocation stalls for %ums, order:%u", > jiffies_to_msecs(jiffies-alloc_start), order); > + wait = false; > stall_timeout += 10 * HZ; > } > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-10 13:21 ` Michal Hocko @ 2017-07-10 13:54 ` Tetsuo Handa 2017-07-10 14:14 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-07-10 13:54 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka Michal Hocko wrote: > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote: > [...] > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz : > > Michal Hocko wrote: > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > > > You are rejecting serialization under OOM without giving a chance to test > > > > side effects of serialization under OOM at linux-next.git. I call such attitude > > > > "speculation" which you never accept. > > > > > > No I am rejecting abusing the lock for purpose it is not aimed for. > > > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable? > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting > > jumbled, there should be no reason you reject this lock. > > > > If you don't like locks, can you instead accept below one? > > No, seriously! Just think about what you are proposing. You are stalling > and now you will stall _random_ tasks even more. Some of them for > unbound amount of time because of inherent unfairness of cmpxchg. The cause of stall when oom_lock is already held is that threads which failed to hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is not sufficient when there are multiple threads doing the same thing, for direct reclaim/compaction consumes a lot of CPU time. What makes this situation worse is, since warn_alloc() periodically appends to printk() buffer, the thread inside the OOM killer with oom_lock held can stall forever due to cond_resched() from console_unlock() from printk(). Below change significantly reduces possibility of falling into printk() v.s. oom_lock lockup problem, for the thread inside the OOM killer with oom_lock held no longer blocks inside printk(). Though there still remains possibility of sleeping for unexpectedly long at schedule_timeout_killable(1) with the oom_lock held. --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc) panic("Out of memory and no killable processes...\n"); } if (oc->chosen && oc->chosen != (void *)-1UL) { + preempt_disable(); oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : "Memory cgroup out of memory"); + preempt_enable_no_resched(); /* * Give the killed process a good chance to exit before trying * to allocate memory again. I wish we could agree with applying this patch until printk-kthread can work reliably... > > If there is a _real_ problem it should be debugged and fixed. If this > is a limitation of what printk can handle then we should think how to > throttle it even more (e.g. does it make much sense to dump_stack when > it hasn't changed since the last time?). If this is about dump_stack > taking too long then we should look into it but we definitely should add > a more on top. The real problem is lack of CPU time for reclaiming memory when allocating threads failed to hold oom_lock. And you are refusing to allow allocating threads give CPU time to the thread holding oom_lock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-10 13:54 ` Tetsuo Handa @ 2017-07-10 14:14 ` Michal Hocko 2017-07-11 13:10 ` Tetsuo Handa 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-07-10 14:14 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka On Mon 10-07-17 22:54:37, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote: > > [...] > > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz : > > > Michal Hocko wrote: > > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > > > > You are rejecting serialization under OOM without giving a chance to test > > > > > side effects of serialization under OOM at linux-next.git. I call such attitude > > > > > "speculation" which you never accept. > > > > > > > > No I am rejecting abusing the lock for purpose it is not aimed for. > > > > > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable? > > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting > > > jumbled, there should be no reason you reject this lock. > > > > > > If you don't like locks, can you instead accept below one? > > > > No, seriously! Just think about what you are proposing. You are stalling > > and now you will stall _random_ tasks even more. Some of them for > > unbound amount of time because of inherent unfairness of cmpxchg. > > The cause of stall when oom_lock is already held is that threads which failed to > hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is > not sufficient when there are multiple threads doing the same thing, for direct > reclaim/compaction consumes a lot of CPU time. > > What makes this situation worse is, since warn_alloc() periodically appends to > printk() buffer, the thread inside the OOM killer with oom_lock held can stall > forever due to cond_resched() from console_unlock() from printk(). warn_alloc is just yet-another-user of printk. We might have many others... > Below change significantly reduces possibility of falling into printk() v.s. oom_lock > lockup problem, for the thread inside the OOM killer with oom_lock held no longer > blocks inside printk(). Though there still remains possibility of sleeping for > unexpectedly long at schedule_timeout_killable(1) with the oom_lock held. This just papers over the real problem. > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc) > panic("Out of memory and no killable processes...\n"); > } > if (oc->chosen && oc->chosen != (void *)-1UL) { > + preempt_disable(); > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > "Memory cgroup out of memory"); > + preempt_enable_no_resched(); > /* > * Give the killed process a good chance to exit before trying > * to allocate memory again. > > I wish we could agree with applying this patch until printk-kthread can > work reliably... And now you have introduced soft lockups most probably because oom_kill_process can take some time... Or maybe even sleeping while atomic warnings if some code path needs to sleep for whatever reason. The real fix is make sure that printk doesn't take an arbitrary amount of time. You are trying to hammer this particular path but you should realize that as long as printk can take an unbound amount of time then there are many other land mines which need fixing. It is simply not feasible to go after each and ever one of them and try to tweak them around. So please stop proposing these random hacks and rather try to work with prink guys to find solution for this long term printk limitation. OOM killer is a good usecase to give this a priority. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-10 14:14 ` Michal Hocko @ 2017-07-11 13:10 ` Tetsuo Handa 2017-07-11 13:49 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-07-11 13:10 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek Michal Hocko wrote: > On Mon 10-07-17 22:54:37, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote: > > > [...] > > > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz : > > > > Michal Hocko wrote: > > > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > > > > > You are rejecting serialization under OOM without giving a chance to test > > > > > > side effects of serialization under OOM at linux-next.git. I call such attitude > > > > > > "speculation" which you never accept. > > > > > > > > > > No I am rejecting abusing the lock for purpose it is not aimed for. > > > > > > > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable? > > > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting > > > > jumbled, there should be no reason you reject this lock. > > > > > > > > If you don't like locks, can you instead accept below one? > > > > > > No, seriously! Just think about what you are proposing. You are stalling > > > and now you will stall _random_ tasks even more. Some of them for > > > unbound amount of time because of inherent unfairness of cmpxchg. > > > > The cause of stall when oom_lock is already held is that threads which failed to > > hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is > > not sufficient when there are multiple threads doing the same thing, for direct > > reclaim/compaction consumes a lot of CPU time. > > > > What makes this situation worse is, since warn_alloc() periodically appends to > > printk() buffer, the thread inside the OOM killer with oom_lock held can stall > > forever due to cond_resched() from console_unlock() from printk(). > > warn_alloc is just yet-another-user of printk. We might have many > others... warn_alloc() is different from other users of printk() that printk() is called as long as oom_lock is already held by somebody else processing console_unlock(). > > > Below change significantly reduces possibility of falling into printk() v.s. oom_lock > > lockup problem, for the thread inside the OOM killer with oom_lock held no longer > > blocks inside printk(). Though there still remains possibility of sleeping for > > unexpectedly long at schedule_timeout_killable(1) with the oom_lock held. > > This just papers over the real problem. > > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc) > > panic("Out of memory and no killable processes...\n"); > > } > > if (oc->chosen && oc->chosen != (void *)-1UL) { > > + preempt_disable(); > > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > > "Memory cgroup out of memory"); > > + preempt_enable_no_resched(); > > /* > > * Give the killed process a good chance to exit before trying > > * to allocate memory again. > > > > I wish we could agree with applying this patch until printk-kthread can > > work reliably... > > And now you have introduced soft lockups most probably because > oom_kill_process can take some time... Or maybe even sleeping while > atomic warnings if some code path needs to sleep for whatever reason. > The real fix is make sure that printk doesn't take an arbitrary amount of > time. The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly (because it will cause recursion deadlock). Thus, even if some code path needs to sleep for some reason, that code path is not permitted to wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can propose scattering preempt_disable()/preempt_enable_no_resched() around printk() rather than whole oom_kill_process(). You will just reject it as you have rejected in the past. Using a reproducer and a patch to disable warn_alloc() at http://lkml.kernel.org/r/201707082230.ECB51545.JtFFFVHOOSMLOQ@I-love.SAKURA.ne.jp , you will find that calling cond_resched() (from console_unlock() from printk()) can cause a delay of nearly one minute, and it can cause a delay of nearly 5 minutes to complete one out_of_memory() call. Notice that the reproducer is using not so insane number of threads (only 10 children). (It will become a DoS if 100 children or 1024 children.) ---------------------------------------- [ 589.570344] idle-priority invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 602.792064] idle-priority cpuset=/ mems_allowed=0 [ 602.794185] CPU: 0 PID: 9833 Comm: idle-priority Not tainted 4.12.0-next-20170711+ #628 [ 602.796870] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 602.800102] Call Trace: [ 602.801685] dump_stack+0x67/0x9e [ 602.803444] dump_header+0x9d/0x3fa [ 602.805227] ? trace_hardirqs_on+0xd/0x10 [ 602.807106] oom_kill_process+0x226/0x650 [ 602.809002] out_of_memory+0x136/0x560 [ 602.810822] ? out_of_memory+0x206/0x560 [ 602.812688] __alloc_pages_nodemask+0xcd2/0xe50 [ 602.814676] alloc_pages_vma+0x76/0x1a0 [ 602.816519] __handle_mm_fault+0xdff/0x1180 [ 602.818416] ? native_sched_clock+0x36/0xa0 [ 602.820322] handle_mm_fault+0x186/0x360 [ 602.822170] ? handle_mm_fault+0x44/0x360 [ 602.824009] __do_page_fault+0x1da/0x510 [ 602.825808] do_page_fault+0x21/0x70 [ 602.827501] page_fault+0x22/0x30 [ 602.829121] RIP: 0033:0x4008b8 [ 602.830657] RSP: 002b:00007ffc7f1b2070 EFLAGS: 00010206 [ 602.832602] RAX: 00000000d2d29000 RBX: 0000000100000000 RCX: 00007f2a8e4debd0 [ 602.834911] RDX: 0000000000000000 RSI: 0000000000400ae0 RDI: 0000000000000004 [ 602.837212] RBP: 00007f288e5ea010 R08: 0000000000000000 R09: 0000000000021000 [ 602.839481] R10: 00007ffc7f1b1df0 R11: 0000000000000246 R12: 0000000000000006 [ 602.841712] R13: 00007f288e5ea010 R14: 0000000000000000 R15: 0000000000000000 [ 602.843964] Mem-Info: [ 660.641313] active_anon:871821 inactive_anon:4422 isolated_anon:0 [ 660.641313] active_file:0 inactive_file:1 isolated_file:0 [ 660.641313] unevictable:0 dirty:0 writeback:0 unstable:0 [ 660.641313] slab_reclaimable:0 slab_unreclaimable:0 [ 660.641313] mapped:555 shmem:6257 pagetables:3184 bounce:0 [ 660.641313] free:21377 free_pcp:188 free_cma:0 [ 660.735954] Node 0 active_anon:3487284kB inactive_anon:17688kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2220kB dirty:0kB writeback:0kB shmem:25028kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3198976kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 681.835092] Node 0 DMA free:14780kB min:288kB low:360kB high:432kB active_anon:1092kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 702.374011] lowmem_reserve[]: 0 2688 3624 3624 [ 707.583447] Node 0 DMA32 free:53556kB min:49908kB low:62384kB high:74860kB active_anon:2698264kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2752884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 729.520428] lowmem_reserve[]: 0 0 936 936 [ 734.700444] Node 0 Normal free:17204kB min:17384kB low:21728kB high:26072kB active_anon:787928kB inactive_anon:17688kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:1048576kB managed:958868kB mlocked:0kB kernel_stack:3072kB pagetables:12736kB bounce:0kB free_pcp:752kB local_pcp:120kB free_cma:0kB [ 760.567618] lowmem_reserve[]: 0 0 0 0 [ 765.855466] Node 0 DMA: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 2*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14780kB [ 780.375573] Node 0 DMA32: 7*4kB (UM) 5*8kB (UM) 7*16kB (UM) 6*32kB (U) 5*64kB (UM) 1*128kB (M) 4*256kB (M) 5*512kB (UM) 2*1024kB (M) 1*2048kB (U) 11*4096kB (UM) = 53556kB [ 795.846001] Node 0 Normal: 103*4kB (UM) 21*8kB (UM) 90*16kB (UME) 101*32kB (ME) 43*64kB (UME) 16*128kB (UE) 6*256kB (E) 1*512kB (U) 5*1024kB (UM) 0*2048kB 0*4096kB = 17220kB [ 812.472579] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 821.375279] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 830.344875] 6258 total pagecache pages [ 836.012842] 0 pages in swap cache [ 841.786993] Swap cache stats: add 0, delete 0, find 0/0 [ 848.324422] Free swap = 0kB [ 853.617092] Total swap = 0kB [ 858.757287] 1048445 pages RAM [ 864.079999] 0 pages HighMem/MovableOnly [ 869.764325] 116531 pages reserved [ 875.162601] 0 pages hwpoisoned [ 880.350179] Out of memory: Kill process 9833 (idle-priority) score 928 or sacrifice child [ 888.751868] Killed process 9835 (normal-priority) total-vm:4360kB, anon-rss:96kB, file-rss:0kB, shmem-rss:0kB ---------------------------------------- > > You are trying to hammer this particular path but you should realize > that as long as printk can take an unbound amount of time then there are > many other land mines which need fixing. It is simply not feasible to go > after each and ever one of them and try to tweak them around. So please > stop proposing these random hacks and rather try to work with prink guys > to find solution for this long term printk limitation. OOM killer is a > good usecase to give this a priority. Whatever approach we use for printk() not to take unbound amount of time (e.g. just enqueue to log_buf using per a thread flag), we might still take unbound amount of time if we allow cond_sched() (or whatever sleep some code path might need to use) with the oom_lock held. After all, the OOM killer is ignoring scheduling priority problem regardless of printk() lockup problem. I don't have objection about making sure that printk() doesn't take an arbitrary amount of time. But the real fix is make sure that out_of_memory() doesn't take an arbitrary amount of time (i.e. don't allow cond_resched() etc. at all) unless there is cooperation from other allocating threads which failed to hold oom_lock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-11 13:10 ` Tetsuo Handa @ 2017-07-11 13:49 ` Michal Hocko 2017-07-11 14:58 ` Petr Mladek 2017-07-11 22:06 ` Tetsuo Handa 0 siblings, 2 replies; 43+ messages in thread From: Michal Hocko @ 2017-07-11 13:49 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek On Tue 11-07-17 22:10:36, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 10-07-17 22:54:37, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote: > > > > [...] > > > > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz : > > > > > Michal Hocko wrote: > > > > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > > > > > > You are rejecting serialization under OOM without giving a chance to test > > > > > > > side effects of serialization under OOM at linux-next.git. I call such attitude > > > > > > > "speculation" which you never accept. > > > > > > > > > > > > No I am rejecting abusing the lock for purpose it is not aimed for. > > > > > > > > > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable? > > > > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting > > > > > jumbled, there should be no reason you reject this lock. > > > > > > > > > > If you don't like locks, can you instead accept below one? > > > > > > > > No, seriously! Just think about what you are proposing. You are stalling > > > > and now you will stall _random_ tasks even more. Some of them for > > > > unbound amount of time because of inherent unfairness of cmpxchg. > > > > > > The cause of stall when oom_lock is already held is that threads which failed to > > > hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is > > > not sufficient when there are multiple threads doing the same thing, for direct > > > reclaim/compaction consumes a lot of CPU time. > > > > > > What makes this situation worse is, since warn_alloc() periodically appends to > > > printk() buffer, the thread inside the OOM killer with oom_lock held can stall > > > forever due to cond_resched() from console_unlock() from printk(). > > > > warn_alloc is just yet-another-user of printk. We might have many > > others... > > warn_alloc() is different from other users of printk() that printk() is called > as long as oom_lock is already held by somebody else processing console_unlock(). So what exactly prevents any other caller of printk interfering while the oom is ongoing? > > > > > > Below change significantly reduces possibility of falling into printk() v.s. oom_lock > > > lockup problem, for the thread inside the OOM killer with oom_lock held no longer > > > blocks inside printk(). Though there still remains possibility of sleeping for > > > unexpectedly long at schedule_timeout_killable(1) with the oom_lock held. > > > > This just papers over the real problem. > > > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc) > > > panic("Out of memory and no killable processes...\n"); > > > } > > > if (oc->chosen && oc->chosen != (void *)-1UL) { > > > + preempt_disable(); > > > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > > > "Memory cgroup out of memory"); > > > + preempt_enable_no_resched(); > > > /* > > > * Give the killed process a good chance to exit before trying > > > * to allocate memory again. > > > > > > I wish we could agree with applying this patch until printk-kthread can > > > work reliably... > > > > And now you have introduced soft lockups most probably because > > oom_kill_process can take some time... Or maybe even sleeping while > > atomic warnings if some code path needs to sleep for whatever reason. > > The real fix is make sure that printk doesn't take an arbitrary amount of > > time. > > The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations > directly/indirectly (because it will cause recursion deadlock). Thus, even if > some code path needs to sleep for some reason, that code path is not permitted to > wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can > propose scattering preempt_disable()/preempt_enable_no_resched() around printk() > rather than whole oom_kill_process(). You will just reject it as you have rejected > in the past. because you are trying to address a problem at a wrong layer. If there is absolutely no way around it and printk is unfixable then we really need a printk variant which will make sure that no excessive waiting will be involved. Then we can replace all printk in the oom path with this special printk. [...] > > You are trying to hammer this particular path but you should realize > > that as long as printk can take an unbound amount of time then there are > > many other land mines which need fixing. It is simply not feasible to go > > after each and ever one of them and try to tweak them around. So please > > stop proposing these random hacks and rather try to work with prink guys > > to find solution for this long term printk limitation. OOM killer is a > > good usecase to give this a priority. > > Whatever approach we use for printk() not to take unbound amount of time > (e.g. just enqueue to log_buf using per a thread flag), we might still take > unbound amount of time if we allow cond_sched() (or whatever sleep some > code path might need to use) with the oom_lock held. After all, the OOM killer > is ignoring scheduling priority problem regardless of printk() lockup problem. > > I don't have objection about making sure that printk() doesn't take an arbitrary > amount of time. But the real fix is make sure that out_of_memory() doesn't take > an arbitrary amount of time (i.e. don't allow cond_resched() etc. at all) unless > there is cooperation from other allocating threads which failed to hold oom_lock. As I've said out_of_memory is an expensive operation and as such it has to be preemptible. Addressing this would require quite some work. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-11 13:49 ` Michal Hocko @ 2017-07-11 14:58 ` Petr Mladek 2017-07-11 22:06 ` Tetsuo Handa 1 sibling, 0 replies; 43+ messages in thread From: Petr Mladek @ 2017-07-11 14:58 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work On Tue 2017-07-11 15:49:00, Michal Hocko wrote: > On Tue 11-07-17 22:10:36, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Mon 10-07-17 22:54:37, Tetsuo Handa wrote: > > > > What makes this situation worse is, since warn_alloc() periodically appends to > > > > printk() buffer, the thread inside the OOM killer with oom_lock held can stall > > > > forever due to cond_resched() from console_unlock() from printk(). > > > > > > warn_alloc is just yet-another-user of printk. We might have many > > > others... > because you are trying to address a problem at a wrong layer. If there > is absolutely no way around it and printk is unfixable then we really > need a printk variant which will make sure that no excessive waiting > will be involved. Then we can replace all printk in the oom path with > this special printk. The last theory about printk offloading suggests that printk() should always try to push some messages to the console when the console lock is available. Otherwise, the messages might not appear at all because the offloading is never 100% reliable, especially when the system is in troubles. In each case, this live-lock is another reason to risk the printk offload at some stage. Of course, we could make the throttling more aggressive. But it is another complex problem. Only printk() knows how much it is stressed and how much throttling is needed. On the other hand, it might be hard to know what information is repeating and who need to be throttled. It is a question if it should be solved by providing more printk_throttle() variants or by some magic inside normal printk(). Best Regards, Petr -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-11 13:49 ` Michal Hocko 2017-07-11 14:58 ` Petr Mladek @ 2017-07-11 22:06 ` Tetsuo Handa 2017-07-12 8:54 ` Michal Hocko 1 sibling, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-07-11 22:06 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek Michal Hocko wrote: > On Tue 11-07-17 22:10:36, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Mon 10-07-17 22:54:37, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > On Sat 08-07-17 13:59:54, Tetsuo Handa wrote: > > > > > [...] > > > > > > Quoting from http://lkml.kernel.org/r/20170705081956.GA14538@dhcp22.suse.cz : > > > > > > Michal Hocko wrote: > > > > > > > On Sat 01-07-17 20:43:56, Tetsuo Handa wrote: > > > > > > > > You are rejecting serialization under OOM without giving a chance to test > > > > > > > > side effects of serialization under OOM at linux-next.git. I call such attitude > > > > > > > > "speculation" which you never accept. > > > > > > > > > > > > > > No I am rejecting abusing the lock for purpose it is not aimed for. > > > > > > > > > > > > Then, why adding a new lock (not oom_lock but warn_alloc_lock) is not acceptable? > > > > > > Since warn_alloc_lock is aimed for avoiding messages by warn_alloc() getting > > > > > > jumbled, there should be no reason you reject this lock. > > > > > > > > > > > > If you don't like locks, can you instead accept below one? > > > > > > > > > > No, seriously! Just think about what you are proposing. You are stalling > > > > > and now you will stall _random_ tasks even more. Some of them for > > > > > unbound amount of time because of inherent unfairness of cmpxchg. > > > > > > > > The cause of stall when oom_lock is already held is that threads which failed to > > > > hold oom_lock continue almost busy looping; schedule_timeout_uninterruptible(1) is > > > > not sufficient when there are multiple threads doing the same thing, for direct > > > > reclaim/compaction consumes a lot of CPU time. > > > > > > > > What makes this situation worse is, since warn_alloc() periodically appends to > > > > printk() buffer, the thread inside the OOM killer with oom_lock held can stall > > > > forever due to cond_resched() from console_unlock() from printk(). > > > > > > warn_alloc is just yet-another-user of printk. We might have many > > > others... > > > > warn_alloc() is different from other users of printk() that printk() is called > > as long as oom_lock is already held by somebody else processing console_unlock(). > > So what exactly prevents any other caller of printk interfering while > the oom is ongoing? Other callers of printk() are not doing silly things like "while(1) printk();". They don't call printk() until something completes (e.g. some operation returned an error code) or they do throttling. Only watchdog calls printk() without waiting for something to complete (because watchdog is there in order to warn that something might be wrong). But watchdog is calling printk() carefully not to cause flooding (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls rcu_lock_break()). As far as I can observe, only warn_alloc() for watchdog trivially causes flooding and lockups. > > > > > > > > > > Below change significantly reduces possibility of falling into printk() v.s. oom_lock > > > > lockup problem, for the thread inside the OOM killer with oom_lock held no longer > > > > blocks inside printk(). Though there still remains possibility of sleeping for > > > > unexpectedly long at schedule_timeout_killable(1) with the oom_lock held. > > > > > > This just papers over the real problem. > > > > > > > --- a/mm/oom_kill.c > > > > +++ b/mm/oom_kill.c > > > > @@ -1051,8 +1051,10 @@ bool out_of_memory(struct oom_control *oc) > > > > panic("Out of memory and no killable processes...\n"); > > > > } > > > > if (oc->chosen && oc->chosen != (void *)-1UL) { > > > > + preempt_disable(); > > > > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > > > > "Memory cgroup out of memory"); > > > > + preempt_enable_no_resched(); > > > > /* > > > > * Give the killed process a good chance to exit before trying > > > > * to allocate memory again. > > > > > > > > I wish we could agree with applying this patch until printk-kthread can > > > > work reliably... > > > > > > And now you have introduced soft lockups most probably because > > > oom_kill_process can take some time... Or maybe even sleeping while > > > atomic warnings if some code path needs to sleep for whatever reason. > > > The real fix is make sure that printk doesn't take an arbitrary amount of > > > time. > > > > The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations > > directly/indirectly (because it will cause recursion deadlock). Thus, even if > > some code path needs to sleep for some reason, that code path is not permitted to > > wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can > > propose scattering preempt_disable()/preempt_enable_no_resched() around printk() > > rather than whole oom_kill_process(). You will just reject it as you have rejected > > in the past. > > because you are trying to address a problem at a wrong layer. If there > is absolutely no way around it and printk is unfixable then we really > need a printk variant which will make sure that no excessive waiting > will be involved. Then we can replace all printk in the oom path with > this special printk. Writing data faster than readers can read is wrong, especially when writers deprive readers of CPU time to read. > > [...] > > > > You are trying to hammer this particular path but you should realize > > > that as long as printk can take an unbound amount of time then there are > > > many other land mines which need fixing. It is simply not feasible to go > > > after each and ever one of them and try to tweak them around. So please > > > stop proposing these random hacks and rather try to work with prink guys > > > to find solution for this long term printk limitation. OOM killer is a > > > good usecase to give this a priority. > > > > Whatever approach we use for printk() not to take unbound amount of time > > (e.g. just enqueue to log_buf using per a thread flag), we might still take > > unbound amount of time if we allow cond_sched() (or whatever sleep some > > code path might need to use) with the oom_lock held. After all, the OOM killer > > is ignoring scheduling priority problem regardless of printk() lockup problem. > > > > I don't have objection about making sure that printk() doesn't take an arbitrary > > amount of time. But the real fix is make sure that out_of_memory() doesn't take > > an arbitrary amount of time (i.e. don't allow cond_resched() etc. at all) unless > > there is cooperation from other allocating threads which failed to hold oom_lock. > > As I've said out_of_memory is an expensive operation and as such it has > to be preemptible. Addressing this would require quite some work. But calling out_of_memory() with SCHED_IDLE priority makes overall allocations far more expensive. If you want to keep out_of_memory() preemptible, you should make sure that out_of_memory() is executed with !SCHED_IDLE priority. Offloading to a dedicated kernel thread like oom_reaper will do it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-11 22:06 ` Tetsuo Handa @ 2017-07-12 8:54 ` Michal Hocko 2017-07-12 12:23 ` Tetsuo Handa 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-07-12 8:54 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek On Wed 12-07-17 07:06:11, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 11-07-17 22:10:36, Tetsuo Handa wrote: > > > Michal Hocko wrote: [...] > > > > warn_alloc is just yet-another-user of printk. We might have many > > > > others... > > > > > > warn_alloc() is different from other users of printk() that printk() is called > > > as long as oom_lock is already held by somebody else processing console_unlock(). > > > > So what exactly prevents any other caller of printk interfering while > > the oom is ongoing? > > Other callers of printk() are not doing silly things like "while(1) printk();". They can still print a lot. There have been reports of one printk source pushing an unrelated context to print way too much. > They don't call printk() until something completes (e.g. some operation returned > an error code) or they do throttling. Only watchdog calls printk() without waiting > for something to complete (because watchdog is there in order to warn that something > might be wrong). But watchdog is calling printk() carefully not to cause flooding > (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls > rcu_lock_break()). Look at hard/soft lockup detector and how it can cause flood of printks. > As far as I can observe, only warn_alloc() for watchdog trivially > causes flooding and lockups. warn_alloc prints a single line + dump_stack for each stalling allocation and show_mem once per second. That doesn't sound overly crazy to me. Sure we can have many stalling tasks under certain conditions (most of them quite unrealistic) and then we can print a lot. I do not see an easy way out of it without losing information about stalls and I guess we want to know about them otherwise we will have much harder time to debug stalls. Sure we can tune this a bit and e.g. do not dump stacks of tasks which have already printed their backtrace as it couldn't have changed. But this doesn't change anything in principle. [...] > > > The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations > > > directly/indirectly (because it will cause recursion deadlock). Thus, even if > > > some code path needs to sleep for some reason, that code path is not permitted to > > > wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can > > > propose scattering preempt_disable()/preempt_enable_no_resched() around printk() > > > rather than whole oom_kill_process(). You will just reject it as you have rejected > > > in the past. > > > > because you are trying to address a problem at a wrong layer. If there > > is absolutely no way around it and printk is unfixable then we really > > need a printk variant which will make sure that no excessive waiting > > will be involved. Then we can replace all printk in the oom path with > > this special printk. > > Writing data faster than readers can read is wrong, especially when > writers deprive readers of CPU time to read. Yes this is not good but only printk knows the congestion. [...] > > As I've said out_of_memory is an expensive operation and as such it has > > to be preemptible. Addressing this would require quite some work. > > But calling out_of_memory() with SCHED_IDLE priority makes overall allocations > far more expensive. If you want to keep out_of_memory() preemptible, you should > make sure that out_of_memory() is executed with !SCHED_IDLE priority. Offloading to > a dedicated kernel thread like oom_reaper will do it. You do realize that the whole page allocator is not priority aware and a low priority task can starve a higher priority task already in the reclaim path. Is this ideal? Absolutely no but let's be realistic, this has never been a priority and it would require a lot of heavy lifting. The OOM is the most cold path in the whole allocation stack and focusing solely on it while claiming something take a minute or two longer is just not going to attract a lot of attention. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-12 8:54 ` Michal Hocko @ 2017-07-12 12:23 ` Tetsuo Handa 2017-07-12 12:41 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-07-12 12:23 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek Michal Hocko wrote: > On Wed 12-07-17 07:06:11, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Tue 11-07-17 22:10:36, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > [...] > > > > > warn_alloc is just yet-another-user of printk. We might have many > > > > > others... > > > > > > > > warn_alloc() is different from other users of printk() that printk() is called > > > > as long as oom_lock is already held by somebody else processing console_unlock(). > > > > > > So what exactly prevents any other caller of printk interfering while > > > the oom is ongoing? > > > > Other callers of printk() are not doing silly things like "while(1) printk();". > > They can still print a lot. There have been reports of one printk source > pushing an unrelated context to print way too much. Which source is that? Legitimate printk() users might do for (i = 0; i < 1000; i++) printk(); but they do not do while (1) for (i = 0; i < 1000; i++) printk(); . > > > They don't call printk() until something completes (e.g. some operation returned > > an error code) or they do throttling. Only watchdog calls printk() without waiting > > for something to complete (because watchdog is there in order to warn that something > > might be wrong). But watchdog is calling printk() carefully not to cause flooding > > (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls > > rcu_lock_break()). > > Look at hard/soft lockup detector and how it can cause flood of printks. Lockup detector is legitimate because it is there to warn that somebody is continuously consuming CPU time. Lockup detector might do for (i = 0; i < 1000; i++) printk(); but does not do while (1) for (i = 0; i < 1000; i++) printk(); because lockup detector waits for enough interval. while (1) { for (i = 0; i < 1000; i++) printk(); schedule_timeout_killable(HZ * 60); } > > > As far as I can observe, only warn_alloc() for watchdog trivially > > causes flooding and lockups. > > warn_alloc prints a single line + dump_stack for each stalling allocation and > show_mem once per second. That doesn't sound overly crazy to me. > Sure we can have many stalling tasks under certain conditions (most of > them quite unrealistic) and then we can print a lot. I do not see an > easy way out of it without losing information about stalls and I guess > we want to know about them otherwise we will have much harder time to > debug stalls. Printing just one line per every second can lead to lockup, for the condition to escape the "for (;;)" loop in console_unlock() is if (console_seq == log_next_seq) break; when cond_resched() in that loop slept for more than one second due to SCHED_IDLE priority. Currently preempt_disable()/preempt_enable_no_resched() (or equivalent) is the only available countermeasure for minimizing interference like for (i = 0; i < 1000; i++) printk(); . If prink() allows per printk context (shown below) flag which allows printk() users to force printk() not to try to print immediately (i.e. declare that use deferred printing (maybe offloaded to the printk-kthread)), lockups by cond_resched() from console_unlock() from printk() from out_of_memory() will be avoided. ---------- static unsigned long printk_context(void) { /* * Assume that we can use lower 2 bits for flags, as with * __mutex_owner() does. */ unsigned long context = (unsigned long) current; /* Both bits set means processing NMI context. */ if (in_nmi()) context |= 3; /* Only next-LSB set means processing hard IRQ context. */ else if (in_irq()) context |= 2; /* Only LSB set means processing soft IRQ context. */ else if (in_serving_softirq()) context |= 1; /* * Neither bits set means processing task context, * though still might be non sleepable context. */ return context; } ---------- Of course given that such flag is introduced and you accept setting/clearing such flag inside out_of_memory()... > > Sure we can tune this a bit and e.g. do not dump stacks of tasks which > have already printed their backtrace as it couldn't have changed. But > this doesn't change anything in principle. > > [...] > > > > The OOM killer is not permitted to wait for __GFP_DIRECT_RECLAIM allocations > > > > directly/indirectly (because it will cause recursion deadlock). Thus, even if > > > > some code path needs to sleep for some reason, that code path is not permitted to > > > > wait for __GFP_DIRECT_RECLAIM allocations directly/indirectly. Anyway, I can > > > > propose scattering preempt_disable()/preempt_enable_no_resched() around printk() > > > > rather than whole oom_kill_process(). You will just reject it as you have rejected > > > > in the past. > > > > > > because you are trying to address a problem at a wrong layer. If there > > > is absolutely no way around it and printk is unfixable then we really > > > need a printk variant which will make sure that no excessive waiting > > > will be involved. Then we can replace all printk in the oom path with > > > this special printk. > > > > Writing data faster than readers can read is wrong, especially when > > writers deprive readers of CPU time to read. > > Yes this is not good but only printk knows the congestion. warn_alloc() callers can wait for oom_lock, for somebody which is holding oom_lock should be making progress for warn_alloc() callers. /* * Acquire the oom lock. If that fails, somebody else is * making progress for us. */ What I'm pointing out is that this comment becomes false if there are multiple threads doing the same thing. ---------- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 80e4adb..5cd845a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3259,11 +3259,21 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) *did_some_progress = 0; /* - * Acquire the oom lock. If that fails, somebody else is - * making progress for us. + * Acquire the oom lock. If that fails, retry after giving the owner of + * the oom lock a chance to make progress, by taking a short sleep. + * But when retrying, skip direct reclaim/compaction in order to give + * the owner of the oom lock as much CPU time as possible, for the + * owner of the oom lock might be a SCHED_IDLE priority thread. + * When 10+ !SCHED_IDLE priority threads do direct reclaim/compaction + * on a CPU which the owner of the oom lock (a SCHED_IDLE priority + * thread) can run, it is possible that all CPU time yielded by the + * short sleep is wasted for direct reclaim/compaction and the owner of + * the oom lock fails to make progress for 60+ seconds due to lack of + * CPU time (i.e. falls into an OOM livelock situation where the OOM + * killer cannot send SIGKILL). */ if (!mutex_trylock(&oom_lock)) { - *did_some_progress = 1; + *did_some_progress = -1; schedule_timeout_uninterruptible(1); return NULL; } @@ -3770,6 +3780,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) unsigned long alloc_start = jiffies; unsigned int stall_timeout = 10 * HZ; unsigned int cpuset_mems_cookie; + bool skip_direct_reclaim = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3906,6 +3917,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) stall_timeout += 10 * HZ; } + if (skip_direct_reclaim) + goto wait_for_oom_killer; + /* Avoid recursion of direct reclaim */ if (current->flags & PF_MEMALLOC) goto nopage; @@ -3955,6 +3969,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) goto retry_cpuset; /* Reclaim has failed us, start killing things */ +wait_for_oom_killer: page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) goto got_pg; @@ -3968,6 +3983,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) /* Retry as long as the OOM killer is making progress */ if (did_some_progress) { no_progress_loops = 0; + skip_direct_reclaim = did_some_progress == -1; goto retry; } ---------- > > [...] > > > As I've said out_of_memory is an expensive operation and as such it has > > > to be preemptible. Addressing this would require quite some work. > > > > But calling out_of_memory() with SCHED_IDLE priority makes overall allocations > > far more expensive. If you want to keep out_of_memory() preemptible, you should > > make sure that out_of_memory() is executed with !SCHED_IDLE priority. Offloading to > > a dedicated kernel thread like oom_reaper will do it. > > You do realize that the whole page allocator is not priority aware and > a low priority task can starve a higher priority task already in the > reclaim path. Is this ideal? Absolutely no but let's be realistic, this > has never been a priority and it would require a lot of heavy lifting. Then why do you complain No, seriously! Just think about what you are proposing. You are stalling and now you will stall _random_ tasks even more. ? We cannot make the whole page allocator priority aware within a few years. Then, avoiding possibility of lockups for now makes sense until you fix it. > The OOM is the most cold path in the whole allocation stack and focusing > solely on it while claiming something take a minute or two longer is > just not going to attract a lot of attention. Who said that "a minute or two"? We can make the stalls effectively "forever", for delay caused by while (1) cond_resched(); interference in __alloc_pages_slowpath() (this is what I demonstrated by disabling warn_alloc()) gets longer as number of threads doing that busy loop increases. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-12 12:23 ` Tetsuo Handa @ 2017-07-12 12:41 ` Michal Hocko 2017-07-14 12:30 ` Tetsuo Handa 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-07-12 12:41 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek On Wed 12-07-17 21:23:05, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Wed 12-07-17 07:06:11, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Tue 11-07-17 22:10:36, Tetsuo Handa wrote: > > > > > Michal Hocko wrote: > > [...] > > > > > > warn_alloc is just yet-another-user of printk. We might have many > > > > > > others... > > > > > > > > > > warn_alloc() is different from other users of printk() that printk() is called > > > > > as long as oom_lock is already held by somebody else processing console_unlock(). > > > > > > > > So what exactly prevents any other caller of printk interfering while > > > > the oom is ongoing? > > > > > > Other callers of printk() are not doing silly things like "while(1) printk();". > > > > They can still print a lot. There have been reports of one printk source > > pushing an unrelated context to print way too much. > > Which source is that? > > Legitimate printk() users might do > > for (i = 0; i < 1000; i++) > printk(); > > but they do not do > > while (1) > for (i = 0; i < 1000; i++) > printk(); > > . > > > > > > They don't call printk() until something completes (e.g. some operation returned > > > an error code) or they do throttling. Only watchdog calls printk() without waiting > > > for something to complete (because watchdog is there in order to warn that something > > > might be wrong). But watchdog is calling printk() carefully not to cause flooding > > > (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls > > > rcu_lock_break()). > > > > Look at hard/soft lockup detector and how it can cause flood of printks. > > Lockup detector is legitimate because it is there to warn that somebody is > continuously consuming CPU time. Lockup detector might do Sigh. What I've tried to convey is that the lockup detector can print _a lot_ (just consider a large machine with hundreds of CPUs and trying to dump stack trace on each of them....) and that might mimic a herd of printks from allocation stalls... [...] > > warn_alloc prints a single line + dump_stack for each stalling allocation and > > show_mem once per second. That doesn't sound overly crazy to me. > > Sure we can have many stalling tasks under certain conditions (most of > > them quite unrealistic) and then we can print a lot. I do not see an > > easy way out of it without losing information about stalls and I guess > > we want to know about them otherwise we will have much harder time to > > debug stalls. > > Printing just one line per every second can lead to lockup, for > the condition to escape the "for (;;)" loop in console_unlock() is > > if (console_seq == log_next_seq) > break; Then something is really broken in that condition, don't you think? Peter has already mentioned that offloading to a different context seems like the way to go here. > when cond_resched() in that loop slept for more than one second due to > SCHED_IDLE priority. > > Currently preempt_disable()/preempt_enable_no_resched() (or equivalent) > is the only available countermeasure for minimizing interference like > > for (i = 0; i < 1000; i++) > printk(); > > . If prink() allows per printk context (shown below) flag which allows printk() > users to force printk() not to try to print immediately (i.e. declare that > use deferred printing (maybe offloaded to the printk-kthread)), lockups by > cond_resched() from console_unlock() from printk() from out_of_memory() will be > avoided. As I've said earlier, if there is no other way to make printk work without all these nasty side effected then I would be OK to add a printk context specific calls into the oom killer. Removing the rest because this is again getting largely tangent. The primary problem you are seeing is that we stumble over printk here. Unless I can see a sound argument this is not the case it doesn't make any sense to discuss allocator changes. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-12 12:41 ` Michal Hocko @ 2017-07-14 12:30 ` Tetsuo Handa 2017-07-14 12:48 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-07-14 12:30 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek Michal Hocko wrote: > On Wed 12-07-17 21:23:05, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Wed 12-07-17 07:06:11, Tetsuo Handa wrote: > > > > They don't call printk() until something completes (e.g. some operation returned > > > > an error code) or they do throttling. Only watchdog calls printk() without waiting > > > > for something to complete (because watchdog is there in order to warn that something > > > > might be wrong). But watchdog is calling printk() carefully not to cause flooding > > > > (e.g. khungtaskd sleeps enough) and not to cause lockups (e.g. khungtaskd calls > > > > rcu_lock_break()). > > > > > > Look at hard/soft lockup detector and how it can cause flood of printks. > > > > Lockup detector is legitimate because it is there to warn that somebody is > > continuously consuming CPU time. Lockup detector might do > > Sigh. What I've tried to convey is that the lockup detector can print _a > lot_ (just consider a large machine with hundreds of CPUs and trying to > dump stack trace on each of them....) and that might mimic a herd of > printks from allocation stalls... So, you are brave enough to offload writing to consoles to a schedulable context when the system is experiencing hard/soft lockups (which means that there is no guarantee that printk kthread can start writing to consoles from a schedulable context). Good luck. Watchdog messages from interrupt context are urgent. But OOM killer messages and warn_alloc() (for stall reporting) messages are both from schedulable context and these messages can be offloaded to printk kthread. I'm not objecting offloading OOM killer messages and warn_alloc() messages. > [...] > > > warn_alloc prints a single line + dump_stack for each stalling allocation and > > > show_mem once per second. That doesn't sound overly crazy to me. > > > Sure we can have many stalling tasks under certain conditions (most of > > > them quite unrealistic) and then we can print a lot. I do not see an > > > easy way out of it without losing information about stalls and I guess > > > we want to know about them otherwise we will have much harder time to > > > debug stalls. > > > > Printing just one line per every second can lead to lockup, for > > the condition to escape the "for (;;)" loop in console_unlock() is > > > > if (console_seq == log_next_seq) > > break; > > Then something is really broken in that condition, don't you think? > Peter has already mentioned that offloading to a different context seems > like the way to go here. > > > when cond_resched() in that loop slept for more than one second due to > > SCHED_IDLE priority. > > > > Currently preempt_disable()/preempt_enable_no_resched() (or equivalent) > > is the only available countermeasure for minimizing interference like > > > > for (i = 0; i < 1000; i++) > > printk(); > > > > . If prink() allows per printk context (shown below) flag which allows printk() > > users to force printk() not to try to print immediately (i.e. declare that > > use deferred printing (maybe offloaded to the printk-kthread)), lockups by > > cond_resched() from console_unlock() from printk() from out_of_memory() will be > > avoided. > > As I've said earlier, if there is no other way to make printk work without all > these nasty side effected then I would be OK to add a printk context > specific calls into the oom killer. > > Removing the rest because this is again getting largely tangent. The > primary problem you are seeing is that we stumble over printk here. > Unless I can see a sound argument this is not the case it doesn't make > any sense to discuss allocator changes. You are still ignoring my point. I agree that we stumble over printk(), but printk() is nothing but one of locations we stumble. Look at schedule_timeout_killable(1) in out_of_memory() which is called with oom_lock still held. I'm reporting that even printk() is offloaded to printk kernel thread, scheduling priority can make schedule_timeout_killable(1) sleep for more than 12 minutes (which is intended to sleep for only one millisecond). (I gave up waiting and pressed SysRq-i. I can't imagine how long it would have continued sleeping inside schedule_timeout_killable(1) with oom_lock held.) Without cooperation from other allocating threads which failed to hold oom_lock, it is dangerous to keep out_of_memory() preemptible/schedulable context. ---------- Reproducer start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sched.h> #include <signal.h> #include <sys/prctl.h> int main(int argc, char *argv[]) { struct sched_param sp = { }; cpu_set_t cpu = { { 1 } }; static int pipe_fd[2] = { EOF, EOF }; unsigned long size = 65536; char *buf = malloc(size); unsigned long i = 0; signal(SIGCLD, SIG_IGN); sched_setaffinity(0, sizeof(cpu), &cpu); prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0); pipe(pipe_fd); for (i = 0; i < 1024; i++) if (fork() == 0) { if (i) close(pipe_fd[1]); /* Wait until first child gets SIGKILL. */ read(pipe_fd[0], &i, 1); /* Join the direct reclaim storm. */ for (i = 0; i < size; i += 4096) buf[i] = 0; _exit(0); } for (size = 512; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sched_setscheduler(0, SCHED_IDLE, &sp); prctl(PR_SET_NAME, (unsigned long) "idle-priority", 0, 0, 0); close(pipe_fd[1]); for (i = 0; i < size; i += 4096) buf[i] = 0; /* Will cause OOM due to overcommit */ kill(-1, SIGKILL); return 0; /* Not reached. */ } ---------- Reproducer end ---------- ---------- Simple printk() offloading patch start ---------- diff --git a/include/linux/sched.h b/include/linux/sched.h index 8337e2d..529fc36 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -651,6 +651,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned offload_printk:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index fc47863..06afcd7 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -48,6 +48,7 @@ #include <linux/sched/clock.h> #include <linux/sched/debug.h> #include <linux/sched/task_stack.h> +#include <linux/kthread.h> #include <linux/uaccess.h> #include <asm/sections.h> @@ -2120,6 +2121,20 @@ static int have_callable_console(void) return 0; } +static atomic_t printk_reaper_pending = ATOMIC_INIT(0); +static DECLARE_WAIT_QUEUE_HEAD(printk_reaper_wait); + +static int printk_reaper(void *unused) +{ + while (true) { + wait_event(printk_reaper_wait, atomic_read(&printk_reaper_pending)); + console_lock(); + console_unlock(); + atomic_dec(&printk_reaper_pending); + } + return 0; +} + /* * Can we actually use the console at this time on this cpu? * @@ -2129,6 +2144,11 @@ static int have_callable_console(void) */ static inline int can_use_console(void) { + if (current->offload_printk && in_task()) { + atomic_inc(&printk_reaper_pending); + wake_up(&printk_reaper_wait); + return 0; + } return cpu_online(raw_smp_processor_id()) || have_callable_console(); } @@ -2678,6 +2698,7 @@ static int __init printk_late_init(void) ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "printk:online", console_cpu_notify, NULL); WARN_ON(ret < 0); + kthread_run(printk_reaper, NULL, "printk_reaper"); return 0; } late_initcall(printk_late_init); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 80e4adb..66356f2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3754,6 +3754,19 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) return false; } +static void memalloc_print(unsigned long unused); +static atomic_t memalloc_in_flight = ATOMIC_INIT(0); +static DEFINE_TIMER(memalloc_timer, memalloc_print, 0, 0); +static void memalloc_print(unsigned long unused) +{ + const int in_flight = atomic_read(&memalloc_in_flight); + + if (in_flight < 10) + return; + pr_warn("MemAlloc: %d in-flight.\n", in_flight); + mod_timer(&memalloc_timer, jiffies + HZ); +} + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) @@ -3770,6 +3783,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) unsigned long alloc_start = jiffies; unsigned int stall_timeout = 10 * HZ; unsigned int cpuset_mems_cookie; + const bool offload_printk = current->offload_printk; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3790,6 +3804,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))) gfp_mask &= ~__GFP_ATOMIC; + if (can_direct_reclaim) + atomic_inc(&memalloc_in_flight); + current->offload_printk = 1; retry_cpuset: compaction_retries = 0; no_progress_loops = 0; @@ -3898,6 +3915,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) if (!can_direct_reclaim) goto nopage; + if (!timer_pending(&memalloc_timer)) + mod_timer(&memalloc_timer, jiffies + HZ); /* Make sure we know about allocations which stall for too long */ if (time_after(jiffies, alloc_start + stall_timeout)) { warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask, @@ -4020,6 +4039,9 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: + current->offload_printk = offload_printk; + if (can_direct_reclaim) + atomic_dec(&memalloc_in_flight); return page; } ---------- Simple printk() offloading patch end ---------- Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170714.txt.xz . It is pointless to complain about 927 threads doing allocations. The primary problem is scheduling priority rather than printk(). I would be able to reproduce this problem with much fewer threads if I were permitted to mix realtime priority threads; I put my reproducer under constraint not to use root privileges. ---------- [ 856.066292] idle-priority invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0 [ 856.066298] idle-priority cpuset=/ mems_allowed=0 [ 856.066306] CPU: 0 PID: 6183 Comm: idle-priority Not tainted 4.12.0-next-20170714+ #632 [ 856.066307] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 856.066317] Call Trace: [ 856.066325] dump_stack+0x67/0x9e [ 856.066331] dump_header+0x9d/0x3fa [ 856.066337] ? trace_hardirqs_on+0xd/0x10 [ 856.066343] oom_kill_process+0x226/0x650 [ 856.066350] out_of_memory+0x136/0x560 [ 856.066351] ? out_of_memory+0x206/0x560 [ 856.066357] __alloc_pages_nodemask+0xec1/0xf50 [ 856.066375] alloc_pages_vma+0x76/0x1a0 [ 856.066381] __handle_mm_fault+0xddd/0x1160 [ 856.066384] ? native_sched_clock+0x36/0xa0 [ 856.066395] handle_mm_fault+0x186/0x360 [ 856.066396] ? handle_mm_fault+0x44/0x360 [ 856.066401] __do_page_fault+0x1da/0x510 [ 856.066409] do_page_fault+0x21/0x70 [ 856.066413] page_fault+0x22/0x30 [ 856.066416] RIP: 0033:0x4008c0 [ 856.066417] RSP: 002b:00007ffc83235ba0 EFLAGS: 00010206 [ 856.066419] RAX: 00000000c7b4e000 RBX: 0000000100000000 RCX: 00007f1b4da57bd0 [ 856.066420] RDX: 0000000000000000 RSI: 0000000000400ae0 RDI: 0000000000000004 [ 856.066420] RBP: 00007f194db63010 R08: 0000000000000000 R09: 0000000000021000 [ 856.066421] R10: 00007ffc83235920 R11: 0000000000000246 R12: 0000000000000006 [ 856.066422] R13: 00007f194db63010 R14: 0000000000000000 R15: 0000000000000000 [ 856.066433] Mem-Info: [ 856.066437] active_anon:843412 inactive_anon:3676 isolated_anon:0 [ 856.066437] active_file:0 inactive_file:209 isolated_file:0 [ 856.066437] unevictable:0 dirty:0 writeback:0 unstable:0 [ 856.066437] slab_reclaimable:0 slab_unreclaimable:0 [ 856.066437] mapped:1877 shmem:4208 pagetables:10553 bounce:0 [ 856.066437] free:21334 free_pcp:0 free_cma:0 [ 856.066439] Node 0 active_anon:3373648kB inactive_anon:14704kB active_file:0kB inactive_file:836kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7508kB dirty:0kB writeback:0kB shmem:16832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2762752kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 856.066484] Node 0 DMA free:14776kB min:288kB low:360kB high:432kB active_anon:1092kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 856.066487] lowmem_reserve[]: 0 2688 3624 3624 [ 856.066492] Node 0 DMA32 free:53464kB min:49908kB low:62384kB high:74860kB active_anon:2696692kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2752816kB mlocked:0kB kernel_stack:48kB pagetables:164kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 856.066495] lowmem_reserve[]: 0 0 936 936 [ 856.066499] Node 0 Normal free:17096kB min:17384kB low:21728kB high:26072kB active_anon:675816kB inactive_anon:14704kB active_file:0kB inactive_file:944kB unevictable:0kB writepending:0kB present:1048576kB managed:958868kB mlocked:0kB kernel_stack:17632kB pagetables:42044kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 856.066502] lowmem_reserve[]: 0 0 0 0 [ 856.066506] Node 0 DMA: 2*4kB (UM) 2*8kB (UM) 2*16kB (UM) 2*32kB (UM) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (ME) = 14776kB [ 856.066525] Node 0 DMA32: 22*4kB (UM) 70*8kB (UM) 73*16kB (UM) 78*32kB (UM) 62*64kB (UM) 41*128kB (UM) 38*256kB (UM) 15*512kB (U) 20*1024kB (UM) 1*2048kB (U) 0*4096kB = 53464kB [ 856.066553] Node 0 Normal: 89*4kB (UMH) 96*8kB (UH) 84*16kB (UMEH) 99*32kB (UMEH) 35*64kB (UMEH) 34*128kB (UMEH) 21*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17604kB [ 856.066569] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 856.066570] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 856.066571] 4450 total pagecache pages [ 856.066573] 0 pages in swap cache [ 856.066574] Swap cache stats: add 0, delete 0, find 0/0 [ 856.066575] Free swap = 0kB [ 856.066576] Total swap = 0kB [ 856.066577] 1048445 pages RAM [ 856.066578] 0 pages HighMem/MovableOnly [ 856.066579] 116548 pages reserved [ 856.066580] 0 pages hwpoisoned [ 856.066581] Out of memory: Kill process 6183 (idle-priority) score 879 or sacrifice child [ 856.068245] Killed process 6301 (normal-priority) total-vm:4360kB, anon-rss:92kB, file-rss:0kB, shmem-rss:0kB [ 856.928087] MemAlloc: 916 in-flight. [ 857.952174] MemAlloc: 921 in-flight. [ 858.976265] MemAlloc: 923 in-flight. [ 859.417641] warn_alloc: 364 callbacks suppressed (...snipped...) [ 1155.865901] normal-priority: page allocation stalls for 300124ms, order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null) [ 1155.865906] normal-priority cpuset=/ mems_allowed=0 [ 1155.865910] CPU: 0 PID: 7193 Comm: normal-priority Not tainted 4.12.0-next-20170714+ #632 [ 1155.865911] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 1155.865911] Call Trace: [ 1155.865915] dump_stack+0x67/0x9e [ 1155.865918] warn_alloc+0x10f/0x1b0 [ 1155.865924] ? wake_all_kswapds+0x56/0x96 [ 1155.865929] __alloc_pages_nodemask+0xb2c/0xf50 [ 1155.865945] alloc_pages_vma+0x76/0x1a0 [ 1155.865950] __handle_mm_fault+0xddd/0x1160 [ 1155.865952] ? native_sched_clock+0x36/0xa0 [ 1155.865962] handle_mm_fault+0x186/0x360 [ 1155.865964] ? handle_mm_fault+0x44/0x360 [ 1155.865968] __do_page_fault+0x1da/0x510 [ 1155.865975] do_page_fault+0x21/0x70 [ 1155.865979] page_fault+0x22/0x30 (...snipped...) [ 1175.931414] normal-priority: page allocation stalls for 320133ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) [ 1175.931420] normal-priority cpuset=/ mems_allowed=0 [ 1175.931425] CPU: 0 PID: 6697 Comm: normal-priority Not tainted 4.12.0-next-20170714+ #632 [ 1175.931426] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 1175.931427] Call Trace: [ 1175.931432] dump_stack+0x67/0x9e [ 1175.931436] warn_alloc+0x10f/0x1b0 [ 1175.931442] ? wake_all_kswapds+0x56/0x96 [ 1175.931447] __alloc_pages_nodemask+0xb2c/0xf50 [ 1175.931465] alloc_pages_current+0x65/0xb0 [ 1175.931469] __page_cache_alloc+0x10b/0x140 [ 1175.931474] filemap_fault+0x3df/0x6a0 [ 1175.931475] ? filemap_fault+0x2ab/0x6a0 [ 1175.931482] xfs_filemap_fault+0x34/0x50 [ 1175.931485] __do_fault+0x19/0x120 [ 1175.931489] __handle_mm_fault+0xa5c/0x1160 [ 1175.931492] ? native_sched_clock+0x36/0xa0 [ 1175.931502] handle_mm_fault+0x186/0x360 [ 1175.931504] ? handle_mm_fault+0x44/0x360 [ 1175.931508] __do_page_fault+0x1da/0x510 [ 1175.931516] do_page_fault+0x21/0x70 [ 1175.931520] page_fault+0x22/0x30 (...snipped...) [ 1494.107705] idle-priority R running task 13272 6183 2806 0x00000080 [ 1494.109560] Call Trace: [ 1494.110510] __schedule+0x256/0x8e0 [ 1494.111659] ? _raw_spin_unlock_irqrestore+0x31/0x50 [ 1494.113078] schedule+0x38/0x90 [ 1494.114204] schedule_timeout+0x19a/0x330 [ 1494.115424] ? call_timer_fn+0x120/0x120 [ 1494.116639] ? oom_kill_process+0x50c/0x650 [ 1494.117909] schedule_timeout_killable+0x25/0x30 [ 1494.119259] out_of_memory+0x140/0x560 [ 1494.120485] ? out_of_memory+0x206/0x560 [ 1494.121693] __alloc_pages_nodemask+0xec1/0xf50 [ 1494.123058] alloc_pages_vma+0x76/0x1a0 [ 1494.124306] __handle_mm_fault+0xddd/0x1160 [ 1494.125572] ? native_sched_clock+0x36/0xa0 [ 1494.126848] handle_mm_fault+0x186/0x360 [ 1494.128103] ? handle_mm_fault+0x44/0x360 [ 1494.129284] __do_page_fault+0x1da/0x510 [ 1494.130575] do_page_fault+0x21/0x70 [ 1494.131738] page_fault+0x22/0x30 (...snipped...) [ 1571.807967] MemAlloc: 927 in-flight. [ 1572.832053] MemAlloc: 927 in-flight. [ 1573.856064] MemAlloc: 927 in-flight. [ 1574.880238] MemAlloc: 927 in-flight. [ 1575.904575] MemAlloc: 927 in-flight. [ 1576.060103] warn_alloc: 462 callbacks suppressed [ 1576.060139] normal-priority: page allocation stalls for 720318ms, order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null) [ 1576.060149] normal-priority cpuset=/ mems_allowed=0 [ 1576.060158] CPU: 0 PID: 7193 Comm: normal-priority Not tainted 4.12.0-next-20170714+ #632 [ 1576.060160] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 1576.060162] Call Trace: [ 1576.060170] dump_stack+0x67/0x9e [ 1576.060177] warn_alloc+0x10f/0x1b0 [ 1576.060184] ? wake_all_kswapds+0x56/0x96 [ 1576.060189] __alloc_pages_nodemask+0xb2c/0xf50 [ 1576.060207] alloc_pages_vma+0x76/0x1a0 [ 1576.060214] __handle_mm_fault+0xddd/0x1160 [ 1576.060218] ? native_sched_clock+0x36/0xa0 [ 1576.060229] handle_mm_fault+0x186/0x360 [ 1576.060231] ? handle_mm_fault+0x44/0x360 [ 1576.060236] __do_page_fault+0x1da/0x510 [ 1576.060244] do_page_fault+0x21/0x70 [ 1576.060249] page_fault+0x22/0x30 [ 1576.060252] RIP: 0033:0x400923 [ 1576.060254] RSP: 002b:00007ffc83235ba0 EFLAGS: 00010287 [ 1576.060257] RAX: 0000000000003000 RBX: 0000000000000000 RCX: 00007f1b4da57c30 [ 1576.060259] RDX: 0000000000000001 RSI: 00007ffc83235bb0 RDI: 0000000000000003 [ 1576.060260] RBP: 0000000000000000 R08: 00007f1b4df40740 R09: 0000000000000000 [ 1576.060262] R10: 00007ffc83235920 R11: 0000000000000246 R12: 0000000000400954 [ 1576.060263] R13: 000000000212d010 R14: 0000000000000000 R15: 0000000000000000 [ 1576.060275] warn_alloc_show_mem: 9 callbacks suppressed [ 1576.060277] Mem-Info: [ 1576.060281] active_anon:843659 inactive_anon:3676 isolated_anon:0 [ 1576.060281] active_file:0 inactive_file:0 isolated_file:0 [ 1576.060281] unevictable:0 dirty:0 writeback:0 unstable:0 [ 1576.060281] slab_reclaimable:0 slab_unreclaimable:0 [ 1576.060281] mapped:1853 shmem:4208 pagetables:10537 bounce:0 [ 1576.060281] free:21425 free_pcp:60 free_cma:0 [ 1576.060284] Node 0 active_anon:3374636kB inactive_anon:14704kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7412kB dirty:0kB writeback:0kB shmem:16832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2762752kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [ 1576.060286] Node 0 DMA free:14776kB min:288kB low:360kB high:432kB active_anon:1092kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 1576.060290] lowmem_reserve[]: 0 2688 3624 3624 [ 1576.060299] Node 0 DMA32 free:53544kB min:49908kB low:62384kB high:74860kB active_anon:2696692kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2752816kB mlocked:0kB kernel_stack:32kB pagetables:164kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 1576.060303] lowmem_reserve[]: 0 0 936 936 [ 1576.060311] Node 0 Normal free:17380kB min:17384kB low:21728kB high:26072kB active_anon:676852kB inactive_anon:14704kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:958868kB mlocked:0kB kernel_stack:17584kB pagetables:41980kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB [ 1576.060315] lowmem_reserve[]: 0 0 0 0 [ 1576.060323] Node 0 DMA: 2*4kB (UM) 2*8kB (UM) 2*16kB (UM) 2*32kB (UM) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (ME) = 14776kB [ 1576.060357] Node 0 DMA32: 22*4kB (UM) 70*8kB (UM) 72*16kB (UM) 77*32kB (UM) 62*64kB (UM) 40*128kB (UM) 39*256kB (UM) 15*512kB (U) 20*1024kB (UM) 1*2048kB (U) 0*4096kB = 53544kB [ 1576.060391] Node 0 Normal: 85*4kB (UMH) 128*8kB (UMH) 97*16kB (UMEH) 92*32kB (UMEH) 32*64kB (UEH) 32*128kB (UMEH) 21*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB [ 1576.060421] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 1576.060423] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1576.060425] 4210 total pagecache pages [ 1576.060428] 0 pages in swap cache [ 1576.060430] Swap cache stats: add 0, delete 0, find 0/0 [ 1576.060432] Free swap = 0kB [ 1576.060433] Total swap = 0kB [ 1576.060435] 1048445 pages RAM [ 1576.060437] 0 pages HighMem/MovableOnly [ 1576.060438] 116548 pages reserved [ 1576.060440] 0 pages hwpoisoned (...snipped...) [ 1576.168438] normal-priority: page allocation stalls for 720370ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null) [ 1576.168443] normal-priority cpuset=/ mems_allowed=0 [ 1576.168473] CPU: 0 PID: 6697 Comm: normal-priority Not tainted 4.12.0-next-20170714+ #632 [ 1576.168474] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 1576.168475] Call Trace: [ 1576.168479] dump_stack+0x67/0x9e [ 1576.168483] warn_alloc+0x10f/0x1b0 [ 1576.168489] ? wake_all_kswapds+0x56/0x96 [ 1576.168493] __alloc_pages_nodemask+0xb2c/0xf50 [ 1576.168511] alloc_pages_current+0x65/0xb0 [ 1576.168515] __page_cache_alloc+0x10b/0x140 [ 1576.168519] filemap_fault+0x3df/0x6a0 [ 1576.168521] ? filemap_fault+0x2ab/0x6a0 [ 1576.168528] xfs_filemap_fault+0x34/0x50 [ 1576.168531] __do_fault+0x19/0x120 [ 1576.168535] __handle_mm_fault+0xa5c/0x1160 [ 1576.168538] ? native_sched_clock+0x36/0xa0 [ 1576.168549] handle_mm_fault+0x186/0x360 [ 1576.168550] ? handle_mm_fault+0x44/0x360 [ 1576.168554] __do_page_fault+0x1da/0x510 [ 1576.168562] do_page_fault+0x21/0x70 [ 1576.168566] page_fault+0x22/0x30 [ 1576.168568] RIP: 0033:0x7f1b4dd39ced [ 1576.168569] RSP: 002b:00007ffc832358e0 EFLAGS: 00010202 [ 1576.168571] RAX: 0000000000000204 RBX: 00007f1b4d972d18 RCX: 00000000000003f3 [ 1576.168572] RDX: 00007f1b4d971280 RSI: 0000000000000001 RDI: 00007f1b4df4f658 [ 1576.168573] RBP: 0000000000000001 R08: 000000000000003e R09: 00000000003bbe1a [ 1576.168574] R10: 00007f1b4df524a8 R11: 00007ffc83235a20 R12: 0000000000000003 [ 1576.168575] R13: 000000000eef86be R14: 00007f1b4df4fb50 R15: 0000000000000000 [ 1576.928411] MemAlloc: 927 in-flight. [ 1577.952567] MemAlloc: 927 in-flight. [ 1578.976001] MemAlloc: 927 in-flight. [ 1580.000154] MemAlloc: 927 in-flight. [ 1580.003790] sysrq: SysRq : Kill All Tasks ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-14 12:30 ` Tetsuo Handa @ 2017-07-14 12:48 ` Michal Hocko 2017-08-09 6:14 ` Tetsuo Handa 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2017-07-14 12:48 UTC (permalink / raw) To: Tetsuo Handa Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek On Fri 14-07-17 21:30:54, Tetsuo Handa wrote: > Michal Hocko wrote: [...] > > As I've said earlier, if there is no other way to make printk work without all > > these nasty side effected then I would be OK to add a printk context > > specific calls into the oom killer. > > > > Removing the rest because this is again getting largely tangent. The > > primary problem you are seeing is that we stumble over printk here. > > Unless I can see a sound argument this is not the case it doesn't make > > any sense to discuss allocator changes. > > You are still ignoring my point. I agree that we stumble over printk(), but > printk() is nothing but one of locations we stumble. I am not ignoring it. You just mix too many things together to have a meaningful conversation... > Look at schedule_timeout_killable(1) in out_of_memory() which is called with > oom_lock still held. I'm reporting that even printk() is offloaded to printk > kernel thread, scheduling priority can make schedule_timeout_killable(1) sleep > for more than 12 minutes (which is intended to sleep for only one millisecond). > (I gave up waiting and pressed SysRq-i. I can't imagine how long it would have > continued sleeping inside schedule_timeout_killable(1) with oom_lock held.) > > Without cooperation from other allocating threads which failed to hold oom_lock, > it is dangerous to keep out_of_memory() preemptible/schedulable context. I have already tried to explain that this is something that the whole reclaim path suffers from the priority inversions problem because it has never been designed to handle that. You are just poking to one particular path of the reclaim stack and missing the whole forest for a tree. How the hack is this any different from a reclaim path stumbling over a lock down inside the filesystem and stalling basically everybody from making a reasonable progress? Is this a problem? Of course it is, theoretically. In practice not all that much to go and reimplement the whole stack. At least I haven't seen any real life reports complaining about this. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-07-14 12:48 ` Michal Hocko @ 2017-08-09 6:14 ` Tetsuo Handa 2017-08-09 13:01 ` Tetsuo Handa 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2017-08-09 6:14 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek Michal Hocko wrote: > On Fri 14-07-17 21:30:54, Tetsuo Handa wrote: > > Michal Hocko wrote: > [...] > > > As I've said earlier, if there is no other way to make printk work without all > > > these nasty side effected then I would be OK to add a printk context > > > specific calls into the oom killer. > > > > > > Removing the rest because this is again getting largely tangent. The > > > primary problem you are seeing is that we stumble over printk here. > > > Unless I can see a sound argument this is not the case it doesn't make > > > any sense to discuss allocator changes. > > > > You are still ignoring my point. I agree that we stumble over printk(), but > > printk() is nothing but one of locations we stumble. > > I am not ignoring it. You just mix too many things together to have a > meaningful conversation... > > > Look at schedule_timeout_killable(1) in out_of_memory() which is called with > > oom_lock still held. I'm reporting that even printk() is offloaded to printk > > kernel thread, scheduling priority can make schedule_timeout_killable(1) sleep > > for more than 12 minutes (which is intended to sleep for only one millisecond). > > (I gave up waiting and pressed SysRq-i. I can't imagine how long it would have > > continued sleeping inside schedule_timeout_killable(1) with oom_lock held.) > > > > Without cooperation from other allocating threads which failed to hold oom_lock, > > it is dangerous to keep out_of_memory() preemptible/schedulable context. > > I have already tried to explain that this is something that the whole > reclaim path suffers from the priority inversions problem because it has > never been designed to handle that. You are just poking to one > particular path of the reclaim stack and missing the whole forest for a > tree. How the hack is this any different from a reclaim path stumbling > over a lock down inside the filesystem and stalling basically everybody > from making a reasonable progress? Is this a problem? Of course it is, > theoretically. In practice not all that much to go and reimplement the > whole stack. At least I haven't seen any real life reports complaining > about this. I'm failing to test your "mm, oom: fix oom_reaper fallouts" patches using http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp because it fails to invoke the OOM killer for unknown reason. I analyzed it using kmallocwd and confirmed that two dozens of concurrent allocating threads is sufficient for hitting this warn_alloc() v.s. printk() lockup. Since printk offloading is not yet available, serialization is the the only choice we can mitigate this problem for now. How long will we have to waste more? ---------- [ 645.993827] MemAlloc-Info: stalling=18 dying=0 exiting=0 victim=0 oom_count=29 (...snipped...) [ 645.996694] MemAlloc: vmtoolsd(2221) flags=0x400100 switches=5607 seq=3740 gfp=0x14200ca(GFP_HIGHUSER_MOVABLE) order=0 delay=7541 [ 645.996695] vmtoolsd R running task 11960 2221 1 0x00000080 [ 645.996699] Call Trace: [ 645.996708] ? console_unlock+0x373/0x4a0 [ 645.996709] ? vprintk_emit+0x211/0x2f0 [ 645.996714] ? vprintk_emit+0x21a/0x2f0 [ 645.996720] ? vprintk_default+0x1a/0x20 [ 645.996722] ? vprintk_func+0x22/0x60 [ 645.996724] ? printk+0x53/0x6a [ 645.996731] ? dump_stack_print_info+0xab/0xb0 [ 645.996736] ? dump_stack+0x5e/0x9e [ 645.996739] ? dump_header+0x9d/0x3fa [ 645.996744] ? trace_hardirqs_on+0xd/0x10 [ 645.996751] ? oom_kill_process+0x226/0x650 [ 645.996757] ? out_of_memory+0x13d/0x570 [ 645.996758] ? out_of_memory+0x20d/0x570 [ 645.996763] ? __alloc_pages_nodemask+0xbc8/0xed0 [ 645.996780] ? alloc_pages_current+0x65/0xb0 [ 645.996784] ? __page_cache_alloc+0x10b/0x140 [ 645.996789] ? filemap_fault+0x3df/0x6a0 [ 645.996790] ? filemap_fault+0x2ab/0x6a0 [ 645.996797] ? xfs_filemap_fault+0x34/0x50 [ 645.996799] ? __do_fault+0x19/0x120 [ 645.996803] ? __handle_mm_fault+0xa99/0x1260 [ 645.996814] ? handle_mm_fault+0x1b2/0x350 [ 645.996816] ? handle_mm_fault+0x46/0x350 [ 645.996820] ? __do_page_fault+0x1da/0x510 [ 645.996828] ? do_page_fault+0x21/0x70 [ 645.996832] ? page_fault+0x22/0x30 (...snipped...) [ 645.998748] MemAlloc-Info: stalling=18 dying=0 exiting=0 victim=0 oom_count=29 (...snipped...) [ 1472.484590] MemAlloc-Info: stalling=25 dying=0 exiting=0 victim=0 oom_count=29 (...snipped...) [ 1472.487341] MemAlloc: vmtoolsd(2221) flags=0x400100 switches=5607 seq=3740 gfp=0x14200ca(GFP_HIGHUSER_MOVABLE) order=0 delay=834032 [ 1472.487342] vmtoolsd R running task 11960 2221 1 0x00000080 [ 1472.487346] Call Trace: [ 1472.487353] ? console_unlock+0x373/0x4a0 [ 1472.487355] ? vprintk_emit+0x211/0x2f0 [ 1472.487360] ? vprintk_emit+0x21a/0x2f0 [ 1472.487367] ? vprintk_default+0x1a/0x20 [ 1472.487369] ? vprintk_func+0x22/0x60 [ 1472.487370] ? printk+0x53/0x6a [ 1472.487377] ? dump_stack_print_info+0xab/0xb0 [ 1472.487381] ? dump_stack+0x5e/0x9e [ 1472.487384] ? dump_header+0x9d/0x3fa [ 1472.487389] ? trace_hardirqs_on+0xd/0x10 [ 1472.487396] ? oom_kill_process+0x226/0x650 [ 1472.487402] ? out_of_memory+0x13d/0x570 [ 1472.487403] ? out_of_memory+0x20d/0x570 [ 1472.487408] ? __alloc_pages_nodemask+0xbc8/0xed0 [ 1472.487426] ? alloc_pages_current+0x65/0xb0 [ 1472.487429] ? __page_cache_alloc+0x10b/0x140 [ 1472.487434] ? filemap_fault+0x3df/0x6a0 [ 1472.487435] ? filemap_fault+0x2ab/0x6a0 [ 1472.487441] ? xfs_filemap_fault+0x34/0x50 [ 1472.487444] ? __do_fault+0x19/0x120 [ 1472.487447] ? __handle_mm_fault+0xa99/0x1260 [ 1472.487459] ? handle_mm_fault+0x1b2/0x350 [ 1472.487460] ? handle_mm_fault+0x46/0x350 [ 1472.487465] ? __do_page_fault+0x1da/0x510 [ 1472.487472] ? do_page_fault+0x21/0x70 [ 1472.487476] ? page_fault+0x22/0x30 (...snipped...) [ 1472.489975] MemAlloc-Info: stalling=25 dying=0 exiting=0 victim=0 oom_count=29 ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable. 2017-08-09 6:14 ` Tetsuo Handa @ 2017-08-09 13:01 ` Tetsuo Handa 0 siblings, 0 replies; 43+ messages in thread From: Tetsuo Handa @ 2017-08-09 13:01 UTC (permalink / raw) To: mhocko Cc: akpm, linux-mm, xiyou.wangcong, dave.hansen, hannes, mgorman, vbabka, sergey.senozhatsky.work, pmladek, penguin-kernel Tetsuo Handa wrote: > I'm failing to test your "mm, oom: fix oom_reaper fallouts" patches using > http://lkml.kernel.org/r/201708072228.FAJ09347.tOOVOFFQJSHMFL@I-love.SAKURA.ne.jp > because it fails to invoke the OOM killer for unknown reason. I analyzed it using > kmallocwd and confirmed that two dozens of concurrent allocating threads is > sufficient for hitting this warn_alloc() v.s. printk() lockup. > Since printk offloading is not yet available, serialization is the the only choice > we can mitigate this problem for now. How long will we have to waste more? Above explanation is not appropriate. As a part of testing your "mm, oom: fix oom_reaper fallouts" patches, I'm trying to use various stress patterns including changing number of threads. For some unknown reason (though it is not caused by your patches), sometimes it takes too much long time (order of minutes) to invoke the OOM killer despite warn_alloc() is periodically printed or not printed at all. In order to find out what stage is taking so long time, I'm using kmallocwd with 1 second timeout, for SysRq-t etc. are useless for tracking how long threads are waiting at specific location because they do not have in-flight allocation information while they take many seconds for showing too much noise (including idle threads simply sleeping). And due to kmallocwd's 1 second timeout setting (which I want to use for finding out at what stage allocating threads are waiting for so long), printk() from out_of_memory() with oom_lock held is trapped by periodical printk() by kmallocwd, and caused lockup. So, this is not direct warn_alloc() v.s. printk() lockup. But what the kmallocwd tries to warn is similar to warn_alloc(). I can't automate testing of "mm, oom: fix oom_reaper fallouts" patches because I sometimes need to use SysRq-f etc. to unstuck the test. > > ---------- > [ 645.993827] MemAlloc-Info: stalling=18 dying=0 exiting=0 victim=0 oom_count=29 > (...snipped...) > [ 645.996694] MemAlloc: vmtoolsd(2221) flags=0x400100 switches=5607 seq=3740 gfp=0x14200ca(GFP_HIGHUSER_MOVABLE) order=0 delay=7541 > [ 645.996695] vmtoolsd R running task 11960 2221 1 0x00000080 > [ 645.996699] Call Trace: > [ 645.996708] ? console_unlock+0x373/0x4a0 > [ 645.996709] ? vprintk_emit+0x211/0x2f0 > [ 645.996714] ? vprintk_emit+0x21a/0x2f0 > [ 645.996720] ? vprintk_default+0x1a/0x20 > [ 645.996722] ? vprintk_func+0x22/0x60 > [ 645.996724] ? printk+0x53/0x6a > [ 645.996731] ? dump_stack_print_info+0xab/0xb0 > [ 645.996736] ? dump_stack+0x5e/0x9e > [ 645.996739] ? dump_header+0x9d/0x3fa > [ 645.996744] ? trace_hardirqs_on+0xd/0x10 > [ 645.996751] ? oom_kill_process+0x226/0x650 > [ 645.996757] ? out_of_memory+0x13d/0x570 > [ 645.996758] ? out_of_memory+0x20d/0x570 > [ 645.996763] ? __alloc_pages_nodemask+0xbc8/0xed0 > [ 645.996780] ? alloc_pages_current+0x65/0xb0 > [ 645.996784] ? __page_cache_alloc+0x10b/0x140 > [ 645.996789] ? filemap_fault+0x3df/0x6a0 > [ 645.996790] ? filemap_fault+0x2ab/0x6a0 > [ 645.996797] ? xfs_filemap_fault+0x34/0x50 > [ 645.996799] ? __do_fault+0x19/0x120 > [ 645.996803] ? __handle_mm_fault+0xa99/0x1260 > [ 645.996814] ? handle_mm_fault+0x1b2/0x350 > [ 645.996816] ? handle_mm_fault+0x46/0x350 > [ 645.996820] ? __do_page_fault+0x1da/0x510 > [ 645.996828] ? do_page_fault+0x21/0x70 > [ 645.996832] ? page_fault+0x22/0x30 > (...snipped...) > [ 645.998748] MemAlloc-Info: stalling=18 dying=0 exiting=0 victim=0 oom_count=29 > (...snipped...) > [ 1472.484590] MemAlloc-Info: stalling=25 dying=0 exiting=0 victim=0 oom_count=29 > (...snipped...) > [ 1472.487341] MemAlloc: vmtoolsd(2221) flags=0x400100 switches=5607 seq=3740 gfp=0x14200ca(GFP_HIGHUSER_MOVABLE) order=0 delay=834032 > [ 1472.487342] vmtoolsd R running task 11960 2221 1 0x00000080 > [ 1472.487346] Call Trace: > [ 1472.487353] ? console_unlock+0x373/0x4a0 > [ 1472.487355] ? vprintk_emit+0x211/0x2f0 > [ 1472.487360] ? vprintk_emit+0x21a/0x2f0 > [ 1472.487367] ? vprintk_default+0x1a/0x20 > [ 1472.487369] ? vprintk_func+0x22/0x60 > [ 1472.487370] ? printk+0x53/0x6a > [ 1472.487377] ? dump_stack_print_info+0xab/0xb0 > [ 1472.487381] ? dump_stack+0x5e/0x9e > [ 1472.487384] ? dump_header+0x9d/0x3fa > [ 1472.487389] ? trace_hardirqs_on+0xd/0x10 > [ 1472.487396] ? oom_kill_process+0x226/0x650 > [ 1472.487402] ? out_of_memory+0x13d/0x570 > [ 1472.487403] ? out_of_memory+0x20d/0x570 > [ 1472.487408] ? __alloc_pages_nodemask+0xbc8/0xed0 > [ 1472.487426] ? alloc_pages_current+0x65/0xb0 > [ 1472.487429] ? __page_cache_alloc+0x10b/0x140 > [ 1472.487434] ? filemap_fault+0x3df/0x6a0 > [ 1472.487435] ? filemap_fault+0x2ab/0x6a0 > [ 1472.487441] ? xfs_filemap_fault+0x34/0x50 > [ 1472.487444] ? __do_fault+0x19/0x120 > [ 1472.487447] ? __handle_mm_fault+0xa99/0x1260 > [ 1472.487459] ? handle_mm_fault+0x1b2/0x350 > [ 1472.487460] ? handle_mm_fault+0x46/0x350 > [ 1472.487465] ? __do_page_fault+0x1da/0x510 > [ 1472.487472] ? do_page_fault+0x21/0x70 > [ 1472.487476] ? page_fault+0x22/0x30 > (...snipped...) > [ 1472.489975] MemAlloc-Info: stalling=25 dying=0 exiting=0 victim=0 oom_count=29 > ---------- > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2017-08-09 13:02 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-06-01 11:43 [PATCH] mm,page_alloc: Serialize warn_alloc() if schedulable Tetsuo Handa 2017-06-01 11:59 ` Michal Hocko 2017-06-01 13:11 ` Tetsuo Handa 2017-06-01 13:28 ` Michal Hocko 2017-06-01 22:10 ` Andrew Morton 2017-06-02 7:18 ` Michal Hocko 2017-06-02 11:13 ` Tetsuo Handa 2017-06-02 12:15 ` Michal Hocko 2017-06-02 17:13 ` Tetsuo Handa 2017-06-02 21:57 ` Cong Wang 2017-06-04 8:58 ` Tetsuo Handa 2017-06-04 15:05 ` Michal Hocko 2017-06-04 21:43 ` Tetsuo Handa 2017-06-05 5:37 ` Michal Hocko 2017-06-05 18:15 ` Cong Wang 2017-06-06 9:17 ` Michal Hocko 2017-06-05 18:25 ` Cong Wang 2017-06-22 10:35 ` Tetsuo Handa 2017-06-22 22:53 ` Cong Wang 2017-06-02 16:59 ` Cong Wang 2017-06-02 19:59 ` Andrew Morton 2017-06-03 2:57 ` Tetsuo Handa 2017-06-03 7:32 ` Michal Hocko 2017-06-03 8:36 ` Tetsuo Handa 2017-06-05 7:10 ` Sergey Senozhatsky 2017-06-05 9:36 ` Sergey Senozhatsky 2017-06-05 15:02 ` Tetsuo Handa 2017-06-03 13:21 ` Tetsuo Handa 2017-07-08 4:59 ` Tetsuo Handa 2017-07-10 13:21 ` Michal Hocko 2017-07-10 13:54 ` Tetsuo Handa 2017-07-10 14:14 ` Michal Hocko 2017-07-11 13:10 ` Tetsuo Handa 2017-07-11 13:49 ` Michal Hocko 2017-07-11 14:58 ` Petr Mladek 2017-07-11 22:06 ` Tetsuo Handa 2017-07-12 8:54 ` Michal Hocko 2017-07-12 12:23 ` Tetsuo Handa 2017-07-12 12:41 ` Michal Hocko 2017-07-14 12:30 ` Tetsuo Handa 2017-07-14 12:48 ` Michal Hocko 2017-08-09 6:14 ` Tetsuo Handa 2017-08-09 13:01 ` Tetsuo Handa
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.