* Softlockup during memory allocation @ 2016-11-01 8:12 Nikolay Borisov 2016-11-01 8:16 ` Nikolay Borisov ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Nikolay Borisov @ 2016-11-01 8:12 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, Linux MM, Paul E. McKenney Hello, I got the following rcu_sched/soft lockup on a server: [7056389.638502] INFO: rcu_sched self-detected stall on CPU [7056389.638509] 21-...: (20994 ticks this GP) idle=9ef/140000000000001/0 softirq=3256767558/3256767596 fqs=6843 [7056389.638510] (t=21001 jiffies g=656551647 c=656551646 q=469247) [7056389.638513] Task dump for CPU 21: [7056389.638515] hive_exec R running task 0 4413 31126 0x0000000a [7056389.638518] ffffffff81c40280 ffff883fff323dc0 ffffffff8107f53d 0000000000000015 [7056389.638520] ffffffff81c40280 ffff883fff323dd8 ffffffff81081ce9 0000000000000016 [7056389.638522] ffff883fff323e08 ffffffff810aa34b ffff883fff335d00 ffffffff81c40280 [7056389.638524] Call Trace: [7056389.638525] <IRQ> [<ffffffff8107f53d>] sched_show_task+0xbd/0x120 [7056389.638535] [<ffffffff81081ce9>] dump_cpu_task+0x39/0x40 [7056389.638539] [<ffffffff810aa34b>] rcu_dump_cpu_stacks+0x8b/0xc0 [7056389.638541] [<ffffffff810ade76>] rcu_check_callbacks+0x4d6/0x7b0 [7056389.638547] [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20 [7056389.638549] [<ffffffff810b28b9>] update_process_times+0x39/0x60 [7056389.638551] [<ffffffff810c0a9d>] tick_sched_timer+0x4d/0x180 [7056389.638553] [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20 [7056389.638554] [<ffffffff810b3197>] __hrtimer_run_queues+0xe7/0x260 [7056389.638556] [<ffffffff810b3718>] hrtimer_interrupt+0xa8/0x1a0 [7056389.638561] [<ffffffff81034608>] local_apic_timer_interrupt+0x38/0x60 [7056389.638565] [<ffffffff81616a7d>] smp_apic_timer_interrupt+0x3d/0x50 [7056389.638568] [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90 [7056389.638569] <EOI> [<ffffffff8113fb4a>] ? shrink_zone+0x28a/0x2a0 [7056389.638575] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 [7056389.638579] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 [7056389.638581] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 [7056389.638585] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 [7056389.638588] [<ffffffff81087486>] ? update_curr+0x66/0x180 [7056389.638591] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 [7056389.638595] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 [7056389.638596] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 [7056389.638600] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 [7056389.638603] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 [7056389.638608] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 [7056389.638611] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 [7056389.638613] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 [7056389.638615] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 [7056389.638618] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 [7056389.638620] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 [7056389.638622] [<ffffffff810b2f1a>] ? hrtimer_try_to_cancel+0x1a/0x110 [7056389.638626] [<ffffffff810775a2>] ? __might_sleep+0x52/0xb0 [7056389.638628] [<ffffffff810b3bdd>] ? hrtimer_nanosleep+0xbd/0x1a0 [7056389.638631] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 [7056389.638632] [<ffffffff8104332c>] do_page_fault+0xc/0x10 [7056389.638634] [<ffffffff81616172>] page_fault+0x22/0x30 Here is the stack of the same process when taken with 'bt' in crash: #0 [ffff8820d5fb3598] __schedule at ffffffff8160fa9a #1 [ffff8820d5fb35e0] preempt_schedule_common at ffffffff8161043f #2 [ffff8820d5fb35f8] _cond_resched at ffffffff8161047c #3 [ffff8820d5fb3608] shrink_page_list at ffffffff8113dd94 #4 [ffff8820d5fb36e8] shrink_inactive_list at ffffffff8113eab1 #5 [ffff8820d5fb37a8] shrink_lruvec at ffffffff8113f710 #6 [ffff8820d5fb38a8] shrink_zone at ffffffff8113f99c #7 [ffff8820d5fb3920] do_try_to_free_pages at ffffffff8113fcd4 #8 [ffff8820d5fb39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7 #9 [ffff8820d5fb3a10] try_charge at ffffffff8118d0ef #10 [ffff8820d5fb3ab0] mem_cgroup_try_charge at ffffffff811915ed #11 [ffff8820d5fb3af0] __add_to_page_cache_locked at ffffffff81128cf2 #12 [ffff8820d5fb3b48] add_to_page_cache_lru at ffffffff81128f28 #13 [ffff8820d5fb3b70] ext4_mpage_readpages at ffffffff812719b2 #14 [ffff8820d5fb3c78] ext4_readpages at ffffffff81228ce6 #15 [ffff8820d5fb3c88] __do_page_cache_readahead at ffffffff811374e0 #16 [ffff8820d5fb3d30] filemap_fault at ffffffff8112b1f0 #17 [ffff8820d5fb3d88] ext4_filemap_fault at ffffffff81231716 #18 [ffff8820d5fb3db0] __do_fault at ffffffff8115629f #19 [ffff8820d5fb3e10] handle_mm_fault at ffffffff8115a205 #20 [ffff8820d5fb3ee8] __do_page_fault at ffffffff810430bb #21 [ffff8820d5fb3f40] do_page_fault at ffffffff8104332c #22 [ffff8820d5fb3f50] page_fault at ffffffff81616172 And then multiple softlockups such as : [7056427.875860] Call Trace: [7056427.875866] [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0 [7056427.875869] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 [7056427.875873] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 [7056427.875875] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 [7056427.875878] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 [7056427.875883] [<ffffffff81087486>] ? update_curr+0x66/0x180 [7056427.875885] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 [7056427.875889] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 [7056427.875891] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 [7056427.875894] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 [7056427.875898] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 [7056427.875903] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 [7056427.875905] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 [7056427.875907] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 [7056427.875909] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 [7056427.875912] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 [7056427.875915] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 [7056427.875916] [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360 [7056427.875920] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 [7056427.875921] [<ffffffff8104332c>] do_page_fault+0xc/0x10 [7056427.875924] [<ffffffff81616172>] page_fault+0x22/0x30 So what happens is that the file system needs memory and kicks in direct page reclaim, however it seems that due to excessive reclaim behavior it locksup in the do { shrink_zones(); }while() in do_try_to_free_pages. However, when I get a backtrace from the process, I see that while going through the (in)active lists the processes have correctly cond_resched: #0 [ffff881fe60f3598] __schedule at ffffffff8160fa9a #1 [ffff881fe60f35e0] preempt_schedule_common at ffffffff8161043f #2 [ffff881fe60f35f8] _cond_resched at ffffffff8161047c #3 [ffff881fe60f3608] shrink_page_list at ffffffff8113dd94 #4 [ffff881fe60f36e8] shrink_inactive_list at ffffffff8113eab1 #5 [ffff881fe60f37a8] shrink_lruvec at ffffffff8113f710 #6 [ffff881fe60f38a8] shrink_zone at ffffffff8113f99c #7 [ffff881fe60f3920] do_try_to_free_pages at ffffffff8113fcd4 #8 [ffff881fe60f39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7 #9 [ffff881fe60f3a10] try_charge at ffffffff8118d0ef #10 [ffff881fe60f3ab0] mem_cgroup_try_charge at ffffffff811915ed #11 [ffff881fe60f3af0] __add_to_page_cache_locked at ffffffff81128cf2 #12 [ffff881fe60f3b48] add_to_page_cache_lru at ffffffff81128f28 #13 [ffff881fe60f3b70] ext4_mpage_readpages at ffffffff812719b2 #14 [ffff881fe60f3c78] ext4_readpages at ffffffff81228ce6 #15 [ffff881fe60f3c88] __do_page_cache_readahead at ffffffff811374e0 #16 [ffff881fe60f3d30] filemap_fault at ffffffff8112b1f0 #17 [ffff881fe60f3d88] ext4_filemap_fault at ffffffff81231716 #18 [ffff881fe60f3db0] __do_fault at ffffffff8115629f #19 [ffff881fe60f3e10] handle_mm_fault at ffffffff8115a205 #20 [ffff881fe60f3ee8] __do_page_fault at ffffffff810430bb #21 [ffff881fe60f3f40] do_page_fault at ffffffff8104332c #22 [ffff881fe60f3f50] page_fault at ffffffff81616172 Given the state of the page lists there are respectively 30 millions and 17 million pages respectively on node 0 and node 1 that direct reclaim should go through. Naturally this takes time, esp on a busy box, however the reclaim code seems to have resched points. So I'm perplexed as to why those softlockups materialized. Here's the state of the memory: PAGES TOTAL PERCENTAGE TOTAL MEM 65945937 251.6 GB ---- FREE 1197261 4.6 GB 1% of TOTAL MEM USED 64748676 247 GB 98% of TOTAL MEM SHARED 16603507 63.3 GB 25% of TOTAL MEM BUFFERS 3508833 13.4 GB 5% of TOTAL MEM CACHED 25814634 98.5 GB 39% of TOTAL MEM SLAB 7521229 28.7 GB 11% of TOTAL MEM TOTAL SWAP 262143 1024 MB ---- SWAP USED 262143 1024 MB 100% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP COMMIT LIMIT 33235111 126.8 GB ---- COMMITTED 65273273 249 GB 196% of TOTAL LIMIT As well as the zone statistics: NODE: 0 ZONE: 2 ADDR: ffff88207fffcf00 NAME: "Normal" SIZE: 33030144 MIN/LOW/HIGH: 22209/27761/33313 VM_STAT: NR_FREE_PAGES: 62436 NR_ALLOC_BATCH: 2024 NR_INACTIVE_ANON: 8177867 NR_ACTIVE_ANON: 5407176 NR_INACTIVE_FILE: 5804642 NR_ACTIVE_FILE: 9694170 NR_UNEVICTABLE: 50013 NR_MLOCK: 59860 NR_ANON_PAGES: 13276046 NR_FILE_MAPPED: 969231 NR_FILE_PAGES: 15858085 NR_FILE_DIRTY: 683 NR_WRITEBACK: 530 NR_SLAB_RECLAIMABLE: 2688882 NR_SLAB_UNRECLAIMABLE: 255070 NR_PAGETABLE: 182007 NR_KERNEL_STACK: 8419 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 1129513 NR_VMSCAN_IMMEDIATE: 39497899 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 462 NR_SHMEM: 331386 NR_DIRTIED: 6868276352 NR_WRITTEN: 5816499568 NR_PAGES_SCANNED: -490 NUMA_HIT: 922019911612 NUMA_MISS: 2935289654 NUMA_FOREIGN: 1903827196 NUMA_INTERLEAVE_HIT: 57290 NUMA_LOCAL: 922017951068 NUMA_OTHER: 2937250198 WORKINGSET_REFAULT: 6998116360 WORKINGSET_ACTIVATE: 6033595269 WORKINGSET_NODERECLAIM: 2300965 NR_ANON_TRANSPARENT_HUGEPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 1 ZONE: 2 ADDR: ffff88407fff9f00 NAME: "Normal" SIZE: 33554432 MIN/LOW/HIGH: 22567/28208/33850 VM_STAT: NR_FREE_PAGES: 1003922 NR_ALLOC_BATCH: 4572 NR_INACTIVE_ANON: 7092366 NR_ACTIVE_ANON: 6898921 NR_INACTIVE_FILE: 4880696 NR_ACTIVE_FILE: 8185594 NR_UNEVICTABLE: 5311 NR_MLOCK: 25509 NR_ANON_PAGES: 13644139 NR_FILE_MAPPED: 790292 NR_FILE_PAGES: 13418055 NR_FILE_DIRTY: 2081 NR_WRITEBACK: 944 NR_SLAB_RECLAIMABLE: 3948975 NR_SLAB_UNRECLAIMABLE: 546053 NR_PAGETABLE: 207960 NR_KERNEL_STACK: 10382 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 213029 NR_VMSCAN_IMMEDIATE: 28902492 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 23 NR_SHMEM: 327804 NR_DIRTIED: 12275571618 NR_WRITTEN: 11397580462 NR_PAGES_SCANNED: -787 NUMA_HIT: 798927158945 NUMA_MISS: 1903827196 NUMA_FOREIGN: 2938097677 NUMA_INTERLEAVE_HIT: 57726 NUMA_LOCAL: 798925933393 NUMA_OTHER: 1905052748 WORKINGSET_REFAULT: 3461465775 WORKINGSET_ACTIVATE: 2724000507 WORKINGSET_NODERECLAIM: 4756016 NR_ANON_TRANSPARENT_HUGEPAGES: 70 NR_FREE_CMA_PAGES: 0 In addition to that I believe there is something wrong with the NR_PAGES_SCANNED stats since they are being negative. I haven't looked into the code to see how this value is being synchronized and if there is a possibility of it temporary going negative. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-01 8:12 Softlockup during memory allocation Nikolay Borisov @ 2016-11-01 8:16 ` Nikolay Borisov 2016-11-02 19:00 ` Vlastimil Babka 2016-11-04 12:18 ` Nikolay Borisov 2 siblings, 0 replies; 20+ messages in thread From: Nikolay Borisov @ 2016-11-01 8:16 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, Linux MM, Paul E. McKenney On 11/01/2016 10:12 AM, Nikolay Borisov wrote: > Hello, > I forgot to mention that this is on kernel 4.4.14 > I got the following rcu_sched/soft lockup on a server: > > [7056389.638502] INFO: rcu_sched self-detected stall on CPU > [7056389.638509] 21-...: (20994 ticks this GP) idle=9ef/140000000000001/0 softirq=3256767558/3256767596 fqs=6843 > [7056389.638510] (t=21001 jiffies g=656551647 c=656551646 q=469247) > [7056389.638513] Task dump for CPU 21: > [7056389.638515] hive_exec R running task 0 4413 31126 0x0000000a > [7056389.638518] ffffffff81c40280 ffff883fff323dc0 ffffffff8107f53d 0000000000000015 > [7056389.638520] ffffffff81c40280 ffff883fff323dd8 ffffffff81081ce9 0000000000000016 > [7056389.638522] ffff883fff323e08 ffffffff810aa34b ffff883fff335d00 ffffffff81c40280 > [7056389.638524] Call Trace: > [7056389.638525] <IRQ> [<ffffffff8107f53d>] sched_show_task+0xbd/0x120 > [7056389.638535] [<ffffffff81081ce9>] dump_cpu_task+0x39/0x40 > [7056389.638539] [<ffffffff810aa34b>] rcu_dump_cpu_stacks+0x8b/0xc0 > [7056389.638541] [<ffffffff810ade76>] rcu_check_callbacks+0x4d6/0x7b0 > [7056389.638547] [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20 > [7056389.638549] [<ffffffff810b28b9>] update_process_times+0x39/0x60 > [7056389.638551] [<ffffffff810c0a9d>] tick_sched_timer+0x4d/0x180 > [7056389.638553] [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20 > [7056389.638554] [<ffffffff810b3197>] __hrtimer_run_queues+0xe7/0x260 > [7056389.638556] [<ffffffff810b3718>] hrtimer_interrupt+0xa8/0x1a0 > [7056389.638561] [<ffffffff81034608>] local_apic_timer_interrupt+0x38/0x60 > [7056389.638565] [<ffffffff81616a7d>] smp_apic_timer_interrupt+0x3d/0x50 > [7056389.638568] [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90 > [7056389.638569] <EOI> [<ffffffff8113fb4a>] ? shrink_zone+0x28a/0x2a0 > [7056389.638575] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 > [7056389.638579] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 > [7056389.638581] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 > [7056389.638585] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 > [7056389.638588] [<ffffffff81087486>] ? update_curr+0x66/0x180 > [7056389.638591] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 > [7056389.638595] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 > [7056389.638596] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 > [7056389.638600] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 > [7056389.638603] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 > [7056389.638608] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 > [7056389.638611] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 > [7056389.638613] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 > [7056389.638615] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 > [7056389.638618] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 > [7056389.638620] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 > [7056389.638622] [<ffffffff810b2f1a>] ? hrtimer_try_to_cancel+0x1a/0x110 > [7056389.638626] [<ffffffff810775a2>] ? __might_sleep+0x52/0xb0 > [7056389.638628] [<ffffffff810b3bdd>] ? hrtimer_nanosleep+0xbd/0x1a0 > [7056389.638631] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 > [7056389.638632] [<ffffffff8104332c>] do_page_fault+0xc/0x10 > [7056389.638634] [<ffffffff81616172>] page_fault+0x22/0x30 > > Here is the stack of the same process when taken with 'bt' in crash: > > #0 [ffff8820d5fb3598] __schedule at ffffffff8160fa9a > #1 [ffff8820d5fb35e0] preempt_schedule_common at ffffffff8161043f > #2 [ffff8820d5fb35f8] _cond_resched at ffffffff8161047c > #3 [ffff8820d5fb3608] shrink_page_list at ffffffff8113dd94 > #4 [ffff8820d5fb36e8] shrink_inactive_list at ffffffff8113eab1 > #5 [ffff8820d5fb37a8] shrink_lruvec at ffffffff8113f710 > #6 [ffff8820d5fb38a8] shrink_zone at ffffffff8113f99c > #7 [ffff8820d5fb3920] do_try_to_free_pages at ffffffff8113fcd4 > #8 [ffff8820d5fb39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7 > #9 [ffff8820d5fb3a10] try_charge at ffffffff8118d0ef > #10 [ffff8820d5fb3ab0] mem_cgroup_try_charge at ffffffff811915ed > #11 [ffff8820d5fb3af0] __add_to_page_cache_locked at ffffffff81128cf2 > #12 [ffff8820d5fb3b48] add_to_page_cache_lru at ffffffff81128f28 > #13 [ffff8820d5fb3b70] ext4_mpage_readpages at ffffffff812719b2 > #14 [ffff8820d5fb3c78] ext4_readpages at ffffffff81228ce6 > #15 [ffff8820d5fb3c88] __do_page_cache_readahead at ffffffff811374e0 > #16 [ffff8820d5fb3d30] filemap_fault at ffffffff8112b1f0 > #17 [ffff8820d5fb3d88] ext4_filemap_fault at ffffffff81231716 > #18 [ffff8820d5fb3db0] __do_fault at ffffffff8115629f > #19 [ffff8820d5fb3e10] handle_mm_fault at ffffffff8115a205 > #20 [ffff8820d5fb3ee8] __do_page_fault at ffffffff810430bb > #21 [ffff8820d5fb3f40] do_page_fault at ffffffff8104332c > #22 [ffff8820d5fb3f50] page_fault at ffffffff81616172 > > > And then multiple softlockups such as : > > [7056427.875860] Call Trace: > [7056427.875866] [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0 > [7056427.875869] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 > [7056427.875873] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 > [7056427.875875] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 > [7056427.875878] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 > [7056427.875883] [<ffffffff81087486>] ? update_curr+0x66/0x180 > [7056427.875885] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 > [7056427.875889] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 > [7056427.875891] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 > [7056427.875894] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 > [7056427.875898] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 > [7056427.875903] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 > [7056427.875905] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 > [7056427.875907] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 > [7056427.875909] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 > [7056427.875912] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 > [7056427.875915] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 > [7056427.875916] [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360 > [7056427.875920] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 > [7056427.875921] [<ffffffff8104332c>] do_page_fault+0xc/0x10 > [7056427.875924] [<ffffffff81616172>] page_fault+0x22/0x30 > > So what happens is that the file system needs memory and kicks in direct page reclaim, > however it seems that due to excessive reclaim behavior it locksup in the > do { shrink_zones(); }while() in do_try_to_free_pages. However, when I get > a backtrace from the process, I see that while going through the (in)active lists > the processes have correctly cond_resched: > > #0 [ffff881fe60f3598] __schedule at ffffffff8160fa9a > #1 [ffff881fe60f35e0] preempt_schedule_common at ffffffff8161043f > #2 [ffff881fe60f35f8] _cond_resched at ffffffff8161047c > #3 [ffff881fe60f3608] shrink_page_list at ffffffff8113dd94 > #4 [ffff881fe60f36e8] shrink_inactive_list at ffffffff8113eab1 > #5 [ffff881fe60f37a8] shrink_lruvec at ffffffff8113f710 > #6 [ffff881fe60f38a8] shrink_zone at ffffffff8113f99c > #7 [ffff881fe60f3920] do_try_to_free_pages at ffffffff8113fcd4 > #8 [ffff881fe60f39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7 > #9 [ffff881fe60f3a10] try_charge at ffffffff8118d0ef > #10 [ffff881fe60f3ab0] mem_cgroup_try_charge at ffffffff811915ed > #11 [ffff881fe60f3af0] __add_to_page_cache_locked at ffffffff81128cf2 > #12 [ffff881fe60f3b48] add_to_page_cache_lru at ffffffff81128f28 > #13 [ffff881fe60f3b70] ext4_mpage_readpages at ffffffff812719b2 > #14 [ffff881fe60f3c78] ext4_readpages at ffffffff81228ce6 > #15 [ffff881fe60f3c88] __do_page_cache_readahead at ffffffff811374e0 > #16 [ffff881fe60f3d30] filemap_fault at ffffffff8112b1f0 > #17 [ffff881fe60f3d88] ext4_filemap_fault at ffffffff81231716 > #18 [ffff881fe60f3db0] __do_fault at ffffffff8115629f > #19 [ffff881fe60f3e10] handle_mm_fault at ffffffff8115a205 > #20 [ffff881fe60f3ee8] __do_page_fault at ffffffff810430bb > #21 [ffff881fe60f3f40] do_page_fault at ffffffff8104332c > #22 [ffff881fe60f3f50] page_fault at ffffffff81616172 > > Given the state of the page lists there are respectively 30 millions and > 17 million pages respectively on node 0 and node 1 that direct reclaim > should go through. Naturally this takes time, esp on a busy box, however > the reclaim code seems to have resched points. So I'm perplexed as to > why those softlockups materialized. > > Here's the state of the memory: > > PAGES TOTAL PERCENTAGE > TOTAL MEM 65945937 251.6 GB ---- > FREE 1197261 4.6 GB 1% of TOTAL MEM > USED 64748676 247 GB 98% of TOTAL MEM > SHARED 16603507 63.3 GB 25% of TOTAL MEM > BUFFERS 3508833 13.4 GB 5% of TOTAL MEM > CACHED 25814634 98.5 GB 39% of TOTAL MEM > SLAB 7521229 28.7 GB 11% of TOTAL MEM > > TOTAL SWAP 262143 1024 MB ---- > SWAP USED 262143 1024 MB 100% of TOTAL SWAP > SWAP FREE 0 0 0% of TOTAL SWAP > > COMMIT LIMIT 33235111 126.8 GB ---- > COMMITTED 65273273 249 GB 196% of TOTAL LIMIT > > As well as the zone statistics: > > NODE: 0 ZONE: 2 ADDR: ffff88207fffcf00 NAME: "Normal" > SIZE: 33030144 MIN/LOW/HIGH: 22209/27761/33313 > VM_STAT: > NR_FREE_PAGES: 62436 > NR_ALLOC_BATCH: 2024 > NR_INACTIVE_ANON: 8177867 > NR_ACTIVE_ANON: 5407176 > NR_INACTIVE_FILE: 5804642 > NR_ACTIVE_FILE: 9694170 > NR_UNEVICTABLE: 50013 > NR_MLOCK: 59860 > NR_ANON_PAGES: 13276046 > NR_FILE_MAPPED: 969231 > NR_FILE_PAGES: 15858085 > NR_FILE_DIRTY: 683 > NR_WRITEBACK: 530 > NR_SLAB_RECLAIMABLE: 2688882 > NR_SLAB_UNRECLAIMABLE: 255070 > NR_PAGETABLE: 182007 > NR_KERNEL_STACK: 8419 > NR_UNSTABLE_NFS: 0 > NR_BOUNCE: 0 > NR_VMSCAN_WRITE: 1129513 > NR_VMSCAN_IMMEDIATE: 39497899 > NR_WRITEBACK_TEMP: 0 > NR_ISOLATED_ANON: 0 > NR_ISOLATED_FILE: 462 > NR_SHMEM: 331386 > NR_DIRTIED: 6868276352 > NR_WRITTEN: 5816499568 > NR_PAGES_SCANNED: -490 > NUMA_HIT: 922019911612 > NUMA_MISS: 2935289654 > NUMA_FOREIGN: 1903827196 > NUMA_INTERLEAVE_HIT: 57290 > NUMA_LOCAL: 922017951068 > NUMA_OTHER: 2937250198 > WORKINGSET_REFAULT: 6998116360 > WORKINGSET_ACTIVATE: 6033595269 > WORKINGSET_NODERECLAIM: 2300965 > NR_ANON_TRANSPARENT_HUGEPAGES: 0 > NR_FREE_CMA_PAGES: 0 > > NODE: 1 ZONE: 2 ADDR: ffff88407fff9f00 NAME: "Normal" > SIZE: 33554432 MIN/LOW/HIGH: 22567/28208/33850 > VM_STAT: > NR_FREE_PAGES: 1003922 > NR_ALLOC_BATCH: 4572 > NR_INACTIVE_ANON: 7092366 > NR_ACTIVE_ANON: 6898921 > NR_INACTIVE_FILE: 4880696 > NR_ACTIVE_FILE: 8185594 > NR_UNEVICTABLE: 5311 > NR_MLOCK: 25509 > NR_ANON_PAGES: 13644139 > NR_FILE_MAPPED: 790292 > NR_FILE_PAGES: 13418055 > NR_FILE_DIRTY: 2081 > NR_WRITEBACK: 944 > NR_SLAB_RECLAIMABLE: 3948975 > NR_SLAB_UNRECLAIMABLE: 546053 > NR_PAGETABLE: 207960 > NR_KERNEL_STACK: 10382 > NR_UNSTABLE_NFS: 0 > NR_BOUNCE: 0 > NR_VMSCAN_WRITE: 213029 > NR_VMSCAN_IMMEDIATE: 28902492 > NR_WRITEBACK_TEMP: 0 > NR_ISOLATED_ANON: 0 > NR_ISOLATED_FILE: 23 > NR_SHMEM: 327804 > NR_DIRTIED: 12275571618 > NR_WRITTEN: 11397580462 > NR_PAGES_SCANNED: -787 > NUMA_HIT: 798927158945 > NUMA_MISS: 1903827196 > NUMA_FOREIGN: 2938097677 > NUMA_INTERLEAVE_HIT: 57726 > NUMA_LOCAL: 798925933393 > NUMA_OTHER: 1905052748 > WORKINGSET_REFAULT: 3461465775 > WORKINGSET_ACTIVATE: 2724000507 > WORKINGSET_NODERECLAIM: 4756016 > NR_ANON_TRANSPARENT_HUGEPAGES: 70 > NR_FREE_CMA_PAGES: 0 > > In addition to that I believe there is something wrong > with the NR_PAGES_SCANNED stats since they are being negative. > I haven't looked into the code to see how this value is being > synchronized and if there is a possibility of it temporary going negative. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-01 8:12 Softlockup during memory allocation Nikolay Borisov 2016-11-01 8:16 ` Nikolay Borisov @ 2016-11-02 19:00 ` Vlastimil Babka 2016-11-04 3:46 ` Hugh Dickins 2016-11-04 12:18 ` Nikolay Borisov 2 siblings, 1 reply; 20+ messages in thread From: Vlastimil Babka @ 2016-11-02 19:00 UTC (permalink / raw) To: Nikolay Borisov, Michal Hocko; +Cc: Johannes Weiner, Linux MM, Paul E. McKenney On 11/01/2016 09:12 AM, Nikolay Borisov wrote: > In addition to that I believe there is something wrong > with the NR_PAGES_SCANNED stats since they are being negative. > I haven't looked into the code to see how this value is being > synchronized and if there is a possibility of it temporary going negative. This is because there's a shared counter and percpu diffs, and crash only looks at the shared counter. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-02 19:00 ` Vlastimil Babka @ 2016-11-04 3:46 ` Hugh Dickins 0 siblings, 0 replies; 20+ messages in thread From: Hugh Dickins @ 2016-11-04 3:46 UTC (permalink / raw) To: Vlastimil Babka Cc: Mel Gorman, Nikolay Borisov, Michal Hocko, Johannes Weiner, Linux MM, Paul E. McKenney On Wed, 2 Nov 2016, Vlastimil Babka wrote: > On 11/01/2016 09:12 AM, Nikolay Borisov wrote: > > In addition to that I believe there is something wrong > > with the NR_PAGES_SCANNED stats since they are being negative. > > I haven't looked into the code to see how this value is being > > synchronized and if there is a possibility of it temporary going negative. > > This is because there's a shared counter and percpu diffs, and crash > only looks at the shared counter. Actually no, as I found when adding vmstat_refresh(). Coincidentally, I spent some of last weekend trying to understand why, then wrote a long comment about it (we thought it might be responsible for a hang). Let me share that comment; but I was writing about an earlier release, so some of the "zone"s have become "node"s since then - and I'm not sure whether my comment is comprehensible to anyone but the writer! This comment attempts to explain the NR_PAGES_SCANNED underflow. I doubt it's crucial to the bug in question: it's unsightly, and it may double the margin of error involved in using per-cpu accounting, but I don't think it makes a crucial difference to the approximation already inherent there. If that approximation is a problem, we could consider reverting the commit 0d5d823ab4e "mm: move zone->pages_scanned into a vmstat counter" which introduced it; or perhaps we could reconsider the stat_threshold calculation on small zones (if that would make a difference - I've not gone so far as to think about the hang itself). The NR_PAGES_SCANNED underflow does not come from any race: it comes from the way free_pcppages_bulk() and free_one_page() attempt to reset the counter to 0 by __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned) with a "fuzzy" nr_scanned obtained from zone_page_state(zone, NR_PAGES_SCANNED). Normally __mod_zone_page_state() is used to adjust a counter by a certain well-known amount, but here it is being used with an approximate amount - the cheaply-visible approximate total, before correction by per-cpu diffs (and getting that total from zone_page_state_snapshot() instead would defeat most of the optimization of using per-cpu here, if not regress it to worse than the global per-zone counter used before). The problem starts on an occasion when nr_scanned there is perhaps perfectly correct, but (factoring in the current cpu diff) is less than the stat_threshold: __mod_zone_page_state() then updates the cpu diff and doesn't touch the globally visible counter - but the negative diff implies that the globally visible counter is larger than the correct value. Then later that too-large value is fed back into __mod_zone_pages_state() as if it were correct, tending towards underflow of the full counter. (Or the same could all happen in reverse, with the "reset to 0" leaving a positive residue in the full counter.) This would be more serious (unbounded) without the periodic refresh_cpu_vm_stats(): which every second folds the per-cpu diffs back into the cheaply-visible totals. When the diffs are 0, then free_pcppages_bulk() and free_one_page() will properly reset the total to 0, even if the value it had before was incorrect (negative or positive). I can eliminate the negative NR_PAGES_SCANNED reports by a one-line change to __mod_zone_page_state(), to stop ever putting a negative into the per-cpu diff for NR_PAGES_SCANNED; or by copying most of __mod_zone_page_state() to a separate __reset_zone_page_state(), called only on NR_PAGES_SCANNED, again avoiding the negative diffs. But as I said in the first paragraph, I doubt the underflow is worse in effect than the approximation already inherent in the per-cpu counting here. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-01 8:12 Softlockup during memory allocation Nikolay Borisov 2016-11-01 8:16 ` Nikolay Borisov 2016-11-02 19:00 ` Vlastimil Babka @ 2016-11-04 12:18 ` Nikolay Borisov 2016-11-13 22:02 ` Nikolay Borisov 2 siblings, 1 reply; 20+ messages in thread From: Nikolay Borisov @ 2016-11-04 12:18 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, Linux MM, Paul E. McKenney On 11/01/2016 10:12 AM, Nikolay Borisov wrote: > Hello, > > I got the following rcu_sched/soft lockup on a server: > > [7056389.638502] INFO: rcu_sched self-detected stall on CPU > [7056389.638509] 21-...: (20994 ticks this GP) idle=9ef/140000000000001/0 softirq=3256767558/3256767596 fqs=6843 > [7056389.638510] (t=21001 jiffies g=656551647 c=656551646 q=469247) > [7056389.638513] Task dump for CPU 21: > [7056389.638515] hive_exec R running task 0 4413 31126 0x0000000a > [7056389.638518] ffffffff81c40280 ffff883fff323dc0 ffffffff8107f53d 0000000000000015 > [7056389.638520] ffffffff81c40280 ffff883fff323dd8 ffffffff81081ce9 0000000000000016 > [7056389.638522] ffff883fff323e08 ffffffff810aa34b ffff883fff335d00 ffffffff81c40280 > [7056389.638524] Call Trace: > [7056389.638525] <IRQ> [<ffffffff8107f53d>] sched_show_task+0xbd/0x120 > [7056389.638535] [<ffffffff81081ce9>] dump_cpu_task+0x39/0x40 > [7056389.638539] [<ffffffff810aa34b>] rcu_dump_cpu_stacks+0x8b/0xc0 > [7056389.638541] [<ffffffff810ade76>] rcu_check_callbacks+0x4d6/0x7b0 > [7056389.638547] [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20 > [7056389.638549] [<ffffffff810b28b9>] update_process_times+0x39/0x60 > [7056389.638551] [<ffffffff810c0a9d>] tick_sched_timer+0x4d/0x180 > [7056389.638553] [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20 > [7056389.638554] [<ffffffff810b3197>] __hrtimer_run_queues+0xe7/0x260 > [7056389.638556] [<ffffffff810b3718>] hrtimer_interrupt+0xa8/0x1a0 > [7056389.638561] [<ffffffff81034608>] local_apic_timer_interrupt+0x38/0x60 > [7056389.638565] [<ffffffff81616a7d>] smp_apic_timer_interrupt+0x3d/0x50 > [7056389.638568] [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90 > [7056389.638569] <EOI> [<ffffffff8113fb4a>] ? shrink_zone+0x28a/0x2a0 > [7056389.638575] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 > [7056389.638579] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 > [7056389.638581] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 > [7056389.638585] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 > [7056389.638588] [<ffffffff81087486>] ? update_curr+0x66/0x180 > [7056389.638591] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 > [7056389.638595] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 > [7056389.638596] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 > [7056389.638600] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 > [7056389.638603] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 > [7056389.638608] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 > [7056389.638611] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 > [7056389.638613] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 > [7056389.638615] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 > [7056389.638618] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 > [7056389.638620] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 > [7056389.638622] [<ffffffff810b2f1a>] ? hrtimer_try_to_cancel+0x1a/0x110 > [7056389.638626] [<ffffffff810775a2>] ? __might_sleep+0x52/0xb0 > [7056389.638628] [<ffffffff810b3bdd>] ? hrtimer_nanosleep+0xbd/0x1a0 > [7056389.638631] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 > [7056389.638632] [<ffffffff8104332c>] do_page_fault+0xc/0x10 > [7056389.638634] [<ffffffff81616172>] page_fault+0x22/0x30 > > Here is the stack of the same process when taken with 'bt' in crash: > > #0 [ffff8820d5fb3598] __schedule at ffffffff8160fa9a > #1 [ffff8820d5fb35e0] preempt_schedule_common at ffffffff8161043f > #2 [ffff8820d5fb35f8] _cond_resched at ffffffff8161047c > #3 [ffff8820d5fb3608] shrink_page_list at ffffffff8113dd94 > #4 [ffff8820d5fb36e8] shrink_inactive_list at ffffffff8113eab1 > #5 [ffff8820d5fb37a8] shrink_lruvec at ffffffff8113f710 > #6 [ffff8820d5fb38a8] shrink_zone at ffffffff8113f99c > #7 [ffff8820d5fb3920] do_try_to_free_pages at ffffffff8113fcd4 > #8 [ffff8820d5fb39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7 > #9 [ffff8820d5fb3a10] try_charge at ffffffff8118d0ef > #10 [ffff8820d5fb3ab0] mem_cgroup_try_charge at ffffffff811915ed > #11 [ffff8820d5fb3af0] __add_to_page_cache_locked at ffffffff81128cf2 > #12 [ffff8820d5fb3b48] add_to_page_cache_lru at ffffffff81128f28 > #13 [ffff8820d5fb3b70] ext4_mpage_readpages at ffffffff812719b2 > #14 [ffff8820d5fb3c78] ext4_readpages at ffffffff81228ce6 > #15 [ffff8820d5fb3c88] __do_page_cache_readahead at ffffffff811374e0 > #16 [ffff8820d5fb3d30] filemap_fault at ffffffff8112b1f0 > #17 [ffff8820d5fb3d88] ext4_filemap_fault at ffffffff81231716 > #18 [ffff8820d5fb3db0] __do_fault at ffffffff8115629f > #19 [ffff8820d5fb3e10] handle_mm_fault at ffffffff8115a205 > #20 [ffff8820d5fb3ee8] __do_page_fault at ffffffff810430bb > #21 [ffff8820d5fb3f40] do_page_fault at ffffffff8104332c > #22 [ffff8820d5fb3f50] page_fault at ffffffff81616172 > > > And then multiple softlockups such as : > > [7056427.875860] Call Trace: > [7056427.875866] [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0 > [7056427.875869] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 > [7056427.875873] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 > [7056427.875875] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 > [7056427.875878] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 > [7056427.875883] [<ffffffff81087486>] ? update_curr+0x66/0x180 > [7056427.875885] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 > [7056427.875889] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 > [7056427.875891] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 > [7056427.875894] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 > [7056427.875898] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 > [7056427.875903] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 > [7056427.875905] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 > [7056427.875907] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 > [7056427.875909] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 > [7056427.875912] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 > [7056427.875915] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 > [7056427.875916] [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360 > [7056427.875920] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 > [7056427.875921] [<ffffffff8104332c>] do_page_fault+0xc/0x10 > [7056427.875924] [<ffffffff81616172>] page_fault+0x22/0x30 > So after further debugging the following seems to be a better representation of what's happened: So the softlockup for a task is as follows: [7056427.875372] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [nginx:47108] [7056427.875832] CPU: 17 PID: 47108 Comm: nginx Tainted: G W O K 4.4.14-clouder5 #2 [7056427.875833] Hardware name: Supermicro PIO-628U-TR4T+-ST031/X10DRU-i+, BIOS 1.0c 03/23/2015 [7056427.875835] task: ffff88208a06d280 ti: ffff88235af70000 task.ti: ffff88235af70000 [7056427.875836] RIP: 0010:[<ffffffff8118dbf3>] [<ffffffff8118dbf3>] mem_cgroup_iter+0x33/0x3e0 [7056427.875844] RSP: 0018:ffff88235af73868 EFLAGS: 00000246 [7056427.875845] RAX: 0000000000000005 RBX: ffff88235af739a8 RCX: ffff88207fffc000 [7056427.875846] RDX: ffff88235af738e0 RSI: 0000000000000000 RDI: ffff8823990d1400 [7056427.875847] RBP: ffff88235af738a0 R08: 0000000000000000 R09: 0000000001d0df03 [7056427.875848] R10: 28f5c28f5c28f5c3 R11: 0000000000000000 R12: ffff8823990d1400 [7056427.875849] R13: ffff8823990d1400 R14: 0000000000000000 R15: ffff88407fffae30 [7056427.875850] FS: 00007f8cdee26720(0000) GS:ffff883fff2a0000(0000) knlGS:0000000000000000 [7056427.875851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [7056427.875852] CR2: 00007f4d199780e0 CR3: 0000002e0b4e3000 CR4: 00000000001406e0 [7056427.875853] Stack: [7056427.875854] 00ff8823990d1598 ffff88235af738a0 ffff88235af739a8 ffff8823990d1400 [7056427.875856] 0000000000000000 ffff88207fffc000 ffff88407fffae30 ffff88235af73918 [7056427.875858] ffffffff8113f92e 0000000000000000 0000000000000000 0000000000000000 [7056427.875860] Call Trace: [7056427.875866] [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0 [7056427.875869] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 [7056427.875873] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 [7056427.875875] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 [7056427.875878] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 [7056427.875883] [<ffffffff81087486>] ? update_curr+0x66/0x180 [7056427.875885] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 [7056427.875889] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 [7056427.875891] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 [7056427.875894] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 [7056427.875898] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 [7056427.875903] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 [7056427.875905] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 [7056427.875907] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 [7056427.875909] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 [7056427.875912] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 [7056427.875915] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 [7056427.875916] [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360 [7056427.875920] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 [7056427.875921] [<ffffffff8104332c>] do_page_fault+0xc/0x10 [7056427.875924] [<ffffffff81616172>] page_fault+0x22/0x30 ffffffff8118dbf3 is mem_cgroup_iter called from shrink_zone but in the outer do {} while. However, after crashing the machine via sysrq and looking at the callstack for the same process looks like: #0 [ffff88235af73598] __schedule at ffffffff8160fa9a #1 [ffff88235af735e0] preempt_schedule_common at ffffffff8161043f #2 [ffff88235af735f8] _cond_resched at ffffffff8161047c #3 [ffff88235af73608] shrink_page_list at ffffffff8113dd94 #4 [ffff88235af736e8] shrink_inactive_list at ffffffff8113eab1 #5 [ffff88235af737a8] shrink_lruvec at ffffffff8113f710 #6 [ffff88235af738a8] shrink_zone at ffffffff8113f99c #7 [ffff88235af73920] do_try_to_free_pages at ffffffff8113fcd4 #8 [ffff88235af739a0] try_to_free_mem_cgroup_pages at ffffffff811401e7 #9 [ffff88235af73a10] try_charge at ffffffff8118d0ef #10 [ffff88235af73ab0] mem_cgroup_try_charge at ffffffff811915ed #11 [ffff88235af73af0] __add_to_page_cache_locked at ffffffff81128cf2 #12 [ffff88235af73b48] add_to_page_cache_lru at ffffffff81128f28 #13 [ffff88235af73b70] ext4_mpage_readpages at ffffffff812719b2 #14 [ffff88235af73c78] ext4_readpages at ffffffff81228ce6 #15 [ffff88235af73c88] __do_page_cache_readahead at ffffffff811374e0 #16 [ffff88235af73d30] filemap_fault at ffffffff8112b1f0 #17 [ffff88235af73d88] ext4_filemap_fault at ffffffff81231716 #18 [ffff88235af73db0] __do_fault at ffffffff8115629f #19 [ffff88235af73e10] handle_mm_fault at ffffffff8115a205 #20 [ffff88235af73ee8] __do_page_fault at ffffffff810430bb #21 [ffff88235af73f40] do_page_fault at ffffffff8104332c #22 [ffff88235af73f50] page_fault at ffffffff81616172 This is the shrink_lruvec in the inner do {} while So clearly the same task has made some progress and has even re-scheduled due to the cond_resched in shrink_page_list. This means that in order for the softlockup to be triggered in shrink_zone this means that the code after the inner while should've taken more than 22 seconds (the time out for the s oftlockup watchdog). The function which are there are shrink_slab (but this seems to have the necessary cond_resched in its callees e.g. do_shrink_slab) and vmpressure - this is very lightweight. So it's still puzzling how come the softlockup is triggered in shrink_zone, when there in fact is plenty of opportunity to resched while scanning the lists -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-04 12:18 ` Nikolay Borisov @ 2016-11-13 22:02 ` Nikolay Borisov 2016-11-21 5:31 ` Michal Hocko 0 siblings, 1 reply; 20+ messages in thread From: Nikolay Borisov @ 2016-11-13 22:02 UTC (permalink / raw) To: Michal Hocko; +Cc: Linux MM Ping on that Michal, in case you've missed it. This seems like a genuine miss of a cond_resched. Can you at least confirm my analysis or is it complete bollocks? On Fri, Nov 4, 2016 at 2:18 PM, Nikolay Borisov <kernel@kyup.com> wrote: > > > On 11/01/2016 10:12 AM, Nikolay Borisov wrote: >> Hello, >> >> I got the following rcu_sched/soft lockup on a server: >> >> [7056389.638502] INFO: rcu_sched self-detected stall on CPU >> [7056389.638509] 21-...: (20994 ticks this GP) idle=9ef/140000000000001/0 softirq=3256767558/3256767596 fqs=6843 >> [7056389.638510] (t=21001 jiffies g=656551647 c=656551646 q=469247) >> [7056389.638513] Task dump for CPU 21: >> [7056389.638515] hive_exec R running task 0 4413 31126 0x0000000a >> [7056389.638518] ffffffff81c40280 ffff883fff323dc0 ffffffff8107f53d 0000000000000015 >> [7056389.638520] ffffffff81c40280 ffff883fff323dd8 ffffffff81081ce9 0000000000000016 >> [7056389.638522] ffff883fff323e08 ffffffff810aa34b ffff883fff335d00 ffffffff81c40280 >> [7056389.638524] Call Trace: >> [7056389.638525] <IRQ> [<ffffffff8107f53d>] sched_show_task+0xbd/0x120 >> [7056389.638535] [<ffffffff81081ce9>] dump_cpu_task+0x39/0x40 >> [7056389.638539] [<ffffffff810aa34b>] rcu_dump_cpu_stacks+0x8b/0xc0 >> [7056389.638541] [<ffffffff810ade76>] rcu_check_callbacks+0x4d6/0x7b0 >> [7056389.638547] [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20 >> [7056389.638549] [<ffffffff810b28b9>] update_process_times+0x39/0x60 >> [7056389.638551] [<ffffffff810c0a9d>] tick_sched_timer+0x4d/0x180 >> [7056389.638553] [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20 >> [7056389.638554] [<ffffffff810b3197>] __hrtimer_run_queues+0xe7/0x260 >> [7056389.638556] [<ffffffff810b3718>] hrtimer_interrupt+0xa8/0x1a0 >> [7056389.638561] [<ffffffff81034608>] local_apic_timer_interrupt+0x38/0x60 >> [7056389.638565] [<ffffffff81616a7d>] smp_apic_timer_interrupt+0x3d/0x50 >> [7056389.638568] [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90 >> [7056389.638569] <EOI> [<ffffffff8113fb4a>] ? shrink_zone+0x28a/0x2a0 >> [7056389.638575] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 >> [7056389.638579] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 >> [7056389.638581] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 >> [7056389.638585] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 >> [7056389.638588] [<ffffffff81087486>] ? update_curr+0x66/0x180 >> [7056389.638591] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 >> [7056389.638595] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 >> [7056389.638596] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 >> [7056389.638600] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 >> [7056389.638603] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 >> [7056389.638608] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 >> [7056389.638611] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 >> [7056389.638613] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 >> [7056389.638615] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 >> [7056389.638618] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 >> [7056389.638620] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 >> [7056389.638622] [<ffffffff810b2f1a>] ? hrtimer_try_to_cancel+0x1a/0x110 >> [7056389.638626] [<ffffffff810775a2>] ? __might_sleep+0x52/0xb0 >> [7056389.638628] [<ffffffff810b3bdd>] ? hrtimer_nanosleep+0xbd/0x1a0 >> [7056389.638631] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 >> [7056389.638632] [<ffffffff8104332c>] do_page_fault+0xc/0x10 >> [7056389.638634] [<ffffffff81616172>] page_fault+0x22/0x30 >> >> Here is the stack of the same process when taken with 'bt' in crash: >> >> #0 [ffff8820d5fb3598] __schedule at ffffffff8160fa9a >> #1 [ffff8820d5fb35e0] preempt_schedule_common at ffffffff8161043f >> #2 [ffff8820d5fb35f8] _cond_resched at ffffffff8161047c >> #3 [ffff8820d5fb3608] shrink_page_list at ffffffff8113dd94 >> #4 [ffff8820d5fb36e8] shrink_inactive_list at ffffffff8113eab1 >> #5 [ffff8820d5fb37a8] shrink_lruvec at ffffffff8113f710 >> #6 [ffff8820d5fb38a8] shrink_zone at ffffffff8113f99c >> #7 [ffff8820d5fb3920] do_try_to_free_pages at ffffffff8113fcd4 >> #8 [ffff8820d5fb39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7 >> #9 [ffff8820d5fb3a10] try_charge at ffffffff8118d0ef >> #10 [ffff8820d5fb3ab0] mem_cgroup_try_charge at ffffffff811915ed >> #11 [ffff8820d5fb3af0] __add_to_page_cache_locked at ffffffff81128cf2 >> #12 [ffff8820d5fb3b48] add_to_page_cache_lru at ffffffff81128f28 >> #13 [ffff8820d5fb3b70] ext4_mpage_readpages at ffffffff812719b2 >> #14 [ffff8820d5fb3c78] ext4_readpages at ffffffff81228ce6 >> #15 [ffff8820d5fb3c88] __do_page_cache_readahead at ffffffff811374e0 >> #16 [ffff8820d5fb3d30] filemap_fault at ffffffff8112b1f0 >> #17 [ffff8820d5fb3d88] ext4_filemap_fault at ffffffff81231716 >> #18 [ffff8820d5fb3db0] __do_fault at ffffffff8115629f >> #19 [ffff8820d5fb3e10] handle_mm_fault at ffffffff8115a205 >> #20 [ffff8820d5fb3ee8] __do_page_fault at ffffffff810430bb >> #21 [ffff8820d5fb3f40] do_page_fault at ffffffff8104332c >> #22 [ffff8820d5fb3f50] page_fault at ffffffff81616172 >> >> >> And then multiple softlockups such as : >> >> [7056427.875860] Call Trace: >> [7056427.875866] [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0 >> [7056427.875869] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 >> [7056427.875873] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 >> [7056427.875875] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 >> [7056427.875878] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 >> [7056427.875883] [<ffffffff81087486>] ? update_curr+0x66/0x180 >> [7056427.875885] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 >> [7056427.875889] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 >> [7056427.875891] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 >> [7056427.875894] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 >> [7056427.875898] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 >> [7056427.875903] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 >> [7056427.875905] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 >> [7056427.875907] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 >> [7056427.875909] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 >> [7056427.875912] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 >> [7056427.875915] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 >> [7056427.875916] [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360 >> [7056427.875920] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 >> [7056427.875921] [<ffffffff8104332c>] do_page_fault+0xc/0x10 >> [7056427.875924] [<ffffffff81616172>] page_fault+0x22/0x30 >> > > So after further debugging the following seems to be a better representation of what's happened: > > So the softlockup for a task is as follows: > > [7056427.875372] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [nginx:47108] > [7056427.875832] CPU: 17 PID: 47108 Comm: nginx Tainted: G W O K 4.4.14-clouder5 #2 > [7056427.875833] Hardware name: Supermicro PIO-628U-TR4T+-ST031/X10DRU-i+, BIOS 1.0c 03/23/2015 > [7056427.875835] task: ffff88208a06d280 ti: ffff88235af70000 task.ti: ffff88235af70000 > [7056427.875836] RIP: 0010:[<ffffffff8118dbf3>] [<ffffffff8118dbf3>] mem_cgroup_iter+0x33/0x3e0 > [7056427.875844] RSP: 0018:ffff88235af73868 EFLAGS: 00000246 > [7056427.875845] RAX: 0000000000000005 RBX: ffff88235af739a8 RCX: ffff88207fffc000 > [7056427.875846] RDX: ffff88235af738e0 RSI: 0000000000000000 RDI: ffff8823990d1400 > [7056427.875847] RBP: ffff88235af738a0 R08: 0000000000000000 R09: 0000000001d0df03 > [7056427.875848] R10: 28f5c28f5c28f5c3 R11: 0000000000000000 R12: ffff8823990d1400 > [7056427.875849] R13: ffff8823990d1400 R14: 0000000000000000 R15: ffff88407fffae30 > [7056427.875850] FS: 00007f8cdee26720(0000) GS:ffff883fff2a0000(0000) knlGS:0000000000000000 > [7056427.875851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [7056427.875852] CR2: 00007f4d199780e0 CR3: 0000002e0b4e3000 CR4: 00000000001406e0 > [7056427.875853] Stack: > [7056427.875854] 00ff8823990d1598 ffff88235af738a0 ffff88235af739a8 ffff8823990d1400 > [7056427.875856] 0000000000000000 ffff88207fffc000 ffff88407fffae30 ffff88235af73918 > [7056427.875858] ffffffff8113f92e 0000000000000000 0000000000000000 0000000000000000 > [7056427.875860] Call Trace: > [7056427.875866] [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0 > [7056427.875869] [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460 > [7056427.875873] [<ffffffff81308415>] ? find_next_bit+0x15/0x20 > [7056427.875875] [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170 > [7056427.875878] [<ffffffff8118d0ef>] try_charge+0x18f/0x650 > [7056427.875883] [<ffffffff81087486>] ? update_curr+0x66/0x180 > [7056427.875885] [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0 > [7056427.875889] [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230 > [7056427.875891] [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80 > [7056427.875894] [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820 > [7056427.875898] [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120 > [7056427.875903] [<ffffffff81228ce6>] ext4_readpages+0x36/0x40 > [7056427.875905] [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210 > [7056427.875907] [<ffffffff8112b1f0>] filemap_fault+0x370/0x400 > [7056427.875909] [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50 > [7056427.875912] [<ffffffff8115629f>] __do_fault+0x3f/0xd0 > [7056427.875915] [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0 > [7056427.875916] [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360 > [7056427.875920] [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410 > [7056427.875921] [<ffffffff8104332c>] do_page_fault+0xc/0x10 > [7056427.875924] [<ffffffff81616172>] page_fault+0x22/0x30 > > > ffffffff8118dbf3 is mem_cgroup_iter called from shrink_zone but in the outer do {} while. However, > after crashing the machine via sysrq and looking at the callstack for the same process looks like: > > #0 [ffff88235af73598] __schedule at ffffffff8160fa9a > #1 [ffff88235af735e0] preempt_schedule_common at ffffffff8161043f > #2 [ffff88235af735f8] _cond_resched at ffffffff8161047c > #3 [ffff88235af73608] shrink_page_list at ffffffff8113dd94 > #4 [ffff88235af736e8] shrink_inactive_list at ffffffff8113eab1 > #5 [ffff88235af737a8] shrink_lruvec at ffffffff8113f710 > #6 [ffff88235af738a8] shrink_zone at ffffffff8113f99c > #7 [ffff88235af73920] do_try_to_free_pages at ffffffff8113fcd4 > #8 [ffff88235af739a0] try_to_free_mem_cgroup_pages at ffffffff811401e7 > #9 [ffff88235af73a10] try_charge at ffffffff8118d0ef > #10 [ffff88235af73ab0] mem_cgroup_try_charge at ffffffff811915ed > #11 [ffff88235af73af0] __add_to_page_cache_locked at ffffffff81128cf2 > #12 [ffff88235af73b48] add_to_page_cache_lru at ffffffff81128f28 > #13 [ffff88235af73b70] ext4_mpage_readpages at ffffffff812719b2 > #14 [ffff88235af73c78] ext4_readpages at ffffffff81228ce6 > #15 [ffff88235af73c88] __do_page_cache_readahead at ffffffff811374e0 > #16 [ffff88235af73d30] filemap_fault at ffffffff8112b1f0 > #17 [ffff88235af73d88] ext4_filemap_fault at ffffffff81231716 > #18 [ffff88235af73db0] __do_fault at ffffffff8115629f > #19 [ffff88235af73e10] handle_mm_fault at ffffffff8115a205 > #20 [ffff88235af73ee8] __do_page_fault at ffffffff810430bb > #21 [ffff88235af73f40] do_page_fault at ffffffff8104332c > #22 [ffff88235af73f50] page_fault at ffffffff81616172 > > This is the shrink_lruvec in the inner do {} while > > So clearly the same task has made some progress and has even > re-scheduled due to the cond_resched in shrink_page_list. > This means that in order for the softlockup to be triggered > in shrink_zone this means that the code after the inner while > should've taken more than 22 seconds (the time out for the s > oftlockup watchdog). The function which are there are > shrink_slab (but this seems to have the necessary cond_resched > in its callees e.g. do_shrink_slab) and vmpressure - this is very lightweight. > > So it's still puzzling how come the softlockup is triggered in shrink_zone, when > there in fact is plenty of opportunity to resched while scanning the lists -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-13 22:02 ` Nikolay Borisov @ 2016-11-21 5:31 ` Michal Hocko 2016-11-22 8:56 ` Nikolay Borisov 0 siblings, 1 reply; 20+ messages in thread From: Michal Hocko @ 2016-11-21 5:31 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Linux MM Hi, I am sorry for a late response, but I was offline until this weekend. I will try to get to this email ASAP but it might take some time. On Mon 14-11-16 00:02:57, Nikolay Borisov wrote: > Ping on that Michal, in case you've missed it. This seems like a > genuine miss of a cond_resched. Can you at least confirm my analysis > or is it complete bollocks? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-21 5:31 ` Michal Hocko @ 2016-11-22 8:56 ` Nikolay Borisov 2016-11-22 14:30 ` Michal Hocko 0 siblings, 1 reply; 20+ messages in thread From: Nikolay Borisov @ 2016-11-22 8:56 UTC (permalink / raw) To: Michal Hocko; +Cc: Linux MM On 11/21/2016 07:31 AM, Michal Hocko wrote: > Hi, > I am sorry for a late response, but I was offline until this weekend. I > will try to get to this email ASAP but it might take some time. No worries. I did some further digging up and here is what I got, which I believe is rather strange: struct scan_control { nr_to_reclaim = 32, gfp_mask = 37880010, order = 0, nodemask = 0x0, target_mem_cgroup = 0xffff8823990d1400, priority = 7, may_writepage = 1, may_unmap = 1, may_swap = 0, may_thrash = 1, hibernation_mode = 0, compaction_ready = 0, nr_scanned = 0, nr_reclaimed = 0 } Parsing: 37880010 #define ___GFP_HIGHMEM 0x02 #define ___GFP_MOVABLE 0x08 #define ___GFP_IO 0x40 #define ___GFP_FS 0x80 #define ___GFP_HARDWALL 0x20000 #define ___GFP_DIRECT_RECLAIM 0x400000 #define ___GFP_KSWAPD_RECLAIM 0x2000000 And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 and priority is 7 this means we've gone 5 times through the do {} while in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the zone which was being reclaimed : http://sprunge.us/hQBi So what's strange is that the softlockup occurred but then the code proceeded (as evident from the subsequent stack traces), yet inspecting the reclaim progress it seems rather sad (no progress at all) > > On Mon 14-11-16 00:02:57, Nikolay Borisov wrote: >> Ping on that Michal, in case you've missed it. This seems like a >> genuine miss of a cond_resched. Can you at least confirm my analysis >> or is it complete bollocks? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-22 8:56 ` Nikolay Borisov @ 2016-11-22 14:30 ` Michal Hocko 2016-11-22 14:32 ` Michal Hocko 2016-11-22 14:35 ` Nikolay Borisov 0 siblings, 2 replies; 20+ messages in thread From: Michal Hocko @ 2016-11-22 14:30 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Linux MM On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: > > > On 11/21/2016 07:31 AM, Michal Hocko wrote: > > Hi, > > I am sorry for a late response, but I was offline until this weekend. I > > will try to get to this email ASAP but it might take some time. > > No worries. I did some further digging up and here is what I got, which > I believe is rather strange: > > struct scan_control { > nr_to_reclaim = 32, > gfp_mask = 37880010, > order = 0, > nodemask = 0x0, > target_mem_cgroup = 0xffff8823990d1400, > priority = 7, > may_writepage = 1, > may_unmap = 1, > may_swap = 0, > may_thrash = 1, > hibernation_mode = 0, > compaction_ready = 0, > nr_scanned = 0, > nr_reclaimed = 0 > } > > Parsing: 37880010 > #define ___GFP_HIGHMEM 0x02 > #define ___GFP_MOVABLE 0x08 > #define ___GFP_IO 0x40 > #define ___GFP_FS 0x80 > #define ___GFP_HARDWALL 0x20000 > #define ___GFP_DIRECT_RECLAIM 0x400000 > #define ___GFP_KSWAPD_RECLAIM 0x2000000 > > And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 > and priority is 7 this means we've gone 5 times through the do {} while > in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the > zone which was being reclaimed : > > http://sprunge.us/hQBi LRUs on that zones seem to be empty from a quick glance. kmem -z in the crash can give you per zone counters much more nicely. > So what's strange is that the softlockup occurred but then the code > proceeded (as evident from the subsequent stack traces), yet inspecting > the reclaim progress it seems rather sad (no progress at all) Unless I have misread the data above it seems something has either isolated all LRU pages for some time or there simply are none while the reclaim is desperately trying to make some progress. In any case this sounds less than a happy system... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-22 14:30 ` Michal Hocko @ 2016-11-22 14:32 ` Michal Hocko 2016-11-22 14:46 ` Nikolay Borisov 2016-11-22 14:35 ` Nikolay Borisov 1 sibling, 1 reply; 20+ messages in thread From: Michal Hocko @ 2016-11-22 14:32 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Linux MM On Tue 22-11-16 15:30:56, Michal Hocko wrote: > On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: > > > > > > On 11/21/2016 07:31 AM, Michal Hocko wrote: > > > Hi, > > > I am sorry for a late response, but I was offline until this weekend. I > > > will try to get to this email ASAP but it might take some time. > > > > No worries. I did some further digging up and here is what I got, which > > I believe is rather strange: > > > > struct scan_control { > > nr_to_reclaim = 32, > > gfp_mask = 37880010, > > order = 0, > > nodemask = 0x0, > > target_mem_cgroup = 0xffff8823990d1400, > > priority = 7, > > may_writepage = 1, > > may_unmap = 1, > > may_swap = 0, > > may_thrash = 1, > > hibernation_mode = 0, > > compaction_ready = 0, > > nr_scanned = 0, > > nr_reclaimed = 0 > > } > > > > Parsing: 37880010 > > #define ___GFP_HIGHMEM 0x02 > > #define ___GFP_MOVABLE 0x08 > > #define ___GFP_IO 0x40 > > #define ___GFP_FS 0x80 > > #define ___GFP_HARDWALL 0x20000 > > #define ___GFP_DIRECT_RECLAIM 0x400000 > > #define ___GFP_KSWAPD_RECLAIM 0x2000000 > > > > And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 > > and priority is 7 this means we've gone 5 times through the do {} while > > in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the > > zone which was being reclaimed : > > > > http://sprunge.us/hQBi > > LRUs on that zones seem to be empty from a quick glance. kmem -z in the > crash can give you per zone counters much more nicely. > > > So what's strange is that the softlockup occurred but then the code > > proceeded (as evident from the subsequent stack traces), yet inspecting > > the reclaim progress it seems rather sad (no progress at all) > > Unless I have misread the data above it seems something has either > isolated all LRU pages for some time or there simply are none while the > reclaim is desperately trying to make some progress. In any case this > sounds less than a happy system... Btw. how do you configure memcgs that the FS workload runs in? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-22 14:32 ` Michal Hocko @ 2016-11-22 14:46 ` Nikolay Borisov 0 siblings, 0 replies; 20+ messages in thread From: Nikolay Borisov @ 2016-11-22 14:46 UTC (permalink / raw) To: Michal Hocko; +Cc: Linux MM On 11/22/2016 04:32 PM, Michal Hocko wrote: > On Tue 22-11-16 15:30:56, Michal Hocko wrote: >> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: >>> >>> >>> On 11/21/2016 07:31 AM, Michal Hocko wrote: >>>> Hi, >>>> I am sorry for a late response, but I was offline until this weekend. I >>>> will try to get to this email ASAP but it might take some time. >>> >>> No worries. I did some further digging up and here is what I got, which >>> I believe is rather strange: >>> >>> struct scan_control { >>> nr_to_reclaim = 32, >>> gfp_mask = 37880010, >>> order = 0, >>> nodemask = 0x0, >>> target_mem_cgroup = 0xffff8823990d1400, >>> priority = 7, >>> may_writepage = 1, >>> may_unmap = 1, >>> may_swap = 0, >>> may_thrash = 1, >>> hibernation_mode = 0, >>> compaction_ready = 0, >>> nr_scanned = 0, >>> nr_reclaimed = 0 >>> } >>> >>> Parsing: 37880010 >>> #define ___GFP_HIGHMEM 0x02 >>> #define ___GFP_MOVABLE 0x08 >>> #define ___GFP_IO 0x40 >>> #define ___GFP_FS 0x80 >>> #define ___GFP_HARDWALL 0x20000 >>> #define ___GFP_DIRECT_RECLAIM 0x400000 >>> #define ___GFP_KSWAPD_RECLAIM 0x2000000 >>> >>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 >>> and priority is 7 this means we've gone 5 times through the do {} while >>> in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the >>> zone which was being reclaimed : >>> >>> http://sprunge.us/hQBi >> >> LRUs on that zones seem to be empty from a quick glance. kmem -z in the >> crash can give you per zone counters much more nicely. >> >>> So what's strange is that the softlockup occurred but then the code >>> proceeded (as evident from the subsequent stack traces), yet inspecting >>> the reclaim progress it seems rather sad (no progress at all) >> >> Unless I have misread the data above it seems something has either >> isolated all LRU pages for some time or there simply are none while the >> reclaim is desperately trying to make some progress. In any case this >> sounds less than a happy system... > > Btw. how do you configure memcgs that the FS workload runs in? So the hierarchy is on v1 and looks like the following: /cgroup/LXC/cgroup-where-fs-load-runs - LXC has all but 5gb of memory for itself. - The leaf cgroup has a limit of 2 gigabytes: memory = { count = { counter = 523334 }, limit = 524288, parent = 0xffff881fefa40cb8, watermark = 524291, failcnt = 0 }, memsw = { count = { counter = 524310 }, limit = 524288, parent = 0xffff881fefa40ce0, watermark = 524320, failcnt = 294061026 }, kmem = { count = { counter = 0 }, limit = 2251799813685247, parent = 0xffff881fefa40d08, watermark = 0, failcnt = 0 }, As you can see the hierarchy is very shallow. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-22 14:30 ` Michal Hocko 2016-11-22 14:32 ` Michal Hocko @ 2016-11-22 14:35 ` Nikolay Borisov 2016-11-22 17:02 ` Michal Hocko 1 sibling, 1 reply; 20+ messages in thread From: Nikolay Borisov @ 2016-11-22 14:35 UTC (permalink / raw) To: Michal Hocko; +Cc: Linux MM On 11/22/2016 04:30 PM, Michal Hocko wrote: > On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: >> >> >> On 11/21/2016 07:31 AM, Michal Hocko wrote: >>> Hi, >>> I am sorry for a late response, but I was offline until this weekend. I >>> will try to get to this email ASAP but it might take some time. >> >> No worries. I did some further digging up and here is what I got, which >> I believe is rather strange: >> >> struct scan_control { >> nr_to_reclaim = 32, >> gfp_mask = 37880010, >> order = 0, >> nodemask = 0x0, >> target_mem_cgroup = 0xffff8823990d1400, >> priority = 7, >> may_writepage = 1, >> may_unmap = 1, >> may_swap = 0, >> may_thrash = 1, >> hibernation_mode = 0, >> compaction_ready = 0, >> nr_scanned = 0, >> nr_reclaimed = 0 >> } >> >> Parsing: 37880010 >> #define ___GFP_HIGHMEM 0x02 >> #define ___GFP_MOVABLE 0x08 >> #define ___GFP_IO 0x40 >> #define ___GFP_FS 0x80 >> #define ___GFP_HARDWALL 0x20000 >> #define ___GFP_DIRECT_RECLAIM 0x400000 >> #define ___GFP_KSWAPD_RECLAIM 0x2000000 >> >> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 >> and priority is 7 this means we've gone 5 times through the do {} while >> in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the >> zone which was being reclaimed : This is also very strange that total_scanned is 0. >> >> http://sprunge.us/hQBi > > LRUs on that zones seem to be empty from a quick glance. kmem -z in the > crash can give you per zone counters much more nicely. > So here are the populated zones: NODE: 0 ZONE: 0 ADDR: ffff88207fffc000 NAME: "DMA" SIZE: 4095 PRESENT: 3994 MIN/LOW/HIGH: 2/2/3 VM_STAT: NR_FREE_PAGES: 3626 NR_ALLOC_BATCH: 1 NR_INACTIVE_ANON: 0 NR_ACTIVE_ANON: 0 NR_INACTIVE_FILE: 0 NR_ACTIVE_FILE: 0 NR_UNEVICTABLE: 0 NR_MLOCK: 0 NR_ANON_PAGES: 0 NR_FILE_MAPPED: 0 NR_FILE_PAGES: 0 NR_FILE_DIRTY: 0 NR_WRITEBACK: 0 NR_SLAB_RECLAIMABLE: 0 NR_SLAB_UNRECLAIMABLE: 0 NR_PAGETABLE: 0 NR_KERNEL_STACK: 0 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 0 NR_VMSCAN_IMMEDIATE: 0 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 0 NR_SHMEM: 0 NR_DIRTIED: 0 NR_WRITTEN: 0 NR_PAGES_SCANNED: 0 NUMA_HIT: 298251 NUMA_MISS: 0 NUMA_FOREIGN: 0 NUMA_INTERLEAVE_HIT: 0 NUMA_LOCAL: 264611 NUMA_OTHER: 33640 WORKINGSET_REFAULT: 0 WORKINGSET_ACTIVATE: 0 WORKINGSET_NODERECLAIM: 0 NR_ANON_TRANSPARENT_HUGEPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 0 ZONE: 1 ADDR: ffff88207fffc780 NAME: "DMA32" SIZE: 1044480 PRESENT: 492819 MIN/LOW/HIGH: 275/343/412 VM_STAT: NR_FREE_PAGES: 127277 NR_ALLOC_BATCH: 69 NR_INACTIVE_ANON: 104061 NR_ACTIVE_ANON: 40297 NR_INACTIVE_FILE: 19114 NR_ACTIVE_FILE: 24517 NR_UNEVICTABLE: 1027 NR_MLOCK: 1231 NR_ANON_PAGES: 141688 NR_FILE_MAPPED: 4619 NR_FILE_PAGES: 47327 NR_FILE_DIRTY: 1 NR_WRITEBACK: 0 NR_SLAB_RECLAIMABLE: 77185 NR_SLAB_UNRECLAIMABLE: 5064 NR_PAGETABLE: 2051 NR_KERNEL_STACK: 236 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 347044 NR_VMSCAN_IMMEDIATE: 289451163 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 0 NR_SHMEM: 3062 NR_DIRTIED: 76625942 NR_WRITTEN: 63608865 NR_PAGES_SCANNED: -9 NUMA_HIT: 11857097869 NUMA_MISS: 2808023 NUMA_FOREIGN: 0 NUMA_INTERLEAVE_HIT: 0 NUMA_LOCAL: 11856373836 NUMA_OTHER: 3532056 WORKINGSET_REFAULT: 107056373 WORKINGSET_ACTIVATE: 88346956 WORKINGSET_NODERECLAIM: 27254 NR_ANON_TRANSPARENT_HUGEPAGES: 10 NR_FREE_CMA_PAGES: 0 NODE: 0 ZONE: 2 ADDR: ffff88207fffcf00 NAME: "Normal" SIZE: 33030144 MIN/LOW/HIGH: 22209/27761/33313 VM_STAT: NR_FREE_PAGES: 62436 NR_ALLOC_BATCH: 2024 NR_INACTIVE_ANON: 8177867 NR_ACTIVE_ANON: 5407176 NR_INACTIVE_FILE: 5804642 NR_ACTIVE_FILE: 9694170 NR_UNEVICTABLE: 50013 NR_MLOCK: 59860 NR_ANON_PAGES: 13276046 NR_FILE_MAPPED: 969231 NR_FILE_PAGES: 15858085 NR_FILE_DIRTY: 683 NR_WRITEBACK: 530 NR_SLAB_RECLAIMABLE: 2688882 NR_SLAB_UNRECLAIMABLE: 255070 NR_PAGETABLE: 182007 NR_KERNEL_STACK: 8419 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 1129513 NR_VMSCAN_IMMEDIATE: 39497899 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 462 NR_SHMEM: 331386 NR_DIRTIED: 6868276352 NR_WRITTEN: 5816499568 NR_PAGES_SCANNED: -490 NUMA_HIT: 922019911612 NUMA_MISS: 2935289654 NUMA_FOREIGN: 1903827196 NUMA_INTERLEAVE_HIT: 57290 NUMA_LOCAL: 922017951068 NUMA_OTHER: 2937250198 WORKINGSET_REFAULT: 6998116360 WORKINGSET_ACTIVATE: 6033595269 WORKINGSET_NODERECLAIM: 2300965 NR_ANON_TRANSPARENT_HUGEPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 1 ZONE: 2 ADDR: ffff88407fff9f00 NAME: "Normal" SIZE: 33554432 MIN/LOW/HIGH: 22567/28208/33850 VM_STAT: NR_FREE_PAGES: 1003922 NR_ALLOC_BATCH: 4572 NR_INACTIVE_ANON: 7092366 NR_ACTIVE_ANON: 6898921 NR_INACTIVE_FILE: 4880696 NR_ACTIVE_FILE: 8185594 NR_UNEVICTABLE: 5311 NR_MLOCK: 25509 NR_ANON_PAGES: 13644139 NR_FILE_MAPPED: 790292 NR_FILE_PAGES: 13418055 NR_FILE_DIRTY: 2081 NR_WRITEBACK: 944 NR_SLAB_RECLAIMABLE: 3948975 NR_SLAB_UNRECLAIMABLE: 546053 NR_PAGETABLE: 207960 NR_KERNEL_STACK: 10382 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 213029 NR_VMSCAN_IMMEDIATE: 28902492 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 23 NR_SHMEM: 327804 NR_DIRTIED: 12275571618 NR_WRITTEN: 11397580462 NR_PAGES_SCANNED: -787 NUMA_HIT: 798927158945 NUMA_MISS: 1903827196 NUMA_FOREIGN: 2938097677 NUMA_INTERLEAVE_HIT: 57726 NUMA_LOCAL: 798925933393 NUMA_OTHER: 1905052748 WORKINGSET_REFAULT: 3461465775 WORKINGSET_ACTIVATE: 2724000507 WORKINGSET_NODERECLAIM: 4756016 NR_ANON_TRANSPARENT_HUGEPAGES: 70 NR_FREE_CMA_PAGES: 0 So looking at those I see the following things: 1. There aren't that many writeback/dirty pages on the 2 nodes. 2. There aren't that many isolated pages. Since the system doesn't have swap then the ANON allocation's cannot possibly be reclaimed. However, this leaves the FILE allocations of which there are plenty. Yet, still no further progress is made. Given all of this I'm not able to map the number to a sensible behavior of the reclamation path. >> So what's strange is that the softlockup occurred but then the code >> proceeded (as evident from the subsequent stack traces), yet inspecting >> the reclaim progress it seems rather sad (no progress at all) > > Unless I have misread the data above it seems something has either > isolated all LRU pages for some time or there simply are none while the > reclaim is desperately trying to make some progress. In any case this > sounds less than a happy system... > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-22 14:35 ` Nikolay Borisov @ 2016-11-22 17:02 ` Michal Hocko 2016-11-23 7:44 ` Nikolay Borisov 0 siblings, 1 reply; 20+ messages in thread From: Michal Hocko @ 2016-11-22 17:02 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Linux MM On Tue 22-11-16 16:35:38, Nikolay Borisov wrote: > > > On 11/22/2016 04:30 PM, Michal Hocko wrote: > > On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: > >> > >> > >> On 11/21/2016 07:31 AM, Michal Hocko wrote: > >>> Hi, > >>> I am sorry for a late response, but I was offline until this weekend. I > >>> will try to get to this email ASAP but it might take some time. > >> > >> No worries. I did some further digging up and here is what I got, which > >> I believe is rather strange: > >> > >> struct scan_control { > >> nr_to_reclaim = 32, > >> gfp_mask = 37880010, > >> order = 0, > >> nodemask = 0x0, > >> target_mem_cgroup = 0xffff8823990d1400, > >> priority = 7, > >> may_writepage = 1, > >> may_unmap = 1, > >> may_swap = 0, > >> may_thrash = 1, > >> hibernation_mode = 0, > >> compaction_ready = 0, > >> nr_scanned = 0, > >> nr_reclaimed = 0 > >> } > >> > >> Parsing: 37880010 > >> #define ___GFP_HIGHMEM 0x02 > >> #define ___GFP_MOVABLE 0x08 > >> #define ___GFP_IO 0x40 > >> #define ___GFP_FS 0x80 > >> #define ___GFP_HARDWALL 0x20000 > >> #define ___GFP_DIRECT_RECLAIM 0x400000 > >> #define ___GFP_KSWAPD_RECLAIM 0x2000000 > >> > >> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 > >> and priority is 7 this means we've gone 5 times through the do {} while > >> in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the > >> zone which was being reclaimed : > > This is also very strange that total_scanned is 0. > > > >> > >> http://sprunge.us/hQBi > > > > LRUs on that zones seem to be empty from a quick glance. kmem -z in the > > crash can give you per zone counters much more nicely. > > > > So here are the populated zones: [...] > NODE: 0 ZONE: 2 ADDR: ffff88207fffcf00 NAME: "Normal" > SIZE: 33030144 MIN/LOW/HIGH: 22209/27761/33313 > VM_STAT: > NR_FREE_PAGES: 62436 > NR_ALLOC_BATCH: 2024 > NR_INACTIVE_ANON: 8177867 > NR_ACTIVE_ANON: 5407176 > NR_INACTIVE_FILE: 5804642 > NR_ACTIVE_FILE: 9694170 So your LRUs are definitely not empty as I have thought. Having 0 pages scanned is indeed very strange. We do reset sc->nr_scanned for each priority but my understanding was that you are looking at a state where we are somwhere in the middle of shrink_zones. Moreover total_scanned should be cumulative. > NR_UNEVICTABLE: 50013 > NR_MLOCK: 59860 > NR_ANON_PAGES: 13276046 > NR_FILE_MAPPED: 969231 > NR_FILE_PAGES: 15858085 > NR_FILE_DIRTY: 683 > NR_WRITEBACK: 530 > NR_SLAB_RECLAIMABLE: 2688882 > NR_SLAB_UNRECLAIMABLE: 255070 > NR_PAGETABLE: 182007 > NR_KERNEL_STACK: 8419 > NR_UNSTABLE_NFS: 0 > NR_BOUNCE: 0 > NR_VMSCAN_WRITE: 1129513 > NR_VMSCAN_IMMEDIATE: 39497899 > NR_WRITEBACK_TEMP: 0 > NR_ISOLATED_ANON: 0 > NR_ISOLATED_FILE: 462 > NR_SHMEM: 331386 > NR_DIRTIED: 6868276352 > NR_WRITTEN: 5816499568 > NR_PAGES_SCANNED: -490 > NUMA_HIT: 922019911612 > NUMA_MISS: 2935289654 > NUMA_FOREIGN: 1903827196 > NUMA_INTERLEAVE_HIT: 57290 > NUMA_LOCAL: 922017951068 > NUMA_OTHER: 2937250198 > WORKINGSET_REFAULT: 6998116360 > WORKINGSET_ACTIVATE: 6033595269 > WORKINGSET_NODERECLAIM: 2300965 > NR_ANON_TRANSPARENT_HUGEPAGES: 0 > NR_FREE_CMA_PAGES: 0 [...] > > So looking at those I see the following things: > > 1. There aren't that many writeback/dirty pages on the 2 nodes. > 2. There aren't that many isolated pages. > > Since the system doesn't have swap then the ANON allocation's cannot > possibly be reclaimed. However, this leaves the FILE allocations of > which there are plenty. Yet, still no further progress is made. Given > all of this I'm not able to map the number to a sensible behavior of the > reclamation path. Well, file pages might be pinned by the filesystem but even then the number of scanned pages shouldn't be zero. So yeah, this doesn't make much sense to me. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-22 17:02 ` Michal Hocko @ 2016-11-23 7:44 ` Nikolay Borisov 2016-11-23 7:49 ` Michal Hocko 0 siblings, 1 reply; 20+ messages in thread From: Nikolay Borisov @ 2016-11-23 7:44 UTC (permalink / raw) To: Michal Hocko; +Cc: Linux MM On 11/22/2016 07:02 PM, Michal Hocko wrote: > On Tue 22-11-16 16:35:38, Nikolay Borisov wrote: >> >> >> On 11/22/2016 04:30 PM, Michal Hocko wrote: >>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: >>>> >>>> >>>> On 11/21/2016 07:31 AM, Michal Hocko wrote: >>>>> Hi, >>>>> I am sorry for a late response, but I was offline until this weekend. I >>>>> will try to get to this email ASAP but it might take some time. >>>> >>>> No worries. I did some further digging up and here is what I got, which >>>> I believe is rather strange: >>>> >>>> struct scan_control { >>>> nr_to_reclaim = 32, >>>> gfp_mask = 37880010, >>>> order = 0, >>>> nodemask = 0x0, >>>> target_mem_cgroup = 0xffff8823990d1400, >>>> priority = 7, >>>> may_writepage = 1, >>>> may_unmap = 1, >>>> may_swap = 0, >>>> may_thrash = 1, >>>> hibernation_mode = 0, >>>> compaction_ready = 0, >>>> nr_scanned = 0, >>>> nr_reclaimed = 0 >>>> } >>>> >>>> Parsing: 37880010 >>>> #define ___GFP_HIGHMEM 0x02 >>>> #define ___GFP_MOVABLE 0x08 >>>> #define ___GFP_IO 0x40 >>>> #define ___GFP_FS 0x80 >>>> #define ___GFP_HARDWALL 0x20000 >>>> #define ___GFP_DIRECT_RECLAIM 0x400000 >>>> #define ___GFP_KSWAPD_RECLAIM 0x2000000 >>>> >>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 >>>> and priority is 7 this means we've gone 5 times through the do {} while >>>> in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the >>>> zone which was being reclaimed : >> >> This is also very strange that total_scanned is 0. >> >> >>>> >>>> http://sprunge.us/hQBi >>> >>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the >>> crash can give you per zone counters much more nicely. >>> >> >> So here are the populated zones: > [...] >> NODE: 0 ZONE: 2 ADDR: ffff88207fffcf00 NAME: "Normal" >> SIZE: 33030144 MIN/LOW/HIGH: 22209/27761/33313 >> VM_STAT: >> NR_FREE_PAGES: 62436 >> NR_ALLOC_BATCH: 2024 >> NR_INACTIVE_ANON: 8177867 >> NR_ACTIVE_ANON: 5407176 >> NR_INACTIVE_FILE: 5804642 >> NR_ACTIVE_FILE: 9694170 > > So your LRUs are definitely not empty as I have thought. Having > 0 pages scanned is indeed very strange. We do reset sc->nr_scanned > for each priority but my understanding was that you are looking at a > state where we are somwhere in the middle of shrink_zones. Moreover > total_scanned should be cumulative. So the server began acting wonky. People logged on it and saw the softlockup as per my initial email. They then initiated a crashdump via sysrq since most commands weren't going through (e.g. forking) so crashing it was a last resort measure. After that I start looking at the crashdump and observe that prior to the crash machine seems to have locked up judging from the dmesg logs. However, when I manually inspect the *current* (and current being at the time the crash was actually initiated) state of the processes reported as softlock up they seem to have made progress are now in shrink_zone->shrink_lruvec->shrink_inactive_list->_cond_resched->__schedule And the softlockup was being shown to be in mem_cgroup_iter. So it's mystery how come this function can softlockup and after the softlockup apparently got resolved reclaim is not making any progress. > >> NR_UNEVICTABLE: 50013 >> NR_MLOCK: 59860 >> NR_ANON_PAGES: 13276046 >> NR_FILE_MAPPED: 969231 >> NR_FILE_PAGES: 15858085 >> NR_FILE_DIRTY: 683 >> NR_WRITEBACK: 530 >> NR_SLAB_RECLAIMABLE: 2688882 >> NR_SLAB_UNRECLAIMABLE: 255070 >> NR_PAGETABLE: 182007 >> NR_KERNEL_STACK: 8419 >> NR_UNSTABLE_NFS: 0 >> NR_BOUNCE: 0 >> NR_VMSCAN_WRITE: 1129513 >> NR_VMSCAN_IMMEDIATE: 39497899 >> NR_WRITEBACK_TEMP: 0 >> NR_ISOLATED_ANON: 0 >> NR_ISOLATED_FILE: 462 >> NR_SHMEM: 331386 >> NR_DIRTIED: 6868276352 >> NR_WRITTEN: 5816499568 >> NR_PAGES_SCANNED: -490 >> NUMA_HIT: 922019911612 >> NUMA_MISS: 2935289654 >> NUMA_FOREIGN: 1903827196 >> NUMA_INTERLEAVE_HIT: 57290 >> NUMA_LOCAL: 922017951068 >> NUMA_OTHER: 2937250198 >> WORKINGSET_REFAULT: 6998116360 >> WORKINGSET_ACTIVATE: 6033595269 >> WORKINGSET_NODERECLAIM: 2300965 >> NR_ANON_TRANSPARENT_HUGEPAGES: 0 >> NR_FREE_CMA_PAGES: 0 > [...] >> >> So looking at those I see the following things: >> >> 1. There aren't that many writeback/dirty pages on the 2 nodes. >> 2. There aren't that many isolated pages. >> >> Since the system doesn't have swap then the ANON allocation's cannot >> possibly be reclaimed. However, this leaves the FILE allocations of >> which there are plenty. Yet, still no further progress is made. Given >> all of this I'm not able to map the number to a sensible behavior of the >> reclamation path. > > Well, file pages might be pinned by the filesystem but even then the > number of scanned pages shouldn't be zero. So yeah, this doesn't make > much sense to me. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-23 7:44 ` Nikolay Borisov @ 2016-11-23 7:49 ` Michal Hocko 2016-11-23 7:50 ` Michal Hocko 2016-11-24 11:45 ` Nikolay Borisov 0 siblings, 2 replies; 20+ messages in thread From: Michal Hocko @ 2016-11-23 7:49 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Linux MM On Wed 23-11-16 09:44:45, Nikolay Borisov wrote: > > > On 11/22/2016 07:02 PM, Michal Hocko wrote: > > On Tue 22-11-16 16:35:38, Nikolay Borisov wrote: > >> > >> > >> On 11/22/2016 04:30 PM, Michal Hocko wrote: > >>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: > >>>> > >>>> > >>>> On 11/21/2016 07:31 AM, Michal Hocko wrote: > >>>>> Hi, > >>>>> I am sorry for a late response, but I was offline until this weekend. I > >>>>> will try to get to this email ASAP but it might take some time. > >>>> > >>>> No worries. I did some further digging up and here is what I got, which > >>>> I believe is rather strange: > >>>> > >>>> struct scan_control { > >>>> nr_to_reclaim = 32, > >>>> gfp_mask = 37880010, > >>>> order = 0, > >>>> nodemask = 0x0, > >>>> target_mem_cgroup = 0xffff8823990d1400, > >>>> priority = 7, > >>>> may_writepage = 1, > >>>> may_unmap = 1, > >>>> may_swap = 0, > >>>> may_thrash = 1, > >>>> hibernation_mode = 0, > >>>> compaction_ready = 0, > >>>> nr_scanned = 0, > >>>> nr_reclaimed = 0 > >>>> } > >>>> > >>>> Parsing: 37880010 > >>>> #define ___GFP_HIGHMEM 0x02 > >>>> #define ___GFP_MOVABLE 0x08 > >>>> #define ___GFP_IO 0x40 > >>>> #define ___GFP_FS 0x80 > >>>> #define ___GFP_HARDWALL 0x20000 > >>>> #define ___GFP_DIRECT_RECLAIM 0x400000 > >>>> #define ___GFP_KSWAPD_RECLAIM 0x2000000 > >>>> > >>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 > >>>> and priority is 7 this means we've gone 5 times through the do {} while > >>>> in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the > >>>> zone which was being reclaimed : > >> > >> This is also very strange that total_scanned is 0. > >> > >> > >>>> > >>>> http://sprunge.us/hQBi > >>> > >>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the > >>> crash can give you per zone counters much more nicely. > >>> > >> > >> So here are the populated zones: > > [...] > >> NODE: 0 ZONE: 2 ADDR: ffff88207fffcf00 NAME: "Normal" > >> SIZE: 33030144 MIN/LOW/HIGH: 22209/27761/33313 > >> VM_STAT: > >> NR_FREE_PAGES: 62436 > >> NR_ALLOC_BATCH: 2024 > >> NR_INACTIVE_ANON: 8177867 > >> NR_ACTIVE_ANON: 5407176 > >> NR_INACTIVE_FILE: 5804642 > >> NR_ACTIVE_FILE: 9694170 > > > > So your LRUs are definitely not empty as I have thought. Having > > 0 pages scanned is indeed very strange. We do reset sc->nr_scanned > > for each priority but my understanding was that you are looking at a > > state where we are somwhere in the middle of shrink_zones. Moreover > > total_scanned should be cumulative. > > So the server began acting wonky. People logged on it and saw the > softlockup as per my initial email. They then initiated a crashdump via > sysrq since most commands weren't going through (e.g. forking) so > crashing it was a last resort measure. After that I start looking at the > crashdump and observe that prior to the crash machine seems to have > locked up judging from the dmesg logs. However, when I manually inspect > the *current* (and current being at the time the crash was actually > initiated) state of the processes reported as softlock up they seem to > have made progress are now in > shrink_zone->shrink_lruvec->shrink_inactive_list->_cond_resched->__schedule OK, I see. > And the softlockup was being shown to be in mem_cgroup_iter. So it's > mystery how come this function can softlockup and after the softlockup > apparently got resolved reclaim is not making any progress. This might be just a coincidence and the lockup might really mean that we couldn't isolate (thus scan) any pages at the time the lockup was detected. mem_cgroup_iter shouldn't itself loop without any bounds to trigger the lockup on its own. There is a loop around css_next_descendant_pre but this should take only few iterations in case we are racing with cgroup removal AFAIR. So to me it sounds more like a problem with the state of LRU lists rather than anything else. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-23 7:49 ` Michal Hocko @ 2016-11-23 7:50 ` Michal Hocko 2016-11-24 11:45 ` Nikolay Borisov 1 sibling, 0 replies; 20+ messages in thread From: Michal Hocko @ 2016-11-23 7:50 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Linux MM On Wed 23-11-16 08:49:47, Michal Hocko wrote: > On Wed 23-11-16 09:44:45, Nikolay Borisov wrote: > > > > > > On 11/22/2016 07:02 PM, Michal Hocko wrote: > > > On Tue 22-11-16 16:35:38, Nikolay Borisov wrote: > > >> > > >> > > >> On 11/22/2016 04:30 PM, Michal Hocko wrote: > > >>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: > > >>>> > > >>>> > > >>>> On 11/21/2016 07:31 AM, Michal Hocko wrote: > > >>>>> Hi, > > >>>>> I am sorry for a late response, but I was offline until this weekend. I > > >>>>> will try to get to this email ASAP but it might take some time. > > >>>> > > >>>> No worries. I did some further digging up and here is what I got, which > > >>>> I believe is rather strange: > > >>>> > > >>>> struct scan_control { > > >>>> nr_to_reclaim = 32, > > >>>> gfp_mask = 37880010, > > >>>> order = 0, > > >>>> nodemask = 0x0, > > >>>> target_mem_cgroup = 0xffff8823990d1400, > > >>>> priority = 7, > > >>>> may_writepage = 1, > > >>>> may_unmap = 1, > > >>>> may_swap = 0, > > >>>> may_thrash = 1, > > >>>> hibernation_mode = 0, > > >>>> compaction_ready = 0, > > >>>> nr_scanned = 0, > > >>>> nr_reclaimed = 0 > > >>>> } > > >>>> > > >>>> Parsing: 37880010 > > >>>> #define ___GFP_HIGHMEM 0x02 > > >>>> #define ___GFP_MOVABLE 0x08 > > >>>> #define ___GFP_IO 0x40 > > >>>> #define ___GFP_FS 0x80 > > >>>> #define ___GFP_HARDWALL 0x20000 > > >>>> #define ___GFP_DIRECT_RECLAIM 0x400000 > > >>>> #define ___GFP_KSWAPD_RECLAIM 0x2000000 > > >>>> > > >>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 > > >>>> and priority is 7 this means we've gone 5 times through the do {} while > > >>>> in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the > > >>>> zone which was being reclaimed : > > >> > > >> This is also very strange that total_scanned is 0. > > >> > > >> > > >>>> > > >>>> http://sprunge.us/hQBi > > >>> > > >>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the > > >>> crash can give you per zone counters much more nicely. > > >>> > > >> > > >> So here are the populated zones: > > > [...] > > >> NODE: 0 ZONE: 2 ADDR: ffff88207fffcf00 NAME: "Normal" > > >> SIZE: 33030144 MIN/LOW/HIGH: 22209/27761/33313 > > >> VM_STAT: > > >> NR_FREE_PAGES: 62436 > > >> NR_ALLOC_BATCH: 2024 > > >> NR_INACTIVE_ANON: 8177867 > > >> NR_ACTIVE_ANON: 5407176 > > >> NR_INACTIVE_FILE: 5804642 > > >> NR_ACTIVE_FILE: 9694170 > > > > > > So your LRUs are definitely not empty as I have thought. Having > > > 0 pages scanned is indeed very strange. We do reset sc->nr_scanned > > > for each priority but my understanding was that you are looking at a > > > state where we are somwhere in the middle of shrink_zones. Moreover > > > total_scanned should be cumulative. > > > > So the server began acting wonky. People logged on it and saw the > > softlockup as per my initial email. They then initiated a crashdump via > > sysrq since most commands weren't going through (e.g. forking) so > > crashing it was a last resort measure. After that I start looking at the > > crashdump and observe that prior to the crash machine seems to have > > locked up judging from the dmesg logs. However, when I manually inspect > > the *current* (and current being at the time the crash was actually > > initiated) state of the processes reported as softlock up they seem to > > have made progress are now in > > shrink_zone->shrink_lruvec->shrink_inactive_list->_cond_resched->__schedule > > OK, I see. One more thing. You might get a better picture if you configure your system to panic on soft lockup so that you get the state at the time of the problem. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-23 7:49 ` Michal Hocko 2016-11-23 7:50 ` Michal Hocko @ 2016-11-24 11:45 ` Nikolay Borisov 2016-11-24 12:12 ` Michal Hocko 1 sibling, 1 reply; 20+ messages in thread From: Nikolay Borisov @ 2016-11-24 11:45 UTC (permalink / raw) To: Michal Hocko; +Cc: Linux MM On 11/23/2016 09:49 AM, Michal Hocko wrote: > On Wed 23-11-16 09:44:45, Nikolay Borisov wrote: >> >> >> On 11/22/2016 07:02 PM, Michal Hocko wrote: >>> On Tue 22-11-16 16:35:38, Nikolay Borisov wrote: >>>> >>>> >>>> On 11/22/2016 04:30 PM, Michal Hocko wrote: >>>>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote: >>>>>> >>>>>> >>>>>> On 11/21/2016 07:31 AM, Michal Hocko wrote: >>>>>>> Hi, >>>>>>> I am sorry for a late response, but I was offline until this weekend. I >>>>>>> will try to get to this email ASAP but it might take some time. >>>>>> >>>>>> No worries. I did some further digging up and here is what I got, which >>>>>> I believe is rather strange: >>>>>> >>>>>> struct scan_control { >>>>>> nr_to_reclaim = 32, >>>>>> gfp_mask = 37880010, >>>>>> order = 0, >>>>>> nodemask = 0x0, >>>>>> target_mem_cgroup = 0xffff8823990d1400, >>>>>> priority = 7, >>>>>> may_writepage = 1, >>>>>> may_unmap = 1, >>>>>> may_swap = 0, >>>>>> may_thrash = 1, >>>>>> hibernation_mode = 0, >>>>>> compaction_ready = 0, >>>>>> nr_scanned = 0, >>>>>> nr_reclaimed = 0 >>>>>> } >>>>>> >>>>>> Parsing: 37880010 >>>>>> #define ___GFP_HIGHMEM 0x02 >>>>>> #define ___GFP_MOVABLE 0x08 >>>>>> #define ___GFP_IO 0x40 >>>>>> #define ___GFP_FS 0x80 >>>>>> #define ___GFP_HARDWALL 0x20000 >>>>>> #define ___GFP_DIRECT_RECLAIM 0x400000 >>>>>> #define ___GFP_KSWAPD_RECLAIM 0x2000000 >>>>>> >>>>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0 >>>>>> and priority is 7 this means we've gone 5 times through the do {} while >>>>>> in do_try_to_free_pages. Also total_scanned seems to be 0. Here is the >>>>>> zone which was being reclaimed : >>>> >>>> This is also very strange that total_scanned is 0. >>>> >>>> >>>>>> >>>>>> http://sprunge.us/hQBi >>>>> >>>>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the >>>>> crash can give you per zone counters much more nicely. >>>>> >>>> >>>> So here are the populated zones: >>> [...] >>>> NODE: 0 ZONE: 2 ADDR: ffff88207fffcf00 NAME: "Normal" >>>> SIZE: 33030144 MIN/LOW/HIGH: 22209/27761/33313 >>>> VM_STAT: >>>> NR_FREE_PAGES: 62436 >>>> NR_ALLOC_BATCH: 2024 >>>> NR_INACTIVE_ANON: 8177867 >>>> NR_ACTIVE_ANON: 5407176 >>>> NR_INACTIVE_FILE: 5804642 >>>> NR_ACTIVE_FILE: 9694170 >>> >>> So your LRUs are definitely not empty as I have thought. Having >>> 0 pages scanned is indeed very strange. We do reset sc->nr_scanned >>> for each priority but my understanding was that you are looking at a >>> state where we are somwhere in the middle of shrink_zones. Moreover >>> total_scanned should be cumulative. >> >> So the server began acting wonky. People logged on it and saw the >> softlockup as per my initial email. They then initiated a crashdump via >> sysrq since most commands weren't going through (e.g. forking) so >> crashing it was a last resort measure. After that I start looking at the >> crashdump and observe that prior to the crash machine seems to have >> locked up judging from the dmesg logs. However, when I manually inspect >> the *current* (and current being at the time the crash was actually >> initiated) state of the processes reported as softlock up they seem to >> have made progress are now in >> shrink_zone->shrink_lruvec->shrink_inactive_list->_cond_resched->__schedule > > OK, I see. > >> And the softlockup was being shown to be in mem_cgroup_iter. So it's >> mystery how come this function can softlockup and after the softlockup >> apparently got resolved reclaim is not making any progress. > > This might be just a coincidence and the lockup might really mean that > we couldn't isolate (thus scan) any pages at the time the lockup was > detected. mem_cgroup_iter shouldn't itself loop without any bounds to > trigger the lockup on its own. There is a loop around > css_next_descendant_pre but this should take only few iterations in case > we are racing with cgroup removal AFAIR. So to me it sounds more like a > problem with the state of LRU lists rather than anything else. > Ok, I think I know what has happened. Inspecting the data structures of the respective cgroup here is what the mem_cgroup_per_zone looks like: zoneinfo[2] = { lruvec = {{ lists = { { next = 0xffffea004f98c660, prev = 0xffffea0063f6b1a0 }, { next = 0xffffea0004123120, prev = 0xffffea002c2e2260 }, { next = 0xffff8818c37bb360, prev = 0xffff8818c37bb360 }, { next = 0xffff8818c37bb370, prev = 0xffff8818c37bb370 }, { next = 0xffff8818c37bb380, prev = 0xffff8818c37bb380 } }, reclaim_stat = { recent_rotated = {172969085, 43319509}, recent_scanned = {173112994, 185446658} }, zone = 0xffff88207fffcf00 }}, lru_size = {159722, 158714, 0, 0, 0}, } So this means that there are inactive_anon and active_annon only - correct? Since the machine doesn't have any swap this means anon memory has nowhere to go. If I'm interpreting the data correctly then this explains why reclaim makes no progress. If that's the case then I have the following questions: 1. Shouldn't reclaim exit at some point rather than being stuck in reclaim without making further progress. 2. It seems rather strange that there are no (INACTIVE|ACTIVE)_FILE pages - is this possible? 3. Why hasn't OOM been activated in order to free up some anonymous memory ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-24 11:45 ` Nikolay Borisov @ 2016-11-24 12:12 ` Michal Hocko 2016-11-24 13:09 ` Nikolay Borisov 0 siblings, 1 reply; 20+ messages in thread From: Michal Hocko @ 2016-11-24 12:12 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Linux MM On Thu 24-11-16 13:45:03, Nikolay Borisov wrote: [...] > Ok, I think I know what has happened. Inspecting the data structures of > the respective cgroup here is what the mem_cgroup_per_zone looks like: > > zoneinfo[2] = { > lruvec = {{ > lists = { > { > next = 0xffffea004f98c660, > prev = 0xffffea0063f6b1a0 > }, > { > next = 0xffffea0004123120, > prev = 0xffffea002c2e2260 > }, > { > next = 0xffff8818c37bb360, > prev = 0xffff8818c37bb360 > }, > { > next = 0xffff8818c37bb370, > prev = 0xffff8818c37bb370 > }, > { > next = 0xffff8818c37bb380, > prev = 0xffff8818c37bb380 > } > }, > reclaim_stat = { > recent_rotated = {172969085, 43319509}, > recent_scanned = {173112994, 185446658} > }, > zone = 0xffff88207fffcf00 > }}, > lru_size = {159722, 158714, 0, 0, 0}, > } > > So this means that there are inactive_anon and active_annon only - > correct? yes. at least in this particular zone. > Since the machine doesn't have any swap this means anon memory > has nowhere to go. If I'm interpreting the data correctly then this > explains why reclaim makes no progress. If that's the case then I have > the following questions: > > 1. Shouldn't reclaim exit at some point rather than being stuck in > reclaim without making further progress. Reclaim (try_to_free_mem_cgroup_pages) has to go down all priorities without to get out. We are not doing any pro-active checks whether there is anything reclaimable but that alone shouldn't be such a big deal because shrink_node_memcg should simply do nothing because get_scan_count will find no pages to scan. So it shouldn't take much time to realize there is nothing to reclaim and get back to try_charge which retries few more times and eventually goes OOM. I do not see how we could trigger rcu stalls here. There shouldn't be any long RCU critical section on the way and preemption points on the way. > 2. It seems rather strange that there are no (INACTIVE|ACTIVE)_FILE > pages - is this possible? All of them might be reclaimed already as a result of the memory pressure in the memcg. So not all that surprising. But the fact that you are hitting the limit means that the anonymous pages saturate your hard limit so your memcg seems underprovisioned. > 3. Why hasn't OOM been activated in order to free up some anonymous memory ? It should eventually. Maybe there still were some reclaimable pages in other zones for this memcg. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-24 12:12 ` Michal Hocko @ 2016-11-24 13:09 ` Nikolay Borisov 2016-11-25 9:00 ` Michal Hocko 0 siblings, 1 reply; 20+ messages in thread From: Nikolay Borisov @ 2016-11-24 13:09 UTC (permalink / raw) To: Michal Hocko; +Cc: Linux MM On 11/24/2016 02:12 PM, Michal Hocko wrote: > On Thu 24-11-16 13:45:03, Nikolay Borisov wrote: > [...] >> Ok, I think I know what has happened. Inspecting the data structures of >> the respective cgroup here is what the mem_cgroup_per_zone looks like: >> >> zoneinfo[2] = { >> lruvec = {{ >> lists = { >> { >> next = 0xffffea004f98c660, >> prev = 0xffffea0063f6b1a0 >> }, >> { >> next = 0xffffea0004123120, >> prev = 0xffffea002c2e2260 >> }, >> { >> next = 0xffff8818c37bb360, >> prev = 0xffff8818c37bb360 >> }, >> { >> next = 0xffff8818c37bb370, >> prev = 0xffff8818c37bb370 >> }, >> { >> next = 0xffff8818c37bb380, >> prev = 0xffff8818c37bb380 >> } >> }, >> reclaim_stat = { >> recent_rotated = {172969085, 43319509}, >> recent_scanned = {173112994, 185446658} >> }, >> zone = 0xffff88207fffcf00 >> }}, >> lru_size = {159722, 158714, 0, 0, 0}, >> } >> >> So this means that there are inactive_anon and active_annon only - >> correct? > > yes. at least in this particular zone. > >> Since the machine doesn't have any swap this means anon memory >> has nowhere to go. If I'm interpreting the data correctly then this >> explains why reclaim makes no progress. If that's the case then I have >> the following questions: >> >> 1. Shouldn't reclaim exit at some point rather than being stuck in >> reclaim without making further progress. > > Reclaim (try_to_free_mem_cgroup_pages) has to go down all priorities > without to get out. We are not doing any pro-active checks whether there > is anything reclaimable but that alone shouldn't be such a big deal > because shrink_node_memcg should simply do nothing because > get_scan_count will find no pages to scan. So it shouldn't take much > time to realize there is nothing to reclaim and get back to try_charge > which retries few more times and eventually goes OOM. I do not see how > we could trigger rcu stalls here. There shouldn't be any long RCU > critical section on the way and preemption points on the way. > >> 2. It seems rather strange that there are no (INACTIVE|ACTIVE)_FILE >> pages - is this possible? > > All of them might be reclaimed already as a result of the memory > pressure in the memcg. So not all that surprising. But the fact that > you are hitting the limit means that the anonymous pages saturate your > hard limit so your memcg seems underprovisioned. > >> 3. Why hasn't OOM been activated in order to free up some anonymous memory ? > > It should eventually. Maybe there still were some reclaimable pages in > other zones for this memcg. I just checked all the zones for both nodes (the machines have 2 NUMA nodes) so essentially there are no reclaimable pages - all are anonymous. So the pertinent question is why process are sleeping in reclamation path when there are no pages to free. I also observed the same behavior on a different node, this time the priority was 0 and the code hasn't resorted to OOM. This seems all too strange.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Softlockup during memory allocation 2016-11-24 13:09 ` Nikolay Borisov @ 2016-11-25 9:00 ` Michal Hocko 0 siblings, 0 replies; 20+ messages in thread From: Michal Hocko @ 2016-11-25 9:00 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Linux MM On Thu 24-11-16 15:09:38, Nikolay Borisov wrote: [...] > I just checked all the zones for both nodes (the machines have 2 NUMA > nodes) so essentially there are no reclaimable pages - all are > anonymous. So the pertinent question is why process are sleeping in > reclamation path when there are no pages to free. I also observed the > same behavior on a different node, this time the priority was 0 and the > code hasn't resorted to OOM. This seems all too strange.. >From my experience we usually hit the memcg OOM quickly if there are no reclaimable pages. I do not remember anything big changed recently in that area. Could you enable mm_vmscan_memcg_isolate, mm_vmscan_memcg_reclaim__{begin,end} and mm_vmscan_lru_shrink_inactive tracepoints to see what is going on during the reclaim? I suspect we will see some minor reclaim activity there which will basically "livelock" the oom from happening but let's see. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2016-11-25 9:00 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-11-01 8:12 Softlockup during memory allocation Nikolay Borisov 2016-11-01 8:16 ` Nikolay Borisov 2016-11-02 19:00 ` Vlastimil Babka 2016-11-04 3:46 ` Hugh Dickins 2016-11-04 12:18 ` Nikolay Borisov 2016-11-13 22:02 ` Nikolay Borisov 2016-11-21 5:31 ` Michal Hocko 2016-11-22 8:56 ` Nikolay Borisov 2016-11-22 14:30 ` Michal Hocko 2016-11-22 14:32 ` Michal Hocko 2016-11-22 14:46 ` Nikolay Borisov 2016-11-22 14:35 ` Nikolay Borisov 2016-11-22 17:02 ` Michal Hocko 2016-11-23 7:44 ` Nikolay Borisov 2016-11-23 7:49 ` Michal Hocko 2016-11-23 7:50 ` Michal Hocko 2016-11-24 11:45 ` Nikolay Borisov 2016-11-24 12:12 ` Michal Hocko 2016-11-24 13:09 ` Nikolay Borisov 2016-11-25 9:00 ` Michal Hocko
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.