All of lore.kernel.org
 help / color / mirror / Atom feed
* Softlockup during memory allocation
@ 2016-11-01  8:12 Nikolay Borisov
  2016-11-01  8:16 ` Nikolay Borisov
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-01  8:12 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Linux MM, Paul E. McKenney

Hello, 

I got the following rcu_sched/soft lockup on a server: 

[7056389.638502] INFO: rcu_sched self-detected stall on CPU
[7056389.638509]        21-...: (20994 ticks this GP) idle=9ef/140000000000001/0 softirq=3256767558/3256767596 fqs=6843 
[7056389.638510]         (t=21001 jiffies g=656551647 c=656551646 q=469247)
[7056389.638513] Task dump for CPU 21:
[7056389.638515] hive_exec       R  running task        0  4413  31126 0x0000000a
[7056389.638518]  ffffffff81c40280 ffff883fff323dc0 ffffffff8107f53d 0000000000000015
[7056389.638520]  ffffffff81c40280 ffff883fff323dd8 ffffffff81081ce9 0000000000000016
[7056389.638522]  ffff883fff323e08 ffffffff810aa34b ffff883fff335d00 ffffffff81c40280
[7056389.638524] Call Trace:
[7056389.638525]  <IRQ>  [<ffffffff8107f53d>] sched_show_task+0xbd/0x120
[7056389.638535]  [<ffffffff81081ce9>] dump_cpu_task+0x39/0x40
[7056389.638539]  [<ffffffff810aa34b>] rcu_dump_cpu_stacks+0x8b/0xc0
[7056389.638541]  [<ffffffff810ade76>] rcu_check_callbacks+0x4d6/0x7b0
[7056389.638547]  [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20
[7056389.638549]  [<ffffffff810b28b9>] update_process_times+0x39/0x60
[7056389.638551]  [<ffffffff810c0a9d>] tick_sched_timer+0x4d/0x180
[7056389.638553]  [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20
[7056389.638554]  [<ffffffff810b3197>] __hrtimer_run_queues+0xe7/0x260
[7056389.638556]  [<ffffffff810b3718>] hrtimer_interrupt+0xa8/0x1a0
[7056389.638561]  [<ffffffff81034608>] local_apic_timer_interrupt+0x38/0x60
[7056389.638565]  [<ffffffff81616a7d>] smp_apic_timer_interrupt+0x3d/0x50
[7056389.638568]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
[7056389.638569]  <EOI>  [<ffffffff8113fb4a>] ? shrink_zone+0x28a/0x2a0
[7056389.638575]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
[7056389.638579]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
[7056389.638581]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
[7056389.638585]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
[7056389.638588]  [<ffffffff81087486>] ? update_curr+0x66/0x180
[7056389.638591]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
[7056389.638595]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
[7056389.638596]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
[7056389.638600]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
[7056389.638603]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
[7056389.638608]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
[7056389.638611]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
[7056389.638613]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
[7056389.638615]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
[7056389.638618]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
[7056389.638620]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
[7056389.638622]  [<ffffffff810b2f1a>] ? hrtimer_try_to_cancel+0x1a/0x110
[7056389.638626]  [<ffffffff810775a2>] ? __might_sleep+0x52/0xb0
[7056389.638628]  [<ffffffff810b3bdd>] ? hrtimer_nanosleep+0xbd/0x1a0
[7056389.638631]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
[7056389.638632]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
[7056389.638634]  [<ffffffff81616172>] page_fault+0x22/0x30

Here is the stack of the same process when taken with 'bt' in crash: 

#0 [ffff8820d5fb3598] __schedule at ffffffff8160fa9a
 #1 [ffff8820d5fb35e0] preempt_schedule_common at ffffffff8161043f
 #2 [ffff8820d5fb35f8] _cond_resched at ffffffff8161047c
 #3 [ffff8820d5fb3608] shrink_page_list at ffffffff8113dd94
 #4 [ffff8820d5fb36e8] shrink_inactive_list at ffffffff8113eab1
 #5 [ffff8820d5fb37a8] shrink_lruvec at ffffffff8113f710
 #6 [ffff8820d5fb38a8] shrink_zone at ffffffff8113f99c
 #7 [ffff8820d5fb3920] do_try_to_free_pages at ffffffff8113fcd4
 #8 [ffff8820d5fb39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7
 #9 [ffff8820d5fb3a10] try_charge at ffffffff8118d0ef
#10 [ffff8820d5fb3ab0] mem_cgroup_try_charge at ffffffff811915ed
#11 [ffff8820d5fb3af0] __add_to_page_cache_locked at ffffffff81128cf2
#12 [ffff8820d5fb3b48] add_to_page_cache_lru at ffffffff81128f28
#13 [ffff8820d5fb3b70] ext4_mpage_readpages at ffffffff812719b2
#14 [ffff8820d5fb3c78] ext4_readpages at ffffffff81228ce6
#15 [ffff8820d5fb3c88] __do_page_cache_readahead at ffffffff811374e0
#16 [ffff8820d5fb3d30] filemap_fault at ffffffff8112b1f0
#17 [ffff8820d5fb3d88] ext4_filemap_fault at ffffffff81231716
#18 [ffff8820d5fb3db0] __do_fault at ffffffff8115629f
#19 [ffff8820d5fb3e10] handle_mm_fault at ffffffff8115a205
#20 [ffff8820d5fb3ee8] __do_page_fault at ffffffff810430bb
#21 [ffff8820d5fb3f40] do_page_fault at ffffffff8104332c
#22 [ffff8820d5fb3f50] page_fault at ffffffff81616172


And then multiple softlockups such as : 

[7056427.875860] Call Trace:
[7056427.875866]  [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0
[7056427.875869]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
[7056427.875873]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
[7056427.875875]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
[7056427.875878]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
[7056427.875883]  [<ffffffff81087486>] ? update_curr+0x66/0x180
[7056427.875885]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
[7056427.875889]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
[7056427.875891]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
[7056427.875894]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
[7056427.875898]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
[7056427.875903]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
[7056427.875905]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
[7056427.875907]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
[7056427.875909]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
[7056427.875912]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
[7056427.875915]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
[7056427.875916]  [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360
[7056427.875920]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
[7056427.875921]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
[7056427.875924]  [<ffffffff81616172>] page_fault+0x22/0x30

So what happens is that the file system needs memory and kicks in direct page reclaim, 
however it seems that due to excessive reclaim behavior it locksup in the 
do { shrink_zones(); }while() in do_try_to_free_pages. However, when I get
a backtrace from the process, I see that while going through the (in)active lists
the processes have correctly cond_resched: 

 #0 [ffff881fe60f3598] __schedule at ffffffff8160fa9a
 #1 [ffff881fe60f35e0] preempt_schedule_common at ffffffff8161043f
 #2 [ffff881fe60f35f8] _cond_resched at ffffffff8161047c
 #3 [ffff881fe60f3608] shrink_page_list at ffffffff8113dd94
 #4 [ffff881fe60f36e8] shrink_inactive_list at ffffffff8113eab1
 #5 [ffff881fe60f37a8] shrink_lruvec at ffffffff8113f710
 #6 [ffff881fe60f38a8] shrink_zone at ffffffff8113f99c
 #7 [ffff881fe60f3920] do_try_to_free_pages at ffffffff8113fcd4
 #8 [ffff881fe60f39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7
 #9 [ffff881fe60f3a10] try_charge at ffffffff8118d0ef
#10 [ffff881fe60f3ab0] mem_cgroup_try_charge at ffffffff811915ed
#11 [ffff881fe60f3af0] __add_to_page_cache_locked at ffffffff81128cf2
#12 [ffff881fe60f3b48] add_to_page_cache_lru at ffffffff81128f28
#13 [ffff881fe60f3b70] ext4_mpage_readpages at ffffffff812719b2
#14 [ffff881fe60f3c78] ext4_readpages at ffffffff81228ce6
#15 [ffff881fe60f3c88] __do_page_cache_readahead at ffffffff811374e0
#16 [ffff881fe60f3d30] filemap_fault at ffffffff8112b1f0
#17 [ffff881fe60f3d88] ext4_filemap_fault at ffffffff81231716
#18 [ffff881fe60f3db0] __do_fault at ffffffff8115629f
#19 [ffff881fe60f3e10] handle_mm_fault at ffffffff8115a205
#20 [ffff881fe60f3ee8] __do_page_fault at ffffffff810430bb
#21 [ffff881fe60f3f40] do_page_fault at ffffffff8104332c
#22 [ffff881fe60f3f50] page_fault at ffffffff81616172

Given the state of the page lists there are respectively 30 millions and
17 million pages respectively on node 0 and node 1 that direct reclaim 
should go through. Naturally this takes time, esp on a busy box, however
the reclaim code seems to have resched points. So I'm perplexed as to 
why those softlockups materialized. 

Here's the state of the memory:

                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM  65945937     251.6 GB         ----
         FREE  1197261       4.6 GB    1% of TOTAL MEM
         USED  64748676       247 GB   98% of TOTAL MEM
       SHARED  16603507      63.3 GB   25% of TOTAL MEM
      BUFFERS  3508833      13.4 GB    5% of TOTAL MEM
       CACHED  25814634      98.5 GB   39% of TOTAL MEM
         SLAB  7521229      28.7 GB   11% of TOTAL MEM

   TOTAL SWAP   262143      1024 MB         ----
    SWAP USED   262143      1024 MB  100% of TOTAL SWAP
    SWAP FREE        0            0    0% of TOTAL SWAP

 COMMIT LIMIT  33235111     126.8 GB         ----
    COMMITTED  65273273       249 GB  196% of TOTAL LIMIT

As well as the zone statistics: 

NODE: 0  ZONE: 2  ADDR: ffff88207fffcf00  NAME: "Normal"
  SIZE: 33030144  MIN/LOW/HIGH: 22209/27761/33313
  VM_STAT:
                NR_FREE_PAGES: 62436
               NR_ALLOC_BATCH: 2024
             NR_INACTIVE_ANON: 8177867
               NR_ACTIVE_ANON: 5407176
             NR_INACTIVE_FILE: 5804642
               NR_ACTIVE_FILE: 9694170
               NR_UNEVICTABLE: 50013
                     NR_MLOCK: 59860
                NR_ANON_PAGES: 13276046
               NR_FILE_MAPPED: 969231
                NR_FILE_PAGES: 15858085
                NR_FILE_DIRTY: 683
                 NR_WRITEBACK: 530
          NR_SLAB_RECLAIMABLE: 2688882
        NR_SLAB_UNRECLAIMABLE: 255070
                 NR_PAGETABLE: 182007
              NR_KERNEL_STACK: 8419
              NR_UNSTABLE_NFS: 0
                    NR_BOUNCE: 0
              NR_VMSCAN_WRITE: 1129513
          NR_VMSCAN_IMMEDIATE: 39497899
            NR_WRITEBACK_TEMP: 0
             NR_ISOLATED_ANON: 0
             NR_ISOLATED_FILE: 462
                     NR_SHMEM: 331386
                   NR_DIRTIED: 6868276352
                   NR_WRITTEN: 5816499568
             NR_PAGES_SCANNED: -490
                     NUMA_HIT: 922019911612
                    NUMA_MISS: 2935289654
                 NUMA_FOREIGN: 1903827196
          NUMA_INTERLEAVE_HIT: 57290
                   NUMA_LOCAL: 922017951068
                   NUMA_OTHER: 2937250198
           WORKINGSET_REFAULT: 6998116360
          WORKINGSET_ACTIVATE: 6033595269
       WORKINGSET_NODERECLAIM: 2300965
NR_ANON_TRANSPARENT_HUGEPAGES: 0
            NR_FREE_CMA_PAGES: 0

NODE: 1  ZONE: 2  ADDR: ffff88407fff9f00  NAME: "Normal"
  SIZE: 33554432  MIN/LOW/HIGH: 22567/28208/33850
  VM_STAT:
                NR_FREE_PAGES: 1003922
               NR_ALLOC_BATCH: 4572
             NR_INACTIVE_ANON: 7092366
               NR_ACTIVE_ANON: 6898921
             NR_INACTIVE_FILE: 4880696
               NR_ACTIVE_FILE: 8185594
               NR_UNEVICTABLE: 5311
                     NR_MLOCK: 25509
                NR_ANON_PAGES: 13644139
               NR_FILE_MAPPED: 790292
                NR_FILE_PAGES: 13418055
                NR_FILE_DIRTY: 2081
                 NR_WRITEBACK: 944
          NR_SLAB_RECLAIMABLE: 3948975
        NR_SLAB_UNRECLAIMABLE: 546053
                 NR_PAGETABLE: 207960
              NR_KERNEL_STACK: 10382
              NR_UNSTABLE_NFS: 0
                    NR_BOUNCE: 0
              NR_VMSCAN_WRITE: 213029
          NR_VMSCAN_IMMEDIATE: 28902492
            NR_WRITEBACK_TEMP: 0
             NR_ISOLATED_ANON: 0
             NR_ISOLATED_FILE: 23
                     NR_SHMEM: 327804
                   NR_DIRTIED: 12275571618
                   NR_WRITTEN: 11397580462
             NR_PAGES_SCANNED: -787
                     NUMA_HIT: 798927158945
                    NUMA_MISS: 1903827196
                 NUMA_FOREIGN: 2938097677
          NUMA_INTERLEAVE_HIT: 57726
                   NUMA_LOCAL: 798925933393
                   NUMA_OTHER: 1905052748
           WORKINGSET_REFAULT: 3461465775
          WORKINGSET_ACTIVATE: 2724000507
       WORKINGSET_NODERECLAIM: 4756016
NR_ANON_TRANSPARENT_HUGEPAGES: 70
            NR_FREE_CMA_PAGES: 0

In addition to that I believe there is something wrong
with the NR_PAGES_SCANNED stats since they are being negative. 
I haven't looked into the code to see how this value is being 
synchronized and if there is a possibility of it temporary going negative. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-01  8:12 Softlockup during memory allocation Nikolay Borisov
@ 2016-11-01  8:16 ` Nikolay Borisov
  2016-11-02 19:00 ` Vlastimil Babka
  2016-11-04 12:18 ` Nikolay Borisov
  2 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-01  8:16 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Linux MM, Paul E. McKenney



On 11/01/2016 10:12 AM, Nikolay Borisov wrote:
> Hello, 
> 

I forgot to mention that this is on kernel 4.4.14

> I got the following rcu_sched/soft lockup on a server: 
> 
> [7056389.638502] INFO: rcu_sched self-detected stall on CPU
> [7056389.638509]        21-...: (20994 ticks this GP) idle=9ef/140000000000001/0 softirq=3256767558/3256767596 fqs=6843 
> [7056389.638510]         (t=21001 jiffies g=656551647 c=656551646 q=469247)
> [7056389.638513] Task dump for CPU 21:
> [7056389.638515] hive_exec       R  running task        0  4413  31126 0x0000000a
> [7056389.638518]  ffffffff81c40280 ffff883fff323dc0 ffffffff8107f53d 0000000000000015
> [7056389.638520]  ffffffff81c40280 ffff883fff323dd8 ffffffff81081ce9 0000000000000016
> [7056389.638522]  ffff883fff323e08 ffffffff810aa34b ffff883fff335d00 ffffffff81c40280
> [7056389.638524] Call Trace:
> [7056389.638525]  <IRQ>  [<ffffffff8107f53d>] sched_show_task+0xbd/0x120
> [7056389.638535]  [<ffffffff81081ce9>] dump_cpu_task+0x39/0x40
> [7056389.638539]  [<ffffffff810aa34b>] rcu_dump_cpu_stacks+0x8b/0xc0
> [7056389.638541]  [<ffffffff810ade76>] rcu_check_callbacks+0x4d6/0x7b0
> [7056389.638547]  [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20
> [7056389.638549]  [<ffffffff810b28b9>] update_process_times+0x39/0x60
> [7056389.638551]  [<ffffffff810c0a9d>] tick_sched_timer+0x4d/0x180
> [7056389.638553]  [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20
> [7056389.638554]  [<ffffffff810b3197>] __hrtimer_run_queues+0xe7/0x260
> [7056389.638556]  [<ffffffff810b3718>] hrtimer_interrupt+0xa8/0x1a0
> [7056389.638561]  [<ffffffff81034608>] local_apic_timer_interrupt+0x38/0x60
> [7056389.638565]  [<ffffffff81616a7d>] smp_apic_timer_interrupt+0x3d/0x50
> [7056389.638568]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> [7056389.638569]  <EOI>  [<ffffffff8113fb4a>] ? shrink_zone+0x28a/0x2a0
> [7056389.638575]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
> [7056389.638579]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
> [7056389.638581]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
> [7056389.638585]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
> [7056389.638588]  [<ffffffff81087486>] ? update_curr+0x66/0x180
> [7056389.638591]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
> [7056389.638595]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
> [7056389.638596]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
> [7056389.638600]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
> [7056389.638603]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
> [7056389.638608]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
> [7056389.638611]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
> [7056389.638613]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
> [7056389.638615]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
> [7056389.638618]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
> [7056389.638620]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
> [7056389.638622]  [<ffffffff810b2f1a>] ? hrtimer_try_to_cancel+0x1a/0x110
> [7056389.638626]  [<ffffffff810775a2>] ? __might_sleep+0x52/0xb0
> [7056389.638628]  [<ffffffff810b3bdd>] ? hrtimer_nanosleep+0xbd/0x1a0
> [7056389.638631]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
> [7056389.638632]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
> [7056389.638634]  [<ffffffff81616172>] page_fault+0x22/0x30
> 
> Here is the stack of the same process when taken with 'bt' in crash: 
> 
> #0 [ffff8820d5fb3598] __schedule at ffffffff8160fa9a
>  #1 [ffff8820d5fb35e0] preempt_schedule_common at ffffffff8161043f
>  #2 [ffff8820d5fb35f8] _cond_resched at ffffffff8161047c
>  #3 [ffff8820d5fb3608] shrink_page_list at ffffffff8113dd94
>  #4 [ffff8820d5fb36e8] shrink_inactive_list at ffffffff8113eab1
>  #5 [ffff8820d5fb37a8] shrink_lruvec at ffffffff8113f710
>  #6 [ffff8820d5fb38a8] shrink_zone at ffffffff8113f99c
>  #7 [ffff8820d5fb3920] do_try_to_free_pages at ffffffff8113fcd4
>  #8 [ffff8820d5fb39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7
>  #9 [ffff8820d5fb3a10] try_charge at ffffffff8118d0ef
> #10 [ffff8820d5fb3ab0] mem_cgroup_try_charge at ffffffff811915ed
> #11 [ffff8820d5fb3af0] __add_to_page_cache_locked at ffffffff81128cf2
> #12 [ffff8820d5fb3b48] add_to_page_cache_lru at ffffffff81128f28
> #13 [ffff8820d5fb3b70] ext4_mpage_readpages at ffffffff812719b2
> #14 [ffff8820d5fb3c78] ext4_readpages at ffffffff81228ce6
> #15 [ffff8820d5fb3c88] __do_page_cache_readahead at ffffffff811374e0
> #16 [ffff8820d5fb3d30] filemap_fault at ffffffff8112b1f0
> #17 [ffff8820d5fb3d88] ext4_filemap_fault at ffffffff81231716
> #18 [ffff8820d5fb3db0] __do_fault at ffffffff8115629f
> #19 [ffff8820d5fb3e10] handle_mm_fault at ffffffff8115a205
> #20 [ffff8820d5fb3ee8] __do_page_fault at ffffffff810430bb
> #21 [ffff8820d5fb3f40] do_page_fault at ffffffff8104332c
> #22 [ffff8820d5fb3f50] page_fault at ffffffff81616172
> 
> 
> And then multiple softlockups such as : 
> 
> [7056427.875860] Call Trace:
> [7056427.875866]  [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0
> [7056427.875869]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
> [7056427.875873]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
> [7056427.875875]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
> [7056427.875878]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
> [7056427.875883]  [<ffffffff81087486>] ? update_curr+0x66/0x180
> [7056427.875885]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
> [7056427.875889]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
> [7056427.875891]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
> [7056427.875894]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
> [7056427.875898]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
> [7056427.875903]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
> [7056427.875905]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
> [7056427.875907]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
> [7056427.875909]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
> [7056427.875912]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
> [7056427.875915]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
> [7056427.875916]  [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360
> [7056427.875920]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
> [7056427.875921]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
> [7056427.875924]  [<ffffffff81616172>] page_fault+0x22/0x30
> 
> So what happens is that the file system needs memory and kicks in direct page reclaim, 
> however it seems that due to excessive reclaim behavior it locksup in the 
> do { shrink_zones(); }while() in do_try_to_free_pages. However, when I get
> a backtrace from the process, I see that while going through the (in)active lists
> the processes have correctly cond_resched: 
> 
>  #0 [ffff881fe60f3598] __schedule at ffffffff8160fa9a
>  #1 [ffff881fe60f35e0] preempt_schedule_common at ffffffff8161043f
>  #2 [ffff881fe60f35f8] _cond_resched at ffffffff8161047c
>  #3 [ffff881fe60f3608] shrink_page_list at ffffffff8113dd94
>  #4 [ffff881fe60f36e8] shrink_inactive_list at ffffffff8113eab1
>  #5 [ffff881fe60f37a8] shrink_lruvec at ffffffff8113f710
>  #6 [ffff881fe60f38a8] shrink_zone at ffffffff8113f99c
>  #7 [ffff881fe60f3920] do_try_to_free_pages at ffffffff8113fcd4
>  #8 [ffff881fe60f39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7
>  #9 [ffff881fe60f3a10] try_charge at ffffffff8118d0ef
> #10 [ffff881fe60f3ab0] mem_cgroup_try_charge at ffffffff811915ed
> #11 [ffff881fe60f3af0] __add_to_page_cache_locked at ffffffff81128cf2
> #12 [ffff881fe60f3b48] add_to_page_cache_lru at ffffffff81128f28
> #13 [ffff881fe60f3b70] ext4_mpage_readpages at ffffffff812719b2
> #14 [ffff881fe60f3c78] ext4_readpages at ffffffff81228ce6
> #15 [ffff881fe60f3c88] __do_page_cache_readahead at ffffffff811374e0
> #16 [ffff881fe60f3d30] filemap_fault at ffffffff8112b1f0
> #17 [ffff881fe60f3d88] ext4_filemap_fault at ffffffff81231716
> #18 [ffff881fe60f3db0] __do_fault at ffffffff8115629f
> #19 [ffff881fe60f3e10] handle_mm_fault at ffffffff8115a205
> #20 [ffff881fe60f3ee8] __do_page_fault at ffffffff810430bb
> #21 [ffff881fe60f3f40] do_page_fault at ffffffff8104332c
> #22 [ffff881fe60f3f50] page_fault at ffffffff81616172
> 
> Given the state of the page lists there are respectively 30 millions and
> 17 million pages respectively on node 0 and node 1 that direct reclaim 
> should go through. Naturally this takes time, esp on a busy box, however
> the reclaim code seems to have resched points. So I'm perplexed as to 
> why those softlockups materialized. 
> 
> Here's the state of the memory:
> 
>                  PAGES        TOTAL      PERCENTAGE
>     TOTAL MEM  65945937     251.6 GB         ----
>          FREE  1197261       4.6 GB    1% of TOTAL MEM
>          USED  64748676       247 GB   98% of TOTAL MEM
>        SHARED  16603507      63.3 GB   25% of TOTAL MEM
>       BUFFERS  3508833      13.4 GB    5% of TOTAL MEM
>        CACHED  25814634      98.5 GB   39% of TOTAL MEM
>          SLAB  7521229      28.7 GB   11% of TOTAL MEM
> 
>    TOTAL SWAP   262143      1024 MB         ----
>     SWAP USED   262143      1024 MB  100% of TOTAL SWAP
>     SWAP FREE        0            0    0% of TOTAL SWAP
> 
>  COMMIT LIMIT  33235111     126.8 GB         ----
>     COMMITTED  65273273       249 GB  196% of TOTAL LIMIT
> 
> As well as the zone statistics: 
> 
> NODE: 0  ZONE: 2  ADDR: ffff88207fffcf00  NAME: "Normal"
>   SIZE: 33030144  MIN/LOW/HIGH: 22209/27761/33313
>   VM_STAT:
>                 NR_FREE_PAGES: 62436
>                NR_ALLOC_BATCH: 2024
>              NR_INACTIVE_ANON: 8177867
>                NR_ACTIVE_ANON: 5407176
>              NR_INACTIVE_FILE: 5804642
>                NR_ACTIVE_FILE: 9694170
>                NR_UNEVICTABLE: 50013
>                      NR_MLOCK: 59860
>                 NR_ANON_PAGES: 13276046
>                NR_FILE_MAPPED: 969231
>                 NR_FILE_PAGES: 15858085
>                 NR_FILE_DIRTY: 683
>                  NR_WRITEBACK: 530
>           NR_SLAB_RECLAIMABLE: 2688882
>         NR_SLAB_UNRECLAIMABLE: 255070
>                  NR_PAGETABLE: 182007
>               NR_KERNEL_STACK: 8419
>               NR_UNSTABLE_NFS: 0
>                     NR_BOUNCE: 0
>               NR_VMSCAN_WRITE: 1129513
>           NR_VMSCAN_IMMEDIATE: 39497899
>             NR_WRITEBACK_TEMP: 0
>              NR_ISOLATED_ANON: 0
>              NR_ISOLATED_FILE: 462
>                      NR_SHMEM: 331386
>                    NR_DIRTIED: 6868276352
>                    NR_WRITTEN: 5816499568
>              NR_PAGES_SCANNED: -490
>                      NUMA_HIT: 922019911612
>                     NUMA_MISS: 2935289654
>                  NUMA_FOREIGN: 1903827196
>           NUMA_INTERLEAVE_HIT: 57290
>                    NUMA_LOCAL: 922017951068
>                    NUMA_OTHER: 2937250198
>            WORKINGSET_REFAULT: 6998116360
>           WORKINGSET_ACTIVATE: 6033595269
>        WORKINGSET_NODERECLAIM: 2300965
> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>             NR_FREE_CMA_PAGES: 0
> 
> NODE: 1  ZONE: 2  ADDR: ffff88407fff9f00  NAME: "Normal"
>   SIZE: 33554432  MIN/LOW/HIGH: 22567/28208/33850
>   VM_STAT:
>                 NR_FREE_PAGES: 1003922
>                NR_ALLOC_BATCH: 4572
>              NR_INACTIVE_ANON: 7092366
>                NR_ACTIVE_ANON: 6898921
>              NR_INACTIVE_FILE: 4880696
>                NR_ACTIVE_FILE: 8185594
>                NR_UNEVICTABLE: 5311
>                      NR_MLOCK: 25509
>                 NR_ANON_PAGES: 13644139
>                NR_FILE_MAPPED: 790292
>                 NR_FILE_PAGES: 13418055
>                 NR_FILE_DIRTY: 2081
>                  NR_WRITEBACK: 944
>           NR_SLAB_RECLAIMABLE: 3948975
>         NR_SLAB_UNRECLAIMABLE: 546053
>                  NR_PAGETABLE: 207960
>               NR_KERNEL_STACK: 10382
>               NR_UNSTABLE_NFS: 0
>                     NR_BOUNCE: 0
>               NR_VMSCAN_WRITE: 213029
>           NR_VMSCAN_IMMEDIATE: 28902492
>             NR_WRITEBACK_TEMP: 0
>              NR_ISOLATED_ANON: 0
>              NR_ISOLATED_FILE: 23
>                      NR_SHMEM: 327804
>                    NR_DIRTIED: 12275571618
>                    NR_WRITTEN: 11397580462
>              NR_PAGES_SCANNED: -787
>                      NUMA_HIT: 798927158945
>                     NUMA_MISS: 1903827196
>                  NUMA_FOREIGN: 2938097677
>           NUMA_INTERLEAVE_HIT: 57726
>                    NUMA_LOCAL: 798925933393
>                    NUMA_OTHER: 1905052748
>            WORKINGSET_REFAULT: 3461465775
>           WORKINGSET_ACTIVATE: 2724000507
>        WORKINGSET_NODERECLAIM: 4756016
> NR_ANON_TRANSPARENT_HUGEPAGES: 70
>             NR_FREE_CMA_PAGES: 0
> 
> In addition to that I believe there is something wrong
> with the NR_PAGES_SCANNED stats since they are being negative. 
> I haven't looked into the code to see how this value is being 
> synchronized and if there is a possibility of it temporary going negative. 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-01  8:12 Softlockup during memory allocation Nikolay Borisov
  2016-11-01  8:16 ` Nikolay Borisov
@ 2016-11-02 19:00 ` Vlastimil Babka
  2016-11-04  3:46   ` Hugh Dickins
  2016-11-04 12:18 ` Nikolay Borisov
  2 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2016-11-02 19:00 UTC (permalink / raw)
  To: Nikolay Borisov, Michal Hocko; +Cc: Johannes Weiner, Linux MM, Paul E. McKenney

On 11/01/2016 09:12 AM, Nikolay Borisov wrote:
> In addition to that I believe there is something wrong
> with the NR_PAGES_SCANNED stats since they are being negative. 
> I haven't looked into the code to see how this value is being 
> synchronized and if there is a possibility of it temporary going negative. 

This is because there's a shared counter and percpu diffs, and crash
only looks at the shared counter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-02 19:00 ` Vlastimil Babka
@ 2016-11-04  3:46   ` Hugh Dickins
  0 siblings, 0 replies; 20+ messages in thread
From: Hugh Dickins @ 2016-11-04  3:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Nikolay Borisov, Michal Hocko, Johannes Weiner,
	Linux MM, Paul E. McKenney

On Wed, 2 Nov 2016, Vlastimil Babka wrote:
> On 11/01/2016 09:12 AM, Nikolay Borisov wrote:
> > In addition to that I believe there is something wrong
> > with the NR_PAGES_SCANNED stats since they are being negative. 
> > I haven't looked into the code to see how this value is being 
> > synchronized and if there is a possibility of it temporary going negative. 
> 
> This is because there's a shared counter and percpu diffs, and crash
> only looks at the shared counter.

Actually no, as I found when adding vmstat_refresh().  Coincidentally,
I spent some of last weekend trying to understand why, then wrote a
long comment about it (we thought it might be responsible for a hang).
Let me share that comment; but I was writing about an earlier release,
so some of the "zone"s have become "node"s since then - and I'm not
sure whether my comment is comprehensible to anyone but the writer!

This comment attempts to explain the NR_PAGES_SCANNED underflow.  I doubt
it's crucial to the bug in question: it's unsightly, and it may double the
margin of error involved in using per-cpu accounting, but I don't think it
makes a crucial difference to the approximation already inherent there.
If that approximation is a problem, we could consider reverting the commit
0d5d823ab4e "mm: move zone->pages_scanned into a vmstat counter" which
introduced it; or perhaps we could reconsider the stat_threshold
calculation on small zones (if that would make a difference -
I've not gone so far as to think about the hang itself).

The NR_PAGES_SCANNED underflow does not come from any race: it comes from
the way free_pcppages_bulk() and free_one_page() attempt to reset the
counter to 0 by __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned)
with a "fuzzy" nr_scanned obtained from zone_page_state(zone,
NR_PAGES_SCANNED).

Normally __mod_zone_page_state() is used to adjust a counter by a certain
well-known amount, but here it is being used with an approximate amount -
the cheaply-visible approximate total, before correction by per-cpu diffs
(and getting that total from zone_page_state_snapshot() instead would
defeat most of the optimization of using per-cpu here, if not regress
it to worse than the global per-zone counter used before).

The problem starts on an occasion when nr_scanned there is perhaps
perfectly correct, but (factoring in the current cpu diff) is less than
the stat_threshold: __mod_zone_page_state() then updates the cpu diff and
doesn't touch the globally visible counter - but the negative diff implies
that the globally visible counter is larger than the correct value.  Then
later that too-large value is fed back into __mod_zone_pages_state() as if
it were correct, tending towards underflow of the full counter.  (Or the
same could all happen in reverse, with the "reset to 0" leaving a positive
residue in the full counter.)

This would be more serious (unbounded) without the periodic
refresh_cpu_vm_stats(): which every second folds the per-cpu diffs
back into the cheaply-visible totals.  When the diffs are 0, then
free_pcppages_bulk() and free_one_page() will properly reset the total to
0, even if the value it had before was incorrect (negative or positive).

I can eliminate the negative NR_PAGES_SCANNED reports by a one-line change
to __mod_zone_page_state(), to stop ever putting a negative into the
per-cpu diff for NR_PAGES_SCANNED; or by copying most of
__mod_zone_page_state() to a separate __reset_zone_page_state(), called
only on NR_PAGES_SCANNED, again avoiding the negative diffs.  But as I said
in the first paragraph, I doubt the underflow is worse in effect than the
approximation already inherent in the per-cpu counting here.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-01  8:12 Softlockup during memory allocation Nikolay Borisov
  2016-11-01  8:16 ` Nikolay Borisov
  2016-11-02 19:00 ` Vlastimil Babka
@ 2016-11-04 12:18 ` Nikolay Borisov
  2016-11-13 22:02   ` Nikolay Borisov
  2 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-04 12:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Linux MM, Paul E. McKenney



On 11/01/2016 10:12 AM, Nikolay Borisov wrote:
> Hello, 
> 
> I got the following rcu_sched/soft lockup on a server: 
> 
> [7056389.638502] INFO: rcu_sched self-detected stall on CPU
> [7056389.638509]        21-...: (20994 ticks this GP) idle=9ef/140000000000001/0 softirq=3256767558/3256767596 fqs=6843 
> [7056389.638510]         (t=21001 jiffies g=656551647 c=656551646 q=469247)
> [7056389.638513] Task dump for CPU 21:
> [7056389.638515] hive_exec       R  running task        0  4413  31126 0x0000000a
> [7056389.638518]  ffffffff81c40280 ffff883fff323dc0 ffffffff8107f53d 0000000000000015
> [7056389.638520]  ffffffff81c40280 ffff883fff323dd8 ffffffff81081ce9 0000000000000016
> [7056389.638522]  ffff883fff323e08 ffffffff810aa34b ffff883fff335d00 ffffffff81c40280
> [7056389.638524] Call Trace:
> [7056389.638525]  <IRQ>  [<ffffffff8107f53d>] sched_show_task+0xbd/0x120
> [7056389.638535]  [<ffffffff81081ce9>] dump_cpu_task+0x39/0x40
> [7056389.638539]  [<ffffffff810aa34b>] rcu_dump_cpu_stacks+0x8b/0xc0
> [7056389.638541]  [<ffffffff810ade76>] rcu_check_callbacks+0x4d6/0x7b0
> [7056389.638547]  [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20
> [7056389.638549]  [<ffffffff810b28b9>] update_process_times+0x39/0x60
> [7056389.638551]  [<ffffffff810c0a9d>] tick_sched_timer+0x4d/0x180
> [7056389.638553]  [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20
> [7056389.638554]  [<ffffffff810b3197>] __hrtimer_run_queues+0xe7/0x260
> [7056389.638556]  [<ffffffff810b3718>] hrtimer_interrupt+0xa8/0x1a0
> [7056389.638561]  [<ffffffff81034608>] local_apic_timer_interrupt+0x38/0x60
> [7056389.638565]  [<ffffffff81616a7d>] smp_apic_timer_interrupt+0x3d/0x50
> [7056389.638568]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
> [7056389.638569]  <EOI>  [<ffffffff8113fb4a>] ? shrink_zone+0x28a/0x2a0
> [7056389.638575]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
> [7056389.638579]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
> [7056389.638581]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
> [7056389.638585]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
> [7056389.638588]  [<ffffffff81087486>] ? update_curr+0x66/0x180
> [7056389.638591]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
> [7056389.638595]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
> [7056389.638596]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
> [7056389.638600]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
> [7056389.638603]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
> [7056389.638608]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
> [7056389.638611]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
> [7056389.638613]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
> [7056389.638615]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
> [7056389.638618]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
> [7056389.638620]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
> [7056389.638622]  [<ffffffff810b2f1a>] ? hrtimer_try_to_cancel+0x1a/0x110
> [7056389.638626]  [<ffffffff810775a2>] ? __might_sleep+0x52/0xb0
> [7056389.638628]  [<ffffffff810b3bdd>] ? hrtimer_nanosleep+0xbd/0x1a0
> [7056389.638631]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
> [7056389.638632]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
> [7056389.638634]  [<ffffffff81616172>] page_fault+0x22/0x30
> 
> Here is the stack of the same process when taken with 'bt' in crash: 
> 
> #0 [ffff8820d5fb3598] __schedule at ffffffff8160fa9a
>  #1 [ffff8820d5fb35e0] preempt_schedule_common at ffffffff8161043f
>  #2 [ffff8820d5fb35f8] _cond_resched at ffffffff8161047c
>  #3 [ffff8820d5fb3608] shrink_page_list at ffffffff8113dd94
>  #4 [ffff8820d5fb36e8] shrink_inactive_list at ffffffff8113eab1
>  #5 [ffff8820d5fb37a8] shrink_lruvec at ffffffff8113f710
>  #6 [ffff8820d5fb38a8] shrink_zone at ffffffff8113f99c
>  #7 [ffff8820d5fb3920] do_try_to_free_pages at ffffffff8113fcd4
>  #8 [ffff8820d5fb39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7
>  #9 [ffff8820d5fb3a10] try_charge at ffffffff8118d0ef
> #10 [ffff8820d5fb3ab0] mem_cgroup_try_charge at ffffffff811915ed
> #11 [ffff8820d5fb3af0] __add_to_page_cache_locked at ffffffff81128cf2
> #12 [ffff8820d5fb3b48] add_to_page_cache_lru at ffffffff81128f28
> #13 [ffff8820d5fb3b70] ext4_mpage_readpages at ffffffff812719b2
> #14 [ffff8820d5fb3c78] ext4_readpages at ffffffff81228ce6
> #15 [ffff8820d5fb3c88] __do_page_cache_readahead at ffffffff811374e0
> #16 [ffff8820d5fb3d30] filemap_fault at ffffffff8112b1f0
> #17 [ffff8820d5fb3d88] ext4_filemap_fault at ffffffff81231716
> #18 [ffff8820d5fb3db0] __do_fault at ffffffff8115629f
> #19 [ffff8820d5fb3e10] handle_mm_fault at ffffffff8115a205
> #20 [ffff8820d5fb3ee8] __do_page_fault at ffffffff810430bb
> #21 [ffff8820d5fb3f40] do_page_fault at ffffffff8104332c
> #22 [ffff8820d5fb3f50] page_fault at ffffffff81616172
> 
> 
> And then multiple softlockups such as : 
> 
> [7056427.875860] Call Trace:
> [7056427.875866]  [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0
> [7056427.875869]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
> [7056427.875873]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
> [7056427.875875]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
> [7056427.875878]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
> [7056427.875883]  [<ffffffff81087486>] ? update_curr+0x66/0x180
> [7056427.875885]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
> [7056427.875889]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
> [7056427.875891]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
> [7056427.875894]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
> [7056427.875898]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
> [7056427.875903]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
> [7056427.875905]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
> [7056427.875907]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
> [7056427.875909]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
> [7056427.875912]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
> [7056427.875915]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
> [7056427.875916]  [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360
> [7056427.875920]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
> [7056427.875921]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
> [7056427.875924]  [<ffffffff81616172>] page_fault+0x22/0x30
> 

So after further debugging the following seems to be a better representation of what's happened: 

So the softlockup for a task is as follows: 

[7056427.875372] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [nginx:47108]
[7056427.875832] CPU: 17 PID: 47108 Comm: nginx Tainted: G        W  O  K 4.4.14-clouder5 #2
[7056427.875833] Hardware name: Supermicro PIO-628U-TR4T+-ST031/X10DRU-i+, BIOS 1.0c 03/23/2015
[7056427.875835] task: ffff88208a06d280 ti: ffff88235af70000 task.ti: ffff88235af70000
[7056427.875836] RIP: 0010:[<ffffffff8118dbf3>]  [<ffffffff8118dbf3>] mem_cgroup_iter+0x33/0x3e0
[7056427.875844] RSP: 0018:ffff88235af73868  EFLAGS: 00000246
[7056427.875845] RAX: 0000000000000005 RBX: ffff88235af739a8 RCX: ffff88207fffc000
[7056427.875846] RDX: ffff88235af738e0 RSI: 0000000000000000 RDI: ffff8823990d1400
[7056427.875847] RBP: ffff88235af738a0 R08: 0000000000000000 R09: 0000000001d0df03
[7056427.875848] R10: 28f5c28f5c28f5c3 R11: 0000000000000000 R12: ffff8823990d1400
[7056427.875849] R13: ffff8823990d1400 R14: 0000000000000000 R15: ffff88407fffae30
[7056427.875850] FS:  00007f8cdee26720(0000) GS:ffff883fff2a0000(0000) knlGS:0000000000000000
[7056427.875851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7056427.875852] CR2: 00007f4d199780e0 CR3: 0000002e0b4e3000 CR4: 00000000001406e0
[7056427.875853] Stack:
[7056427.875854]  00ff8823990d1598 ffff88235af738a0 ffff88235af739a8 ffff8823990d1400
[7056427.875856]  0000000000000000 ffff88207fffc000 ffff88407fffae30 ffff88235af73918
[7056427.875858]  ffffffff8113f92e 0000000000000000 0000000000000000 0000000000000000
[7056427.875860] Call Trace:
[7056427.875866]  [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0
[7056427.875869]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
[7056427.875873]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
[7056427.875875]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
[7056427.875878]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
[7056427.875883]  [<ffffffff81087486>] ? update_curr+0x66/0x180
[7056427.875885]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
[7056427.875889]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
[7056427.875891]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
[7056427.875894]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
[7056427.875898]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
[7056427.875903]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
[7056427.875905]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
[7056427.875907]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
[7056427.875909]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
[7056427.875912]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
[7056427.875915]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
[7056427.875916]  [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360
[7056427.875920]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
[7056427.875921]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
[7056427.875924]  [<ffffffff81616172>] page_fault+0x22/0x30


ffffffff8118dbf3 is mem_cgroup_iter called from shrink_zone but in the outer do {} while. However, 
after crashing the machine via sysrq and looking at the callstack for the same process looks like:

#0 [ffff88235af73598] __schedule at ffffffff8160fa9a
 #1 [ffff88235af735e0] preempt_schedule_common at ffffffff8161043f
 #2 [ffff88235af735f8] _cond_resched at ffffffff8161047c
 #3 [ffff88235af73608] shrink_page_list at ffffffff8113dd94
 #4 [ffff88235af736e8] shrink_inactive_list at ffffffff8113eab1
 #5 [ffff88235af737a8] shrink_lruvec at ffffffff8113f710
 #6 [ffff88235af738a8] shrink_zone at ffffffff8113f99c
 #7 [ffff88235af73920] do_try_to_free_pages at ffffffff8113fcd4
 #8 [ffff88235af739a0] try_to_free_mem_cgroup_pages at ffffffff811401e7
 #9 [ffff88235af73a10] try_charge at ffffffff8118d0ef
#10 [ffff88235af73ab0] mem_cgroup_try_charge at ffffffff811915ed
#11 [ffff88235af73af0] __add_to_page_cache_locked at ffffffff81128cf2
#12 [ffff88235af73b48] add_to_page_cache_lru at ffffffff81128f28
#13 [ffff88235af73b70] ext4_mpage_readpages at ffffffff812719b2
#14 [ffff88235af73c78] ext4_readpages at ffffffff81228ce6
#15 [ffff88235af73c88] __do_page_cache_readahead at ffffffff811374e0
#16 [ffff88235af73d30] filemap_fault at ffffffff8112b1f0
#17 [ffff88235af73d88] ext4_filemap_fault at ffffffff81231716
#18 [ffff88235af73db0] __do_fault at ffffffff8115629f
#19 [ffff88235af73e10] handle_mm_fault at ffffffff8115a205
#20 [ffff88235af73ee8] __do_page_fault at ffffffff810430bb
#21 [ffff88235af73f40] do_page_fault at ffffffff8104332c
#22 [ffff88235af73f50] page_fault at ffffffff81616172

This is the shrink_lruvec in the inner do {} while

So clearly the same task has made some progress and has even 
re-scheduled due to the cond_resched in shrink_page_list. 
This means that in order for the softlockup to be triggered 
in shrink_zone this means that the code after the inner while 
should've taken more than 22 seconds (the time out for the s
oftlockup watchdog). The function which are there are 
shrink_slab (but this seems to have the necessary cond_resched 
in its callees e.g. do_shrink_slab) and vmpressure - this is very lightweight. 

So it's still puzzling how come the softlockup is triggered in shrink_zone, when
there in fact is plenty of opportunity to resched while scanning the lists

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-04 12:18 ` Nikolay Borisov
@ 2016-11-13 22:02   ` Nikolay Borisov
  2016-11-21  5:31     ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-13 22:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux MM

Ping on that Michal, in case you've missed it. This seems like a
genuine miss of a cond_resched. Can you at least confirm my analysis
or is it complete bollocks?

On Fri, Nov 4, 2016 at 2:18 PM, Nikolay Borisov <kernel@kyup.com> wrote:
>
>
> On 11/01/2016 10:12 AM, Nikolay Borisov wrote:
>> Hello,
>>
>> I got the following rcu_sched/soft lockup on a server:
>>
>> [7056389.638502] INFO: rcu_sched self-detected stall on CPU
>> [7056389.638509]        21-...: (20994 ticks this GP) idle=9ef/140000000000001/0 softirq=3256767558/3256767596 fqs=6843
>> [7056389.638510]         (t=21001 jiffies g=656551647 c=656551646 q=469247)
>> [7056389.638513] Task dump for CPU 21:
>> [7056389.638515] hive_exec       R  running task        0  4413  31126 0x0000000a
>> [7056389.638518]  ffffffff81c40280 ffff883fff323dc0 ffffffff8107f53d 0000000000000015
>> [7056389.638520]  ffffffff81c40280 ffff883fff323dd8 ffffffff81081ce9 0000000000000016
>> [7056389.638522]  ffff883fff323e08 ffffffff810aa34b ffff883fff335d00 ffffffff81c40280
>> [7056389.638524] Call Trace:
>> [7056389.638525]  <IRQ>  [<ffffffff8107f53d>] sched_show_task+0xbd/0x120
>> [7056389.638535]  [<ffffffff81081ce9>] dump_cpu_task+0x39/0x40
>> [7056389.638539]  [<ffffffff810aa34b>] rcu_dump_cpu_stacks+0x8b/0xc0
>> [7056389.638541]  [<ffffffff810ade76>] rcu_check_callbacks+0x4d6/0x7b0
>> [7056389.638547]  [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20
>> [7056389.638549]  [<ffffffff810b28b9>] update_process_times+0x39/0x60
>> [7056389.638551]  [<ffffffff810c0a9d>] tick_sched_timer+0x4d/0x180
>> [7056389.638553]  [<ffffffff810c0a50>] ? tick_init_highres+0x20/0x20
>> [7056389.638554]  [<ffffffff810b3197>] __hrtimer_run_queues+0xe7/0x260
>> [7056389.638556]  [<ffffffff810b3718>] hrtimer_interrupt+0xa8/0x1a0
>> [7056389.638561]  [<ffffffff81034608>] local_apic_timer_interrupt+0x38/0x60
>> [7056389.638565]  [<ffffffff81616a7d>] smp_apic_timer_interrupt+0x3d/0x50
>> [7056389.638568]  [<ffffffff816151e9>] apic_timer_interrupt+0x89/0x90
>> [7056389.638569]  <EOI>  [<ffffffff8113fb4a>] ? shrink_zone+0x28a/0x2a0
>> [7056389.638575]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
>> [7056389.638579]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
>> [7056389.638581]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
>> [7056389.638585]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
>> [7056389.638588]  [<ffffffff81087486>] ? update_curr+0x66/0x180
>> [7056389.638591]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
>> [7056389.638595]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
>> [7056389.638596]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
>> [7056389.638600]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
>> [7056389.638603]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
>> [7056389.638608]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
>> [7056389.638611]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
>> [7056389.638613]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
>> [7056389.638615]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
>> [7056389.638618]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
>> [7056389.638620]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
>> [7056389.638622]  [<ffffffff810b2f1a>] ? hrtimer_try_to_cancel+0x1a/0x110
>> [7056389.638626]  [<ffffffff810775a2>] ? __might_sleep+0x52/0xb0
>> [7056389.638628]  [<ffffffff810b3bdd>] ? hrtimer_nanosleep+0xbd/0x1a0
>> [7056389.638631]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
>> [7056389.638632]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
>> [7056389.638634]  [<ffffffff81616172>] page_fault+0x22/0x30
>>
>> Here is the stack of the same process when taken with 'bt' in crash:
>>
>> #0 [ffff8820d5fb3598] __schedule at ffffffff8160fa9a
>>  #1 [ffff8820d5fb35e0] preempt_schedule_common at ffffffff8161043f
>>  #2 [ffff8820d5fb35f8] _cond_resched at ffffffff8161047c
>>  #3 [ffff8820d5fb3608] shrink_page_list at ffffffff8113dd94
>>  #4 [ffff8820d5fb36e8] shrink_inactive_list at ffffffff8113eab1
>>  #5 [ffff8820d5fb37a8] shrink_lruvec at ffffffff8113f710
>>  #6 [ffff8820d5fb38a8] shrink_zone at ffffffff8113f99c
>>  #7 [ffff8820d5fb3920] do_try_to_free_pages at ffffffff8113fcd4
>>  #8 [ffff8820d5fb39a0] try_to_free_mem_cgroup_pages at ffffffff811401e7
>>  #9 [ffff8820d5fb3a10] try_charge at ffffffff8118d0ef
>> #10 [ffff8820d5fb3ab0] mem_cgroup_try_charge at ffffffff811915ed
>> #11 [ffff8820d5fb3af0] __add_to_page_cache_locked at ffffffff81128cf2
>> #12 [ffff8820d5fb3b48] add_to_page_cache_lru at ffffffff81128f28
>> #13 [ffff8820d5fb3b70] ext4_mpage_readpages at ffffffff812719b2
>> #14 [ffff8820d5fb3c78] ext4_readpages at ffffffff81228ce6
>> #15 [ffff8820d5fb3c88] __do_page_cache_readahead at ffffffff811374e0
>> #16 [ffff8820d5fb3d30] filemap_fault at ffffffff8112b1f0
>> #17 [ffff8820d5fb3d88] ext4_filemap_fault at ffffffff81231716
>> #18 [ffff8820d5fb3db0] __do_fault at ffffffff8115629f
>> #19 [ffff8820d5fb3e10] handle_mm_fault at ffffffff8115a205
>> #20 [ffff8820d5fb3ee8] __do_page_fault at ffffffff810430bb
>> #21 [ffff8820d5fb3f40] do_page_fault at ffffffff8104332c
>> #22 [ffff8820d5fb3f50] page_fault at ffffffff81616172
>>
>>
>> And then multiple softlockups such as :
>>
>> [7056427.875860] Call Trace:
>> [7056427.875866]  [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0
>> [7056427.875869]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
>> [7056427.875873]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
>> [7056427.875875]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
>> [7056427.875878]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
>> [7056427.875883]  [<ffffffff81087486>] ? update_curr+0x66/0x180
>> [7056427.875885]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
>> [7056427.875889]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
>> [7056427.875891]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
>> [7056427.875894]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
>> [7056427.875898]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
>> [7056427.875903]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
>> [7056427.875905]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
>> [7056427.875907]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
>> [7056427.875909]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
>> [7056427.875912]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
>> [7056427.875915]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
>> [7056427.875916]  [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360
>> [7056427.875920]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
>> [7056427.875921]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
>> [7056427.875924]  [<ffffffff81616172>] page_fault+0x22/0x30
>>
>
> So after further debugging the following seems to be a better representation of what's happened:
>
> So the softlockup for a task is as follows:
>
> [7056427.875372] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [nginx:47108]
> [7056427.875832] CPU: 17 PID: 47108 Comm: nginx Tainted: G        W  O  K 4.4.14-clouder5 #2
> [7056427.875833] Hardware name: Supermicro PIO-628U-TR4T+-ST031/X10DRU-i+, BIOS 1.0c 03/23/2015
> [7056427.875835] task: ffff88208a06d280 ti: ffff88235af70000 task.ti: ffff88235af70000
> [7056427.875836] RIP: 0010:[<ffffffff8118dbf3>]  [<ffffffff8118dbf3>] mem_cgroup_iter+0x33/0x3e0
> [7056427.875844] RSP: 0018:ffff88235af73868  EFLAGS: 00000246
> [7056427.875845] RAX: 0000000000000005 RBX: ffff88235af739a8 RCX: ffff88207fffc000
> [7056427.875846] RDX: ffff88235af738e0 RSI: 0000000000000000 RDI: ffff8823990d1400
> [7056427.875847] RBP: ffff88235af738a0 R08: 0000000000000000 R09: 0000000001d0df03
> [7056427.875848] R10: 28f5c28f5c28f5c3 R11: 0000000000000000 R12: ffff8823990d1400
> [7056427.875849] R13: ffff8823990d1400 R14: 0000000000000000 R15: ffff88407fffae30
> [7056427.875850] FS:  00007f8cdee26720(0000) GS:ffff883fff2a0000(0000) knlGS:0000000000000000
> [7056427.875851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [7056427.875852] CR2: 00007f4d199780e0 CR3: 0000002e0b4e3000 CR4: 00000000001406e0
> [7056427.875853] Stack:
> [7056427.875854]  00ff8823990d1598 ffff88235af738a0 ffff88235af739a8 ffff8823990d1400
> [7056427.875856]  0000000000000000 ffff88207fffc000 ffff88407fffae30 ffff88235af73918
> [7056427.875858]  ffffffff8113f92e 0000000000000000 0000000000000000 0000000000000000
> [7056427.875860] Call Trace:
> [7056427.875866]  [<ffffffff8113f92e>] shrink_zone+0x6e/0x2a0
> [7056427.875869]  [<ffffffff8113fcd4>] do_try_to_free_pages+0x174/0x460
> [7056427.875873]  [<ffffffff81308415>] ? find_next_bit+0x15/0x20
> [7056427.875875]  [<ffffffff811401e7>] try_to_free_mem_cgroup_pages+0xa7/0x170
> [7056427.875878]  [<ffffffff8118d0ef>] try_charge+0x18f/0x650
> [7056427.875883]  [<ffffffff81087486>] ? update_curr+0x66/0x180
> [7056427.875885]  [<ffffffff811915ed>] mem_cgroup_try_charge+0x7d/0x1c0
> [7056427.875889]  [<ffffffff81128cf2>] __add_to_page_cache_locked+0x42/0x230
> [7056427.875891]  [<ffffffff81128f28>] add_to_page_cache_lru+0x28/0x80
> [7056427.875894]  [<ffffffff812719b2>] ext4_mpage_readpages+0x172/0x820
> [7056427.875898]  [<ffffffff81172d02>] ? alloc_pages_current+0x92/0x120
> [7056427.875903]  [<ffffffff81228ce6>] ext4_readpages+0x36/0x40
> [7056427.875905]  [<ffffffff811374e0>] __do_page_cache_readahead+0x180/0x210
> [7056427.875907]  [<ffffffff8112b1f0>] filemap_fault+0x370/0x400
> [7056427.875909]  [<ffffffff81231716>] ext4_filemap_fault+0x36/0x50
> [7056427.875912]  [<ffffffff8115629f>] __do_fault+0x3f/0xd0
> [7056427.875915]  [<ffffffff8115a205>] handle_mm_fault+0x1245/0x19c0
> [7056427.875916]  [<ffffffff81190a67>] ? mem_cgroup_oom_synchronize+0x2c7/0x360
> [7056427.875920]  [<ffffffff810430bb>] __do_page_fault+0x1ab/0x410
> [7056427.875921]  [<ffffffff8104332c>] do_page_fault+0xc/0x10
> [7056427.875924]  [<ffffffff81616172>] page_fault+0x22/0x30
>
>
> ffffffff8118dbf3 is mem_cgroup_iter called from shrink_zone but in the outer do {} while. However,
> after crashing the machine via sysrq and looking at the callstack for the same process looks like:
>
> #0 [ffff88235af73598] __schedule at ffffffff8160fa9a
>  #1 [ffff88235af735e0] preempt_schedule_common at ffffffff8161043f
>  #2 [ffff88235af735f8] _cond_resched at ffffffff8161047c
>  #3 [ffff88235af73608] shrink_page_list at ffffffff8113dd94
>  #4 [ffff88235af736e8] shrink_inactive_list at ffffffff8113eab1
>  #5 [ffff88235af737a8] shrink_lruvec at ffffffff8113f710
>  #6 [ffff88235af738a8] shrink_zone at ffffffff8113f99c
>  #7 [ffff88235af73920] do_try_to_free_pages at ffffffff8113fcd4
>  #8 [ffff88235af739a0] try_to_free_mem_cgroup_pages at ffffffff811401e7
>  #9 [ffff88235af73a10] try_charge at ffffffff8118d0ef
> #10 [ffff88235af73ab0] mem_cgroup_try_charge at ffffffff811915ed
> #11 [ffff88235af73af0] __add_to_page_cache_locked at ffffffff81128cf2
> #12 [ffff88235af73b48] add_to_page_cache_lru at ffffffff81128f28
> #13 [ffff88235af73b70] ext4_mpage_readpages at ffffffff812719b2
> #14 [ffff88235af73c78] ext4_readpages at ffffffff81228ce6
> #15 [ffff88235af73c88] __do_page_cache_readahead at ffffffff811374e0
> #16 [ffff88235af73d30] filemap_fault at ffffffff8112b1f0
> #17 [ffff88235af73d88] ext4_filemap_fault at ffffffff81231716
> #18 [ffff88235af73db0] __do_fault at ffffffff8115629f
> #19 [ffff88235af73e10] handle_mm_fault at ffffffff8115a205
> #20 [ffff88235af73ee8] __do_page_fault at ffffffff810430bb
> #21 [ffff88235af73f40] do_page_fault at ffffffff8104332c
> #22 [ffff88235af73f50] page_fault at ffffffff81616172
>
> This is the shrink_lruvec in the inner do {} while
>
> So clearly the same task has made some progress and has even
> re-scheduled due to the cond_resched in shrink_page_list.
> This means that in order for the softlockup to be triggered
> in shrink_zone this means that the code after the inner while
> should've taken more than 22 seconds (the time out for the s
> oftlockup watchdog). The function which are there are
> shrink_slab (but this seems to have the necessary cond_resched
> in its callees e.g. do_shrink_slab) and vmpressure - this is very lightweight.
>
> So it's still puzzling how come the softlockup is triggered in shrink_zone, when
> there in fact is plenty of opportunity to resched while scanning the lists

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-13 22:02   ` Nikolay Borisov
@ 2016-11-21  5:31     ` Michal Hocko
  2016-11-22  8:56       ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2016-11-21  5:31 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Linux MM

Hi,
I am sorry for a late response, but I was offline until this weekend. I
will try to get to this email ASAP but it might take some time.

On Mon 14-11-16 00:02:57, Nikolay Borisov wrote:
> Ping on that Michal, in case you've missed it. This seems like a
> genuine miss of a cond_resched. Can you at least confirm my analysis
> or is it complete bollocks?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-21  5:31     ` Michal Hocko
@ 2016-11-22  8:56       ` Nikolay Borisov
  2016-11-22 14:30         ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-22  8:56 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux MM



On 11/21/2016 07:31 AM, Michal Hocko wrote:
> Hi,
> I am sorry for a late response, but I was offline until this weekend. I
> will try to get to this email ASAP but it might take some time.

No worries. I did some further digging up and here is what I got, which
I believe is rather strange:

struct scan_control {
  nr_to_reclaim = 32,
  gfp_mask = 37880010,
  order = 0,
  nodemask = 0x0,
  target_mem_cgroup = 0xffff8823990d1400,
  priority = 7,
  may_writepage = 1,
  may_unmap = 1,
  may_swap = 0,
  may_thrash = 1,
  hibernation_mode = 0,
  compaction_ready = 0,
  nr_scanned = 0,
  nr_reclaimed = 0
}

Parsing: 37880010
#define ___GFP_HIGHMEM		0x02
#define ___GFP_MOVABLE		0x08
#define ___GFP_IO		0x40
#define ___GFP_FS		0x80
#define ___GFP_HARDWALL		0x20000
#define ___GFP_DIRECT_RECLAIM	0x400000
#define ___GFP_KSWAPD_RECLAIM	0x2000000

And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
and priority is 7 this means we've gone 5 times through the do {} while
in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
zone which was being reclaimed :

http://sprunge.us/hQBi

So what's strange is that the softlockup occurred but then the code
proceeded (as evident from the subsequent stack traces), yet inspecting
the reclaim progress it seems rather sad (no progress at all)


> 
> On Mon 14-11-16 00:02:57, Nikolay Borisov wrote:
>> Ping on that Michal, in case you've missed it. This seems like a
>> genuine miss of a cond_resched. Can you at least confirm my analysis
>> or is it complete bollocks?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-22  8:56       ` Nikolay Borisov
@ 2016-11-22 14:30         ` Michal Hocko
  2016-11-22 14:32           ` Michal Hocko
  2016-11-22 14:35           ` Nikolay Borisov
  0 siblings, 2 replies; 20+ messages in thread
From: Michal Hocko @ 2016-11-22 14:30 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Linux MM

On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
> 
> 
> On 11/21/2016 07:31 AM, Michal Hocko wrote:
> > Hi,
> > I am sorry for a late response, but I was offline until this weekend. I
> > will try to get to this email ASAP but it might take some time.
> 
> No worries. I did some further digging up and here is what I got, which
> I believe is rather strange:
> 
> struct scan_control {
>   nr_to_reclaim = 32,
>   gfp_mask = 37880010,
>   order = 0,
>   nodemask = 0x0,
>   target_mem_cgroup = 0xffff8823990d1400,
>   priority = 7,
>   may_writepage = 1,
>   may_unmap = 1,
>   may_swap = 0,
>   may_thrash = 1,
>   hibernation_mode = 0,
>   compaction_ready = 0,
>   nr_scanned = 0,
>   nr_reclaimed = 0
> }
> 
> Parsing: 37880010
> #define ___GFP_HIGHMEM		0x02
> #define ___GFP_MOVABLE		0x08
> #define ___GFP_IO		0x40
> #define ___GFP_FS		0x80
> #define ___GFP_HARDWALL		0x20000
> #define ___GFP_DIRECT_RECLAIM	0x400000
> #define ___GFP_KSWAPD_RECLAIM	0x2000000
> 
> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
> and priority is 7 this means we've gone 5 times through the do {} while
> in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
> zone which was being reclaimed :
> 
> http://sprunge.us/hQBi

LRUs on that zones seem to be empty from a quick glance. kmem -z in the
crash can give you per zone counters much more nicely.

> So what's strange is that the softlockup occurred but then the code
> proceeded (as evident from the subsequent stack traces), yet inspecting
> the reclaim progress it seems rather sad (no progress at all)

Unless I have misread the data above it seems something has either
isolated all LRU pages for some time or there simply are none while the
reclaim is desperately trying to make some progress. In any case this
sounds less than a happy system...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-22 14:30         ` Michal Hocko
@ 2016-11-22 14:32           ` Michal Hocko
  2016-11-22 14:46             ` Nikolay Borisov
  2016-11-22 14:35           ` Nikolay Borisov
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2016-11-22 14:32 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Linux MM

On Tue 22-11-16 15:30:56, Michal Hocko wrote:
> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
> > 
> > 
> > On 11/21/2016 07:31 AM, Michal Hocko wrote:
> > > Hi,
> > > I am sorry for a late response, but I was offline until this weekend. I
> > > will try to get to this email ASAP but it might take some time.
> > 
> > No worries. I did some further digging up and here is what I got, which
> > I believe is rather strange:
> > 
> > struct scan_control {
> >   nr_to_reclaim = 32,
> >   gfp_mask = 37880010,
> >   order = 0,
> >   nodemask = 0x0,
> >   target_mem_cgroup = 0xffff8823990d1400,
> >   priority = 7,
> >   may_writepage = 1,
> >   may_unmap = 1,
> >   may_swap = 0,
> >   may_thrash = 1,
> >   hibernation_mode = 0,
> >   compaction_ready = 0,
> >   nr_scanned = 0,
> >   nr_reclaimed = 0
> > }
> > 
> > Parsing: 37880010
> > #define ___GFP_HIGHMEM		0x02
> > #define ___GFP_MOVABLE		0x08
> > #define ___GFP_IO		0x40
> > #define ___GFP_FS		0x80
> > #define ___GFP_HARDWALL		0x20000
> > #define ___GFP_DIRECT_RECLAIM	0x400000
> > #define ___GFP_KSWAPD_RECLAIM	0x2000000
> > 
> > And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
> > and priority is 7 this means we've gone 5 times through the do {} while
> > in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
> > zone which was being reclaimed :
> > 
> > http://sprunge.us/hQBi
> 
> LRUs on that zones seem to be empty from a quick glance. kmem -z in the
> crash can give you per zone counters much more nicely.
> 
> > So what's strange is that the softlockup occurred but then the code
> > proceeded (as evident from the subsequent stack traces), yet inspecting
> > the reclaim progress it seems rather sad (no progress at all)
> 
> Unless I have misread the data above it seems something has either
> isolated all LRU pages for some time or there simply are none while the
> reclaim is desperately trying to make some progress. In any case this
> sounds less than a happy system...

Btw. how do you configure memcgs that the FS workload runs in?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-22 14:30         ` Michal Hocko
  2016-11-22 14:32           ` Michal Hocko
@ 2016-11-22 14:35           ` Nikolay Borisov
  2016-11-22 17:02             ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-22 14:35 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux MM



On 11/22/2016 04:30 PM, Michal Hocko wrote:
> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
>>
>>
>> On 11/21/2016 07:31 AM, Michal Hocko wrote:
>>> Hi,
>>> I am sorry for a late response, but I was offline until this weekend. I
>>> will try to get to this email ASAP but it might take some time.
>>
>> No worries. I did some further digging up and here is what I got, which
>> I believe is rather strange:
>>
>> struct scan_control {
>>   nr_to_reclaim = 32,
>>   gfp_mask = 37880010,
>>   order = 0,
>>   nodemask = 0x0,
>>   target_mem_cgroup = 0xffff8823990d1400,
>>   priority = 7,
>>   may_writepage = 1,
>>   may_unmap = 1,
>>   may_swap = 0,
>>   may_thrash = 1,
>>   hibernation_mode = 0,
>>   compaction_ready = 0,
>>   nr_scanned = 0,
>>   nr_reclaimed = 0
>> }
>>
>> Parsing: 37880010
>> #define ___GFP_HIGHMEM		0x02
>> #define ___GFP_MOVABLE		0x08
>> #define ___GFP_IO		0x40
>> #define ___GFP_FS		0x80
>> #define ___GFP_HARDWALL		0x20000
>> #define ___GFP_DIRECT_RECLAIM	0x400000
>> #define ___GFP_KSWAPD_RECLAIM	0x2000000
>>
>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
>> and priority is 7 this means we've gone 5 times through the do {} while
>> in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
>> zone which was being reclaimed :

This is also very strange that total_scanned is 0.


>>
>> http://sprunge.us/hQBi
> 
> LRUs on that zones seem to be empty from a quick glance. kmem -z in the
> crash can give you per zone counters much more nicely.
> 

So here are the populated zones:

NODE: 0  ZONE: 0  ADDR: ffff88207fffc000  NAME: "DMA"
  SIZE: 4095  PRESENT: 3994  MIN/LOW/HIGH: 2/2/3
  VM_STAT:
                NR_FREE_PAGES: 3626
               NR_ALLOC_BATCH: 1
             NR_INACTIVE_ANON: 0
               NR_ACTIVE_ANON: 0
             NR_INACTIVE_FILE: 0
               NR_ACTIVE_FILE: 0
               NR_UNEVICTABLE: 0
                     NR_MLOCK: 0
                NR_ANON_PAGES: 0
               NR_FILE_MAPPED: 0
                NR_FILE_PAGES: 0
                NR_FILE_DIRTY: 0
                 NR_WRITEBACK: 0
          NR_SLAB_RECLAIMABLE: 0
        NR_SLAB_UNRECLAIMABLE: 0
                 NR_PAGETABLE: 0
              NR_KERNEL_STACK: 0
              NR_UNSTABLE_NFS: 0
                    NR_BOUNCE: 0
              NR_VMSCAN_WRITE: 0
          NR_VMSCAN_IMMEDIATE: 0
            NR_WRITEBACK_TEMP: 0
             NR_ISOLATED_ANON: 0
             NR_ISOLATED_FILE: 0
                     NR_SHMEM: 0
                   NR_DIRTIED: 0
                   NR_WRITTEN: 0
             NR_PAGES_SCANNED: 0
                     NUMA_HIT: 298251
                    NUMA_MISS: 0
                 NUMA_FOREIGN: 0
          NUMA_INTERLEAVE_HIT: 0
                   NUMA_LOCAL: 264611
                   NUMA_OTHER: 33640
           WORKINGSET_REFAULT: 0
          WORKINGSET_ACTIVATE: 0
       WORKINGSET_NODERECLAIM: 0
NR_ANON_TRANSPARENT_HUGEPAGES: 0
            NR_FREE_CMA_PAGES: 0


NODE: 0  ZONE: 1  ADDR: ffff88207fffc780  NAME: "DMA32"
  SIZE: 1044480  PRESENT: 492819  MIN/LOW/HIGH: 275/343/412
  VM_STAT:
                NR_FREE_PAGES: 127277
               NR_ALLOC_BATCH: 69
             NR_INACTIVE_ANON: 104061
               NR_ACTIVE_ANON: 40297
             NR_INACTIVE_FILE: 19114
               NR_ACTIVE_FILE: 24517
               NR_UNEVICTABLE: 1027
                     NR_MLOCK: 1231
                NR_ANON_PAGES: 141688
               NR_FILE_MAPPED: 4619
                NR_FILE_PAGES: 47327
                NR_FILE_DIRTY: 1
                 NR_WRITEBACK: 0
          NR_SLAB_RECLAIMABLE: 77185
        NR_SLAB_UNRECLAIMABLE: 5064
                 NR_PAGETABLE: 2051
              NR_KERNEL_STACK: 236
              NR_UNSTABLE_NFS: 0
                    NR_BOUNCE: 0
              NR_VMSCAN_WRITE: 347044
          NR_VMSCAN_IMMEDIATE: 289451163
            NR_WRITEBACK_TEMP: 0
             NR_ISOLATED_ANON: 0
             NR_ISOLATED_FILE: 0
                     NR_SHMEM: 3062
                   NR_DIRTIED: 76625942
                   NR_WRITTEN: 63608865
             NR_PAGES_SCANNED: -9
                     NUMA_HIT: 11857097869
                    NUMA_MISS: 2808023
                 NUMA_FOREIGN: 0
          NUMA_INTERLEAVE_HIT: 0
                   NUMA_LOCAL: 11856373836
                   NUMA_OTHER: 3532056
           WORKINGSET_REFAULT: 107056373
          WORKINGSET_ACTIVATE: 88346956
       WORKINGSET_NODERECLAIM: 27254
NR_ANON_TRANSPARENT_HUGEPAGES: 10
            NR_FREE_CMA_PAGES: 0


NODE: 0  ZONE: 2  ADDR: ffff88207fffcf00  NAME: "Normal"
  SIZE: 33030144  MIN/LOW/HIGH: 22209/27761/33313
  VM_STAT:
                NR_FREE_PAGES: 62436
               NR_ALLOC_BATCH: 2024
             NR_INACTIVE_ANON: 8177867
               NR_ACTIVE_ANON: 5407176
             NR_INACTIVE_FILE: 5804642
               NR_ACTIVE_FILE: 9694170
               NR_UNEVICTABLE: 50013
                     NR_MLOCK: 59860
                NR_ANON_PAGES: 13276046
               NR_FILE_MAPPED: 969231
                NR_FILE_PAGES: 15858085
                NR_FILE_DIRTY: 683
                 NR_WRITEBACK: 530
          NR_SLAB_RECLAIMABLE: 2688882
        NR_SLAB_UNRECLAIMABLE: 255070
                 NR_PAGETABLE: 182007
              NR_KERNEL_STACK: 8419
              NR_UNSTABLE_NFS: 0
                    NR_BOUNCE: 0
              NR_VMSCAN_WRITE: 1129513
          NR_VMSCAN_IMMEDIATE: 39497899
            NR_WRITEBACK_TEMP: 0
             NR_ISOLATED_ANON: 0
             NR_ISOLATED_FILE: 462
                     NR_SHMEM: 331386
                   NR_DIRTIED: 6868276352
                   NR_WRITTEN: 5816499568
             NR_PAGES_SCANNED: -490
                     NUMA_HIT: 922019911612
                    NUMA_MISS: 2935289654
                 NUMA_FOREIGN: 1903827196
          NUMA_INTERLEAVE_HIT: 57290
                   NUMA_LOCAL: 922017951068
                   NUMA_OTHER: 2937250198
           WORKINGSET_REFAULT: 6998116360
          WORKINGSET_ACTIVATE: 6033595269
       WORKINGSET_NODERECLAIM: 2300965
NR_ANON_TRANSPARENT_HUGEPAGES: 0
            NR_FREE_CMA_PAGES: 0

NODE: 1  ZONE: 2  ADDR: ffff88407fff9f00  NAME: "Normal"
  SIZE: 33554432  MIN/LOW/HIGH: 22567/28208/33850
  VM_STAT:
                NR_FREE_PAGES: 1003922
               NR_ALLOC_BATCH: 4572
             NR_INACTIVE_ANON: 7092366
               NR_ACTIVE_ANON: 6898921
             NR_INACTIVE_FILE: 4880696
               NR_ACTIVE_FILE: 8185594
               NR_UNEVICTABLE: 5311
                     NR_MLOCK: 25509
                NR_ANON_PAGES: 13644139
               NR_FILE_MAPPED: 790292
                NR_FILE_PAGES: 13418055
                NR_FILE_DIRTY: 2081
                 NR_WRITEBACK: 944
          NR_SLAB_RECLAIMABLE: 3948975
        NR_SLAB_UNRECLAIMABLE: 546053
                 NR_PAGETABLE: 207960
              NR_KERNEL_STACK: 10382
              NR_UNSTABLE_NFS: 0
                    NR_BOUNCE: 0
              NR_VMSCAN_WRITE: 213029
          NR_VMSCAN_IMMEDIATE: 28902492
            NR_WRITEBACK_TEMP: 0
             NR_ISOLATED_ANON: 0
             NR_ISOLATED_FILE: 23
                     NR_SHMEM: 327804
                   NR_DIRTIED: 12275571618
                   NR_WRITTEN: 11397580462
             NR_PAGES_SCANNED: -787
                     NUMA_HIT: 798927158945
                    NUMA_MISS: 1903827196
                 NUMA_FOREIGN: 2938097677
          NUMA_INTERLEAVE_HIT: 57726
                   NUMA_LOCAL: 798925933393
                   NUMA_OTHER: 1905052748
           WORKINGSET_REFAULT: 3461465775
          WORKINGSET_ACTIVATE: 2724000507
       WORKINGSET_NODERECLAIM: 4756016
NR_ANON_TRANSPARENT_HUGEPAGES: 70
            NR_FREE_CMA_PAGES: 0

So looking at those I see the following things:

1. There aren't that many writeback/dirty pages on the 2 nodes.
2. There aren't that many isolated pages.

Since the system doesn't have swap then the ANON allocation's cannot
possibly be reclaimed. However, this leaves the FILE allocations of
which there are plenty. Yet, still no further progress is made. Given
all of this I'm not able to map the number to a sensible behavior of the
reclamation path.

>> So what's strange is that the softlockup occurred but then the code
>> proceeded (as evident from the subsequent stack traces), yet inspecting
>> the reclaim progress it seems rather sad (no progress at all)
> 
> Unless I have misread the data above it seems something has either
> isolated all LRU pages for some time or there simply are none while the
> reclaim is desperately trying to make some progress. In any case this
> sounds less than a happy system...
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-22 14:32           ` Michal Hocko
@ 2016-11-22 14:46             ` Nikolay Borisov
  0 siblings, 0 replies; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-22 14:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux MM



On 11/22/2016 04:32 PM, Michal Hocko wrote:
> On Tue 22-11-16 15:30:56, Michal Hocko wrote:
>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
>>>
>>>
>>> On 11/21/2016 07:31 AM, Michal Hocko wrote:
>>>> Hi,
>>>> I am sorry for a late response, but I was offline until this weekend. I
>>>> will try to get to this email ASAP but it might take some time.
>>>
>>> No worries. I did some further digging up and here is what I got, which
>>> I believe is rather strange:
>>>
>>> struct scan_control {
>>>   nr_to_reclaim = 32,
>>>   gfp_mask = 37880010,
>>>   order = 0,
>>>   nodemask = 0x0,
>>>   target_mem_cgroup = 0xffff8823990d1400,
>>>   priority = 7,
>>>   may_writepage = 1,
>>>   may_unmap = 1,
>>>   may_swap = 0,
>>>   may_thrash = 1,
>>>   hibernation_mode = 0,
>>>   compaction_ready = 0,
>>>   nr_scanned = 0,
>>>   nr_reclaimed = 0
>>> }
>>>
>>> Parsing: 37880010
>>> #define ___GFP_HIGHMEM		0x02
>>> #define ___GFP_MOVABLE		0x08
>>> #define ___GFP_IO		0x40
>>> #define ___GFP_FS		0x80
>>> #define ___GFP_HARDWALL		0x20000
>>> #define ___GFP_DIRECT_RECLAIM	0x400000
>>> #define ___GFP_KSWAPD_RECLAIM	0x2000000
>>>
>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
>>> and priority is 7 this means we've gone 5 times through the do {} while
>>> in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
>>> zone which was being reclaimed :
>>>
>>> http://sprunge.us/hQBi
>>
>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the
>> crash can give you per zone counters much more nicely.
>>
>>> So what's strange is that the softlockup occurred but then the code
>>> proceeded (as evident from the subsequent stack traces), yet inspecting
>>> the reclaim progress it seems rather sad (no progress at all)
>>
>> Unless I have misread the data above it seems something has either
>> isolated all LRU pages for some time or there simply are none while the
>> reclaim is desperately trying to make some progress. In any case this
>> sounds less than a happy system...
> 
> Btw. how do you configure memcgs that the FS workload runs in?

So the hierarchy is on v1 and looks like the following:

/cgroup/LXC/cgroup-where-fs-load-runs

- LXC has all but 5gb of memory for itself.
- The leaf cgroup has a limit of 2 gigabytes:

memory = {
    count = {
      counter = 523334
    },
    limit = 524288,
    parent = 0xffff881fefa40cb8,
    watermark = 524291,
    failcnt = 0
  },
  memsw = {
    count = {
      counter = 524310
    },
    limit = 524288,
    parent = 0xffff881fefa40ce0,
    watermark = 524320,
    failcnt = 294061026
  },
  kmem = {
    count = {
      counter = 0
    },
    limit = 2251799813685247,
    parent = 0xffff881fefa40d08,
    watermark = 0,
    failcnt = 0
  },


As you can see the hierarchy is very shallow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-22 14:35           ` Nikolay Borisov
@ 2016-11-22 17:02             ` Michal Hocko
  2016-11-23  7:44               ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2016-11-22 17:02 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Linux MM

On Tue 22-11-16 16:35:38, Nikolay Borisov wrote:
> 
> 
> On 11/22/2016 04:30 PM, Michal Hocko wrote:
> > On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
> >>
> >>
> >> On 11/21/2016 07:31 AM, Michal Hocko wrote:
> >>> Hi,
> >>> I am sorry for a late response, but I was offline until this weekend. I
> >>> will try to get to this email ASAP but it might take some time.
> >>
> >> No worries. I did some further digging up and here is what I got, which
> >> I believe is rather strange:
> >>
> >> struct scan_control {
> >>   nr_to_reclaim = 32,
> >>   gfp_mask = 37880010,
> >>   order = 0,
> >>   nodemask = 0x0,
> >>   target_mem_cgroup = 0xffff8823990d1400,
> >>   priority = 7,
> >>   may_writepage = 1,
> >>   may_unmap = 1,
> >>   may_swap = 0,
> >>   may_thrash = 1,
> >>   hibernation_mode = 0,
> >>   compaction_ready = 0,
> >>   nr_scanned = 0,
> >>   nr_reclaimed = 0
> >> }
> >>
> >> Parsing: 37880010
> >> #define ___GFP_HIGHMEM		0x02
> >> #define ___GFP_MOVABLE		0x08
> >> #define ___GFP_IO		0x40
> >> #define ___GFP_FS		0x80
> >> #define ___GFP_HARDWALL		0x20000
> >> #define ___GFP_DIRECT_RECLAIM	0x400000
> >> #define ___GFP_KSWAPD_RECLAIM	0x2000000
> >>
> >> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
> >> and priority is 7 this means we've gone 5 times through the do {} while
> >> in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
> >> zone which was being reclaimed :
> 
> This is also very strange that total_scanned is 0.
> 
> 
> >>
> >> http://sprunge.us/hQBi
> > 
> > LRUs on that zones seem to be empty from a quick glance. kmem -z in the
> > crash can give you per zone counters much more nicely.
> > 
> 
> So here are the populated zones:
[...]
> NODE: 0  ZONE: 2  ADDR: ffff88207fffcf00  NAME: "Normal"
>   SIZE: 33030144  MIN/LOW/HIGH: 22209/27761/33313
>   VM_STAT:
>                 NR_FREE_PAGES: 62436
>                NR_ALLOC_BATCH: 2024
>              NR_INACTIVE_ANON: 8177867
>                NR_ACTIVE_ANON: 5407176
>              NR_INACTIVE_FILE: 5804642
>                NR_ACTIVE_FILE: 9694170

So your LRUs are definitely not empty as I have thought. Having 
0 pages scanned is indeed very strange. We do reset sc->nr_scanned
for each priority but my understanding was that you are looking at a
state where we are somwhere in the middle of shrink_zones. Moreover
total_scanned should be cumulative.

>                NR_UNEVICTABLE: 50013
>                      NR_MLOCK: 59860
>                 NR_ANON_PAGES: 13276046
>                NR_FILE_MAPPED: 969231
>                 NR_FILE_PAGES: 15858085
>                 NR_FILE_DIRTY: 683
>                  NR_WRITEBACK: 530
>           NR_SLAB_RECLAIMABLE: 2688882
>         NR_SLAB_UNRECLAIMABLE: 255070
>                  NR_PAGETABLE: 182007
>               NR_KERNEL_STACK: 8419
>               NR_UNSTABLE_NFS: 0
>                     NR_BOUNCE: 0
>               NR_VMSCAN_WRITE: 1129513
>           NR_VMSCAN_IMMEDIATE: 39497899
>             NR_WRITEBACK_TEMP: 0
>              NR_ISOLATED_ANON: 0
>              NR_ISOLATED_FILE: 462
>                      NR_SHMEM: 331386
>                    NR_DIRTIED: 6868276352
>                    NR_WRITTEN: 5816499568
>              NR_PAGES_SCANNED: -490
>                      NUMA_HIT: 922019911612
>                     NUMA_MISS: 2935289654
>                  NUMA_FOREIGN: 1903827196
>           NUMA_INTERLEAVE_HIT: 57290
>                    NUMA_LOCAL: 922017951068
>                    NUMA_OTHER: 2937250198
>            WORKINGSET_REFAULT: 6998116360
>           WORKINGSET_ACTIVATE: 6033595269
>        WORKINGSET_NODERECLAIM: 2300965
> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>             NR_FREE_CMA_PAGES: 0
[...]
> 
> So looking at those I see the following things:
> 
> 1. There aren't that many writeback/dirty pages on the 2 nodes.
> 2. There aren't that many isolated pages.
> 
> Since the system doesn't have swap then the ANON allocation's cannot
> possibly be reclaimed. However, this leaves the FILE allocations of
> which there are plenty. Yet, still no further progress is made. Given
> all of this I'm not able to map the number to a sensible behavior of the
> reclamation path.

Well, file pages might be pinned by the filesystem but even then the
number of scanned pages shouldn't be zero. So yeah, this doesn't make
much sense to me.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-22 17:02             ` Michal Hocko
@ 2016-11-23  7:44               ` Nikolay Borisov
  2016-11-23  7:49                 ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-23  7:44 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux MM



On 11/22/2016 07:02 PM, Michal Hocko wrote:
> On Tue 22-11-16 16:35:38, Nikolay Borisov wrote:
>>
>>
>> On 11/22/2016 04:30 PM, Michal Hocko wrote:
>>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 11/21/2016 07:31 AM, Michal Hocko wrote:
>>>>> Hi,
>>>>> I am sorry for a late response, but I was offline until this weekend. I
>>>>> will try to get to this email ASAP but it might take some time.
>>>>
>>>> No worries. I did some further digging up and here is what I got, which
>>>> I believe is rather strange:
>>>>
>>>> struct scan_control {
>>>>   nr_to_reclaim = 32,
>>>>   gfp_mask = 37880010,
>>>>   order = 0,
>>>>   nodemask = 0x0,
>>>>   target_mem_cgroup = 0xffff8823990d1400,
>>>>   priority = 7,
>>>>   may_writepage = 1,
>>>>   may_unmap = 1,
>>>>   may_swap = 0,
>>>>   may_thrash = 1,
>>>>   hibernation_mode = 0,
>>>>   compaction_ready = 0,
>>>>   nr_scanned = 0,
>>>>   nr_reclaimed = 0
>>>> }
>>>>
>>>> Parsing: 37880010
>>>> #define ___GFP_HIGHMEM		0x02
>>>> #define ___GFP_MOVABLE		0x08
>>>> #define ___GFP_IO		0x40
>>>> #define ___GFP_FS		0x80
>>>> #define ___GFP_HARDWALL		0x20000
>>>> #define ___GFP_DIRECT_RECLAIM	0x400000
>>>> #define ___GFP_KSWAPD_RECLAIM	0x2000000
>>>>
>>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
>>>> and priority is 7 this means we've gone 5 times through the do {} while
>>>> in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
>>>> zone which was being reclaimed :
>>
>> This is also very strange that total_scanned is 0.
>>
>>
>>>>
>>>> http://sprunge.us/hQBi
>>>
>>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the
>>> crash can give you per zone counters much more nicely.
>>>
>>
>> So here are the populated zones:
> [...]
>> NODE: 0  ZONE: 2  ADDR: ffff88207fffcf00  NAME: "Normal"
>>   SIZE: 33030144  MIN/LOW/HIGH: 22209/27761/33313
>>   VM_STAT:
>>                 NR_FREE_PAGES: 62436
>>                NR_ALLOC_BATCH: 2024
>>              NR_INACTIVE_ANON: 8177867
>>                NR_ACTIVE_ANON: 5407176
>>              NR_INACTIVE_FILE: 5804642
>>                NR_ACTIVE_FILE: 9694170
> 
> So your LRUs are definitely not empty as I have thought. Having 
> 0 pages scanned is indeed very strange. We do reset sc->nr_scanned
> for each priority but my understanding was that you are looking at a
> state where we are somwhere in the middle of shrink_zones. Moreover
> total_scanned should be cumulative.

So the server began acting wonky. People logged on it and saw the
softlockup as per my initial email. They then initiated a crashdump via
sysrq since most commands weren't going through (e.g. forking) so
crashing it was a last resort measure. After that I start looking at the
crashdump and observe that prior to the crash machine seems to have
locked up judging from the dmesg logs. However, when I manually inspect
the *current* (and current being at the time the crash was actually
initiated) state of the processes reported as softlock up they seem to
have made progress are now in
shrink_zone->shrink_lruvec->shrink_inactive_list->_cond_resched->__schedule

And the softlockup was being shown to be in mem_cgroup_iter. So it's
mystery how come this function can softlockup and after the softlockup
apparently got resolved reclaim is not making any progress.



> 
>>                NR_UNEVICTABLE: 50013
>>                      NR_MLOCK: 59860
>>                 NR_ANON_PAGES: 13276046
>>                NR_FILE_MAPPED: 969231
>>                 NR_FILE_PAGES: 15858085
>>                 NR_FILE_DIRTY: 683
>>                  NR_WRITEBACK: 530
>>           NR_SLAB_RECLAIMABLE: 2688882
>>         NR_SLAB_UNRECLAIMABLE: 255070
>>                  NR_PAGETABLE: 182007
>>               NR_KERNEL_STACK: 8419
>>               NR_UNSTABLE_NFS: 0
>>                     NR_BOUNCE: 0
>>               NR_VMSCAN_WRITE: 1129513
>>           NR_VMSCAN_IMMEDIATE: 39497899
>>             NR_WRITEBACK_TEMP: 0
>>              NR_ISOLATED_ANON: 0
>>              NR_ISOLATED_FILE: 462
>>                      NR_SHMEM: 331386
>>                    NR_DIRTIED: 6868276352
>>                    NR_WRITTEN: 5816499568
>>              NR_PAGES_SCANNED: -490
>>                      NUMA_HIT: 922019911612
>>                     NUMA_MISS: 2935289654
>>                  NUMA_FOREIGN: 1903827196
>>           NUMA_INTERLEAVE_HIT: 57290
>>                    NUMA_LOCAL: 922017951068
>>                    NUMA_OTHER: 2937250198
>>            WORKINGSET_REFAULT: 6998116360
>>           WORKINGSET_ACTIVATE: 6033595269
>>        WORKINGSET_NODERECLAIM: 2300965
>> NR_ANON_TRANSPARENT_HUGEPAGES: 0
>>             NR_FREE_CMA_PAGES: 0
> [...]
>>
>> So looking at those I see the following things:
>>
>> 1. There aren't that many writeback/dirty pages on the 2 nodes.
>> 2. There aren't that many isolated pages.
>>
>> Since the system doesn't have swap then the ANON allocation's cannot
>> possibly be reclaimed. However, this leaves the FILE allocations of
>> which there are plenty. Yet, still no further progress is made. Given
>> all of this I'm not able to map the number to a sensible behavior of the
>> reclamation path.
> 
> Well, file pages might be pinned by the filesystem but even then the
> number of scanned pages shouldn't be zero. So yeah, this doesn't make
> much sense to me.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-23  7:44               ` Nikolay Borisov
@ 2016-11-23  7:49                 ` Michal Hocko
  2016-11-23  7:50                   ` Michal Hocko
  2016-11-24 11:45                   ` Nikolay Borisov
  0 siblings, 2 replies; 20+ messages in thread
From: Michal Hocko @ 2016-11-23  7:49 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Linux MM

On Wed 23-11-16 09:44:45, Nikolay Borisov wrote:
> 
> 
> On 11/22/2016 07:02 PM, Michal Hocko wrote:
> > On Tue 22-11-16 16:35:38, Nikolay Borisov wrote:
> >>
> >>
> >> On 11/22/2016 04:30 PM, Michal Hocko wrote:
> >>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
> >>>>
> >>>>
> >>>> On 11/21/2016 07:31 AM, Michal Hocko wrote:
> >>>>> Hi,
> >>>>> I am sorry for a late response, but I was offline until this weekend. I
> >>>>> will try to get to this email ASAP but it might take some time.
> >>>>
> >>>> No worries. I did some further digging up and here is what I got, which
> >>>> I believe is rather strange:
> >>>>
> >>>> struct scan_control {
> >>>>   nr_to_reclaim = 32,
> >>>>   gfp_mask = 37880010,
> >>>>   order = 0,
> >>>>   nodemask = 0x0,
> >>>>   target_mem_cgroup = 0xffff8823990d1400,
> >>>>   priority = 7,
> >>>>   may_writepage = 1,
> >>>>   may_unmap = 1,
> >>>>   may_swap = 0,
> >>>>   may_thrash = 1,
> >>>>   hibernation_mode = 0,
> >>>>   compaction_ready = 0,
> >>>>   nr_scanned = 0,
> >>>>   nr_reclaimed = 0
> >>>> }
> >>>>
> >>>> Parsing: 37880010
> >>>> #define ___GFP_HIGHMEM		0x02
> >>>> #define ___GFP_MOVABLE		0x08
> >>>> #define ___GFP_IO		0x40
> >>>> #define ___GFP_FS		0x80
> >>>> #define ___GFP_HARDWALL		0x20000
> >>>> #define ___GFP_DIRECT_RECLAIM	0x400000
> >>>> #define ___GFP_KSWAPD_RECLAIM	0x2000000
> >>>>
> >>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
> >>>> and priority is 7 this means we've gone 5 times through the do {} while
> >>>> in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
> >>>> zone which was being reclaimed :
> >>
> >> This is also very strange that total_scanned is 0.
> >>
> >>
> >>>>
> >>>> http://sprunge.us/hQBi
> >>>
> >>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the
> >>> crash can give you per zone counters much more nicely.
> >>>
> >>
> >> So here are the populated zones:
> > [...]
> >> NODE: 0  ZONE: 2  ADDR: ffff88207fffcf00  NAME: "Normal"
> >>   SIZE: 33030144  MIN/LOW/HIGH: 22209/27761/33313
> >>   VM_STAT:
> >>                 NR_FREE_PAGES: 62436
> >>                NR_ALLOC_BATCH: 2024
> >>              NR_INACTIVE_ANON: 8177867
> >>                NR_ACTIVE_ANON: 5407176
> >>              NR_INACTIVE_FILE: 5804642
> >>                NR_ACTIVE_FILE: 9694170
> > 
> > So your LRUs are definitely not empty as I have thought. Having 
> > 0 pages scanned is indeed very strange. We do reset sc->nr_scanned
> > for each priority but my understanding was that you are looking at a
> > state where we are somwhere in the middle of shrink_zones. Moreover
> > total_scanned should be cumulative.
> 
> So the server began acting wonky. People logged on it and saw the
> softlockup as per my initial email. They then initiated a crashdump via
> sysrq since most commands weren't going through (e.g. forking) so
> crashing it was a last resort measure. After that I start looking at the
> crashdump and observe that prior to the crash machine seems to have
> locked up judging from the dmesg logs. However, when I manually inspect
> the *current* (and current being at the time the crash was actually
> initiated) state of the processes reported as softlock up they seem to
> have made progress are now in
> shrink_zone->shrink_lruvec->shrink_inactive_list->_cond_resched->__schedule

OK, I see.

> And the softlockup was being shown to be in mem_cgroup_iter. So it's
> mystery how come this function can softlockup and after the softlockup
> apparently got resolved reclaim is not making any progress.

This might be just a coincidence and the lockup might really mean that
we couldn't isolate (thus scan) any pages at the time the lockup was
detected. mem_cgroup_iter shouldn't itself loop without any bounds to
trigger the lockup on its own. There is a loop around
css_next_descendant_pre but this should take only few iterations in case
we are racing with cgroup removal AFAIR. So to me it sounds more like a
problem with the state of LRU lists rather than anything else.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-23  7:49                 ` Michal Hocko
@ 2016-11-23  7:50                   ` Michal Hocko
  2016-11-24 11:45                   ` Nikolay Borisov
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-11-23  7:50 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Linux MM

On Wed 23-11-16 08:49:47, Michal Hocko wrote:
> On Wed 23-11-16 09:44:45, Nikolay Borisov wrote:
> > 
> > 
> > On 11/22/2016 07:02 PM, Michal Hocko wrote:
> > > On Tue 22-11-16 16:35:38, Nikolay Borisov wrote:
> > >>
> > >>
> > >> On 11/22/2016 04:30 PM, Michal Hocko wrote:
> > >>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
> > >>>>
> > >>>>
> > >>>> On 11/21/2016 07:31 AM, Michal Hocko wrote:
> > >>>>> Hi,
> > >>>>> I am sorry for a late response, but I was offline until this weekend. I
> > >>>>> will try to get to this email ASAP but it might take some time.
> > >>>>
> > >>>> No worries. I did some further digging up and here is what I got, which
> > >>>> I believe is rather strange:
> > >>>>
> > >>>> struct scan_control {
> > >>>>   nr_to_reclaim = 32,
> > >>>>   gfp_mask = 37880010,
> > >>>>   order = 0,
> > >>>>   nodemask = 0x0,
> > >>>>   target_mem_cgroup = 0xffff8823990d1400,
> > >>>>   priority = 7,
> > >>>>   may_writepage = 1,
> > >>>>   may_unmap = 1,
> > >>>>   may_swap = 0,
> > >>>>   may_thrash = 1,
> > >>>>   hibernation_mode = 0,
> > >>>>   compaction_ready = 0,
> > >>>>   nr_scanned = 0,
> > >>>>   nr_reclaimed = 0
> > >>>> }
> > >>>>
> > >>>> Parsing: 37880010
> > >>>> #define ___GFP_HIGHMEM		0x02
> > >>>> #define ___GFP_MOVABLE		0x08
> > >>>> #define ___GFP_IO		0x40
> > >>>> #define ___GFP_FS		0x80
> > >>>> #define ___GFP_HARDWALL		0x20000
> > >>>> #define ___GFP_DIRECT_RECLAIM	0x400000
> > >>>> #define ___GFP_KSWAPD_RECLAIM	0x2000000
> > >>>>
> > >>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
> > >>>> and priority is 7 this means we've gone 5 times through the do {} while
> > >>>> in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
> > >>>> zone which was being reclaimed :
> > >>
> > >> This is also very strange that total_scanned is 0.
> > >>
> > >>
> > >>>>
> > >>>> http://sprunge.us/hQBi
> > >>>
> > >>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the
> > >>> crash can give you per zone counters much more nicely.
> > >>>
> > >>
> > >> So here are the populated zones:
> > > [...]
> > >> NODE: 0  ZONE: 2  ADDR: ffff88207fffcf00  NAME: "Normal"
> > >>   SIZE: 33030144  MIN/LOW/HIGH: 22209/27761/33313
> > >>   VM_STAT:
> > >>                 NR_FREE_PAGES: 62436
> > >>                NR_ALLOC_BATCH: 2024
> > >>              NR_INACTIVE_ANON: 8177867
> > >>                NR_ACTIVE_ANON: 5407176
> > >>              NR_INACTIVE_FILE: 5804642
> > >>                NR_ACTIVE_FILE: 9694170
> > > 
> > > So your LRUs are definitely not empty as I have thought. Having 
> > > 0 pages scanned is indeed very strange. We do reset sc->nr_scanned
> > > for each priority but my understanding was that you are looking at a
> > > state where we are somwhere in the middle of shrink_zones. Moreover
> > > total_scanned should be cumulative.
> > 
> > So the server began acting wonky. People logged on it and saw the
> > softlockup as per my initial email. They then initiated a crashdump via
> > sysrq since most commands weren't going through (e.g. forking) so
> > crashing it was a last resort measure. After that I start looking at the
> > crashdump and observe that prior to the crash machine seems to have
> > locked up judging from the dmesg logs. However, when I manually inspect
> > the *current* (and current being at the time the crash was actually
> > initiated) state of the processes reported as softlock up they seem to
> > have made progress are now in
> > shrink_zone->shrink_lruvec->shrink_inactive_list->_cond_resched->__schedule
> 
> OK, I see.

One more thing. You might get a better picture if you configure your
system to panic on soft lockup so that you get the state at the time of
the problem.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-23  7:49                 ` Michal Hocko
  2016-11-23  7:50                   ` Michal Hocko
@ 2016-11-24 11:45                   ` Nikolay Borisov
  2016-11-24 12:12                     ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-24 11:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux MM



On 11/23/2016 09:49 AM, Michal Hocko wrote:
> On Wed 23-11-16 09:44:45, Nikolay Borisov wrote:
>>
>>
>> On 11/22/2016 07:02 PM, Michal Hocko wrote:
>>> On Tue 22-11-16 16:35:38, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 11/22/2016 04:30 PM, Michal Hocko wrote:
>>>>> On Tue 22-11-16 10:56:51, Nikolay Borisov wrote:
>>>>>>
>>>>>>
>>>>>> On 11/21/2016 07:31 AM, Michal Hocko wrote:
>>>>>>> Hi,
>>>>>>> I am sorry for a late response, but I was offline until this weekend. I
>>>>>>> will try to get to this email ASAP but it might take some time.
>>>>>>
>>>>>> No worries. I did some further digging up and here is what I got, which
>>>>>> I believe is rather strange:
>>>>>>
>>>>>> struct scan_control {
>>>>>>   nr_to_reclaim = 32,
>>>>>>   gfp_mask = 37880010,
>>>>>>   order = 0,
>>>>>>   nodemask = 0x0,
>>>>>>   target_mem_cgroup = 0xffff8823990d1400,
>>>>>>   priority = 7,
>>>>>>   may_writepage = 1,
>>>>>>   may_unmap = 1,
>>>>>>   may_swap = 0,
>>>>>>   may_thrash = 1,
>>>>>>   hibernation_mode = 0,
>>>>>>   compaction_ready = 0,
>>>>>>   nr_scanned = 0,
>>>>>>   nr_reclaimed = 0
>>>>>> }
>>>>>>
>>>>>> Parsing: 37880010
>>>>>> #define ___GFP_HIGHMEM		0x02
>>>>>> #define ___GFP_MOVABLE		0x08
>>>>>> #define ___GFP_IO		0x40
>>>>>> #define ___GFP_FS		0x80
>>>>>> #define ___GFP_HARDWALL		0x20000
>>>>>> #define ___GFP_DIRECT_RECLAIM	0x400000
>>>>>> #define ___GFP_KSWAPD_RECLAIM	0x2000000
>>>>>>
>>>>>> And initial_priority is 12 (DEF_PRIORITY). Given that nr_scanned is 0
>>>>>> and priority is 7 this means we've gone 5 times through the do {} while
>>>>>> in do_try_to_free_pages. Also total_scanned seems to be 0.  Here is the
>>>>>> zone which was being reclaimed :
>>>>
>>>> This is also very strange that total_scanned is 0.
>>>>
>>>>
>>>>>>
>>>>>> http://sprunge.us/hQBi
>>>>>
>>>>> LRUs on that zones seem to be empty from a quick glance. kmem -z in the
>>>>> crash can give you per zone counters much more nicely.
>>>>>
>>>>
>>>> So here are the populated zones:
>>> [...]
>>>> NODE: 0  ZONE: 2  ADDR: ffff88207fffcf00  NAME: "Normal"
>>>>   SIZE: 33030144  MIN/LOW/HIGH: 22209/27761/33313
>>>>   VM_STAT:
>>>>                 NR_FREE_PAGES: 62436
>>>>                NR_ALLOC_BATCH: 2024
>>>>              NR_INACTIVE_ANON: 8177867
>>>>                NR_ACTIVE_ANON: 5407176
>>>>              NR_INACTIVE_FILE: 5804642
>>>>                NR_ACTIVE_FILE: 9694170
>>>
>>> So your LRUs are definitely not empty as I have thought. Having 
>>> 0 pages scanned is indeed very strange. We do reset sc->nr_scanned
>>> for each priority but my understanding was that you are looking at a
>>> state where we are somwhere in the middle of shrink_zones. Moreover
>>> total_scanned should be cumulative.
>>
>> So the server began acting wonky. People logged on it and saw the
>> softlockup as per my initial email. They then initiated a crashdump via
>> sysrq since most commands weren't going through (e.g. forking) so
>> crashing it was a last resort measure. After that I start looking at the
>> crashdump and observe that prior to the crash machine seems to have
>> locked up judging from the dmesg logs. However, when I manually inspect
>> the *current* (and current being at the time the crash was actually
>> initiated) state of the processes reported as softlock up they seem to
>> have made progress are now in
>> shrink_zone->shrink_lruvec->shrink_inactive_list->_cond_resched->__schedule
> 
> OK, I see.
> 
>> And the softlockup was being shown to be in mem_cgroup_iter. So it's
>> mystery how come this function can softlockup and after the softlockup
>> apparently got resolved reclaim is not making any progress.
> 
> This might be just a coincidence and the lockup might really mean that
> we couldn't isolate (thus scan) any pages at the time the lockup was
> detected. mem_cgroup_iter shouldn't itself loop without any bounds to
> trigger the lockup on its own. There is a loop around
> css_next_descendant_pre but this should take only few iterations in case
> we are racing with cgroup removal AFAIR. So to me it sounds more like a
> problem with the state of LRU lists rather than anything else.
> 

Ok, I think I know what has happened. Inspecting the data structures of
the respective cgroup here is what the mem_cgroup_per_zone looks like:

  zoneinfo[2] =   {
    lruvec = {{
        lists = {
          {
            next = 0xffffea004f98c660,
            prev = 0xffffea0063f6b1a0
          },
          {
            next = 0xffffea0004123120,
            prev = 0xffffea002c2e2260
          },
          {
            next = 0xffff8818c37bb360,
            prev = 0xffff8818c37bb360
          },
          {
            next = 0xffff8818c37bb370,
            prev = 0xffff8818c37bb370
          },
          {
            next = 0xffff8818c37bb380,
            prev = 0xffff8818c37bb380
          }
        },
        reclaim_stat = {
          recent_rotated = {172969085, 43319509},
          recent_scanned = {173112994, 185446658}
        },
        zone = 0xffff88207fffcf00
    }},
    lru_size = {159722, 158714, 0, 0, 0},
    }

So this means that there are inactive_anon and active_annon only -
correct? Since the machine doesn't have any swap this means anon memory
has nowhere to go. If I'm interpreting the data correctly then this
explains why reclaim makes no progress. If that's the case then I have
the following questions:

1. Shouldn't reclaim exit at some point rather than being stuck in
reclaim without making further progress.

2. It seems rather strange that there are no (INACTIVE|ACTIVE)_FILE
pages - is this possible?

3. Why hasn't OOM been activated in order to free up some anonymous memory ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-24 11:45                   ` Nikolay Borisov
@ 2016-11-24 12:12                     ` Michal Hocko
  2016-11-24 13:09                       ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2016-11-24 12:12 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Linux MM

On Thu 24-11-16 13:45:03, Nikolay Borisov wrote:
[...]
> Ok, I think I know what has happened. Inspecting the data structures of
> the respective cgroup here is what the mem_cgroup_per_zone looks like:
> 
>   zoneinfo[2] =   {
>     lruvec = {{
>         lists = {
>           {
>             next = 0xffffea004f98c660,
>             prev = 0xffffea0063f6b1a0
>           },
>           {
>             next = 0xffffea0004123120,
>             prev = 0xffffea002c2e2260
>           },
>           {
>             next = 0xffff8818c37bb360,
>             prev = 0xffff8818c37bb360
>           },
>           {
>             next = 0xffff8818c37bb370,
>             prev = 0xffff8818c37bb370
>           },
>           {
>             next = 0xffff8818c37bb380,
>             prev = 0xffff8818c37bb380
>           }
>         },
>         reclaim_stat = {
>           recent_rotated = {172969085, 43319509},
>           recent_scanned = {173112994, 185446658}
>         },
>         zone = 0xffff88207fffcf00
>     }},
>     lru_size = {159722, 158714, 0, 0, 0},
>     }
> 
> So this means that there are inactive_anon and active_annon only -
> correct?

yes. at least in this particular zone.

> Since the machine doesn't have any swap this means anon memory
> has nowhere to go. If I'm interpreting the data correctly then this
> explains why reclaim makes no progress. If that's the case then I have
> the following questions:
> 
> 1. Shouldn't reclaim exit at some point rather than being stuck in
> reclaim without making further progress.

Reclaim (try_to_free_mem_cgroup_pages) has to go down all priorities
without to get out. We are not doing any pro-active checks whether there
is anything reclaimable but that alone shouldn't be such a big deal
because shrink_node_memcg should simply do nothing because
get_scan_count will find no pages to scan. So it shouldn't take much
time to realize there is nothing to reclaim and get back to try_charge
which retries few more times and eventually goes OOM. I do not see how
we could trigger rcu stalls here. There shouldn't be any long RCU
critical section on the way and preemption points on the way.

> 2. It seems rather strange that there are no (INACTIVE|ACTIVE)_FILE
> pages - is this possible?

All of them might be reclaimed already as a result of the memory
pressure in the memcg. So not all that surprising. But the fact that
you are hitting the limit means that the anonymous pages saturate your
hard limit so your memcg seems underprovisioned.

> 3. Why hasn't OOM been activated in order to free up some anonymous memory ?

It should eventually. Maybe there still were some reclaimable pages in
other zones for this memcg.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-24 12:12                     ` Michal Hocko
@ 2016-11-24 13:09                       ` Nikolay Borisov
  2016-11-25  9:00                         ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2016-11-24 13:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Linux MM



On 11/24/2016 02:12 PM, Michal Hocko wrote:
> On Thu 24-11-16 13:45:03, Nikolay Borisov wrote:
> [...]
>> Ok, I think I know what has happened. Inspecting the data structures of
>> the respective cgroup here is what the mem_cgroup_per_zone looks like:
>>
>>   zoneinfo[2] =   {
>>     lruvec = {{
>>         lists = {
>>           {
>>             next = 0xffffea004f98c660,
>>             prev = 0xffffea0063f6b1a0
>>           },
>>           {
>>             next = 0xffffea0004123120,
>>             prev = 0xffffea002c2e2260
>>           },
>>           {
>>             next = 0xffff8818c37bb360,
>>             prev = 0xffff8818c37bb360
>>           },
>>           {
>>             next = 0xffff8818c37bb370,
>>             prev = 0xffff8818c37bb370
>>           },
>>           {
>>             next = 0xffff8818c37bb380,
>>             prev = 0xffff8818c37bb380
>>           }
>>         },
>>         reclaim_stat = {
>>           recent_rotated = {172969085, 43319509},
>>           recent_scanned = {173112994, 185446658}
>>         },
>>         zone = 0xffff88207fffcf00
>>     }},
>>     lru_size = {159722, 158714, 0, 0, 0},
>>     }
>>
>> So this means that there are inactive_anon and active_annon only -
>> correct?
> 
> yes. at least in this particular zone.
> 
>> Since the machine doesn't have any swap this means anon memory
>> has nowhere to go. If I'm interpreting the data correctly then this
>> explains why reclaim makes no progress. If that's the case then I have
>> the following questions:
>>
>> 1. Shouldn't reclaim exit at some point rather than being stuck in
>> reclaim without making further progress.
> 
> Reclaim (try_to_free_mem_cgroup_pages) has to go down all priorities
> without to get out. We are not doing any pro-active checks whether there
> is anything reclaimable but that alone shouldn't be such a big deal
> because shrink_node_memcg should simply do nothing because
> get_scan_count will find no pages to scan. So it shouldn't take much
> time to realize there is nothing to reclaim and get back to try_charge
> which retries few more times and eventually goes OOM. I do not see how
> we could trigger rcu stalls here. There shouldn't be any long RCU
> critical section on the way and preemption points on the way.
> 
>> 2. It seems rather strange that there are no (INACTIVE|ACTIVE)_FILE
>> pages - is this possible?
> 
> All of them might be reclaimed already as a result of the memory
> pressure in the memcg. So not all that surprising. But the fact that
> you are hitting the limit means that the anonymous pages saturate your
> hard limit so your memcg seems underprovisioned.
> 
>> 3. Why hasn't OOM been activated in order to free up some anonymous memory ?
> 
> It should eventually. Maybe there still were some reclaimable pages in
> other zones for this memcg.

I just checked all the zones for both nodes (the machines have 2 NUMA
nodes) so essentially there are no reclaimable pages - all are
anonymous. So the pertinent question is why process are sleeping in
reclamation path when there are no pages to free. I also observed the
same behavior on a different node, this time the priority was 0 and the
code hasn't resorted to OOM. This seems all too strange..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Softlockup during memory allocation
  2016-11-24 13:09                       ` Nikolay Borisov
@ 2016-11-25  9:00                         ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-11-25  9:00 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Linux MM

On Thu 24-11-16 15:09:38, Nikolay Borisov wrote:
[...]
> I just checked all the zones for both nodes (the machines have 2 NUMA
> nodes) so essentially there are no reclaimable pages - all are
> anonymous. So the pertinent question is why process are sleeping in
> reclamation path when there are no pages to free. I also observed the
> same behavior on a different node, this time the priority was 0 and the
> code hasn't resorted to OOM. This seems all too strange..

>From my experience we usually hit the memcg OOM quickly if there are no
reclaimable pages. I do not remember anything big changed recently in
that area. Could you enable mm_vmscan_memcg_isolate,
mm_vmscan_memcg_reclaim__{begin,end} and mm_vmscan_lru_shrink_inactive
tracepoints to see what is going on during the reclaim?

I suspect we will see some minor reclaim activity there which will
basically "livelock" the oom from happening but let's see.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-11-25  9:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-01  8:12 Softlockup during memory allocation Nikolay Borisov
2016-11-01  8:16 ` Nikolay Borisov
2016-11-02 19:00 ` Vlastimil Babka
2016-11-04  3:46   ` Hugh Dickins
2016-11-04 12:18 ` Nikolay Borisov
2016-11-13 22:02   ` Nikolay Borisov
2016-11-21  5:31     ` Michal Hocko
2016-11-22  8:56       ` Nikolay Borisov
2016-11-22 14:30         ` Michal Hocko
2016-11-22 14:32           ` Michal Hocko
2016-11-22 14:46             ` Nikolay Borisov
2016-11-22 14:35           ` Nikolay Borisov
2016-11-22 17:02             ` Michal Hocko
2016-11-23  7:44               ` Nikolay Borisov
2016-11-23  7:49                 ` Michal Hocko
2016-11-23  7:50                   ` Michal Hocko
2016-11-24 11:45                   ` Nikolay Borisov
2016-11-24 12:12                     ` Michal Hocko
2016-11-24 13:09                       ` Nikolay Borisov
2016-11-25  9:00                         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.