All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,writeback: Don't use ALLOC_NO_WATERMARKS for wb_start_writeback
@ 2016-03-13  5:32 Tetsuo Handa
  2016-03-13 14:22   ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2016-03-13  5:32 UTC (permalink / raw)
  To: mhocko, viro, tj; +Cc: linux-mm, linux-fsdevel, Tetsuo Handa

When writeback operation cannot make forward progress because memory
allocation requests needed for doing I/O cannot be satisfied (e.g.
under OOM-livelock situation), we can observe flood of order-0 page
allocation failure messages caused by complete depletion of memory
reserves.

This is caused by unconditionally allocating "struct wb_writeback_work"
objects using GFP_ATOMIC from PF_MEMALLOC context.

__alloc_pages_nodemask() {
  __alloc_pages_slowpath() {
    __alloc_pages_direct_reclaim() {
      __perform_reclaim() {
        current->flags |= PF_MEMALLOC;
        try_to_free_pages() {
          do_try_to_free_pages() {
            wakeup_flusher_threads() {
              wb_start_writeback() {
                kzalloc(sizeof(*work), GFP_ATOMIC) {
                  /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                }
              }
            }
          }
        }
        current->flags &= ~PF_MEMALLOC;
      }
    }
  }
}

Since I/O is stalling, allocating writeback requests forever shall deplete
memory reserves. Fortunately, since wb_start_writeback() can fall back to
wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
need to use ALLOC_NO_WATERMARKS for wb_start_writeback().

----------
[   59.562581] Mem-Info:
[   59.563935] active_anon:289393 inactive_anon:2093 isolated_anon:29
[   59.563935]  active_file:10838 inactive_file:113013 isolated_file:859
[   59.563935]  unevictable:0 dirty:108531 writeback:5308 unstable:0
[   59.563935]  slab_reclaimable:5526 slab_unreclaimable:7077
[   59.563935]  mapped:9970 shmem:2159 pagetables:2387 bounce:0
[   59.563935]  free:3042 free_pcp:0 free_cma:0
[   59.574558] Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
[   59.585464] lowmem_reserve[]: 0 1732 1732 1732
[   59.587123] Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
[   59.599649] lowmem_reserve[]: 0 0 0 0
[   59.601431] Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
[   59.606509] Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
[   59.610415] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   59.612879] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   59.615308] 126847 total pagecache pages
[   59.616921] 0 pages in swap cache
[   59.618475] Swap cache stats: add 0, delete 0, find 0/0
[   59.620268] Free swap  = 0kB
[   59.621650] Total swap = 0kB
[   59.623011] 524157 pages RAM
[   59.624365] 0 pages HighMem/MovableOnly
[   59.625893] 76348 pages reserved
[   59.627506] 0 pages hwpoisoned
[   59.628838] Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
[   59.631071] Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
[   61.526353] kthreadd: page allocation failure: order:0, mode:0x2200020
[   61.527976] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.527978] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
[   61.527979] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   61.527981]  0000000000000086 000000000005bb2d ffff88006cc5b588 ffffffff812a4d65
[   61.527982]  0000000002200020 0000000000000000 ffff88006cc5b618 ffffffff81106dc7
[   61.527983]  0000000000000000 ffffffffffffffff 00ff880000000000 ffff880000000004
[   61.527983] Call Trace:
[   61.528009]  [<ffffffff812a4d65>] dump_stack+0x4d/0x68
[   61.528012]  [<ffffffff81106dc7>] warn_alloc_failed+0xf7/0x150
[   61.528014]  [<ffffffff81109e3f>] __alloc_pages_nodemask+0x23f/0xa60
[   61.528016]  [<ffffffff81137770>] ? page_check_address_transhuge+0x350/0x350
[   61.528018]  [<ffffffff8111327d>] ? page_evictable+0xd/0x40
[   61.528019]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528021]  [<ffffffff81155181>] new_slab+0x3a1/0x440
[   61.528023]  [<ffffffff81156fdf>] ___slab_alloc+0x3cf/0x590
[   61.528024]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528027]  [<ffffffff815a7f68>] ? preempt_schedule_common+0x1f/0x37
[   61.528028]  [<ffffffff815a7f9f>] ? preempt_schedule+0x1f/0x30
[   61.528030]  [<ffffffff81001012>] ? ___preempt_schedule+0x12/0x14
[   61.528030]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528032]  [<ffffffff81175536>] __slab_alloc.isra.64+0x18/0x1d
[   61.528033]  [<ffffffff8115778c>] kmem_cache_alloc+0x11c/0x150
[   61.528034]  [<ffffffff811a0999>] wb_start_writeback+0x39/0x90
[   61.528035]  [<ffffffff811a0d9f>] wakeup_flusher_threads+0x7f/0xf0
[   61.528036]  [<ffffffff81115ac9>] do_try_to_free_pages+0x1f9/0x410
[   61.528037]  [<ffffffff81115d74>] try_to_free_pages+0x94/0xc0
[   61.528038]  [<ffffffff8110a166>] __alloc_pages_nodemask+0x566/0xa60
[   61.528040]  [<ffffffff81200878>] ? xfs_bmapi_read+0x208/0x2f0
[   61.528041]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528042]  [<ffffffff8110092f>] __page_cache_alloc+0xaf/0xc0
[   61.528043]  [<ffffffff811011e8>] pagecache_get_page+0x88/0x260
[   61.528044]  [<ffffffff81101d31>] grab_cache_page_write_begin+0x21/0x40
[   61.528046]  [<ffffffff81222c9f>] xfs_vm_write_begin+0x2f/0xf0
[   61.528047]  [<ffffffff810b14be>] ? current_fs_time+0x1e/0x30
[   61.528048]  [<ffffffff81101eca>] generic_perform_write+0xca/0x1c0
[   61.528050]  [<ffffffff8107c390>] ? wake_up_process+0x10/0x20
[   61.528051]  [<ffffffff8122e01c>] xfs_file_buffered_aio_write+0xcc/0x1f0
[   61.528052]  [<ffffffff81079037>] ? finish_task_switch+0x77/0x280
[   61.528053]  [<ffffffff8122e1c4>] xfs_file_write_iter+0x84/0x140
[   61.528054]  [<ffffffff811777a7>] __vfs_write+0xc7/0x100
[   61.528055]  [<ffffffff811784cd>] vfs_write+0x9d/0x190
[   61.528056]  [<ffffffff810010a1>] ? do_audit_syscall_entry+0x61/0x70
[   61.528057]  [<ffffffff811793c0>] SyS_write+0x50/0xc0
[   61.528059]  [<ffffffff815ab4d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[   61.528059] Mem-Info:
[   61.528062] active_anon:293335 inactive_anon:2093 isolated_anon:0
[   61.528062]  active_file:10829 inactive_file:110045 isolated_file:32
[   61.528062]  unevictable:0 dirty:109275 writeback:822 unstable:0
[   61.528062]  slab_reclaimable:5489 slab_unreclaimable:10070
[   61.528062]  mapped:9999 shmem:2159 pagetables:2420 bounce:0
[   61.528062]  free:3 free_pcp:0 free_cma:0
[   61.528065] Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
[   61.528066] lowmem_reserve[]: 0 1732 1732 1732
[   61.528068] Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
[   61.528069] lowmem_reserve[]: 0 0 0 0
[   61.528072] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528074] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   61.528076] 123086 total pagecache pages
[   61.528076] 0 pages in swap cache
[   61.528077] Swap cache stats: add 0, delete 0, find 0/0
[   61.528077] Free swap  = 0kB
[   61.528077] Total swap = 0kB
[   61.528077] 524157 pages RAM
[   61.528078] 0 pages HighMem/MovableOnly
[   61.528078] 76348 pages reserved
[   61.528078] 0 pages hwpoisoned
[   61.528079] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[   61.528080]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[   61.528080]   node 0: slabs: 3218, objs: 205952, free: 0
[   61.528085] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.528086] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
----------

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 fs/fs-writeback.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5c46ed9..d4e13ec 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = kzalloc(sizeof(*work),
+		       GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (!work) {
 		trace_writeback_nowork(wb);
 		wb_wakeup(wb);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-13  5:32 [PATCH] mm,writeback: Don't use ALLOC_NO_WATERMARKS for wb_start_writeback Tetsuo Handa
@ 2016-03-13 14:22   ` Tetsuo Handa
  0 siblings, 0 replies; 15+ messages in thread
From: Tetsuo Handa @ 2016-03-13 14:22 UTC (permalink / raw)
  To: mhocko, viro, tj; +Cc: linux-mm, linux-fsdevel

Tetsuo Handa wrote:
> Since I/O is stalling, allocating writeback requests forever shall deplete
> memory reserves. Fortunately, since wb_start_writeback() can fall back to
> wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
> need to use ALLOC_NO_WATERMARKS for wb_start_writeback().

Well, maybe we should not use memory reserves at all.

I retested with this patch and kmallocwd patch applied. While depletion
of memory reserves by wb_start_writeback() no longer occurs, I can
still observe order-0 page allocation failure messages caused by
GFP_ATOMIC allocation requests because wb_start_writeback() consumed
memory reserves to the level where GFP_ATOMIC starts failing.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160313.txt.xz .
----------
[   89.794733] swapper/2: page allocation failure: order:0, mode:0x2200020(GFP_NOWAIT|__GFP_HIGH|__GFP_NOTRACK)
[   89.804714] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.5.0-rc7-next-20160311+ #399
[   89.813049] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   89.819803]  0000000000000086 631ee38347526c6b ffff88007fc83640 ffffffff812b5c97
[   89.823859]  0000000000000000 0000000000000000 ffff88007fc836d0 ffffffff811131b1
[   89.827917]  02200020ffffffff 0000000000000040 fffffffffffffffe 0000000000000000
[   89.831879] Call Trace:
[   89.834214]  <IRQ>  [<ffffffff812b5c97>] dump_stack+0x4f/0x68
[   89.837652]  [<ffffffff811131b1>] warn_alloc_failed+0x101/0x160
[   89.841115]  [<ffffffff81116521>] __alloc_pages_nodemask+0x481/0xd50
[   89.844677]  [<ffffffff8115b977>] alloc_pages_current+0x87/0x110
[   89.848158]  [<ffffffff81163fe0>] new_slab+0x540/0x550
[   89.851355]  [<ffffffff8116624d>] ___slab_alloc+0x46d/0x580
[   89.854856]  [<ffffffffa0317180>] ? __nf_ct_ext_add_length+0x1a0/0x1e0 [nf_conntrack]
[   89.858996]  [<ffffffffa0310e2d>] ? __nf_conntrack_alloc.isra.35+0x5d/0x1b0 [nf_conntrack]
[   89.863170]  [<ffffffff8116611f>] ? ___slab_alloc+0x33f/0x580
[   89.866604]  [<ffffffff8110ff20>] ? mempool_alloc_slab+0x10/0x20
[   89.870141]  [<ffffffff8110ff20>] ? mempool_alloc_slab+0x10/0x20
[   89.873866]  [<ffffffff8118435b>] __slab_alloc.isra.68+0x46/0x55
[   89.877354]  [<ffffffffa0317180>] ? __nf_ct_ext_add_length+0x1a0/0x1e0 [nf_conntrack]
[   89.881370]  [<ffffffffa0317180>] ? __nf_ct_ext_add_length+0x1a0/0x1e0 [nf_conntrack]
[   89.885328]  [<ffffffff81166666>] __kmalloc+0x146/0x190
[   89.888559]  [<ffffffffa0317180>] __nf_ct_ext_add_length+0x1a0/0x1e0 [nf_conntrack]
[   89.892538]  [<ffffffffa0310ebb>] ? __nf_conntrack_alloc.isra.35+0xeb/0x1b0 [nf_conntrack]
[   89.896557]  [<ffffffffa031150c>] nf_conntrack_in+0x56c/0x830 [nf_conntrack]
[   89.900166]  [<ffffffffa0330327>] ipv4_conntrack_in+0x17/0x20 [nf_conntrack_ipv4]
[   89.903902]  [<ffffffff81501848>] nf_iterate+0x58/0x70
[   89.906570]  [<ffffffff815018d6>] nf_hook_slow+0x76/0xd0
[   89.908271]  [<ffffffff8150b0a8>] ip_rcv+0x2f8/0x410
[   89.909803]  [<ffffffff8150a7f0>] ? ip_local_deliver_finish+0x1e0/0x1e0
[   89.911516]  [<ffffffff814cc9f4>] __netif_receive_skb_core+0x354/0x9b0
[   89.913284]  [<ffffffff8153c79f>] ? udp4_gro_receive+0x1ef/0x2a0
[   89.914906]  [<ffffffff81544bd2>] ? inet_gro_receive+0x92/0x230
[   89.916542]  [<ffffffff814cefe3>] __netif_receive_skb+0x13/0x60
[   89.918099]  [<ffffffff814cf0a6>] netif_receive_skb_internal+0x76/0xd0
[   89.919710]  [<ffffffff814cfae8>] napi_gro_receive+0x78/0xc0
[   89.921399]  [<ffffffffa0065e43>] e1000_clean_rx_irq+0x153/0x490 [e1000]
[   89.923038]  [<ffffffffa0063ccf>] e1000_clean+0x25f/0x8b0 [e1000]
[   89.924601]  [<ffffffff8107fb31>] ? check_preempt_curr+0x71/0x90
[   89.926098]  [<ffffffff814d0acb>] net_rx_action+0x14b/0x320
[   89.927636]  [<ffffffff81060cd1>] __do_softirq+0xd1/0x250
[   89.929034]  [<ffffffff810610d4>] irq_exit+0xe4/0x100
[   89.930443]  [<ffffffff8101b2bd>] do_IRQ+0x5d/0xf0
[   89.931769]  [<ffffffff815cc6c9>] common_interrupt+0x89/0x89
[   89.933210]  <EOI>  [<ffffffff81022e0b>] ? default_idle+0xb/0x20
[   89.935050]  [<ffffffff8102340a>] arch_cpu_idle+0xa/0x10
[   89.936480]  [<ffffffff81098625>] default_idle_call+0x25/0x30
[   89.938031]  [<ffffffff81098845>] cpu_startup_entry+0x215/0x2a0
[   89.939517]  [<ffffffff8103924b>] start_secondary+0x14b/0x170
[   89.941103] Mem-Info:
[   89.942132] active_anon:288561 inactive_anon:2093 isolated_anon:0
[   89.942132]  active_file:10819 inactive_file:114585 isolated_file:32
[   89.942132]  unevictable:0 dirty:113936 writeback:679 unstable:0
[   89.942132]  slab_reclaimable:5394 slab_unreclaimable:7802
[   89.942132]  mapped:10005 shmem:2159 pagetables:2415 bounce:0
[   89.942132]  free:2367 free_pcp:98 free_cma:0
[   89.950633] Node 0 DMA free:6952kB min:44kB low:56kB high:68kB active_anon:5308kB inactive_anon:140kB active_file:548kB inactive_file:1452kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:1452kB writeback:0kB mapped:540kB shmem:148kB slab_reclaimable:84kB slab_unreclaimable:544kB kernel_stack:384kB pagetables:56kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12224 all_unreclaimable? yes
[   89.960099] lowmem_reserve[]: 0 1732 1732 1732
[   89.961626] Node 0 DMA32 free:2516kB min:5200kB low:6972kB high:8744kB active_anon:1148936kB inactive_anon:8232kB active_file:42728kB inactive_file:456888kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775252kB mlocked:0kB dirty:454292kB writeback:2716kB mapped:39480kB shmem:8488kB slab_reclaimable:21492kB slab_unreclaimable:30664kB kernel_stack:20960kB pagetables:9604kB unstable:0kB bounce:0kB free_pcp:392kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:7473108 all_unreclaimable? yes
[   89.972153] lowmem_reserve[]: 0 0 0 0
[   89.973640] Node 0 DMA: 18*4kB (UM) 18*8kB (UE) 9*16kB (UE) 4*32kB (UE) 5*64kB (UME) 4*128kB (UME) 4*256kB (UE) 5*512kB (UME) 2*1024kB (UE) 0*2048kB 0*4096kB = 6952kB
[   89.978150] Node 0 DMA32: 281*4kB (UME) 172*8kB (UM) 1*16kB (E) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2516kB
[   89.981301] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   89.983498] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   89.985688] 127595 total pagecache pages
[   89.987068] 0 pages in swap cache
[   89.988344] Swap cache stats: add 0, delete 0, find 0/0
[   89.990303] Free swap  = 0kB
[   89.991505] Total swap = 0kB
[   89.992755] 524157 pages RAM
[   89.993945] 0 pages HighMem/MovableOnly
[   89.995279] 76368 pages reserved
[   89.996633] 0 pages hwpoisoned
[   89.997845] SLUB: Unable to allocate memory on node -1, gfp=0x2088020(GFP_ATOMIC|__GFP_ZERO)
[   90.000147]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[   90.002463]   node 0: slabs: 728, objs: 46592, free: 0
[   90.549072] swapper/2: page allocation failure: order:0, mode:0x2200020(GFP_NOWAIT|__GFP_HIGH|__GFP_NOTRACK)
[   90.559150] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.5.0-rc7-next-20160311+ #399
----------

While no messages printed under OOM-livelock situation is annoying and
page allocation failure messages by GFP_ATOMIC helps us to know we are
under OOM-livelock situation, we will need to kill more processes if we
allow wb_start_writeback() to consume half of memory reserves.

wb_start_writeback() should not try to consume until min: watermark so
that other GFP_NOIO allocations can succeed. But there is not such gfp
flags.

Anyway, please pick up below patch if you think GFP_NOWAIT is better than
GFP_ATOMIC.
----------------------------------------
>>From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 13 Mar 2016 23:03:05 +0900
Subject: [PATCH] mm,writeback: Don't use memory reserves for
 wb_start_writeback

When writeback operation cannot make forward progress because memory
allocation requests needed for doing I/O cannot be satisfied (e.g.
under OOM-livelock situation), we can observe flood of order-0 page
allocation failure messages caused by complete depletion of memory
reserves.

This is caused by unconditionally allocating "struct wb_writeback_work"
objects using GFP_ATOMIC from PF_MEMALLOC context.

__alloc_pages_nodemask() {
  __alloc_pages_slowpath() {
    __alloc_pages_direct_reclaim() {
      __perform_reclaim() {
        current->flags |= PF_MEMALLOC;
        try_to_free_pages() {
          do_try_to_free_pages() {
            wakeup_flusher_threads() {
              wb_start_writeback() {
                kzalloc(sizeof(*work), GFP_ATOMIC) {
                  /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                }
              }
            }
          }
        }
        current->flags &= ~PF_MEMALLOC;
      }
    }
  }
}

Since I/O is stalling, allocating writeback requests forever shall deplete
memory reserves. Fortunately, since wb_start_writeback() can fall back to
wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
need to allow wb_start_writeback() to use memory reserves.

----------
[   59.562581] Mem-Info:
[   59.563935] active_anon:289393 inactive_anon:2093 isolated_anon:29
[   59.563935]  active_file:10838 inactive_file:113013 isolated_file:859
[   59.563935]  unevictable:0 dirty:108531 writeback:5308 unstable:0
[   59.563935]  slab_reclaimable:5526 slab_unreclaimable:7077
[   59.563935]  mapped:9970 shmem:2159 pagetables:2387 bounce:0
[   59.563935]  free:3042 free_pcp:0 free_cma:0
[   59.574558] Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
[   59.585464] lowmem_reserve[]: 0 1732 1732 1732
[   59.587123] Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
[   59.599649] lowmem_reserve[]: 0 0 0 0
[   59.601431] Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
[   59.606509] Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
[   59.610415] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   59.612879] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   59.615308] 126847 total pagecache pages
[   59.616921] 0 pages in swap cache
[   59.618475] Swap cache stats: add 0, delete 0, find 0/0
[   59.620268] Free swap  = 0kB
[   59.621650] Total swap = 0kB
[   59.623011] 524157 pages RAM
[   59.624365] 0 pages HighMem/MovableOnly
[   59.625893] 76348 pages reserved
[   59.627506] 0 pages hwpoisoned
[   59.628838] Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
[   59.631071] Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
[   61.526353] kthreadd: page allocation failure: order:0, mode:0x2200020
[   61.527976] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.527978] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
[   61.527979] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   61.527981]  0000000000000086 000000000005bb2d ffff88006cc5b588 ffffffff812a4d65
[   61.527982]  0000000002200020 0000000000000000 ffff88006cc5b618 ffffffff81106dc7
[   61.527983]  0000000000000000 ffffffffffffffff 00ff880000000000 ffff880000000004
[   61.527983] Call Trace:
[   61.528009]  [<ffffffff812a4d65>] dump_stack+0x4d/0x68
[   61.528012]  [<ffffffff81106dc7>] warn_alloc_failed+0xf7/0x150
[   61.528014]  [<ffffffff81109e3f>] __alloc_pages_nodemask+0x23f/0xa60
[   61.528016]  [<ffffffff81137770>] ? page_check_address_transhuge+0x350/0x350
[   61.528018]  [<ffffffff8111327d>] ? page_evictable+0xd/0x40
[   61.528019]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528021]  [<ffffffff81155181>] new_slab+0x3a1/0x440
[   61.528023]  [<ffffffff81156fdf>] ___slab_alloc+0x3cf/0x590
[   61.528024]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528027]  [<ffffffff815a7f68>] ? preempt_schedule_common+0x1f/0x37
[   61.528028]  [<ffffffff815a7f9f>] ? preempt_schedule+0x1f/0x30
[   61.528030]  [<ffffffff81001012>] ? ___preempt_schedule+0x12/0x14
[   61.528030]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528032]  [<ffffffff81175536>] __slab_alloc.isra.64+0x18/0x1d
[   61.528033]  [<ffffffff8115778c>] kmem_cache_alloc+0x11c/0x150
[   61.528034]  [<ffffffff811a0999>] wb_start_writeback+0x39/0x90
[   61.528035]  [<ffffffff811a0d9f>] wakeup_flusher_threads+0x7f/0xf0
[   61.528036]  [<ffffffff81115ac9>] do_try_to_free_pages+0x1f9/0x410
[   61.528037]  [<ffffffff81115d74>] try_to_free_pages+0x94/0xc0
[   61.528038]  [<ffffffff8110a166>] __alloc_pages_nodemask+0x566/0xa60
[   61.528040]  [<ffffffff81200878>] ? xfs_bmapi_read+0x208/0x2f0
[   61.528041]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528042]  [<ffffffff8110092f>] __page_cache_alloc+0xaf/0xc0
[   61.528043]  [<ffffffff811011e8>] pagecache_get_page+0x88/0x260
[   61.528044]  [<ffffffff81101d31>] grab_cache_page_write_begin+0x21/0x40
[   61.528046]  [<ffffffff81222c9f>] xfs_vm_write_begin+0x2f/0xf0
[   61.528047]  [<ffffffff810b14be>] ? current_fs_time+0x1e/0x30
[   61.528048]  [<ffffffff81101eca>] generic_perform_write+0xca/0x1c0
[   61.528050]  [<ffffffff8107c390>] ? wake_up_process+0x10/0x20
[   61.528051]  [<ffffffff8122e01c>] xfs_file_buffered_aio_write+0xcc/0x1f0
[   61.528052]  [<ffffffff81079037>] ? finish_task_switch+0x77/0x280
[   61.528053]  [<ffffffff8122e1c4>] xfs_file_write_iter+0x84/0x140
[   61.528054]  [<ffffffff811777a7>] __vfs_write+0xc7/0x100
[   61.528055]  [<ffffffff811784cd>] vfs_write+0x9d/0x190
[   61.528056]  [<ffffffff810010a1>] ? do_audit_syscall_entry+0x61/0x70
[   61.528057]  [<ffffffff811793c0>] SyS_write+0x50/0xc0
[   61.528059]  [<ffffffff815ab4d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[   61.528059] Mem-Info:
[   61.528062] active_anon:293335 inactive_anon:2093 isolated_anon:0
[   61.528062]  active_file:10829 inactive_file:110045 isolated_file:32
[   61.528062]  unevictable:0 dirty:109275 writeback:822 unstable:0
[   61.528062]  slab_reclaimable:5489 slab_unreclaimable:10070
[   61.528062]  mapped:9999 shmem:2159 pagetables:2420 bounce:0
[   61.528062]  free:3 free_pcp:0 free_cma:0
[   61.528065] Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
[   61.528066] lowmem_reserve[]: 0 1732 1732 1732
[   61.528068] Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
[   61.528069] lowmem_reserve[]: 0 0 0 0
[   61.528072] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528074] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   61.528076] 123086 total pagecache pages
[   61.528076] 0 pages in swap cache
[   61.528077] Swap cache stats: add 0, delete 0, find 0/0
[   61.528077] Free swap  = 0kB
[   61.528077] Total swap = 0kB
[   61.528077] 524157 pages RAM
[   61.528078] 0 pages HighMem/MovableOnly
[   61.528078] 76348 pages reserved
[   61.528078] 0 pages hwpoisoned
[   61.528079] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[   61.528080]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[   61.528080]   node 0: slabs: 3218, objs: 205952, free: 0
[   61.528085] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.528086] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
----------

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 fs/fs-writeback.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5c46ed9..21450c7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = kzalloc(sizeof(*work),
+		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (!work) {
 		trace_writeback_nowork(wb);
 		wb_wakeup(wb);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
@ 2016-03-13 14:22   ` Tetsuo Handa
  0 siblings, 0 replies; 15+ messages in thread
From: Tetsuo Handa @ 2016-03-13 14:22 UTC (permalink / raw)
  To: mhocko, viro, tj; +Cc: linux-mm, linux-fsdevel

Tetsuo Handa wrote:
> Since I/O is stalling, allocating writeback requests forever shall deplete
> memory reserves. Fortunately, since wb_start_writeback() can fall back to
> wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
> need to use ALLOC_NO_WATERMARKS for wb_start_writeback().

Well, maybe we should not use memory reserves at all.

I retested with this patch and kmallocwd patch applied. While depletion
of memory reserves by wb_start_writeback() no longer occurs, I can
still observe order-0 page allocation failure messages caused by
GFP_ATOMIC allocation requests because wb_start_writeback() consumed
memory reserves to the level where GFP_ATOMIC starts failing.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160313.txt.xz .
----------
[   89.794733] swapper/2: page allocation failure: order:0, mode:0x2200020(GFP_NOWAIT|__GFP_HIGH|__GFP_NOTRACK)
[   89.804714] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.5.0-rc7-next-20160311+ #399
[   89.813049] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   89.819803]  0000000000000086 631ee38347526c6b ffff88007fc83640 ffffffff812b5c97
[   89.823859]  0000000000000000 0000000000000000 ffff88007fc836d0 ffffffff811131b1
[   89.827917]  02200020ffffffff 0000000000000040 fffffffffffffffe 0000000000000000
[   89.831879] Call Trace:
[   89.834214]  <IRQ>  [<ffffffff812b5c97>] dump_stack+0x4f/0x68
[   89.837652]  [<ffffffff811131b1>] warn_alloc_failed+0x101/0x160
[   89.841115]  [<ffffffff81116521>] __alloc_pages_nodemask+0x481/0xd50
[   89.844677]  [<ffffffff8115b977>] alloc_pages_current+0x87/0x110
[   89.848158]  [<ffffffff81163fe0>] new_slab+0x540/0x550
[   89.851355]  [<ffffffff8116624d>] ___slab_alloc+0x46d/0x580
[   89.854856]  [<ffffffffa0317180>] ? __nf_ct_ext_add_length+0x1a0/0x1e0 [nf_conntrack]
[   89.858996]  [<ffffffffa0310e2d>] ? __nf_conntrack_alloc.isra.35+0x5d/0x1b0 [nf_conntrack]
[   89.863170]  [<ffffffff8116611f>] ? ___slab_alloc+0x33f/0x580
[   89.866604]  [<ffffffff8110ff20>] ? mempool_alloc_slab+0x10/0x20
[   89.870141]  [<ffffffff8110ff20>] ? mempool_alloc_slab+0x10/0x20
[   89.873866]  [<ffffffff8118435b>] __slab_alloc.isra.68+0x46/0x55
[   89.877354]  [<ffffffffa0317180>] ? __nf_ct_ext_add_length+0x1a0/0x1e0 [nf_conntrack]
[   89.881370]  [<ffffffffa0317180>] ? __nf_ct_ext_add_length+0x1a0/0x1e0 [nf_conntrack]
[   89.885328]  [<ffffffff81166666>] __kmalloc+0x146/0x190
[   89.888559]  [<ffffffffa0317180>] __nf_ct_ext_add_length+0x1a0/0x1e0 [nf_conntrack]
[   89.892538]  [<ffffffffa0310ebb>] ? __nf_conntrack_alloc.isra.35+0xeb/0x1b0 [nf_conntrack]
[   89.896557]  [<ffffffffa031150c>] nf_conntrack_in+0x56c/0x830 [nf_conntrack]
[   89.900166]  [<ffffffffa0330327>] ipv4_conntrack_in+0x17/0x20 [nf_conntrack_ipv4]
[   89.903902]  [<ffffffff81501848>] nf_iterate+0x58/0x70
[   89.906570]  [<ffffffff815018d6>] nf_hook_slow+0x76/0xd0
[   89.908271]  [<ffffffff8150b0a8>] ip_rcv+0x2f8/0x410
[   89.909803]  [<ffffffff8150a7f0>] ? ip_local_deliver_finish+0x1e0/0x1e0
[   89.911516]  [<ffffffff814cc9f4>] __netif_receive_skb_core+0x354/0x9b0
[   89.913284]  [<ffffffff8153c79f>] ? udp4_gro_receive+0x1ef/0x2a0
[   89.914906]  [<ffffffff81544bd2>] ? inet_gro_receive+0x92/0x230
[   89.916542]  [<ffffffff814cefe3>] __netif_receive_skb+0x13/0x60
[   89.918099]  [<ffffffff814cf0a6>] netif_receive_skb_internal+0x76/0xd0
[   89.919710]  [<ffffffff814cfae8>] napi_gro_receive+0x78/0xc0
[   89.921399]  [<ffffffffa0065e43>] e1000_clean_rx_irq+0x153/0x490 [e1000]
[   89.923038]  [<ffffffffa0063ccf>] e1000_clean+0x25f/0x8b0 [e1000]
[   89.924601]  [<ffffffff8107fb31>] ? check_preempt_curr+0x71/0x90
[   89.926098]  [<ffffffff814d0acb>] net_rx_action+0x14b/0x320
[   89.927636]  [<ffffffff81060cd1>] __do_softirq+0xd1/0x250
[   89.929034]  [<ffffffff810610d4>] irq_exit+0xe4/0x100
[   89.930443]  [<ffffffff8101b2bd>] do_IRQ+0x5d/0xf0
[   89.931769]  [<ffffffff815cc6c9>] common_interrupt+0x89/0x89
[   89.933210]  <EOI>  [<ffffffff81022e0b>] ? default_idle+0xb/0x20
[   89.935050]  [<ffffffff8102340a>] arch_cpu_idle+0xa/0x10
[   89.936480]  [<ffffffff81098625>] default_idle_call+0x25/0x30
[   89.938031]  [<ffffffff81098845>] cpu_startup_entry+0x215/0x2a0
[   89.939517]  [<ffffffff8103924b>] start_secondary+0x14b/0x170
[   89.941103] Mem-Info:
[   89.942132] active_anon:288561 inactive_anon:2093 isolated_anon:0
[   89.942132]  active_file:10819 inactive_file:114585 isolated_file:32
[   89.942132]  unevictable:0 dirty:113936 writeback:679 unstable:0
[   89.942132]  slab_reclaimable:5394 slab_unreclaimable:7802
[   89.942132]  mapped:10005 shmem:2159 pagetables:2415 bounce:0
[   89.942132]  free:2367 free_pcp:98 free_cma:0
[   89.950633] Node 0 DMA free:6952kB min:44kB low:56kB high:68kB active_anon:5308kB inactive_anon:140kB active_file:548kB inactive_file:1452kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:1452kB writeback:0kB mapped:540kB shmem:148kB slab_reclaimable:84kB slab_unreclaimable:544kB kernel_stack:384kB pagetables:56kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12224 all_unreclaimable? yes
[   89.960099] lowmem_reserve[]: 0 1732 1732 1732
[   89.961626] Node 0 DMA32 free:2516kB min:5200kB low:6972kB high:8744kB active_anon:1148936kB inactive_anon:8232kB active_file:42728kB inactive_file:456888kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775252kB mlocked:0kB dirty:454292kB writeback:2716kB mapped:39480kB shmem:8488kB slab_reclaimable:21492kB slab_unreclaimable:30664kB kernel_stack:20960kB pagetables:9604kB unstable:0kB bounce:0kB free_pcp:392kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:7473108 all_unreclaimable? yes
[   89.972153] lowmem_reserve[]: 0 0 0 0
[   89.973640] Node 0 DMA: 18*4kB (UM) 18*8kB (UE) 9*16kB (UE) 4*32kB (UE) 5*64kB (UME) 4*128kB (UME) 4*256kB (UE) 5*512kB (UME) 2*1024kB (UE) 0*2048kB 0*4096kB = 6952kB
[   89.978150] Node 0 DMA32: 281*4kB (UME) 172*8kB (UM) 1*16kB (E) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2516kB
[   89.981301] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   89.983498] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   89.985688] 127595 total pagecache pages
[   89.987068] 0 pages in swap cache
[   89.988344] Swap cache stats: add 0, delete 0, find 0/0
[   89.990303] Free swap  = 0kB
[   89.991505] Total swap = 0kB
[   89.992755] 524157 pages RAM
[   89.993945] 0 pages HighMem/MovableOnly
[   89.995279] 76368 pages reserved
[   89.996633] 0 pages hwpoisoned
[   89.997845] SLUB: Unable to allocate memory on node -1, gfp=0x2088020(GFP_ATOMIC|__GFP_ZERO)
[   90.000147]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[   90.002463]   node 0: slabs: 728, objs: 46592, free: 0
[   90.549072] swapper/2: page allocation failure: order:0, mode:0x2200020(GFP_NOWAIT|__GFP_HIGH|__GFP_NOTRACK)
[   90.559150] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.5.0-rc7-next-20160311+ #399
----------

While no messages printed under OOM-livelock situation is annoying and
page allocation failure messages by GFP_ATOMIC helps us to know we are
under OOM-livelock situation, we will need to kill more processes if we
allow wb_start_writeback() to consume half of memory reserves.

wb_start_writeback() should not try to consume until min: watermark so
that other GFP_NOIO allocations can succeed. But there is not such gfp
flags.

Anyway, please pick up below patch if you think GFP_NOWAIT is better than
GFP_ATOMIC.
----------------------------------------
>From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 13 Mar 2016 23:03:05 +0900
Subject: [PATCH] mm,writeback: Don't use memory reserves for
 wb_start_writeback

When writeback operation cannot make forward progress because memory
allocation requests needed for doing I/O cannot be satisfied (e.g.
under OOM-livelock situation), we can observe flood of order-0 page
allocation failure messages caused by complete depletion of memory
reserves.

This is caused by unconditionally allocating "struct wb_writeback_work"
objects using GFP_ATOMIC from PF_MEMALLOC context.

__alloc_pages_nodemask() {
  __alloc_pages_slowpath() {
    __alloc_pages_direct_reclaim() {
      __perform_reclaim() {
        current->flags |= PF_MEMALLOC;
        try_to_free_pages() {
          do_try_to_free_pages() {
            wakeup_flusher_threads() {
              wb_start_writeback() {
                kzalloc(sizeof(*work), GFP_ATOMIC) {
                  /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                }
              }
            }
          }
        }
        current->flags &= ~PF_MEMALLOC;
      }
    }
  }
}

Since I/O is stalling, allocating writeback requests forever shall deplete
memory reserves. Fortunately, since wb_start_writeback() can fall back to
wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
need to allow wb_start_writeback() to use memory reserves.

----------
[   59.562581] Mem-Info:
[   59.563935] active_anon:289393 inactive_anon:2093 isolated_anon:29
[   59.563935]  active_file:10838 inactive_file:113013 isolated_file:859
[   59.563935]  unevictable:0 dirty:108531 writeback:5308 unstable:0
[   59.563935]  slab_reclaimable:5526 slab_unreclaimable:7077
[   59.563935]  mapped:9970 shmem:2159 pagetables:2387 bounce:0
[   59.563935]  free:3042 free_pcp:0 free_cma:0
[   59.574558] Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
[   59.585464] lowmem_reserve[]: 0 1732 1732 1732
[   59.587123] Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
[   59.599649] lowmem_reserve[]: 0 0 0 0
[   59.601431] Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
[   59.606509] Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
[   59.610415] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   59.612879] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   59.615308] 126847 total pagecache pages
[   59.616921] 0 pages in swap cache
[   59.618475] Swap cache stats: add 0, delete 0, find 0/0
[   59.620268] Free swap  = 0kB
[   59.621650] Total swap = 0kB
[   59.623011] 524157 pages RAM
[   59.624365] 0 pages HighMem/MovableOnly
[   59.625893] 76348 pages reserved
[   59.627506] 0 pages hwpoisoned
[   59.628838] Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
[   59.631071] Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
[   61.526353] kthreadd: page allocation failure: order:0, mode:0x2200020
[   61.527976] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.527978] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
[   61.527979] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   61.527981]  0000000000000086 000000000005bb2d ffff88006cc5b588 ffffffff812a4d65
[   61.527982]  0000000002200020 0000000000000000 ffff88006cc5b618 ffffffff81106dc7
[   61.527983]  0000000000000000 ffffffffffffffff 00ff880000000000 ffff880000000004
[   61.527983] Call Trace:
[   61.528009]  [<ffffffff812a4d65>] dump_stack+0x4d/0x68
[   61.528012]  [<ffffffff81106dc7>] warn_alloc_failed+0xf7/0x150
[   61.528014]  [<ffffffff81109e3f>] __alloc_pages_nodemask+0x23f/0xa60
[   61.528016]  [<ffffffff81137770>] ? page_check_address_transhuge+0x350/0x350
[   61.528018]  [<ffffffff8111327d>] ? page_evictable+0xd/0x40
[   61.528019]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528021]  [<ffffffff81155181>] new_slab+0x3a1/0x440
[   61.528023]  [<ffffffff81156fdf>] ___slab_alloc+0x3cf/0x590
[   61.528024]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528027]  [<ffffffff815a7f68>] ? preempt_schedule_common+0x1f/0x37
[   61.528028]  [<ffffffff815a7f9f>] ? preempt_schedule+0x1f/0x30
[   61.528030]  [<ffffffff81001012>] ? ___preempt_schedule+0x12/0x14
[   61.528030]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528032]  [<ffffffff81175536>] __slab_alloc.isra.64+0x18/0x1d
[   61.528033]  [<ffffffff8115778c>] kmem_cache_alloc+0x11c/0x150
[   61.528034]  [<ffffffff811a0999>] wb_start_writeback+0x39/0x90
[   61.528035]  [<ffffffff811a0d9f>] wakeup_flusher_threads+0x7f/0xf0
[   61.528036]  [<ffffffff81115ac9>] do_try_to_free_pages+0x1f9/0x410
[   61.528037]  [<ffffffff81115d74>] try_to_free_pages+0x94/0xc0
[   61.528038]  [<ffffffff8110a166>] __alloc_pages_nodemask+0x566/0xa60
[   61.528040]  [<ffffffff81200878>] ? xfs_bmapi_read+0x208/0x2f0
[   61.528041]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528042]  [<ffffffff8110092f>] __page_cache_alloc+0xaf/0xc0
[   61.528043]  [<ffffffff811011e8>] pagecache_get_page+0x88/0x260
[   61.528044]  [<ffffffff81101d31>] grab_cache_page_write_begin+0x21/0x40
[   61.528046]  [<ffffffff81222c9f>] xfs_vm_write_begin+0x2f/0xf0
[   61.528047]  [<ffffffff810b14be>] ? current_fs_time+0x1e/0x30
[   61.528048]  [<ffffffff81101eca>] generic_perform_write+0xca/0x1c0
[   61.528050]  [<ffffffff8107c390>] ? wake_up_process+0x10/0x20
[   61.528051]  [<ffffffff8122e01c>] xfs_file_buffered_aio_write+0xcc/0x1f0
[   61.528052]  [<ffffffff81079037>] ? finish_task_switch+0x77/0x280
[   61.528053]  [<ffffffff8122e1c4>] xfs_file_write_iter+0x84/0x140
[   61.528054]  [<ffffffff811777a7>] __vfs_write+0xc7/0x100
[   61.528055]  [<ffffffff811784cd>] vfs_write+0x9d/0x190
[   61.528056]  [<ffffffff810010a1>] ? do_audit_syscall_entry+0x61/0x70
[   61.528057]  [<ffffffff811793c0>] SyS_write+0x50/0xc0
[   61.528059]  [<ffffffff815ab4d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[   61.528059] Mem-Info:
[   61.528062] active_anon:293335 inactive_anon:2093 isolated_anon:0
[   61.528062]  active_file:10829 inactive_file:110045 isolated_file:32
[   61.528062]  unevictable:0 dirty:109275 writeback:822 unstable:0
[   61.528062]  slab_reclaimable:5489 slab_unreclaimable:10070
[   61.528062]  mapped:9999 shmem:2159 pagetables:2420 bounce:0
[   61.528062]  free:3 free_pcp:0 free_cma:0
[   61.528065] Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
[   61.528066] lowmem_reserve[]: 0 1732 1732 1732
[   61.528068] Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
[   61.528069] lowmem_reserve[]: 0 0 0 0
[   61.528072] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528074] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   61.528076] 123086 total pagecache pages
[   61.528076] 0 pages in swap cache
[   61.528077] Swap cache stats: add 0, delete 0, find 0/0
[   61.528077] Free swap  = 0kB
[   61.528077] Total swap = 0kB
[   61.528077] 524157 pages RAM
[   61.528078] 0 pages HighMem/MovableOnly
[   61.528078] 76348 pages reserved
[   61.528078] 0 pages hwpoisoned
[   61.528079] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[   61.528080]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[   61.528080]   node 0: slabs: 3218, objs: 205952, free: 0
[   61.528085] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.528086] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
----------

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 fs/fs-writeback.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5c46ed9..21450c7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = kzalloc(sizeof(*work),
+		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (!work) {
 		trace_writeback_nowork(wb);
 		wb_wakeup(wb);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-13 14:22   ` Tetsuo Handa
  (?)
@ 2016-03-14 16:09   ` Michal Hocko
  2016-03-16 20:46     ` Tejun Heo
  -1 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2016-03-14 16:09 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: viro, tj, linux-mm, linux-fsdevel

On Sun 13-03-16 23:22:23, Tetsuo Handa wrote:
[...]

I am not familiar with the writeback code so I might be missing
something essential here but why are we even queueing more and more
work without checking there has been enough already scheduled or in
progress.

Something as simple as:
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c950e6e8..aa52e23ac280 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -887,7 +887,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 {
 	struct wb_writeback_work *work;
 
-	if (!wb_has_dirty_io(wb))
+	if (!wb_has_dirty_io(wb) || writeback_in_progress(wb))
 		return;
 
 	/*

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5c46ed9..21450c7 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
>  	 * wakeup the thread for old dirty data writeback
>  	 */
> -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> +	work = kzalloc(sizeof(*work),
> +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);

Well, I guess you are right that this doesn't sound like a context
which really needs access to memory reserves and GFP_ATOMIC would more
used for what can be achieved by GFP_NOWAIT now. Using __GFP_NOMEMALLOC
would be needed regardless as you pointed out already because this might
be called from the page reclaim context. So if the above simple hack
or other explicit limit cannot be done then __GFP_NOMEMALLOC is an
absolute minimum.

>  	if (!work) {
>  		trace_writeback_nowork(wb);
>  		wb_wakeup(wb);
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-14 16:09   ` Michal Hocko
@ 2016-03-16 20:46     ` Tejun Heo
  2016-03-18 13:11       ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2016-03-16 20:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Tetsuo Handa, viro, linux-mm, linux-fsdevel, Jan Kara

Hello,

(cc'ing Jan)

On Mon, Mar 14, 2016 at 05:09:00PM +0100, Michal Hocko wrote:
> On Sun 13-03-16 23:22:23, Tetsuo Handa wrote:
> [...]
> 
> I am not familiar with the writeback code so I might be missing
> something essential here but why are we even queueing more and more
> work without checking there has been enough already scheduled or in
> progress.
>
> Something as simple as:
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 6915c950e6e8..aa52e23ac280 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -887,7 +887,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>  {
>  	struct wb_writeback_work *work;
>  
> -	if (!wb_has_dirty_io(wb))
> +	if (!wb_has_dirty_io(wb) || writeback_in_progress(wb))
>  		return;

I'm not sure this would be safe.  It shouldn't harm correctness as
wb_start_writeback() isn't used in sync case but this might change
flush behavior in various ways.  Dropping GFP_ATOMIC as suggested by
Tetsuo is likely better.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-16 20:46     ` Tejun Heo
@ 2016-03-18 13:11       ` Jan Kara
  2016-03-18 13:34         ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2016-03-18 13:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Tetsuo Handa, viro, linux-mm, linux-fsdevel, Jan Kara

On Wed 16-03-16 13:46:17, Tejun Heo wrote:
> Hello,
> 
> (cc'ing Jan)
> 
> On Mon, Mar 14, 2016 at 05:09:00PM +0100, Michal Hocko wrote:
> > On Sun 13-03-16 23:22:23, Tetsuo Handa wrote:
> > [...]
> > 
> > I am not familiar with the writeback code so I might be missing
> > something essential here but why are we even queueing more and more
> > work without checking there has been enough already scheduled or in
> > progress.
> >
> > Something as simple as:
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 6915c950e6e8..aa52e23ac280 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -887,7 +887,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> >  {
> >  	struct wb_writeback_work *work;
> >  
> > -	if (!wb_has_dirty_io(wb))
> > +	if (!wb_has_dirty_io(wb) || writeback_in_progress(wb))
> >  		return;
> 
> I'm not sure this would be safe.  It shouldn't harm correctness as
> wb_start_writeback() isn't used in sync case but this might change
> flush behavior in various ways.  Dropping GFP_ATOMIC as suggested by
> Tetsuo is likely better.

Yes, there can be different requests for different numbers of pages to be
written and you don't want to discard a request to clean 4000 pages just
because a writeback of 10 pages is just running. As Tejun says, this is not
a hard requirement but in general it would be unexpected for the users of
the api...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-18 13:11       ` Jan Kara
@ 2016-03-18 13:34         ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2016-03-18 13:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: Tejun Heo, Tetsuo Handa, viro, linux-mm, linux-fsdevel, Jan Kara

On Fri 18-03-16 14:11:36, Jan Kara wrote:
> On Wed 16-03-16 13:46:17, Tejun Heo wrote:
> > Hello,
> > 
> > (cc'ing Jan)
> > 
> > On Mon, Mar 14, 2016 at 05:09:00PM +0100, Michal Hocko wrote:
> > > On Sun 13-03-16 23:22:23, Tetsuo Handa wrote:
> > > [...]
> > > 
> > > I am not familiar with the writeback code so I might be missing
> > > something essential here but why are we even queueing more and more
> > > work without checking there has been enough already scheduled or in
> > > progress.
> > >
> > > Something as simple as:
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 6915c950e6e8..aa52e23ac280 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -887,7 +887,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> > >  {
> > >  	struct wb_writeback_work *work;
> > >  
> > > -	if (!wb_has_dirty_io(wb))
> > > +	if (!wb_has_dirty_io(wb) || writeback_in_progress(wb))
> > >  		return;
> > 
> > I'm not sure this would be safe.  It shouldn't harm correctness as
> > wb_start_writeback() isn't used in sync case but this might change
> > flush behavior in various ways.  Dropping GFP_ATOMIC as suggested by
> > Tetsuo is likely better.
> 
> Yes, there can be different requests for different numbers of pages to be
> written and you don't want to discard a request to clean 4000 pages just
> because a writeback of 10 pages is just running. As Tejun says, this is not
> a hard requirement but in general it would be unexpected for the users of
> the api...

Thanks for the clarification. Then the proper fix indeed is to add
__GFP_NOMEMALLOC. 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-13 14:22   ` Tetsuo Handa
  (?)
  (?)
@ 2016-03-18 13:42   ` Michal Hocko
  -1 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2016-03-18 13:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: viro, tj, linux-mm, linux-fsdevel

On Sun 13-03-16 23:22:23, Tetsuo Handa wrote:
[...]
> >From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 13 Mar 2016 23:03:05 +0900
> Subject: [PATCH] mm,writeback: Don't use memory reserves for
>  wb_start_writeback
> 
> When writeback operation cannot make forward progress because memory
> allocation requests needed for doing I/O cannot be satisfied (e.g.
> under OOM-livelock situation), we can observe flood of order-0 page
> allocation failure messages caused by complete depletion of memory
> reserves.
> 
> This is caused by unconditionally allocating "struct wb_writeback_work"
> objects using GFP_ATOMIC from PF_MEMALLOC context.
> 
> __alloc_pages_nodemask() {
>   __alloc_pages_slowpath() {
>     __alloc_pages_direct_reclaim() {
>       __perform_reclaim() {
>         current->flags |= PF_MEMALLOC;
>         try_to_free_pages() {
>           do_try_to_free_pages() {
>             wakeup_flusher_threads() {
>               wb_start_writeback() {
>                 kzalloc(sizeof(*work), GFP_ATOMIC) {
>                   /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
>                 }
>               }
>             }
>           }
>         }
>         current->flags &= ~PF_MEMALLOC;
>       }
>     }
>   }
> }
> 
> Since I/O is stalling, allocating writeback requests forever shall deplete
> memory reserves. Fortunately, since wb_start_writeback() can fall back to
> wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
> need to allow wb_start_writeback() to use memory reserves.
> 
> ----------
> [   59.562581] Mem-Info:
> [   59.563935] active_anon:289393 inactive_anon:2093 isolated_anon:29
> [   59.563935]  active_file:10838 inactive_file:113013 isolated_file:859
> [   59.563935]  unevictable:0 dirty:108531 writeback:5308 unstable:0
> [   59.563935]  slab_reclaimable:5526 slab_unreclaimable:7077
> [   59.563935]  mapped:9970 shmem:2159 pagetables:2387 bounce:0
> [   59.563935]  free:3042 free_pcp:0 free_cma:0
> [   59.574558] Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
> [   59.585464] lowmem_reserve[]: 0 1732 1732 1732
> [   59.587123] Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
> [   59.599649] lowmem_reserve[]: 0 0 0 0
> [   59.601431] Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
> [   59.606509] Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
> [   59.610415] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> [   59.612879] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [   59.615308] 126847 total pagecache pages
> [   59.616921] 0 pages in swap cache
> [   59.618475] Swap cache stats: add 0, delete 0, find 0/0
> [   59.620268] Free swap  = 0kB
> [   59.621650] Total swap = 0kB
> [   59.623011] 524157 pages RAM
> [   59.624365] 0 pages HighMem/MovableOnly
> [   59.625893] 76348 pages reserved
> [   59.627506] 0 pages hwpoisoned
> [   59.628838] Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
> [   59.631071] Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
> [   61.526353] kthreadd: page allocation failure: order:0, mode:0x2200020
> [   61.527976] file_io.00: page allocation failure: order:0, mode:0x2200020
> [   61.527978] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
> [   61.527979] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [   61.527981]  0000000000000086 000000000005bb2d ffff88006cc5b588 ffffffff812a4d65
> [   61.527982]  0000000002200020 0000000000000000 ffff88006cc5b618 ffffffff81106dc7
> [   61.527983]  0000000000000000 ffffffffffffffff 00ff880000000000 ffff880000000004
> [   61.527983] Call Trace:
> [   61.528009]  [<ffffffff812a4d65>] dump_stack+0x4d/0x68
> [   61.528012]  [<ffffffff81106dc7>] warn_alloc_failed+0xf7/0x150
> [   61.528014]  [<ffffffff81109e3f>] __alloc_pages_nodemask+0x23f/0xa60
> [   61.528016]  [<ffffffff81137770>] ? page_check_address_transhuge+0x350/0x350
> [   61.528018]  [<ffffffff8111327d>] ? page_evictable+0xd/0x40
> [   61.528019]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
> [   61.528021]  [<ffffffff81155181>] new_slab+0x3a1/0x440
> [   61.528023]  [<ffffffff81156fdf>] ___slab_alloc+0x3cf/0x590
> [   61.528024]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
> [   61.528027]  [<ffffffff815a7f68>] ? preempt_schedule_common+0x1f/0x37
> [   61.528028]  [<ffffffff815a7f9f>] ? preempt_schedule+0x1f/0x30
> [   61.528030]  [<ffffffff81001012>] ? ___preempt_schedule+0x12/0x14
> [   61.528030]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
> [   61.528032]  [<ffffffff81175536>] __slab_alloc.isra.64+0x18/0x1d
> [   61.528033]  [<ffffffff8115778c>] kmem_cache_alloc+0x11c/0x150
> [   61.528034]  [<ffffffff811a0999>] wb_start_writeback+0x39/0x90
> [   61.528035]  [<ffffffff811a0d9f>] wakeup_flusher_threads+0x7f/0xf0
> [   61.528036]  [<ffffffff81115ac9>] do_try_to_free_pages+0x1f9/0x410
> [   61.528037]  [<ffffffff81115d74>] try_to_free_pages+0x94/0xc0
> [   61.528038]  [<ffffffff8110a166>] __alloc_pages_nodemask+0x566/0xa60
> [   61.528040]  [<ffffffff81200878>] ? xfs_bmapi_read+0x208/0x2f0
> [   61.528041]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
> [   61.528042]  [<ffffffff8110092f>] __page_cache_alloc+0xaf/0xc0
> [   61.528043]  [<ffffffff811011e8>] pagecache_get_page+0x88/0x260
> [   61.528044]  [<ffffffff81101d31>] grab_cache_page_write_begin+0x21/0x40
> [   61.528046]  [<ffffffff81222c9f>] xfs_vm_write_begin+0x2f/0xf0
> [   61.528047]  [<ffffffff810b14be>] ? current_fs_time+0x1e/0x30
> [   61.528048]  [<ffffffff81101eca>] generic_perform_write+0xca/0x1c0
> [   61.528050]  [<ffffffff8107c390>] ? wake_up_process+0x10/0x20
> [   61.528051]  [<ffffffff8122e01c>] xfs_file_buffered_aio_write+0xcc/0x1f0
> [   61.528052]  [<ffffffff81079037>] ? finish_task_switch+0x77/0x280
> [   61.528053]  [<ffffffff8122e1c4>] xfs_file_write_iter+0x84/0x140
> [   61.528054]  [<ffffffff811777a7>] __vfs_write+0xc7/0x100
> [   61.528055]  [<ffffffff811784cd>] vfs_write+0x9d/0x190
> [   61.528056]  [<ffffffff810010a1>] ? do_audit_syscall_entry+0x61/0x70
> [   61.528057]  [<ffffffff811793c0>] SyS_write+0x50/0xc0
> [   61.528059]  [<ffffffff815ab4d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [   61.528059] Mem-Info:
> [   61.528062] active_anon:293335 inactive_anon:2093 isolated_anon:0
> [   61.528062]  active_file:10829 inactive_file:110045 isolated_file:32
> [   61.528062]  unevictable:0 dirty:109275 writeback:822 unstable:0
> [   61.528062]  slab_reclaimable:5489 slab_unreclaimable:10070
> [   61.528062]  mapped:9999 shmem:2159 pagetables:2420 bounce:0
> [   61.528062]  free:3 free_pcp:0 free_cma:0
> [   61.528065] Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
> [   61.528066] lowmem_reserve[]: 0 1732 1732 1732
> [   61.528068] Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
> [   61.528069] lowmem_reserve[]: 0 0 0 0
> [   61.528072] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [   61.528074] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
> [   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> [   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [   61.528076] 123086 total pagecache pages
> [   61.528076] 0 pages in swap cache
> [   61.528077] Swap cache stats: add 0, delete 0, find 0/0
> [   61.528077] Free swap  = 0kB
> [   61.528077] Total swap = 0kB
> [   61.528077] 524157 pages RAM
> [   61.528078] 0 pages HighMem/MovableOnly
> [   61.528078] 76348 pages reserved
> [   61.528078] 0 pages hwpoisoned
> [   61.528079] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
> [   61.528080]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
> [   61.528080]   node 0: slabs: 3218, objs: 205952, free: 0
> [   61.528085] file_io.00: page allocation failure: order:0, mode:0x2200020
> [   61.528086] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
> ----------
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  fs/fs-writeback.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5c46ed9..21450c7 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
>  	 * wakeup the thread for old dirty data writeback
>  	 */
> -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> +	work = kzalloc(sizeof(*work),
> +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
>  	if (!work) {
>  		trace_writeback_nowork(wb);
>  		wb_wakeup(wb);
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
@ 2016-04-28 13:26 Tetsuo Handa
  0 siblings, 0 replies; 15+ messages in thread
From: Tetsuo Handa @ 2016-04-28 13:26 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Tetsuo Handa, Jan Kara, Tejun Heo

When writeback operation cannot make forward progress because memory
allocation requests needed for doing I/O cannot be satisfied (e.g.
under OOM-livelock situation), we can observe flood of order-0 page
allocation failure messages caused by complete depletion of memory
reserves.

This is caused by unconditionally allocating "struct wb_writeback_work"
objects using GFP_ATOMIC from PF_MEMALLOC context.

__alloc_pages_nodemask() {
  __alloc_pages_slowpath() {
    __alloc_pages_direct_reclaim() {
      __perform_reclaim() {
        current->flags |= PF_MEMALLOC;
        try_to_free_pages() {
          do_try_to_free_pages() {
            wakeup_flusher_threads() {
              wb_start_writeback() {
                kzalloc(sizeof(*work), GFP_ATOMIC) {
                  /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                }
              }
            }
          }
        }
        current->flags &= ~PF_MEMALLOC;
      }
    }
  }
}

Since I/O is stalling, allocating writeback requests forever shall deplete
memory reserves. Fortunately, since wb_start_writeback() can fall back to
wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
need to allow wb_start_writeback() to use memory reserves.

----------
[   59.562581] Mem-Info:
[   59.563935] active_anon:289393 inactive_anon:2093 isolated_anon:29
[   59.563935]  active_file:10838 inactive_file:113013 isolated_file:859
[   59.563935]  unevictable:0 dirty:108531 writeback:5308 unstable:0
[   59.563935]  slab_reclaimable:5526 slab_unreclaimable:7077
[   59.563935]  mapped:9970 shmem:2159 pagetables:2387 bounce:0
[   59.563935]  free:3042 free_pcp:0 free_cma:0
[   59.574558] Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
[   59.585464] lowmem_reserve[]: 0 1732 1732 1732
[   59.587123] Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
[   59.599649] lowmem_reserve[]: 0 0 0 0
[   59.601431] Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
[   59.606509] Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
[   59.610415] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   59.612879] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   59.615308] 126847 total pagecache pages
[   59.616921] 0 pages in swap cache
[   59.618475] Swap cache stats: add 0, delete 0, find 0/0
[   59.620268] Free swap  = 0kB
[   59.621650] Total swap = 0kB
[   59.623011] 524157 pages RAM
[   59.624365] 0 pages HighMem/MovableOnly
[   59.625893] 76348 pages reserved
[   59.627506] 0 pages hwpoisoned
[   59.628838] Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
[   59.631071] Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
[   61.526353] kthreadd: page allocation failure: order:0, mode:0x2200020
[   61.527976] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.527978] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
[   61.527979] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   61.527981]  0000000000000086 000000000005bb2d ffff88006cc5b588 ffffffff812a4d65
[   61.527982]  0000000002200020 0000000000000000 ffff88006cc5b618 ffffffff81106dc7
[   61.527983]  0000000000000000 ffffffffffffffff 00ff880000000000 ffff880000000004
[   61.527983] Call Trace:
[   61.528009]  [<ffffffff812a4d65>] dump_stack+0x4d/0x68
[   61.528012]  [<ffffffff81106dc7>] warn_alloc_failed+0xf7/0x150
[   61.528014]  [<ffffffff81109e3f>] __alloc_pages_nodemask+0x23f/0xa60
[   61.528016]  [<ffffffff81137770>] ? page_check_address_transhuge+0x350/0x350
[   61.528018]  [<ffffffff8111327d>] ? page_evictable+0xd/0x40
[   61.528019]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528021]  [<ffffffff81155181>] new_slab+0x3a1/0x440
[   61.528023]  [<ffffffff81156fdf>] ___slab_alloc+0x3cf/0x590
[   61.528024]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528027]  [<ffffffff815a7f68>] ? preempt_schedule_common+0x1f/0x37
[   61.528028]  [<ffffffff815a7f9f>] ? preempt_schedule+0x1f/0x30
[   61.528030]  [<ffffffff81001012>] ? ___preempt_schedule+0x12/0x14
[   61.528030]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528032]  [<ffffffff81175536>] __slab_alloc.isra.64+0x18/0x1d
[   61.528033]  [<ffffffff8115778c>] kmem_cache_alloc+0x11c/0x150
[   61.528034]  [<ffffffff811a0999>] wb_start_writeback+0x39/0x90
[   61.528035]  [<ffffffff811a0d9f>] wakeup_flusher_threads+0x7f/0xf0
[   61.528036]  [<ffffffff81115ac9>] do_try_to_free_pages+0x1f9/0x410
[   61.528037]  [<ffffffff81115d74>] try_to_free_pages+0x94/0xc0
[   61.528038]  [<ffffffff8110a166>] __alloc_pages_nodemask+0x566/0xa60
[   61.528040]  [<ffffffff81200878>] ? xfs_bmapi_read+0x208/0x2f0
[   61.528041]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528042]  [<ffffffff8110092f>] __page_cache_alloc+0xaf/0xc0
[   61.528043]  [<ffffffff811011e8>] pagecache_get_page+0x88/0x260
[   61.528044]  [<ffffffff81101d31>] grab_cache_page_write_begin+0x21/0x40
[   61.528046]  [<ffffffff81222c9f>] xfs_vm_write_begin+0x2f/0xf0
[   61.528047]  [<ffffffff810b14be>] ? current_fs_time+0x1e/0x30
[   61.528048]  [<ffffffff81101eca>] generic_perform_write+0xca/0x1c0
[   61.528050]  [<ffffffff8107c390>] ? wake_up_process+0x10/0x20
[   61.528051]  [<ffffffff8122e01c>] xfs_file_buffered_aio_write+0xcc/0x1f0
[   61.528052]  [<ffffffff81079037>] ? finish_task_switch+0x77/0x280
[   61.528053]  [<ffffffff8122e1c4>] xfs_file_write_iter+0x84/0x140
[   61.528054]  [<ffffffff811777a7>] __vfs_write+0xc7/0x100
[   61.528055]  [<ffffffff811784cd>] vfs_write+0x9d/0x190
[   61.528056]  [<ffffffff810010a1>] ? do_audit_syscall_entry+0x61/0x70
[   61.528057]  [<ffffffff811793c0>] SyS_write+0x50/0xc0
[   61.528059]  [<ffffffff815ab4d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[   61.528059] Mem-Info:
[   61.528062] active_anon:293335 inactive_anon:2093 isolated_anon:0
[   61.528062]  active_file:10829 inactive_file:110045 isolated_file:32
[   61.528062]  unevictable:0 dirty:109275 writeback:822 unstable:0
[   61.528062]  slab_reclaimable:5489 slab_unreclaimable:10070
[   61.528062]  mapped:9999 shmem:2159 pagetables:2420 bounce:0
[   61.528062]  free:3 free_pcp:0 free_cma:0
[   61.528065] Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
[   61.528066] lowmem_reserve[]: 0 1732 1732 1732
[   61.528068] Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
[   61.528069] lowmem_reserve[]: 0 0 0 0
[   61.528072] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528074] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   61.528076] 123086 total pagecache pages
[   61.528076] 0 pages in swap cache
[   61.528077] Swap cache stats: add 0, delete 0, find 0/0
[   61.528077] Free swap  = 0kB
[   61.528077] Total swap = 0kB
[   61.528077] 524157 pages RAM
[   61.528078] 0 pages HighMem/MovableOnly
[   61.528078] 76348 pages reserved
[   61.528078] 0 pages hwpoisoned
[   61.528079] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[   61.528080]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[   61.528080]   node 0: slabs: 3218, objs: 205952, free: 0
[   61.528085] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.528086] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
----------

Assuming that somebody will find a better solution, let's apply
this patch for now to stop bleeding, for this problem frequently
prevents me from testing OOM livelock condition.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz
---
 fs/fs-writeback.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 592cea5..989a2ce 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -931,7 +931,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = kzalloc(sizeof(*work),
+		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (!work) {
 		trace_writeback_nowork(wb);
 		wb_wakeup(wb);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-29 16:49     ` Jan Kara
@ 2016-04-04 10:58       ` Tetsuo Handa
  0 siblings, 0 replies; 15+ messages in thread
From: Tetsuo Handa @ 2016-04-04 10:58 UTC (permalink / raw)
  To: jack, mhocko; +Cc: akpm, linux-mm, tj

Hello, Jan.

Assuming that you will find a better solution, can we apply this patch
for now to stop bleeding?
This problem frequently prevents me from testing OOM livelock condition.

Jan Kara wrote:
> On Tue 29-03-16 10:54:35, Michal Hocko wrote:
> > On Thu 24-03-16 14:17:14, Andrew Morton wrote:
> > > On Thu, 24 Mar 2016 23:03:16 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > > 
> > > > Andrew, can you take this patch?
> > > 
> > > Tejun.
> > > 
> > > > ----------------------------------------
> > > > >From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
> > > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Date: Sun, 13 Mar 2016 23:03:05 +0900
> > > > Subject: [PATCH] mm,writeback: Don't use memory reserves for
> > > >  wb_start_writeback
> > > > 
> > > > When writeback operation cannot make forward progress because memory
> > > > allocation requests needed for doing I/O cannot be satisfied (e.g.
> > > > under OOM-livelock situation), we can observe flood of order-0 page
> > > > allocation failure messages caused by complete depletion of memory
> > > > reserves.
> > > > 
> > > > This is caused by unconditionally allocating "struct wb_writeback_work"
> > > > objects using GFP_ATOMIC from PF_MEMALLOC context.
> > > > 
> > > > __alloc_pages_nodemask() {
> > > >   __alloc_pages_slowpath() {
> > > >     __alloc_pages_direct_reclaim() {
> > > >       __perform_reclaim() {
> > > >         current->flags |= PF_MEMALLOC;
> > > >         try_to_free_pages() {
> > > >           do_try_to_free_pages() {
> > > >             wakeup_flusher_threads() {
> > > >               wb_start_writeback() {
> > > >                 kzalloc(sizeof(*work), GFP_ATOMIC) {
> > > >                   /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
> > > >                 }
> > > >               }
> > > >             }
> > > >           }
> > > >         }
> > > >         current->flags &= ~PF_MEMALLOC;
> > > >       }
> > > >     }
> > > >   }
> > > > }
> > > > 
> > > > Since I/O is stalling, allocating writeback requests forever shall deplete
> > > > memory reserves. Fortunately, since wb_start_writeback() can fall back to
> > > > wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
> > > > need to allow wb_start_writeback() to use memory reserves.
> > > > 
> > > > ...
> > > >
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> > > >  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> > > >  	 * wakeup the thread for old dirty data writeback
> > > >  	 */
> > > > -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> > > > +	work = kzalloc(sizeof(*work),
> > > > +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > >  	if (!work) {
> > > >  		trace_writeback_nowork(wb);
> > > >  		wb_wakeup(wb);
> > > 
> > > Oh geeze.  fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :(
> > > 
> > > How does this actually all work?
> > 
> > Jack has explained it a bit
> > http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz
> > 
> > > afaict if we fail this
> > > wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say
> > > "hey, there are no work items!" and will do nothing at all.  Or does
> > > wb_workfn() fall into write-1024-pages-anyway mode and if so, how did
> > > it know how to do that?
> 
> We will end up in wb_do_writeback() which finds there's no work item so it
> falls back to doing default background writeback (i.e., write out until
> number of dirty pages is below background_dirty_limit).
> 
> > > If we had (say) a mempool of wb_writeback_work's (at least for for
> > > wb_start_writeback), would that help anything?  Or would writeback
> > > simply fail shortly afterwards for other reasons?
> 
> Not sure mempools would significantly improve the situation. Writeback code
> is able to deal with the failed allocation so I think the issue remains
> more with writeback code mostly pointlessly exhausting memory reserves with
> atomic allocations.
> 
> I think it is somewhat dumb from do_try_to_free_pages() that it calls
> wakeup_flusher_threads() so often (I guess it can quickly end up asking to
> write more than it is ever sensible to ask). Admittedly it is also dumb from
> the writeback code that it is not able to merge requests for writeback - we
> could easily merge items created by wb_start_writeback() with matching
> 'reason' and 'range_cyclic'.
> 
> I'm not sure how easy it is to fix the first thing, I think improving the
> second one may be worth it and I can have a look at that.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-29  8:54   ` Michal Hocko
@ 2016-03-29 16:49     ` Jan Kara
  2016-04-04 10:58       ` Tetsuo Handa
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2016-03-29 16:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, Tetsuo Handa, linux-mm, Tejun Heo, Jan Kara

On Tue 29-03-16 10:54:35, Michal Hocko wrote:
> On Thu 24-03-16 14:17:14, Andrew Morton wrote:
> > On Thu, 24 Mar 2016 23:03:16 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > 
> > > Andrew, can you take this patch?
> > 
> > Tejun.
> > 
> > > ----------------------------------------
> > > >From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Sun, 13 Mar 2016 23:03:05 +0900
> > > Subject: [PATCH] mm,writeback: Don't use memory reserves for
> > >  wb_start_writeback
> > > 
> > > When writeback operation cannot make forward progress because memory
> > > allocation requests needed for doing I/O cannot be satisfied (e.g.
> > > under OOM-livelock situation), we can observe flood of order-0 page
> > > allocation failure messages caused by complete depletion of memory
> > > reserves.
> > > 
> > > This is caused by unconditionally allocating "struct wb_writeback_work"
> > > objects using GFP_ATOMIC from PF_MEMALLOC context.
> > > 
> > > __alloc_pages_nodemask() {
> > >   __alloc_pages_slowpath() {
> > >     __alloc_pages_direct_reclaim() {
> > >       __perform_reclaim() {
> > >         current->flags |= PF_MEMALLOC;
> > >         try_to_free_pages() {
> > >           do_try_to_free_pages() {
> > >             wakeup_flusher_threads() {
> > >               wb_start_writeback() {
> > >                 kzalloc(sizeof(*work), GFP_ATOMIC) {
> > >                   /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
> > >                 }
> > >               }
> > >             }
> > >           }
> > >         }
> > >         current->flags &= ~PF_MEMALLOC;
> > >       }
> > >     }
> > >   }
> > > }
> > > 
> > > Since I/O is stalling, allocating writeback requests forever shall deplete
> > > memory reserves. Fortunately, since wb_start_writeback() can fall back to
> > > wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
> > > need to allow wb_start_writeback() to use memory reserves.
> > > 
> > > ...
> > >
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> > >  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> > >  	 * wakeup the thread for old dirty data writeback
> > >  	 */
> > > -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> > > +	work = kzalloc(sizeof(*work),
> > > +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > >  	if (!work) {
> > >  		trace_writeback_nowork(wb);
> > >  		wb_wakeup(wb);
> > 
> > Oh geeze.  fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :(
> > 
> > How does this actually all work?
> 
> Jack has explained it a bit
> http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz
> 
> > afaict if we fail this
> > wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say
> > "hey, there are no work items!" and will do nothing at all.  Or does
> > wb_workfn() fall into write-1024-pages-anyway mode and if so, how did
> > it know how to do that?

We will end up in wb_do_writeback() which finds there's no work item so it
falls back to doing default background writeback (i.e., write out until
number of dirty pages is below background_dirty_limit).

> > If we had (say) a mempool of wb_writeback_work's (at least for for
> > wb_start_writeback), would that help anything?  Or would writeback
> > simply fail shortly afterwards for other reasons?

Not sure mempools would significantly improve the situation. Writeback code
is able to deal with the failed allocation so I think the issue remains
more with writeback code mostly pointlessly exhausting memory reserves with
atomic allocations.

I think it is somewhat dumb from do_try_to_free_pages() that it calls
wakeup_flusher_threads() so often (I guess it can quickly end up asking to
write more than it is ever sensible to ask). Admittedly it is also dumb from
the writeback code that it is not able to merge requests for writeback - we
could easily merge items created by wb_start_writeback() with matching
'reason' and 'range_cyclic'.

I'm not sure how easy it is to fix the first thing, I think improving the
second one may be worth it and I can have a look at that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-24 21:17 ` Andrew Morton
  2016-03-25 11:54   ` Tetsuo Handa
@ 2016-03-29  8:54   ` Michal Hocko
  2016-03-29 16:49     ` Jan Kara
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2016-03-29  8:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tetsuo Handa, linux-mm, Tejun Heo, Jan Kara

[CCed Jack - Tetsuo it is preferable to CC people involved in the
previous discussion - and of course those who acked the patch as well]

On Thu 24-03-16 14:17:14, Andrew Morton wrote:
> On Thu, 24 Mar 2016 23:03:16 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
> 
> > Andrew, can you take this patch?
> 
> Tejun.
> 
> > ----------------------------------------
> > >From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Sun, 13 Mar 2016 23:03:05 +0900
> > Subject: [PATCH] mm,writeback: Don't use memory reserves for
> >  wb_start_writeback
> > 
> > When writeback operation cannot make forward progress because memory
> > allocation requests needed for doing I/O cannot be satisfied (e.g.
> > under OOM-livelock situation), we can observe flood of order-0 page
> > allocation failure messages caused by complete depletion of memory
> > reserves.
> > 
> > This is caused by unconditionally allocating "struct wb_writeback_work"
> > objects using GFP_ATOMIC from PF_MEMALLOC context.
> > 
> > __alloc_pages_nodemask() {
> >   __alloc_pages_slowpath() {
> >     __alloc_pages_direct_reclaim() {
> >       __perform_reclaim() {
> >         current->flags |= PF_MEMALLOC;
> >         try_to_free_pages() {
> >           do_try_to_free_pages() {
> >             wakeup_flusher_threads() {
> >               wb_start_writeback() {
> >                 kzalloc(sizeof(*work), GFP_ATOMIC) {
> >                   /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
> >                 }
> >               }
> >             }
> >           }
> >         }
> >         current->flags &= ~PF_MEMALLOC;
> >       }
> >     }
> >   }
> > }
> > 
> > Since I/O is stalling, allocating writeback requests forever shall deplete
> > memory reserves. Fortunately, since wb_start_writeback() can fall back to
> > wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
> > need to allow wb_start_writeback() to use memory reserves.
> > 
> > ...
> >
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> >  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> >  	 * wakeup the thread for old dirty data writeback
> >  	 */
> > -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> > +	work = kzalloc(sizeof(*work),
> > +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> >  	if (!work) {
> >  		trace_writeback_nowork(wb);
> >  		wb_wakeup(wb);
> 
> Oh geeze.  fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :(
> 
> How does this actually all work?

Jack has explained it a bit
http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz

> afaict if we fail this
> wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say
> "hey, there are no work items!" and will do nothing at all.  Or does
> wb_workfn() fall into write-1024-pages-anyway mode and if so, how did
> it know how to do that?
> 
> If we had (say) a mempool of wb_writeback_work's (at least for for
> wb_start_writeback), would that help anything?  Or would writeback
> simply fail shortly afterwards for other reasons?
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-24 21:17 ` Andrew Morton
@ 2016-03-25 11:54   ` Tetsuo Handa
  2016-03-29  8:54   ` Michal Hocko
  1 sibling, 0 replies; 15+ messages in thread
From: Tetsuo Handa @ 2016-03-25 11:54 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, tj

Andrew Morton wrote:
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> >  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> >  	 * wakeup the thread for old dirty data writeback
> >  	 */
> > -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> > +	work = kzalloc(sizeof(*work),
> > +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> >  	if (!work) {
> >  		trace_writeback_nowork(wb);
> >  		wb_wakeup(wb);
> 
> Oh geeze.  fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :(
> 
> How does this actually all work?  afaict if we fail this
> wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say
> "hey, there are no work items!" and will do nothing at all.  Or does
> wb_workfn() fall into write-1024-pages-anyway mode and if so, how did
> it know how to do that?
> 
> If we had (say) a mempool of wb_writeback_work's (at least for for
> wb_start_writeback), would that help anything?  Or would writeback
> simply fail shortly afterwards for other reasons?
> 

I tried http://lkml.kernel.org/r/20160318133417.GB30225@dhcp22.suse.cz which would
reduce number of wb_writeback_work allocations compared to this patch, and I got
http://lkml.kernel.org/r/201603172035.CJH95337.SOJOFFFHMLOQVt@I-love.SAKURA.ne.jp
where wb_workfn() got stuck after all when we started using memory reserves.

Having a mempool for wb_writeback_work is not sufficient. There are allocations
after wb_workfn() is called. All allocations (GFP_NOFS or GFP_NOIO) needed for
doing writeback operation are expected to be satisfied. If we let GFP_NOFS and
GFP_NOIO allocations to fail rather than selecting next OOM victim by calling
the OOM killer when the page allocator declared OOM, we will loose data which was
supposed to be flushed asynchronously. Who is happy with buffered writes which
discard data (and causes filesystem errors such as remounting read-only,
followed by killing almost all processes like SysRq-i due to userspace programs
being unable to write data to filesystem) simply because the system was OOM at
that moment? Basically, any allocation (GFP_NOFS or GFP_NOIO) needed for doing
writeback operation is __GFP_NOFAIL because failing to flush data should not occur
unless one of power failure, kernel panic, kernel oops or hardware troubles
occurs. I hate failing to flush data simply because the system was OOM at that
moment, without selecting next OOM victim which would kill fewer processes
compared to consequences caused by filesystem errors.

I expect this patch to merely serve for stop bleeding after we started using
memory reserves. Nothing more. We will need to solve OOM-livelock situation
when we started using memory reserves by killing more processes by calling
the OOM killer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
  2016-03-24 14:03 Tetsuo Handa
@ 2016-03-24 21:17 ` Andrew Morton
  2016-03-25 11:54   ` Tetsuo Handa
  2016-03-29  8:54   ` Michal Hocko
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2016-03-24 21:17 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Tejun Heo

On Thu, 24 Mar 2016 23:03:16 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:

> Andrew, can you take this patch?

Tejun.

> ----------------------------------------
> >From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 13 Mar 2016 23:03:05 +0900
> Subject: [PATCH] mm,writeback: Don't use memory reserves for
>  wb_start_writeback
> 
> When writeback operation cannot make forward progress because memory
> allocation requests needed for doing I/O cannot be satisfied (e.g.
> under OOM-livelock situation), we can observe flood of order-0 page
> allocation failure messages caused by complete depletion of memory
> reserves.
> 
> This is caused by unconditionally allocating "struct wb_writeback_work"
> objects using GFP_ATOMIC from PF_MEMALLOC context.
> 
> __alloc_pages_nodemask() {
>   __alloc_pages_slowpath() {
>     __alloc_pages_direct_reclaim() {
>       __perform_reclaim() {
>         current->flags |= PF_MEMALLOC;
>         try_to_free_pages() {
>           do_try_to_free_pages() {
>             wakeup_flusher_threads() {
>               wb_start_writeback() {
>                 kzalloc(sizeof(*work), GFP_ATOMIC) {
>                   /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
>                 }
>               }
>             }
>           }
>         }
>         current->flags &= ~PF_MEMALLOC;
>       }
>     }
>   }
> }
> 
> Since I/O is stalling, allocating writeback requests forever shall deplete
> memory reserves. Fortunately, since wb_start_writeback() can fall back to
> wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
> need to allow wb_start_writeback() to use memory reserves.
> 
> ...
>
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
>  	 * wakeup the thread for old dirty data writeback
>  	 */
> -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> +	work = kzalloc(sizeof(*work),
> +		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
>  	if (!work) {
>  		trace_writeback_nowork(wb);
>  		wb_wakeup(wb);

Oh geeze.  fs/fs-writeback.c has grown waaay too many GFP_ATOMICs :(

How does this actually all work?  afaict if we fail this
wb_writeback_work allocation, wb_workfn->wb_do_writeback will later say
"hey, there are no work items!" and will do nothing at all.  Or does
wb_workfn() fall into write-1024-pages-anyway mode and if so, how did
it know how to do that?

If we had (say) a mempool of wb_writeback_work's (at least for for
wb_start_writeback), would that help anything?  Or would writeback
simply fail shortly afterwards for other reasons?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
@ 2016-03-24 14:03 Tetsuo Handa
  2016-03-24 21:17 ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Tetsuo Handa @ 2016-03-24 14:03 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm

Andrew, can you take this patch?
----------------------------------------
>From 5d43acbc5849a63494a732e39374692822145923 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 13 Mar 2016 23:03:05 +0900
Subject: [PATCH] mm,writeback: Don't use memory reserves for
 wb_start_writeback

When writeback operation cannot make forward progress because memory
allocation requests needed for doing I/O cannot be satisfied (e.g.
under OOM-livelock situation), we can observe flood of order-0 page
allocation failure messages caused by complete depletion of memory
reserves.

This is caused by unconditionally allocating "struct wb_writeback_work"
objects using GFP_ATOMIC from PF_MEMALLOC context.

__alloc_pages_nodemask() {
  __alloc_pages_slowpath() {
    __alloc_pages_direct_reclaim() {
      __perform_reclaim() {
        current->flags |= PF_MEMALLOC;
        try_to_free_pages() {
          do_try_to_free_pages() {
            wakeup_flusher_threads() {
              wb_start_writeback() {
                kzalloc(sizeof(*work), GFP_ATOMIC) {
                  /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                }
              }
            }
          }
        }
        current->flags &= ~PF_MEMALLOC;
      }
    }
  }
}

Since I/O is stalling, allocating writeback requests forever shall deplete
memory reserves. Fortunately, since wb_start_writeback() can fall back to
wb_wakeup() when allocating "struct wb_writeback_work" failed, we don't
need to allow wb_start_writeback() to use memory reserves.

----------
[   59.562581] Mem-Info:
[   59.563935] active_anon:289393 inactive_anon:2093 isolated_anon:29
[   59.563935]  active_file:10838 inactive_file:113013 isolated_file:859
[   59.563935]  unevictable:0 dirty:108531 writeback:5308 unstable:0
[   59.563935]  slab_reclaimable:5526 slab_unreclaimable:7077
[   59.563935]  mapped:9970 shmem:2159 pagetables:2387 bounce:0
[   59.563935]  free:3042 free_pcp:0 free_cma:0
[   59.574558] Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
[   59.585464] lowmem_reserve[]: 0 1732 1732 1732
[   59.587123] Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
[   59.599649] lowmem_reserve[]: 0 0 0 0
[   59.601431] Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
[   59.606509] Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
[   59.610415] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   59.612879] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   59.615308] 126847 total pagecache pages
[   59.616921] 0 pages in swap cache
[   59.618475] Swap cache stats: add 0, delete 0, find 0/0
[   59.620268] Free swap  = 0kB
[   59.621650] Total swap = 0kB
[   59.623011] 524157 pages RAM
[   59.624365] 0 pages HighMem/MovableOnly
[   59.625893] 76348 pages reserved
[   59.627506] 0 pages hwpoisoned
[   59.628838] Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
[   59.631071] Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
[   61.526353] kthreadd: page allocation failure: order:0, mode:0x2200020
[   61.527976] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.527978] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
[   61.527979] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   61.527981]  0000000000000086 000000000005bb2d ffff88006cc5b588 ffffffff812a4d65
[   61.527982]  0000000002200020 0000000000000000 ffff88006cc5b618 ffffffff81106dc7
[   61.527983]  0000000000000000 ffffffffffffffff 00ff880000000000 ffff880000000004
[   61.527983] Call Trace:
[   61.528009]  [<ffffffff812a4d65>] dump_stack+0x4d/0x68
[   61.528012]  [<ffffffff81106dc7>] warn_alloc_failed+0xf7/0x150
[   61.528014]  [<ffffffff81109e3f>] __alloc_pages_nodemask+0x23f/0xa60
[   61.528016]  [<ffffffff81137770>] ? page_check_address_transhuge+0x350/0x350
[   61.528018]  [<ffffffff8111327d>] ? page_evictable+0xd/0x40
[   61.528019]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528021]  [<ffffffff81155181>] new_slab+0x3a1/0x440
[   61.528023]  [<ffffffff81156fdf>] ___slab_alloc+0x3cf/0x590
[   61.528024]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528027]  [<ffffffff815a7f68>] ? preempt_schedule_common+0x1f/0x37
[   61.528028]  [<ffffffff815a7f9f>] ? preempt_schedule+0x1f/0x30
[   61.528030]  [<ffffffff81001012>] ? ___preempt_schedule+0x12/0x14
[   61.528030]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528032]  [<ffffffff81175536>] __slab_alloc.isra.64+0x18/0x1d
[   61.528033]  [<ffffffff8115778c>] kmem_cache_alloc+0x11c/0x150
[   61.528034]  [<ffffffff811a0999>] wb_start_writeback+0x39/0x90
[   61.528035]  [<ffffffff811a0d9f>] wakeup_flusher_threads+0x7f/0xf0
[   61.528036]  [<ffffffff81115ac9>] do_try_to_free_pages+0x1f9/0x410
[   61.528037]  [<ffffffff81115d74>] try_to_free_pages+0x94/0xc0
[   61.528038]  [<ffffffff8110a166>] __alloc_pages_nodemask+0x566/0xa60
[   61.528040]  [<ffffffff81200878>] ? xfs_bmapi_read+0x208/0x2f0
[   61.528041]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528042]  [<ffffffff8110092f>] __page_cache_alloc+0xaf/0xc0
[   61.528043]  [<ffffffff811011e8>] pagecache_get_page+0x88/0x260
[   61.528044]  [<ffffffff81101d31>] grab_cache_page_write_begin+0x21/0x40
[   61.528046]  [<ffffffff81222c9f>] xfs_vm_write_begin+0x2f/0xf0
[   61.528047]  [<ffffffff810b14be>] ? current_fs_time+0x1e/0x30
[   61.528048]  [<ffffffff81101eca>] generic_perform_write+0xca/0x1c0
[   61.528050]  [<ffffffff8107c390>] ? wake_up_process+0x10/0x20
[   61.528051]  [<ffffffff8122e01c>] xfs_file_buffered_aio_write+0xcc/0x1f0
[   61.528052]  [<ffffffff81079037>] ? finish_task_switch+0x77/0x280
[   61.528053]  [<ffffffff8122e1c4>] xfs_file_write_iter+0x84/0x140
[   61.528054]  [<ffffffff811777a7>] __vfs_write+0xc7/0x100
[   61.528055]  [<ffffffff811784cd>] vfs_write+0x9d/0x190
[   61.528056]  [<ffffffff810010a1>] ? do_audit_syscall_entry+0x61/0x70
[   61.528057]  [<ffffffff811793c0>] SyS_write+0x50/0xc0
[   61.528059]  [<ffffffff815ab4d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[   61.528059] Mem-Info:
[   61.528062] active_anon:293335 inactive_anon:2093 isolated_anon:0
[   61.528062]  active_file:10829 inactive_file:110045 isolated_file:32
[   61.528062]  unevictable:0 dirty:109275 writeback:822 unstable:0
[   61.528062]  slab_reclaimable:5489 slab_unreclaimable:10070
[   61.528062]  mapped:9999 shmem:2159 pagetables:2420 bounce:0
[   61.528062]  free:3 free_pcp:0 free_cma:0
[   61.528065] Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
[   61.528066] lowmem_reserve[]: 0 1732 1732 1732
[   61.528068] Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
[   61.528069] lowmem_reserve[]: 0 0 0 0
[   61.528072] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528074] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   61.528076] 123086 total pagecache pages
[   61.528076] 0 pages in swap cache
[   61.528077] Swap cache stats: add 0, delete 0, find 0/0
[   61.528077] Free swap  = 0kB
[   61.528077] Total swap = 0kB
[   61.528077] 524157 pages RAM
[   61.528078] 0 pages HighMem/MovableOnly
[   61.528078] 76348 pages reserved
[   61.528078] 0 pages hwpoisoned
[   61.528079] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[   61.528080]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[   61.528080]   node 0: slabs: 3218, objs: 205952, free: 0
[   61.528085] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.528086] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
----------

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 fs/fs-writeback.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5c46ed9..21450c7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -929,7 +929,8 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = kzalloc(sizeof(*work),
+		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (!work) {
 		trace_writeback_nowork(wb);
 		wb_wakeup(wb);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-04-28 13:28 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-13  5:32 [PATCH] mm,writeback: Don't use ALLOC_NO_WATERMARKS for wb_start_writeback Tetsuo Handa
2016-03-13 14:22 ` [PATCH] mm,writeback: Don't use memory reserves " Tetsuo Handa
2016-03-13 14:22   ` Tetsuo Handa
2016-03-14 16:09   ` Michal Hocko
2016-03-16 20:46     ` Tejun Heo
2016-03-18 13:11       ` Jan Kara
2016-03-18 13:34         ` Michal Hocko
2016-03-18 13:42   ` Michal Hocko
2016-03-24 14:03 Tetsuo Handa
2016-03-24 21:17 ` Andrew Morton
2016-03-25 11:54   ` Tetsuo Handa
2016-03-29  8:54   ` Michal Hocko
2016-03-29 16:49     ` Jan Kara
2016-04-04 10:58       ` Tetsuo Handa
2016-04-28 13:26 Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.