linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
       [not found] ` <20200501135806.4eebf0b92f84ab60bba3e1e7@linux-foundation.org>
@ 2020-05-18 14:10   ` Naresh Kamboju
  2020-05-19  7:52     ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-18 14:10 UTC (permalink / raw)
  To: linux-f2fs-devel, linux-ext4, linux-block, Andrew Morton
  Cc: open list, Linux-Next Mailing List, linux-mm, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Arnd Bergmann,
	Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Chao Yu,
	lkft-triage

Thanks for looking into this problem.

On Sat, 2 May 2020 at 02:28, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > and started happening on linux -next master branch kernel tag next-20200430
> > and next-20200501. We did not bisect this problem.
>
> It would be wonderful if you could do so, please.  I can't immediately see
> any MM change in this area which might cause this.

We are planning a bisection soon on this problem.

>
> > metadata
> >   git branch: master
> >   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> >   git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1
> >   git describe: next-20200430
> >   make_kernelversion: 5.7.0-rc3
> >   kernel-config:
> > https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
> >
> > Steps to reproduce: (always reproducible)
>
> Reproducibility helps!
>
> > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
>
> > [   34.793430]  pagecache_get_page+0xae/0x260
>
> > [   34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0
> > [   34.897923]  active_file:4151 inactive_file:212494 isolated_file:0
> > [   34.897923]  unevictable:0 dirty:16505 writeback:6520 unstable:0
>
> > [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:1096kB inactive_file:786400kB unevictable:0kB
> > writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> > kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> > local_pcp:500kB free_cma:0kB
>
> ZONE_NORMAL has a huge amount of clean pagecache stuck on the
> inactive list, not being reclaimed.

FYI,
This issue is already reported here.
Now this problem is happening and easily reproducible on i386
and arm beagleboard x15 devices.

mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190703A01414
mke2fs 1.43.8 (1-Jan-2018)
Discarding device blocks:     4096/29306880
2625536/29306880
9441280/29306880                 16257024/29306880
23072768/29306880
                                 done
Creating filesystem with 29306880 4k blocks and 7331840 inodes
Filesystem UUID: a838d994-0a1e-403a-88d5-444d75aecc5a
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables:   0/895                     done
Writing inode tables:   0/895                     done
Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
oom_score_adj=0
[   31.261172] CPU: 0 PID: 397 Comm: mkfs.ext4 Not tainted
5.7.0-rc6-next-20200518 #1
[   31.268771] Hardware name: Generic DRA74X (Flattened Device Tree)
[   31.274904] [<c0411500>] (unwind_backtrace) from [<c040b66c>]
(show_stack+0x10/0x14)
[   31.282685] [<c040b66c>] (show_stack) from [<c08b1b14>]
(dump_stack+0xc4/0xd8)
[   31.289940] [<c08b1b14>] (dump_stack) from [<c0547bf8>]
(dump_header+0x54/0x1ec)
[   31.297367] [<c0547bf8>] (dump_header) from [<c0547008>]
(oom_kill_process+0x18c/0x198)
[   31.305405] [<c0547008>] (oom_kill_process) from [<c0547a0c>]
(out_of_memory+0x250/0x368)
[   31.313619] [<c0547a0c>] (out_of_memory) from [<c0599d80>]
(__alloc_pages_nodemask+0xce8/0x10bc)
[   31.322445] [<c0599d80>] (__alloc_pages_nodemask) from [<c0541bb4>]
(pagecache_get_page+0x128/0x358)
[   31.331619] [<c0541bb4>] (pagecache_get_page) from [<c0543a8c>]
(grab_cache_page_write_begin+0x18/0x2c)
[   31.341054] [<c0543a8c>] (grab_cache_page_write_begin) from
[<c0619fb0>] (block_write_begin+0x20/0xc4)
[   31.350401] [<c0619fb0>] (block_write_begin) from [<c053e718>]
(generic_perform_write+0xb8/0x1d8)
[   31.359312] [<c053e718>] (generic_perform_write) from [<c054496c>]
(__generic_file_write_iter+0x164/0x1ec)
[   31.369007] [<c054496c>] (__generic_file_write_iter) from
[<c061c8a4>] (blkdev_write_iter+0xc8/0x1a4)
[   31.378269] [<c061c8a4>] (blkdev_write_iter) from [<c05d50d0>]
(__vfs_write+0x13c/0x1cc)
[   31.386397] [<c05d50d0>] (__vfs_write) from [<c05d81d4>]
(vfs_write+0xb0/0x1bc)
[   31.393738] [<c05d81d4>] (vfs_write) from [<c05d85e4>]
(ksys_pwrite64+0x60/0x8c)
[   31.401167] [<c05d85e4>] (ksys_pwrite64) from [<c04001a0>]
(ret_fast_syscall+0x0/0x4c)
[   31.409115] Exception stack(0xe810dfa8 to 0xe810dff0)
[   31.414185] dfa0:                   a2000000 0000000d 00000003
b6952008 00400000 00000000
[   31.422395] dfc0: a2000000 0000000d a2000000 000000b5 00400000
0003b768 b6952008 00da2000
[   31.430604] dfe0: 00000064 beb891b8 b6f85108 b6e38f2c
[   31.435809] Mem-Info:
[   31.438098] active_anon:5813 inactive_anon:4129 isolated_anon:0
[   31.438098]  active_file:6080 inactive_file:118548 isolated_file:0
[   31.438098]  unevictable:0 dirty:13674 writeback:7440 unstable:0
[   31.438098]  slab_reclaimable:5651 slab_unreclaimable:4566
[   31.438098]  mapped:5585 shmem:4468 pagetables:182 bounce:0
[   31.438098]  free:347556 free_pcp:608 free_cma:57235
[   31.472362] Node 0 active_anon:23252kB inactive_anon:16516kB
active_file:24320kB inactive_file:474192kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:22340kB dirty:54696kB
writeback:11196kB shmem:17872kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:4736kB inactive_file:431688kB unevictable:0kB
writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
local_pcp:216kB free_cma:163840kB
[   31.531339] lowmem_reserve[]: 0 0 1216 0
[   31.535289] HighMem free:1203904kB min:512kB low:11592kB
high:22672kB reserved_highatomic:0KB active_anon:23252kB
inactive_anon:16516kB active_file:19584kB inactive_file:42420kB
unevictable:0kB writepending:0kB present:1310720kB managed:1310720kB
mlocked:0kB kernel_stack:0kB pagetables:728kB bounce:0kB
free_pcp:1584kB local_pcp:1232kB free_cma:65100kB
[   31.566540] lowmem_reserve[]: 0 0 0 0
[   31.570244] DMA: 87*4kB (UME) 53*8kB (UME) 26*16kB (UE) 6*32kB (UM)
1*64kB (E) 1*128kB (U) 5*256kB (ME) 5*512kB (ME) 4*1024kB (ME)
5*2048kB (M) 1*4096kB (M) 20*8192kB (C) = 187684kB
[   31.586520] HighMem: 2*4kB (MC) 1*8kB (C) 1*16kB (M) 5*32kB (UM)
4*64kB (UMC) 2*128kB (UM) 2*256kB (UM) 1*512kB (C) 2*1024kB (MC)
2*2048kB (MC) 2*4096kB (UC) 145*8192kB (MC) = 1203904kB
[   31.603150] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[   31.611637] 129102 total pagecache pages
[   31.615577] 0 pages in swap cache
[   31.618902] Swap cache stats: add 0, delete 0, find 0/0
[   31.624162] Free swap  = 0kB
[   31.627053] Total swap = 0kB
[   31.629955] 523520 pages RAM
[   31.632846] 327680 pages HighMem/MovableOnly
[   31.637128] 28774 pages reserved
[   31.640381] 57344 pages cma reserved
[   31.643971] Tasks state (memory values in pages):
[   31.648691] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
swapents oom_score_adj name
[   31.657367] [    183]     0   183     7370     1082    36864
0             0 systemd-journal
[   31.666466] [    209]   994   209     3742      326    40960
0             0 systemd-timesyn
[   31.675570] [    217]     0   217     3398      817    32768
0         -1000 systemd-udevd
[   31.684498] [    230]   993   230     1411      737    32768
0             0 systemd-network
[   31.693598] [    231]   992   231     1496      712    32768
0             0 systemd-resolve
[   31.702702] [    236]   996   236     1112      742    24576
0          -900 dbus-daemon
[   31.711454] [    241]     0   241     1895     1045    36864
0             0 haveged
[   31.719857] [    242]     0   242     1362      906    28672
0             0 systemd-logind
[   31.728855] [    243]     0   243    13412     2571    69632
0             0 NetworkManager
[   31.737867] [    244]   995   244     1197      608    28672
0             0 avahi-daemon
[   31.746707] [    245]   995   245     1164       59    28672
0             0 avahi-daemon
[   31.755545] [    246]     0   246      594      332    28672
0             0 atd
[   31.763601] [    248]     0   248      699       99    24576
0             0 syslogd
[   31.772001] [    251]     0   251      699      102    24576
0             0 klogd
[   31.780231] [    252]     0   252      676      365    24576
0             0 crond
[   31.788443] [    254]     0   254     1172      240    32768
0             0 systemd-hostnam
[   31.797547] [    264] 65534   264      605       32    24576
0             0 dnsmasq
[   31.805948] [    265]     0   265      556      357    28672
0             0 agetty
[   31.814262] [    266]     0   266     1131      613    32768
0             0 login
[   31.822492] [    268]   998   268    18201     2629    81920
0             0 polkitd
[   31.830895] [    350]     0   350     1840     1161    32768
0             0 systemd
[   31.839286] [    351]     0   351     2403      473    36864
0             0 (sd-pam)
[   31.847774] [    355]     0   355      827      611    24576
0             0 sh
[   31.855742] [    364]     0   364     7341     1145    53248
0             0 nm-dispatcher
[   31.864667] [    377]     0   377      711      510    28672
0             0 lava-test-runne
[   31.873770] [    387]     0   387      711      138    20480
0             0 lava-test-shell
[   31.882869] [    388]     0   388      711      523    20480
0             0 sh
[   31.890837] [    397]     0   397     1785     1518    36864
0             0 mkfs.ext4
[   31.899397] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=polkitd,pid=268,uid=998
[   31.910012] Out of memory: Killed process 268 (polkitd)
total-vm:72804kB, anon-rss:2948kB, file-rss:7568kB, shmem-rss:0kB,
UID:998 pgtables:80kB oom_score_adj:0
[   31.927948] oom_reaper: reaped process 268 (polkitd), now
anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[   31.937461] mkfs.ext4 invoked oom-killer:
gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0, oom_score_adj=0
[   31.947273] CPU: 1 PID: 397 Comm: mkfs.ext4 Not tainted
5.7.0-rc6-next-20200518 #1
[   31.954871] Hardware name: Generic DRA74X (Flattened Device Tree)
[   31.961000] [<c0411500>] (unwind_backtrace) from [<c040b66c>]
(show_stack+0x10/0x14)
[   31.968778] [<c040b66c>] (show_stack) from [<c08b1b14>]
(dump_stack+0xc4/0xd8)
[   31.976032] [<c08b1b14>] (dump_stack) from [<c0547bf8>]
(dump_header+0x54/0x1ec)
[   31.983458] [<c0547bf8>] (dump_header) from [<c0547008>]
(oom_kill_process+0x18c/0x198)
[   31.991495] [<c0547008>] (oom_kill_process) from [<c0547a0c>]
(out_of_memory+0x250/0x368)
[   31.999706] [<c0547a0c>] (out_of_memory) from [<c0599d80>]
(__alloc_pages_nodemask+0xce8/0x10bc)
[   32.008532] [<c0599d80>] (__alloc_pages_nodemask) from [<c0541bb4>]
(pagecache_get_page+0x128/0x358)
[   32.017704] [<c0541bb4>] (pagecache_get_page) from [<c0543a8c>]
(grab_cache_page_write_begin+0x18/0x2c)
[   32.027138] [<c0543a8c>] (grab_cache_page_write_begin) from
[<c0619fb0>] (block_write_begin+0x20/0xc4)
[   32.036484] [<c0619fb0>] (block_write_begin) from [<c053e718>]
(generic_perform_write+0xb8/0x1d8)
[   32.045395] [<c053e718>] (generic_perform_write) from [<c054496c>]
(__generic_file_write_iter+0x164/0x1ec)
[   32.055090] [<c054496c>] (__generic_file_write_iter) from
[<c061c8a4>] (blkdev_write_iter+0xc8/0x1a4)
[   32.064350] [<c061c8a4>] (blkdev_write_iter) from [<c05d50d0>]
(__vfs_write+0x13c/0x1cc)
[   32.072476] [<c05d50d0>] (__vfs_write) from [<c05d81d4>]
(vfs_write+0xb0/0x1bc)
[   32.079814] [<c05d81d4>] (vfs_write) from [<c05d85e4>]
(ksys_pwrite64+0x60/0x8c)
[   32.087241] [<c05d85e4>] (ksys_pwrite64) from [<c04001a0>]
(ret_fast_syscall+0x0/0x4c)
[   32.095187] Exception stack(0xe810dfa8 to 0xe810dff0)
[   32.100256] dfa0:                   a2000000 0000000d 00000003
b6952008 00400000 00000000
[   32.108466] dfc0: a2000000 0000000d a2000000 000000b5 00400000
0003b768 b6952008 00da2000
[   32.116673] dfe0: 00000064 beb891b8 b6f85108 b6e38f2c
[   32.121786] Mem-Info:
[   32.124070] active_anon:5056 inactive_anon:4129 isolated_anon:0
[   32.124070]  active_file:6289 inactive_file:118790 isolated_file:0
[   32.124070]  unevictable:0 dirty:14118 writeback:6 unstable:0
[   32.124070]  slab_reclaimable:5653 slab_unreclaimable:4209
[   32.124070]  mapped:4839 shmem:4468 pagetables:165 bounce:0
[   32.124070]  free:348249 free_pcp:562 free_cma:57235
[   32.158031] Node 0 active_anon:20224kB inactive_anon:16516kB
active_file:25156kB inactive_file:475160kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:19356kB dirty:56472kB
writeback:24kB shmem:17872kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[   32.186324] DMA free:186320kB min:22528kB low:28160kB high:33792kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:4736kB inactive_file:433580kB unevictable:0kB
writepending:56468kB present:783360kB managed:668264kB mlocked:0kB
kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:420kB
local_pcp:220kB free_cma:163840kB
[   32.216693] lowmem_reserve[]: 0 0 1216 0
[   32.220652] HighMem free:1206676kB min:512kB low:11592kB
high:22672kB reserved_highatomic:0KB active_anon:20224kB
inactive_anon:16516kB active_file:20420kB inactive_file:41584kB
unevictable:0kB writepending:0kB present:1310720kB managed:1310720kB
mlocked:0kB kernel_stack:0kB pagetables:660kB bounce:0kB
free_pcp:1816kB local_pcp:340kB free_cma:65100kB
[   32.251805] lowmem_reserve[]: 0 0 0 0
[   32.255482] DMA: 2*4kB (UM) 3*8kB (UME) 1*16kB (U) 1*32kB (M)
0*64kB 1*128kB (U) 5*256kB (ME) 5*512kB (ME) 4*1024kB (ME) 5*2048kB
(M) 1*4096kB (M) 20*8192kB (C) = 186320kB
[   32.270871] HighMem: 183*4kB (UMC) 65*8kB (UMC) 21*16kB (M) 11*32kB
(UM) 6*64kB (UMC) 3*128kB (UM) 3*256kB (UM) 2*512kB (MC) 2*1024kB (MC)
2*2048kB (MC) 2*4096kB (UC) 145*8192kB (MC) = 1206676kB
[   32.288273] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[   32.296751] 129546 total pagecache pages
[   32.300695] 0 pages in swap cache
[   32.304019] Swap cache stats: add 0, delete 0, find 0/0
[   32.309260] Free swap  = 0kB
[   32.312155] Total swap = 0kB
[   32.315045] 523520 pages RAM
[   32.317932] 327680 pages HighMem/MovableOnly
[   32.322221] 28774 pages reserved
[   32.325457] 57344 pages cma reserved
[   32.329043] Tasks state (memory values in pages):
[   32.333771] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
swapents oom_score_adj name
[   32.342436] [    183]     0   183     7370     1082    36864
0             0 systemd-journal
[   32.351529] [    209]   994   209     3742      326    40960
0             0 systemd-timesyn
[   32.360620] [    217]     0   217     3398      817    32768
0         -1000 systemd-udevd
[   32.369528] [    230]   993   230     1411      737    32768
0             0 systemd-network
[   32.378620] [    231]   992   231     1496      712    32768
0             0 systemd-resolve
[   32.387713] [    236]   996   236     1112      742    24576
0          -900 dbus-daemon
[   32.396456] [    241]     0   241     1895     1045    36864
0             0 haveged
[   32.404850] [    242]     0   242     1362      906    28672
0             0 systemd-logind
[   32.413852] [    243]     0   243    13412     2571    69632
0             0 NetworkManager
[   32.422858] [    244]   995   244     1197      608    28672
0             0 avahi-daemon
[   32.431687] [    245]   995   245     1164       59    28672
0             0 avahi-daemon
[   32.440518] [    246]     0   246      594      332    28672
0             0 atd
[   32.448553] [    248]     0   248      699       99    24576
0             0 syslogd
[   32.456945] [    251]     0   251      699      102    24576
0             0 klogd
[   32.465171] [    252]     0   252      676      365    24576
0             0 crond
[   32.473390] [    254]     0   254     1172      240    32768
0             0 systemd-hostnam
[   32.482481] [    264] 65534   264      605       32    24576
0             0 dnsmasq
[   32.490876] [    265]     0   265      556      357    28672
0             0 agetty
[   32.499175] [    266]     0   266     1131      613    32768
0             0 login
[   32.507394] [    350]     0   350     1840     1161    32768
0             0 systemd
[   32.515788] [    351]     0   351     2403      473    36864
0             0 (sd-pam)
[   32.524268] [    355]     0   355      827      611    24576
0             0 sh
[   32.532227] [    364]     0   364     7341     1145    53248
0             0 nm-dispatcher
[   32.541142] [    377]     0   377      711      510    28672
0             0 lava-test-runne
[   32.550234] [    387]     0   387      711      138    20480
0             0 lava-test-shell
[   32.559316] [    388]     0   388      711      523    20480
0             0 sh
[   32.567273] [    397]     0   397     1785     1518    36864
0             0 mkfs.ext4

ref:
https://lkft.validation.linaro.org/scheduler/job/1436647#L4261
https://lkft.validation.linaro.org/scheduler/job/1436562#L1247

--
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-18 14:10   ` mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page Naresh Kamboju
@ 2020-05-19  7:52     ` Michal Hocko
  2020-05-19  8:11       ` Arnd Bergmann
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-05-19  7:52 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: linux-f2fs-devel, linux-ext4, linux-block, Andrew Morton,
	open list, Linux-Next Mailing List, linux-mm, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Arnd Bergmann,
	Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Chao Yu,
	lkft-triage

On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> Thanks for looking into this problem.
> 
> On Sat, 2 May 2020 at 02:28, Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >
> > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > and started happening on linux -next master branch kernel tag next-20200430
> > > and next-20200501. We did not bisect this problem.
[...]
> Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
> oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> oom_score_adj=0
[...]
> [   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:4736kB inactive_file:431688kB unevictable:0kB
> writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> local_pcp:216kB free_cma:163840kB

This is really unexpected. You are saying this is a regular i386 and DMA
should be bottom 16MB while yours is 780MB and the rest of the low mem
is in the Normal zone which is completely missing here. How have you got
to that configuration? I have to say I haven't seen anything like that
on i386.

The failing request is GFP_USER so highmem is not really allowed but
free pages are way above watermarks so the allocation should have just
succeeded.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-19  7:52     ` Michal Hocko
@ 2020-05-19  8:11       ` Arnd Bergmann
  2020-05-19  8:45         ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Arnd Bergmann @ 2020-05-19  8:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naresh Kamboju, Linux F2FS DEV, Mailing List, linux-ext4,
	linux-block, Andrew Morton, open list, Linux-Next Mailing List,
	linux-mm, Andreas Dilger, Jaegeuk Kim, Theodore Ts'o,
	Chao Yu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Chao Yu,
	lkft-triage

On Tue, May 19, 2020 at 9:52 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> > Thanks for looking into this problem.
> >
> > On Sat, 2 May 2020 at 02:28, Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> > >
> > > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > > and started happening on linux -next master branch kernel tag next-20200430
> > > > and next-20200501. We did not bisect this problem.
> [...]
> > Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
> > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> > oom_score_adj=0
> [...]
> > [   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:4736kB inactive_file:431688kB unevictable:0kB
> > writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> > kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> > local_pcp:216kB free_cma:163840kB
>
> This is really unexpected. You are saying this is a regular i386 and DMA
> should be bottom 16MB while yours is 780MB and the rest of the low mem
> is in the Normal zone which is completely missing here. How have you got
> to that configuration? I have to say I haven't seen anything like that
> on i386.

I think that line comes from an ARM32 beaglebone-X15 machine showing
the same symptom. The i386 line from the log file that Naresh linked to at
https://lkft.validation.linaro.org/scheduler/job/1406110#L1223  is less
unusual:

[   34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
active_file:16604kB inactive_file:849976kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
[   34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:11964kB unevictable:0kB
writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
[   34.983385] lowmem_reserve[]: 0 825 1947 825
[   34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:1096kB inactive_file:786400kB unevictable:0kB
writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
local_pcp:500kB free_cma:0kB
[   35.017427] lowmem_reserve[]: 0 0 8980 0
[   35.021362] HighMem free:1049496kB min:512kB low:1748kB high:2984kB
reserved_highatomic:0KB active_anon:21464kB inactive_anon:8688kB
active_file:15508kB inactive_file:51612kB unevictable:0kB
writepending:0kB present:1149540kB managed:1149540kB mlocked:0kB
kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:1524kB
local_pcp:292kB free_cma:0kB
[   35.051717] lowmem_reserve[]: 0 0 0 0
[   35.055374] DMA: 8*4kB (UE) 1*8kB (E) 1*16kB (E) 0*32kB 0*64kB
0*128kB 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB =
3384kB
[   35.067446] Normal: 27*4kB (U) 23*8kB (U) 12*16kB (UE) 12*32kB (U)
4*64kB (UE) 2*128kB (U) 2*256kB (UE) 1*512kB (E) 0*1024kB 1*2048kB (U)
0*4096kB = 4452kB
[   35.081347] HighMem: 2*4kB (UM) 0*8kB 1*16kB (M) 2*32kB (UM) 1*64kB
(U) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 0*2048kB 256*4096kB (M) =
1049496kB

        Arnd

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-19  8:11       ` Arnd Bergmann
@ 2020-05-19  8:45         ` Michal Hocko
  2020-05-20 11:56           ` Naresh Kamboju
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-05-19  8:45 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Naresh Kamboju, Linux F2FS DEV, Mailing List, linux-ext4,
	linux-block, Andrew Morton, open list, Linux-Next Mailing List,
	linux-mm, Andreas Dilger, Jaegeuk Kim, Theodore Ts'o,
	Chao Yu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Chao Yu,
	lkft-triage

On Tue 19-05-20 10:11:25, Arnd Bergmann wrote:
> On Tue, May 19, 2020 at 9:52 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> > > Thanks for looking into this problem.
> > >
> > > On Sat, 2 May 2020 at 02:28, Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> > > >
> > > > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > > > and started happening on linux -next master branch kernel tag next-20200430
> > > > > and next-20200501. We did not bisect this problem.
> > [...]
> > > Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
> > > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> > > oom_score_adj=0
> > [...]
> > > [   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> > > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > > active_file:4736kB inactive_file:431688kB unevictable:0kB
> > > writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> > > kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> > > local_pcp:216kB free_cma:163840kB
> >
> > This is really unexpected. You are saying this is a regular i386 and DMA
> > should be bottom 16MB while yours is 780MB and the rest of the low mem
> > is in the Normal zone which is completely missing here. How have you got
> > to that configuration? I have to say I haven't seen anything like that
> > on i386.
> 
> I think that line comes from an ARM32 beaglebone-X15 machine showing
> the same symptom. The i386 line from the log file that Naresh linked to at
> https://lkft.validation.linaro.org/scheduler/job/1406110#L1223  is less
> unusual:

OK, that makes more sense! At least for the memory layout.
 
> [   34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
> active_file:16604kB inactive_file:849976kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
> writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
> all_unreclaimable? yes
> [   34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:11964kB unevictable:0kB
> writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
> free_cma:0kB
> [   34.983385] lowmem_reserve[]: 0 825 1947 825
> [   34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:1096kB inactive_file:786400kB unevictable:0kB
> writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> local_pcp:500kB free_cma:0kB

The lowmem is really low (way below the min watermark so even memory
reserves for high priority and atomic requests are depleted. There is
still 786MB of inactive page cache to be reclaimed. It doesn't seem to
be dirty or under the writeback but it still might be pinned by the
filesystem. I would suggest watching vmscan reclaim tracepoints and
check why the reclaim fails to reclaim anything.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-19  8:45         ` Michal Hocko
@ 2020-05-20 11:56           ` Naresh Kamboju
  2020-05-20 17:59             ` Naresh Kamboju
  0 siblings, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-20 11:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Arnd Bergmann, Linux F2FS DEV, Mailing List, linux-ext4,
	linux-block, Andrew Morton, open list, Linux-Next Mailing List,
	linux-mm, Andreas Dilger, Jaegeuk Kim, Theodore Ts'o,
	Chao Yu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Chao Yu,
	lkft-triage

FYI,

This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
As per the test results history this problem started happening from
Bad : next-20200430
Good : next-20200429

steps to reproduce:
dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573
of=/dev/null bs=1M count=2048
or
mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5


Problem:
[   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
order=0, oom_score_adj=0

i386 crash log:  https://pastebin.com/Hb8U89vU
arm crash log: https://pastebin.com/BD9t3JTm

On Tue, 19 May 2020 at 14:15, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 19-05-20 10:11:25, Arnd Bergmann wrote:
> > On Tue, May 19, 2020 at 9:52 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> > > > Thanks for looking into this problem.
> > > >
> > > > On Sat, 2 May 2020 at 02:28, Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> > > > >
> > > > > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > > > > and started happening on linux -next master branch kernel tag next-20200430
> > > > > > and next-20200501. We did not bisect this problem.
> > > [...]
> > > > Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
> > > > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> > > > oom_score_adj=0
> > > [...]
> > > > [   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> > > > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > > > active_file:4736kB inactive_file:431688kB unevictable:0kB
> > > > writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> > > > kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> > > > local_pcp:216kB free_cma:163840kB
> > >
> > > This is really unexpected. You are saying this is a regular i386 and DMA
> > > should be bottom 16MB while yours is 780MB and the rest of the low mem
> > > is in the Normal zone which is completely missing here. How have you got
> > > to that configuration? I have to say I haven't seen anything like that
> > > on i386.
> >
> > I think that line comes from an ARM32 beaglebone-X15 machine showing
> > the same symptom. The i386 line from the log file that Naresh linked to at
> > https://lkft.validation.linaro.org/scheduler/job/1406110#L1223  is less
> > unusual:
>
> OK, that makes more sense! At least for the memory layout.
>
> > [   34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
> > active_file:16604kB inactive_file:849976kB unevictable:0kB
> > isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
> > writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
> > all_unreclaimable? yes
> > [   34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:0kB inactive_file:11964kB unevictable:0kB
> > writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
> > kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
> > free_cma:0kB
> > [   34.983385] lowmem_reserve[]: 0 825 1947 825
> > [   34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:1096kB inactive_file:786400kB unevictable:0kB
> > writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> > kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> > local_pcp:500kB free_cma:0kB
>
> The lowmem is really low (way below the min watermark so even memory
> reserves for high priority and atomic requests are depleted. There is
> still 786MB of inactive page cache to be reclaimed. It doesn't seem to
> be dirty or under the writeback but it still might be pinned by the
> filesystem. I would suggest watching vmscan reclaim tracepoints and
> check why the reclaim fails to reclaim anything.
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-20 11:56           ` Naresh Kamboju
@ 2020-05-20 17:59             ` Naresh Kamboju
  2020-05-20 19:09               ` Chris Down
  2020-05-21  2:39               ` Yafang Shao
  0 siblings, 2 replies; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-20 17:59 UTC (permalink / raw)
  To: Chris Down, Yafang Shao, Michal Hocko, Anders Roxell
  Cc: Linux F2FS DEV, Mailing List, linux-ext4, linux-block,
	Andrew Morton, open list, Linux-Next Mailing List, linux-mm,
	Arnd Bergmann, Andreas Dilger, Jaegeuk Kim, Theodore Ts'o,
	Chao Yu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Chao Yu,
	lkft-triage, Johannes Weiner, Roman Gushchin, cgroups

On Wed, 20 May 2020 at 17:26, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
>
> This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
> As per the test results history this problem started happening from
> Bad : next-20200430
> Good : next-20200429
>
> steps to reproduce:
> dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573
> of=/dev/null bs=1M count=2048
> or
> mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
>
>
> Problem:
> [   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> order=0, oom_score_adj=0

As a part of investigation on this issue LKFT teammate Anders Roxell
git bisected the problem and found bad commit(s) which caused this problem.

The following two patches have been reverted on next-20200519 and retested the
reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
( invoked oom-killer is gone now)

Revert "mm, memcg: avoid stale protection values when cgroup is above
protection"
    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.

Revert "mm, memcg: decouple e{low,min} state mutations from protection
checks"
    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.

i386 test log shows mkfs -t ext4 pass
https://lkft.validation.linaro.org/scheduler/job/1443405#L1200

ref:
https://lore.kernel.org/linux-mm/cover.1588092152.git.chris@chrisdown.name/
https://lore.kernel.org/linux-mm/CA+G9fYvzLm7n1BE7AJXd8_49fOgPgWWTiQ7sXkVre_zoERjQKg@mail.gmail.com/T/#t

-- 
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-20 17:59             ` Naresh Kamboju
@ 2020-05-20 19:09               ` Chris Down
  2020-05-21  9:22                 ` Naresh Kamboju
  2020-05-21  9:55                 ` Michal Hocko
  2020-05-21  2:39               ` Yafang Shao
  1 sibling, 2 replies; 48+ messages in thread
From: Chris Down @ 2020-05-20 19:09 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Yafang Shao, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, cgroups

Hi Naresh,

Naresh Kamboju writes:
>As a part of investigation on this issue LKFT teammate Anders Roxell
>git bisected the problem and found bad commit(s) which caused this problem.
>
>The following two patches have been reverted on next-20200519 and retested the
>reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
>( invoked oom-killer is gone now)
>
>Revert "mm, memcg: avoid stale protection values when cgroup is above
>protection"
>    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
>
>Revert "mm, memcg: decouple e{low,min} state mutations from protection
>checks"
>    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.

Thanks Anders and Naresh for tracking this down and reverting.

I'll take a look tomorrow. I don't see anything immediately obviously wrong in 
either of those commits from a (very) cursory glance, but they should only be 
taking effect if protections are set.

Since you have i386 hardware available, and I don't, could you please apply 
only "avoid stale protection" again and check if it only happens with that 
commit, or requires both? That would help narrow down the suspects.

Do you use any memcg protections in these tests?

Thank you!

Chris

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-20 17:59             ` Naresh Kamboju
  2020-05-20 19:09               ` Chris Down
@ 2020-05-21  2:39               ` Yafang Shao
  2020-05-21  8:58                 ` Naresh Kamboju
  1 sibling, 1 reply; 48+ messages in thread
From: Yafang Shao @ 2020-05-21  2:39 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Chris Down, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju
<naresh.kamboju@linaro.org> wrote:
>
> On Wed, 20 May 2020 at 17:26, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >
> >
> > This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
> > As per the test results history this problem started happening from
> > Bad : next-20200430
> > Good : next-20200429
> >
> > steps to reproduce:
> > dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573
> > of=/dev/null bs=1M count=2048
> > or
> > mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
> >
> >
> > Problem:
> > [   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> > order=0, oom_score_adj=0
>
> As a part of investigation on this issue LKFT teammate Anders Roxell
> git bisected the problem and found bad commit(s) which caused this problem.
>
> The following two patches have been reverted on next-20200519 and retested the
> reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> ( invoked oom-killer is gone now)
>
> Revert "mm, memcg: avoid stale protection values when cgroup is above
> protection"
>     This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
>
> Revert "mm, memcg: decouple e{low,min} state mutations from protection
> checks"
>     This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
>

My guess is that we made the same mistake in commit "mm, memcg:
decouple e{low,min} state mutations from protection
checks" that it read a stale memcg protection in
mem_cgroup_below_low() and mem_cgroup_below_min().

Bellow is a possble fix,

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7a2c56fc..6591b71 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -391,20 +391,28 @@ static inline unsigned long
mem_cgroup_protection(struct mem_cgroup *root,
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
                                     struct mem_cgroup *memcg);

-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *root,
+                                       struct mem_cgroup *memcg)
 {
        if (mem_cgroup_disabled())
                return false;

+       if (root == memcg)
+               return false;
+
        return READ_ONCE(memcg->memory.elow) >=
                page_counter_read(&memcg->memory);
 }

-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_min(struct mem_cgroup *root,
+                                       struct mem_cgroup *memcg)
 {
        if (mem_cgroup_disabled())
                return false;

+       if (root == memcg)
+               return false;
+
        return READ_ONCE(memcg->memory.emin) >=
                page_counter_read(&memcg->memory);
 }
@@ -896,12 +904,14 @@ static inline void
mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 }

-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *root,
+                                       struct mem_cgroup *memcg)
 {
        return false;
 }

-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_min(struct mem_cgroup *root,
+                                       struct mem_cgroup *memcg)
 {
        return false;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c71660e..fdcdd88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2637,13 +2637,13 @@ static void shrink_node_memcgs(pg_data_t
*pgdat, struct scan_control *sc)

                mem_cgroup_calculate_protection(target_memcg, memcg);

-               if (mem_cgroup_below_min(memcg)) {
+               if (mem_cgroup_below_min(target_memcg, memcg)) {
                        /*
                         * Hard protection.
                         * If there is no reclaimable memory, OOM.
                         */
                        continue;
-               } else if (mem_cgroup_below_low(memcg)) {
+               } else if (mem_cgroup_below_low(target_memcg, memcg)) {
                        /*
                         * Soft protection.
                         * Respect the protection only as long as





> i386 test log shows mkfs -t ext4 pass
> https://lkft.validation.linaro.org/scheduler/job/1443405#L1200
>
> ref:
> https://lore.kernel.org/linux-mm/cover.1588092152.git.chris@chrisdown.name/
> https://lore.kernel.org/linux-mm/CA+G9fYvzLm7n1BE7AJXd8_49fOgPgWWTiQ7sXkVre_zoERjQKg@mail.gmail.com/T/#t
>
> --
> Linaro LKFT
> https://lkft.linaro.org



--
Thanks
Yafang

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21  2:39               ` Yafang Shao
@ 2020-05-21  8:58                 ` Naresh Kamboju
  2020-05-21  9:47                   ` Yafang Shao
  0 siblings, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-21  8:58 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Chris Down, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, 21 May 2020 at 08:10, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju
> <naresh.kamboju@linaro.org> wrote:
> >
> > On Wed, 20 May 2020 at 17:26, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> > >
> > >
> > > This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
> > > As per the test results history this problem started happening from
> > > mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
> > >
> > >
> > > Problem:
> > > [   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> > > order=0, oom_score_adj=0
> >
> My guess is that we made the same mistake in commit "mm, memcg:
> decouple e{low,min} state mutations from protection
> checks" that it read a stale memcg protection in
> mem_cgroup_below_low() and mem_cgroup_below_min().
>
> Bellow is a possble fix,

Sorry. The proposed fix did not work.
I have took your patch and applied on top of linux-next master branch and
tested and mkfs -t ext4 invoked oom-killer.

After patch applied test log link,
https://lkft.validation.linaro.org/scheduler/job/1443936#L1168


test  log,
+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: ab107250-bf18-4357-a06a-67f2bfcc1048
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables:    0/7453                           done
Writing inode tables:    0/7453                           done
Creating journal (262144 blocks): [   34.423940] mkfs.ext4 invoked
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
oom_score_adj=0
[   34.433694] CPU: 0 PID: 402 Comm: mkfs.ext4 Not tainted
5.7.0-rc6-next-20200519+ #1
[   34.441342] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[   34.448734] Call Trace:
[   34.451196]  dump_stack+0x54/0x76
[   34.454517]  dump_header+0x40/0x1f0
[   34.458008]  ? oom_badness+0x1f/0x120
[   34.461673]  ? ___ratelimit+0x6c/0xe0
[   34.465332]  oom_kill_process+0xc9/0x110
[   34.469255]  out_of_memory+0xd7/0x2f0
[   34.472916]  __alloc_pages_nodemask+0xdd1/0xe90
[   34.477446]  ? set_bh_page+0x33/0x50
[   34.481016]  ? __xa_set_mark+0x4d/0x70
[   34.484762]  pagecache_get_page+0xbe/0x250
[   34.488859]  grab_cache_page_write_begin+0x1a/0x30
[   34.493645]  block_write_begin+0x25/0x90
[   34.497569]  blkdev_write_begin+0x1e/0x20
[   34.501574]  ? bdev_evict_inode+0xc0/0xc0
[   34.505578]  generic_perform_write+0x95/0x190
[   34.509927]  __generic_file_write_iter+0xe0/0x1a0
[   34.514626]  blkdev_write_iter+0xbf/0x1c0
[   34.518630]  __vfs_write+0x122/0x1e0
[   34.522200]  vfs_write+0x8f/0x1b0
[   34.525510]  ksys_pwrite64+0x60/0x80
[   34.529081]  __ia32_sys_ia32_pwrite64+0x16/0x20
[   34.533604]  do_fast_syscall_32+0x66/0x240
[   34.537697]  entry_SYSENTER_32+0xa5/0xf8
[   34.541613] EIP: 0xb7f3c549
[   34.544403] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
8d 76
[   34.563140] EAX: ffffffda EBX: 00000003 ECX: b7830010 EDX: 00400000
[   34.569397] ESI: 38400000 EDI: 00000074 EBP: 07438400 ESP: bff1e650
[   34.575654] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[   34.582453] Mem-Info:
[   34.584732] active_anon:5713 inactive_anon:2169 isolated_anon:0
[   34.584732]  active_file:4040 inactive_file:211204 isolated_file:0
[   34.584732]  unevictable:0 dirty:17270 writeback:6240 unstable:0
[   34.584732]  slab_reclaimable:5856 slab_unreclaimable:3439
[   34.584732]  mapped:6192 shmem:2258 pagetables:178 bounce:0
[   34.584732]  free:265105 free_pcp:1330 free_cma:0
[   34.618483] Node 0 active_anon:22852kB inactive_anon:8676kB
active_file:16160kB inactive_file:844816kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:24768kB dirty:69080kB
writeback:19628kB shmem:9032kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
[   34.642354] DMA free:3588kB min:68kB low:84kB high:100kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:11848kB unevictable:0kB
writepending:11856kB present:15964kB managed:15876kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
[   34.670194] lowmem_reserve[]: 0 824 1947 824
[   34.674483] Normal free:4228kB min:3636kB low:4544kB high:5452kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:1136kB inactive_file:786456kB unevictable:0kB
writepending:68084kB present:884728kB managed:845324kB mlocked:0kB
kernel_stack:1104kB pagetables:0kB bounce:0kB free_pcp:3056kB
local_pcp:388kB free_cma:0kB
[   34.704243] lowmem_reserve[]: 0 0 8980 0
[   34.708189] HighMem free:1053028kB min:512kB low:1748kB high:2984kB
reserved_highatomic:0KB active_anon:22852kB inactive_anon:8676kB
active_file:15024kB inactive_file:46596kB unevictable:0kB
writepending:0kB present:1149544kB managed:1149544kB mlocked:0kB
kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:2160kB
local_pcp:736kB free_cma:0kB
[   34.738563] lowmem_reserve[]: 0 0 0 0
[   34.742245] DMA: 23*4kB (U) 2*8kB (U) 3*16kB (U) 2*32kB (UE) 2*64kB
(U) 1*128kB (U) 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB
= 3804kB
[   34.755479] Normal: 25*4kB (UM) 27*8kB (UME) 16*16kB (UME) 14*32kB
(UME) 7*64kB (UME) 2*128kB (UM) 1*256kB (E) 1*512kB (E) 0*1024kB
1*2048kB (M) 0*4096kB = 4540kB
[   34.770004] HighMem: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (M)
2*128kB (UM) 2*256kB (UM) 1*512kB (U) 1*1024kB (U) 1*2048kB (U)
256*4096kB (M) = 1053028kB
[   34.784010] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=4096kB
[   34.792466] 217507 total pagecache pages
[   34.796387] 0 pages in swap cache
[   34.799704] Swap cache stats: add 0, delete 0, find 0/0
[   34.804923] Free swap  = 0kB
[   34.807834] Total swap = 0kB
[   34.810738] 512559 pages RAM
[   34.813640] 287386 pages HighMem/MovableOnly
[   34.817931] 9873 pages reserved


- Naresh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-20 19:09               ` Chris Down
@ 2020-05-21  9:22                 ` Naresh Kamboju
  2020-05-21  9:35                   ` Arnd Bergmann
  2020-05-21  9:55                 ` Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-21  9:22 UTC (permalink / raw)
  To: Chris Down
  Cc: Yafang Shao, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, 21 May 2020 at 00:39, Chris Down <chris@chrisdown.name> wrote:
>
> Hi Naresh,
>
> Naresh Kamboju writes:
> >As a part of investigation on this issue LKFT teammate Anders Roxell
> >git bisected the problem and found bad commit(s) which caused this problem.
> >
> >The following two patches have been reverted on next-20200519 and retested the
> >reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> >( invoked oom-killer is gone now)
> >
> >Revert "mm, memcg: avoid stale protection values when cgroup is above
> >protection"
> >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> >
> >Revert "mm, memcg: decouple e{low,min} state mutations from protection
> >checks"
> >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
>
> Thanks Anders and Naresh for tracking this down and reverting.
>
> I'll take a look tomorrow. I don't see anything immediately obviously wrong in
> either of those commits from a (very) cursory glance, but they should only be
> taking effect if protections are set.
>
> Since you have i386 hardware available, and I don't, could you please apply
> only "avoid stale protection" again and check if it only happens with that
> commit, or requires both? That would help narrow down the suspects.

Not both.
The bad commit is
"mm, memcg: decouple e{low,min} state mutations from protection checks"

>
> Do you use any memcg protections in these tests?
I see three MEMCG configs and please find the kernel config link
for more details.

CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y

kernel config link,
https://builds.tuxbuild.com/8lg6WQibcwtQRRtIa0bcFA/kernel.config

- Naresh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21  9:22                 ` Naresh Kamboju
@ 2020-05-21  9:35                   ` Arnd Bergmann
  0 siblings, 0 replies; 48+ messages in thread
From: Arnd Bergmann @ 2020-05-21  9:35 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Chris Down, Yafang Shao, Michal Hocko, Anders Roxell,
	Linux F2FS DEV, Mailing List, linux-ext4, linux-block,
	Andrew Morton, open list, Linux-Next Mailing List, linux-mm,
	Andreas Dilger, Jaegeuk Kim, Theodore Ts'o, Chao Yu,
	Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Chao Yu,
	lkft-triage, Johannes Weiner, Roman Gushchin, Cgroups

On Thu, May 21, 2020 at 11:22 AM Naresh Kamboju
<naresh.kamboju@linaro.org> wrote:
> On Thu, 21 May 2020 at 00:39, Chris Down <chris@chrisdown.name> wrote:
> > Since you have i386 hardware available, and I don't, could you please apply
> > only "avoid stale protection" again and check if it only happens with that
> > commit, or requires both? That would help narrow down the suspects.

Note that Naresh is running an i386 kernel on regular 64-bit hardware that
most people have access to.

> kernel config link,
> https://builds.tuxbuild.com/8lg6WQibcwtQRRtIa0bcFA/kernel.config

Do you know if the same bug shows up running a kernel with that
configuration in qemu? I would expect it to, and that would make
it much easier to reproduce.

I would also not be surprised if it happens on all architectures but only
shows up on the 32-bit arm and x86 machines first because they have
a rather limited amount of lowmem. Maybe booting a 64-bit kernel
with "mem=512M" and then running "dd if=/dev/sda of=/dev/null bs=1M"
will also trigger it. I did not attempt to run this myself.

       Arnd

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21  8:58                 ` Naresh Kamboju
@ 2020-05-21  9:47                   ` Yafang Shao
  0 siblings, 0 replies; 48+ messages in thread
From: Yafang Shao @ 2020-05-21  9:47 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Chris Down, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, May 21, 2020 at 4:59 PM Naresh Kamboju
<naresh.kamboju@linaro.org> wrote:
>
> On Thu, 21 May 2020 at 08:10, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju
> > <naresh.kamboju@linaro.org> wrote:
> > >
> > > On Wed, 20 May 2020 at 17:26, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> > > >
> > > >
> > > > This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
> > > > As per the test results history this problem started happening from
> > > > mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
> > > >
> > > >
> > > > Problem:
> > > > [   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> > > > order=0, oom_score_adj=0
> > >
> > My guess is that we made the same mistake in commit "mm, memcg:
> > decouple e{low,min} state mutations from protection
> > checks" that it read a stale memcg protection in
> > mem_cgroup_below_low() and mem_cgroup_below_min().
> >
> > Bellow is a possble fix,
>
> Sorry. The proposed fix did not work.
> I have took your patch and applied on top of linux-next master branch and
> tested and mkfs -t ext4 invoked oom-killer.
>
> After patch applied test log link,
> https://lkft.validation.linaro.org/scheduler/job/1443936#L1168
>
>
> test  log,
> + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> mke2fs 1.43.8 (1-Jan-2018)
> Creating filesystem with 244190646 4k blocks and 61054976 inodes
> Filesystem UUID: ab107250-bf18-4357-a06a-67f2bfcc1048
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848
> Allocating group tables:    0/7453                           done
> Writing inode tables:    0/7453                           done
> Creating journal (262144 blocks): [   34.423940] mkfs.ext4 invoked
> oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> oom_score_adj=0
> [   34.433694] CPU: 0 PID: 402 Comm: mkfs.ext4 Not tainted
> 5.7.0-rc6-next-20200519+ #1
> [   34.441342] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.2 05/23/2018
> [   34.448734] Call Trace:
> [   34.451196]  dump_stack+0x54/0x76
> [   34.454517]  dump_header+0x40/0x1f0
> [   34.458008]  ? oom_badness+0x1f/0x120
> [   34.461673]  ? ___ratelimit+0x6c/0xe0
> [   34.465332]  oom_kill_process+0xc9/0x110
> [   34.469255]  out_of_memory+0xd7/0x2f0
> [   34.472916]  __alloc_pages_nodemask+0xdd1/0xe90
> [   34.477446]  ? set_bh_page+0x33/0x50
> [   34.481016]  ? __xa_set_mark+0x4d/0x70
> [   34.484762]  pagecache_get_page+0xbe/0x250
> [   34.488859]  grab_cache_page_write_begin+0x1a/0x30
> [   34.493645]  block_write_begin+0x25/0x90
> [   34.497569]  blkdev_write_begin+0x1e/0x20
> [   34.501574]  ? bdev_evict_inode+0xc0/0xc0
> [   34.505578]  generic_perform_write+0x95/0x190
> [   34.509927]  __generic_file_write_iter+0xe0/0x1a0
> [   34.514626]  blkdev_write_iter+0xbf/0x1c0
> [   34.518630]  __vfs_write+0x122/0x1e0
> [   34.522200]  vfs_write+0x8f/0x1b0
> [   34.525510]  ksys_pwrite64+0x60/0x80
> [   34.529081]  __ia32_sys_ia32_pwrite64+0x16/0x20
> [   34.533604]  do_fast_syscall_32+0x66/0x240
> [   34.537697]  entry_SYSENTER_32+0xa5/0xf8
> [   34.541613] EIP: 0xb7f3c549
> [   34.544403] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
> 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
> 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
> 8d 76
> [   34.563140] EAX: ffffffda EBX: 00000003 ECX: b7830010 EDX: 00400000
> [   34.569397] ESI: 38400000 EDI: 00000074 EBP: 07438400 ESP: bff1e650
> [   34.575654] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
> [   34.582453] Mem-Info:
> [   34.584732] active_anon:5713 inactive_anon:2169 isolated_anon:0
> [   34.584732]  active_file:4040 inactive_file:211204 isolated_file:0
> [   34.584732]  unevictable:0 dirty:17270 writeback:6240 unstable:0
> [   34.584732]  slab_reclaimable:5856 slab_unreclaimable:3439
> [   34.584732]  mapped:6192 shmem:2258 pagetables:178 bounce:0
> [   34.584732]  free:265105 free_pcp:1330 free_cma:0
> [   34.618483] Node 0 active_anon:22852kB inactive_anon:8676kB
> active_file:16160kB inactive_file:844816kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB mapped:24768kB dirty:69080kB
> writeback:19628kB shmem:9032kB writeback_tmp:0kB unstable:0kB
> all_unreclaimable? yes
> [   34.642354] DMA free:3588kB min:68kB low:84kB high:100kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:11848kB unevictable:0kB
> writepending:11856kB present:15964kB managed:15876kB mlocked:0kB
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
> free_cma:0kB
> [   34.670194] lowmem_reserve[]: 0 824 1947 824
> [   34.674483] Normal free:4228kB min:3636kB low:4544kB high:5452kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:1136kB inactive_file:786456kB unevictable:0kB
> writepending:68084kB present:884728kB managed:845324kB mlocked:0kB
> kernel_stack:1104kB pagetables:0kB bounce:0kB free_pcp:3056kB
> local_pcp:388kB free_cma:0kB
> [   34.704243] lowmem_reserve[]: 0 0 8980 0
> [   34.708189] HighMem free:1053028kB min:512kB low:1748kB high:2984kB
> reserved_highatomic:0KB active_anon:22852kB inactive_anon:8676kB
> active_file:15024kB inactive_file:46596kB unevictable:0kB
> writepending:0kB present:1149544kB managed:1149544kB mlocked:0kB
> kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:2160kB
> local_pcp:736kB free_cma:0kB
> [   34.738563] lowmem_reserve[]: 0 0 0 0
> [   34.742245] DMA: 23*4kB (U) 2*8kB (U) 3*16kB (U) 2*32kB (UE) 2*64kB
> (U) 1*128kB (U) 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB
> = 3804kB
> [   34.755479] Normal: 25*4kB (UM) 27*8kB (UME) 16*16kB (UME) 14*32kB
> (UME) 7*64kB (UME) 2*128kB (UM) 1*256kB (E) 1*512kB (E) 0*1024kB
> 1*2048kB (M) 0*4096kB = 4540kB
> [   34.770004] HighMem: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (M)
> 2*128kB (UM) 2*256kB (UM) 1*512kB (U) 1*1024kB (U) 1*2048kB (U)
> 256*4096kB (M) = 1053028kB
> [   34.784010] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=4096kB
> [   34.792466] 217507 total pagecache pages
> [   34.796387] 0 pages in swap cache
> [   34.799704] Swap cache stats: add 0, delete 0, find 0/0
> [   34.804923] Free swap  = 0kB
> [   34.807834] Total swap = 0kB
> [   34.810738] 512559 pages RAM
> [   34.813640] 287386 pages HighMem/MovableOnly
> [   34.817931] 9873 pages reserved
>
>
> - Naresh


Thanks for your work.
I just noticed that this is a system oom, rather than a memcg oom.
While this patch is against memcg oom.

As you have verified this oom is only caused by commit "mm, memcg:
decouple e{low,min} state mutations from protection checks",
this commit really introduce the issue of using the stale protection
value, but I haven't thought deeply why this occurs. This issue can
occur only  when you set memcg {min, low} protection, but
unfortunately memcg {min, low} isn't shown in the oom log.

Appreciat if you would like to check the memcg {min, low} protection
setting. If they are set, I think bellow workaround can avoid this
issue.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 474815a..f6f794a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6380,6 +6380,9 @@ void mem_cgroup_calculate_protection(struct
mem_cgroup *root,
        if (mem_cgroup_disabled())
                return;

+       memcg->memory.elow = 0;
+       memcg->memory.emin = 0;
+
        if (!root)
                root = root_mem_cgroup;

But I think the right thing to do now is reverting the bad commit,
because the usage of memory.{emin, elow} is very subtle, we shouldn't
place them here and there at the risk of reading a stale value.

-- 
Thanks
Yafang

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-20 19:09               ` Chris Down
  2020-05-21  9:22                 ` Naresh Kamboju
@ 2020-05-21  9:55                 ` Michal Hocko
  2020-05-21 10:41                   ` Naresh Kamboju
  2020-05-21 16:34                   ` Michal Hocko
  1 sibling, 2 replies; 48+ messages in thread
From: Michal Hocko @ 2020-05-21  9:55 UTC (permalink / raw)
  To: Naresh Kamboju, Chris Down
  Cc: Yafang Shao, Anders Roxell, Linux F2FS DEV, Mailing List,
	linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, cgroups

On Wed 20-05-20 20:09:06, Chris Down wrote:
> Hi Naresh,
> 
> Naresh Kamboju writes:
> > As a part of investigation on this issue LKFT teammate Anders Roxell
> > git bisected the problem and found bad commit(s) which caused this problem.
> > 
> > The following two patches have been reverted on next-20200519 and retested the
> > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > ( invoked oom-killer is gone now)
> > 
> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > protection"
> >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > 
> > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > checks"
> >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> 
> Thanks Anders and Naresh for tracking this down and reverting.
> 
> I'll take a look tomorrow. I don't see anything immediately obviously wrong
> in either of those commits from a (very) cursory glance, but they should
> only be taking effect if protections are set.

Agreed. If memory.{low,min} is not used then the patch should be
effectively a nop. Btw. do you see the problem when booting with
cgroup_disable=memory kernel command line parameter?

I suspect that something might be initialized for memcg incorrectly and
the patch just makes it more visible for some reason.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21  9:55                 ` Michal Hocko
@ 2020-05-21 10:41                   ` Naresh Kamboju
  2020-05-21 10:58                     ` Michal Hocko
  2020-05-21 16:34                   ` Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-21 10:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chris Down, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, 21 May 2020 at 15:25, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 20-05-20 20:09:06, Chris Down wrote:
> > Hi Naresh,
> >
> > Naresh Kamboju writes:
> > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > git bisected the problem and found bad commit(s) which caused this problem.
> > >
> > > The following two patches have been reverted on next-20200519 and retested the
> > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > ( invoked oom-killer is gone now)
> > >
> > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > protection"
> > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > >
> > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > checks"
> > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> >
> > Thanks Anders and Naresh for tracking this down and reverting.
> >
> > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > in either of those commits from a (very) cursory glance, but they should
> > only be taking effect if protections are set.
>
> Agreed. If memory.{low,min} is not used then the patch should be
> effectively a nop. Btw. do you see the problem when booting with
> cgroup_disable=memory kernel command line parameter?

With extra kernel command line parameters, cgroup_disable=memory
I have noticed a differ problem now.

+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables:    0/7453                           done
Writing inode tables:    0/7453                           done
Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
pointer dereference, address: 000000c8
[   35.508372] #PF: supervisor read access in kernel mode
[   35.513506] #PF: error_code(0x0000) - not-present page
[   35.518638] *pde = 00000000
[   35.521514] Oops: 0000 [#1] SMP
[   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
5.7.0-rc6-next-20200519+ #1
[   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[   35.544724] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[   35.563461] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000
[   35.569718] ESI: 00000000 EDI: f4e13ea8 EBP: f4e13e10 ESP: f4e13e04
[   35.575976] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207
[   35.582751] CR0: 80050033 CR2: 000000c8 CR3: 0bef4000 CR4: 003406d0
[   35.589010] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   35.595266] DR6: fffe0ff0 DR7: 00000400
[   35.599096] Call Trace:
[   35.601544]  shrink_lruvec+0x447/0x630
[   35.605294]  ? newidle_balance.isra.100+0x8e/0x3f0
[   35.610080]  ? pick_next_task_fair+0x3a/0x320
[   35.614437]  ? deactivate_task+0xcf/0x100
[   35.618442]  ? put_prev_entity+0x1a/0xd0
[   35.622359]  ? deactivate_task+0xcf/0x100
[   35.626363]  shrink_node+0x1be/0x640
[   35.629932]  ? shrink_node+0x1be/0x640
[   35.633676]  kswapd+0x32c/0x890
[   35.636815]  ? deactivate_task+0xcf/0x100
[   35.640820]  kthread+0xf1/0x110
[   35.643963]  ? do_try_to_free_pages+0x3b0/0x3b0
[   35.648489]  ? kthread_park+0xa0/0xa0
[   35.652147]  ret_from_fork+0x1c/0x28
[   35.655726] Modules linked in: x86_pkg_temp_thermal
[   35.660605] CR2: 00000000000000c8
[   35.663916] ---[ end trace d85b8564ea55fb0d ]---
[   35.663917] BUG: kernel NULL pointer dereference, address: 000000c8
[   35.663918] #PF: supervisor read access in kernel mode
[   35.668534] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[   35.674792] #PF: error_code(0x0000) - not-present page
[   35.674792] *pde = 00000000
[   35.679921] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[   35.685140] Oops: 0000 [#2] SMP
[   35.685142] CPU: 2 PID: 391 Comm: mkfs.ext4 Tainted: G      D
    5.7.0-rc6-next-20200519+ #1
[   35.690278] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000
[   35.690279] ESI: 00000000 EDI: f4e13ea8 EBP: f4e13e10 ESP: f4e13e04
[   35.693155] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[   35.693158] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[   35.711893] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207
[   35.711894] CR0: 80050033 CR2: 000000c8 CR3: 0bef4000 CR4: 003406d0
[   35.715031] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[   35.724061] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   35.730317] EAX: 00000000 EBX: f5411000 ECX: 00000000 EDX: 00000000
[   35.730318] ESI: 00000000 EDI: f2d73c14 EBP: f2d73b78 ESP: f2d73b6c
[   35.736576] DR6: fffe0ff0 DR7: 00000400
[   35.803603] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207
[   35.810380] CR0: 80050033 CR2: 000000c8 CR3: 33241000 CR4: 003406d0
[   35.816636] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   35.822893] DR6: fffe0ff0 DR7: 00000400
[   35.826725] Call Trace:
[   35.829171]  shrink_lruvec+0x447/0x630
[   35.832921]  ? check_preempt_curr+0x75/0x80
[   35.837100]  shrink_node+0x1be/0x640
[   35.840670]  ? shrink_node+0x1be/0x640
[   35.844412]  do_try_to_free_pages+0xc1/0x3b0
[   35.848677]  try_to_free_pages+0xba/0x1d0
[   35.852683]  __alloc_pages_nodemask+0x573/0xe90
[   35.857232]  ? set_bh_page+0x33/0x50
[   35.860829]  ? xas_load+0xf/0x70
[   35.864050]  ? __xa_set_mark+0x4d/0x70
[   35.867795]  ? find_get_entry+0x47/0x110
[   35.871714]  pagecache_get_page+0xbe/0x250
[   35.875805]  grab_cache_page_write_begin+0x1a/0x30
[   35.880588]  block_write_begin+0x25/0x90
[   35.884504]  blkdev_write_begin+0x1e/0x20
[   35.888507]  ? bdev_evict_inode+0xc0/0xc0
[   35.892513]  generic_perform_write+0x95/0x190
[   35.896863]  __generic_file_write_iter+0xe0/0x1a0
[   35.901562]  blkdev_write_iter+0xbf/0x1c0
[   35.905564]  __vfs_write+0x122/0x1e0
[   35.909136]  vfs_write+0x8f/0x1b0
[   35.912454]  ksys_pwrite64+0x60/0x80
[   35.916024]  __ia32_sys_ia32_pwrite64+0x16/0x20
[   35.920549]  do_fast_syscall_32+0x66/0x240
[   35.924641]  entry_SYSENTER_32+0xa5/0xf8
[   35.928567] EIP: 0xb7f72549
[   35.931357] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
8d 76
[   35.950093] EAX: ffffffda EBX: 00000003 ECX: b7866010 EDX: 00400000
[   35.956351] ESI: 39000000 EDI: 00000074 EBP: 07439000 ESP: bf973700
[   35.962607] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[   35.969384] Modules linked in: x86_pkg_temp_thermal
[   35.974269] CR2: 00000000000000c8
[   35.977582] ---[ end trace d85b8564ea55fb0e ]---
[   35.977583] BUG: kernel NULL pointer dereference, address: 000000c8
[   35.977584] #PF: supervisor read access in kernel mode
[   35.982193] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[   35.982195] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[   35.988450] #PF: error_code(0x0000) - not-present page
[   35.988451] *pde = 00000000

full test log link,
https://lkft.validation.linaro.org/scheduler/job/1443939#L1170

- Naresh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 10:41                   ` Naresh Kamboju
@ 2020-05-21 10:58                     ` Michal Hocko
  2020-05-21 12:24                       ` Hugh Dickins
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-05-21 10:58 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Chris Down, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> On Thu, 21 May 2020 at 15:25, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop. Btw. do you see the problem when booting with
> > cgroup_disable=memory kernel command line parameter?
> 
> With extra kernel command line parameters, cgroup_disable=memory
> I have noticed a differ problem now.
> 
> + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> mke2fs 1.43.8 (1-Jan-2018)
> Creating filesystem with 244190646 4k blocks and 61054976 inodes
> Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848
> Allocating group tables:    0/7453                           done
> Writing inode tables:    0/7453                           done
> Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> pointer dereference, address: 000000c8
> [   35.508372] #PF: supervisor read access in kernel mode
> [   35.513506] #PF: error_code(0x0000) - not-present page
> [   35.518638] *pde = 00000000
> [   35.521514] Oops: 0000 [#1] SMP
> [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> 5.7.0-rc6-next-20200519+ #1
> [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.2 05/23/2018
> [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60

Could you get faddr2line for this offset?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 10:58                     ` Michal Hocko
@ 2020-05-21 12:24                       ` Hugh Dickins
  2020-05-21 12:44                         ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Hugh Dickins @ 2020-05-21 12:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naresh Kamboju, Chris Down, Yafang Shao, Anders Roxell,
	Linux F2FS DEV, Mailing List, linux-ext4, linux-block,
	Andrew Morton, open list, Linux-Next Mailing List, linux-mm,
	Arnd Bergmann, Andreas Dilger, Jaegeuk Kim, Theodore Ts'o,
	Chao Yu, Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, Chao Yu,
	lkft-triage, Johannes Weiner, Roman Gushchin, Cgroups

On Thu, 21 May 2020, Michal Hocko wrote:
> On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > On Thu, 21 May 2020 at 15:25, Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > Hi Naresh,
> > > >
> > > > Naresh Kamboju writes:
> > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > > >
> > > > > The following two patches have been reverted on next-20200519 and retested the
> > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > ( invoked oom-killer is gone now)
> > > > >
> > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > protection"
> > > > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > >
> > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > checks"
> > > > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > >
> > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > >
> > > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > > in either of those commits from a (very) cursory glance, but they should
> > > > only be taking effect if protections are set.
> > >
> > > Agreed. If memory.{low,min} is not used then the patch should be
> > > effectively a nop. Btw. do you see the problem when booting with
> > > cgroup_disable=memory kernel command line parameter?
> > 
> > With extra kernel command line parameters, cgroup_disable=memory
> > I have noticed a differ problem now.
> > 
> > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > mke2fs 1.43.8 (1-Jan-2018)
> > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > Superblock backups stored on blocks:
> > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > 102400000, 214990848
> > Allocating group tables:    0/7453                           done
> > Writing inode tables:    0/7453                           done
> > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > pointer dereference, address: 000000c8
> > [   35.508372] #PF: supervisor read access in kernel mode
> > [   35.513506] #PF: error_code(0x0000) - not-present page
> > [   35.518638] *pde = 00000000
> > [   35.521514] Oops: 0000 [#1] SMP
> > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > 5.7.0-rc6-next-20200519+ #1
> > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > 2.2 05/23/2018
> > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> 
> Could you get faddr2line for this offset?

No need for that, I can help with the "cgroup_disabled=memory" crash:
I've been happily running with the fixup below, but haven't got to
send it in yet (and wouldn't normally be reading mail at this time!)
because of busy chasing a couple of other bugs (not necessarily mm);
and maybe the fix would be better with explicit mem_cgroup_disabled()
test, or maybe that should be where cgroup_memory_noswap is decided -
up to Johannes.

---

 mm/memcontrol.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 5.7-rc6-mm1/mm/memcontrol.c	2020-05-20 12:21:56.109693740 -0700
+++ linux/mm/memcontrol.c	2020-05-20 12:26:15.500478753 -0700
@@ -6954,7 +6954,8 @@ long mem_cgroup_get_nr_swap_pages(struct
 {
 	long nr_swap_pages = get_nr_swap_pages();
 
-	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (!memcg || cgroup_memory_noswap ||
+            !cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return nr_swap_pages;
 	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
 		nr_swap_pages = min_t(long, nr_swap_pages,

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 12:24                       ` Hugh Dickins
@ 2020-05-21 12:44                         ` Michal Hocko
  2020-05-21 19:17                           ` Johannes Weiner
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-05-21 12:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Naresh Kamboju, Chris Down, Yafang Shao, Anders Roxell,
	Linux F2FS DEV, Mailing List, linux-ext4, linux-block,
	Andrew Morton, open list, Linux-Next Mailing List, linux-mm,
	Arnd Bergmann, Andreas Dilger, Jaegeuk Kim, Theodore Ts'o,
	Chao Yu, Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu 21-05-20 05:24:27, Hugh Dickins wrote:
> On Thu, 21 May 2020, Michal Hocko wrote:
> > On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > > On Thu, 21 May 2020 at 15:25, Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > > Hi Naresh,
> > > > >
> > > > > Naresh Kamboju writes:
> > > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > > > >
> > > > > > The following two patches have been reverted on next-20200519 and retested the
> > > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > > ( invoked oom-killer is gone now)
> > > > > >
> > > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > > protection"
> > > > > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > > >
> > > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > > checks"
> > > > > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > > >
> > > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > > >
> > > > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > > > in either of those commits from a (very) cursory glance, but they should
> > > > > only be taking effect if protections are set.
> > > >
> > > > Agreed. If memory.{low,min} is not used then the patch should be
> > > > effectively a nop. Btw. do you see the problem when booting with
> > > > cgroup_disable=memory kernel command line parameter?
> > > 
> > > With extra kernel command line parameters, cgroup_disable=memory
> > > I have noticed a differ problem now.
> > > 
> > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > mke2fs 1.43.8 (1-Jan-2018)
> > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > Superblock backups stored on blocks:
> > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > 102400000, 214990848
> > > Allocating group tables:    0/7453                           done
> > > Writing inode tables:    0/7453                           done
> > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > pointer dereference, address: 000000c8
> > > [   35.508372] #PF: supervisor read access in kernel mode
> > > [   35.513506] #PF: error_code(0x0000) - not-present page
> > > [   35.518638] *pde = 00000000
> > > [   35.521514] Oops: 0000 [#1] SMP
> > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > 5.7.0-rc6-next-20200519+ #1
> > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > 2.2 05/23/2018
> > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> > 
> > Could you get faddr2line for this offset?
> 
> No need for that, I can help with the "cgroup_disabled=memory" crash:
> I've been happily running with the fixup below, but haven't got to
> send it in yet (and wouldn't normally be reading mail at this time!)
> because of busy chasing a couple of other bugs (not necessarily mm);
> and maybe the fix would be better with explicit mem_cgroup_disabled()
> test, or maybe that should be where cgroup_memory_noswap is decided -
> up to Johannes.

Thanks Hugh. I can see what is the problem now. I was looking at the
Linus' tree and we have a different code there

	long nr_swap_pages = get_nr_swap_pages();

        if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
                return nr_swap_pages;

which would be impossible to crash so I was really wondering what is
going on here. But there are other changes in the mmotm which I haven't
reviewed yet. Looking at the next tree now it is a fallout from "mm:
memcontrol: prepare swap controller setup for integration".

!memcg check slightly more cryptic than an explicit mem_cgroup_disabled
but I would just leave it to Johannes as well.

> 
> ---
> 
>  mm/memcontrol.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 5.7-rc6-mm1/mm/memcontrol.c	2020-05-20 12:21:56.109693740 -0700
> +++ linux/mm/memcontrol.c	2020-05-20 12:26:15.500478753 -0700
> @@ -6954,7 +6954,8 @@ long mem_cgroup_get_nr_swap_pages(struct
>  {
>  	long nr_swap_pages = get_nr_swap_pages();
>  
> -	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +	if (!memcg || cgroup_memory_noswap ||
> +            !cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return nr_swap_pages;
>  	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
>  		nr_swap_pages = min_t(long, nr_swap_pages,

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21  9:55                 ` Michal Hocko
  2020-05-21 10:41                   ` Naresh Kamboju
@ 2020-05-21 16:34                   ` Michal Hocko
  2020-05-21 19:00                     ` Naresh Kamboju
  2020-06-17 13:37                     ` Naresh Kamboju
  1 sibling, 2 replies; 48+ messages in thread
From: Michal Hocko @ 2020-05-21 16:34 UTC (permalink / raw)
  To: Naresh Kamboju, Chris Down
  Cc: Yafang Shao, Anders Roxell, Linux F2FS DEV, Mailing List,
	linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, cgroups

On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> On Wed 20-05-20 20:09:06, Chris Down wrote:
> > Hi Naresh,
> > 
> > Naresh Kamboju writes:
> > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > git bisected the problem and found bad commit(s) which caused this problem.
> > > 
> > > The following two patches have been reverted on next-20200519 and retested the
> > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > ( invoked oom-killer is gone now)
> > > 
> > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > protection"
> > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > 
> > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > checks"
> > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > 
> > Thanks Anders and Naresh for tracking this down and reverting.
> > 
> > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > in either of those commits from a (very) cursory glance, but they should
> > only be taking effect if protections are set.
> 
> Agreed. If memory.{low,min} is not used then the patch should be
> effectively a nop.

I was staring into the code and do not see anything.  Could you give the
following debugging patch a try and see whether it triggers?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc555903a332..df2e8df0eb71 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			 * sc->priority further than desirable.
 			 */
 			scan = max(scan, SWAP_CLUSTER_MAX);
+
+			trace_printk("scan:%lu protection:%lu\n", scan, protection);
 		} else {
 			scan = lruvec_size;
 		}
@@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		mem_cgroup_calculate_protection(target_memcg, memcg);
 
 		if (mem_cgroup_below_min(memcg)) {
+			trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin);
 			/*
 			 * Hard protection.
 			 * If there is no reclaimable memory, OOM.
@@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			 * there is an unprotected supply
 			 * of reclaimable memory from other cgroups.
 			 */
+			trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow);
 			if (!sc->memcg_low_reclaim) {
 				sc->memcg_low_skipped = 1;
 				continue;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 16:34                   ` Michal Hocko
@ 2020-05-21 19:00                     ` Naresh Kamboju
  2020-05-21 20:53                       ` Naresh Kamboju
  2020-06-17 13:37                     ` Naresh Kamboju
  1 sibling, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-21 19:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chris Down, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, 21 May 2020 at 22:04, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop.
>
> I was staring into the code and did not see anything.  Could you give the
> following debugging patch a try and see whether it triggers?

These code paths did not touch it seems. but still see the reported problem.
Please find a detailed test log output [1]

And
One more test log with cgroup_disable=memory [2]

Test log link,
[1] https://pastebin.com/XJU7We1g
[2] https://pastebin.com/BZ0BMUVt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 12:44                         ` Michal Hocko
@ 2020-05-21 19:17                           ` Johannes Weiner
  2020-05-21 20:06                             ` Hugh Dickins
  0 siblings, 1 reply; 48+ messages in thread
From: Johannes Weiner @ 2020-05-21 19:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Naresh Kamboju, Chris Down, Yafang Shao,
	Anders Roxell, Linux F2FS DEV, Mailing List, linux-ext4,
	linux-block, Andrew Morton, open list, Linux-Next Mailing List,
	linux-mm, Arnd Bergmann, Andreas Dilger, Jaegeuk Kim,
	Theodore Ts'o, Chao Yu, Andrea Arcangeli, Matthew Wilcox,
	Chao Yu, lkft-triage, Roman Gushchin, Cgroups

On Thu, May 21, 2020 at 02:44:44PM +0200, Michal Hocko wrote:
> On Thu 21-05-20 05:24:27, Hugh Dickins wrote:
> > On Thu, 21 May 2020, Michal Hocko wrote:
> > > On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > > > On Thu, 21 May 2020 at 15:25, Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > > > Hi Naresh,
> > > > > >
> > > > > > Naresh Kamboju writes:
> > > > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > > > > >
> > > > > > > The following two patches have been reverted on next-20200519 and retested the
> > > > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > > > ( invoked oom-killer is gone now)
> > > > > > >
> > > > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > > > protection"
> > > > > > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > > > >
> > > > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > > > checks"
> > > > > > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > > > >
> > > > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > > > >
> > > > > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > > > > in either of those commits from a (very) cursory glance, but they should
> > > > > > only be taking effect if protections are set.
> > > > >
> > > > > Agreed. If memory.{low,min} is not used then the patch should be
> > > > > effectively a nop. Btw. do you see the problem when booting with
> > > > > cgroup_disable=memory kernel command line parameter?
> > > > 
> > > > With extra kernel command line parameters, cgroup_disable=memory
> > > > I have noticed a differ problem now.
> > > > 
> > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > 102400000, 214990848
> > > > Allocating group tables:    0/7453                           done
> > > > Writing inode tables:    0/7453                           done
> > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > pointer dereference, address: 000000c8
> > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > [   35.513506] #PF: error_code(0x0000) - not-present page
> > > > [   35.518638] *pde = 00000000
> > > > [   35.521514] Oops: 0000 [#1] SMP
> > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > 5.7.0-rc6-next-20200519+ #1
> > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.2 05/23/2018
> > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> > > 
> > > Could you get faddr2line for this offset?
> > 
> > No need for that, I can help with the "cgroup_disabled=memory" crash:
> > I've been happily running with the fixup below, but haven't got to
> > send it in yet (and wouldn't normally be reading mail at this time!)
> > because of busy chasing a couple of other bugs (not necessarily mm);
> > and maybe the fix would be better with explicit mem_cgroup_disabled()
> > test, or maybe that should be where cgroup_memory_noswap is decided -
> > up to Johannes.
> 
> Thanks Hugh. I can see what is the problem now. I was looking at the
> Linus' tree and we have a different code there
> 
> 	long nr_swap_pages = get_nr_swap_pages();
> 
>         if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
>                 return nr_swap_pages;
> 
> which would be impossible to crash so I was really wondering what is
> going on here. But there are other changes in the mmotm which I haven't
> reviewed yet. Looking at the next tree now it is a fallout from "mm:
> memcontrol: prepare swap controller setup for integration".
> 
> !memcg check slightly more cryptic than an explicit mem_cgroup_disabled
> but I would just leave it to Johannes as well.

Very much appreciate you guys tracking it down so quickly. Sorry about
the breakage.

I think mem_cgroup_disabled() checks are pretty good markers of public
entry points to the memcg API, so I'd prefer that even if a bit more
verbose. What do you think?

---
From cd373ec232942a9bc43ee5e7d2171352019a58fb Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hughd@google.com>
Date: Thu, 21 May 2020 14:58:36 -0400
Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
 fix

Fix crash with cgroup_disable=memory:

> > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > 102400000, 214990848
> > > > Allocating group tables:    0/7453                           done
> > > > Writing inode tables:    0/7453                           done
> > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > pointer dereference, address: 000000c8
> > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > [   35.513506] #PF: error_code(0x0000) - not-present page
> > > > [   35.518638] *pde = 00000000
> > > > [   35.521514] Oops: 0000 [#1] SMP
> > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > 5.7.0-rc6-next-20200519+ #1
> > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.2 05/23/2018
> > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60

do_memsw_account() used to be automatically false when the cgroup
controller was disabled. Now that it's replaced by
cgroup_memory_noswap, for which this isn't true, make the
mem_cgroup_disabled() checks explicit in the swap control API.

[hannes@cmpxchg.org: use mem_cgroup_disabled() in all API functions]
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Debugged-by: Hugh Dickins <hughd@google.com>
Debugged-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e000a316b59..850bca380562 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6811,6 +6811,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 	VM_BUG_ON_PAGE(page_count(page), page);
 
+	if (mem_cgroup_disabled())
+		return;
+
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return;
 
@@ -6876,6 +6879,10 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	struct mem_cgroup *memcg;
 	unsigned short oldid;
 
+	if (mem_cgroup_disabled())
+		return 0;
+
+	/* Only cgroup2 has swap.max */
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return 0;
 
@@ -6920,6 +6927,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	struct mem_cgroup *memcg;
 	unsigned short id;
 
+	if (mem_cgroup_disabled())
+		return;
+
 	id = swap_cgroup_record(entry, 0, nr_pages);
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
@@ -6940,12 +6950,25 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
 	long nr_swap_pages = get_nr_swap_pages();
 
-	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		return nr_swap_pages;
+	if (mem_cgroup_disabled())
+		goto out;
+
+	/* Swap control disabled */
+	if (cgroup_memory_noswap)
+		goto out;
+
+	/*
+	 * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
+	 * which does not place restrictions specifically on swap.
+	 */
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		goto out;
+
 	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
 		nr_swap_pages = min_t(long, nr_swap_pages,
 				      READ_ONCE(memcg->swap.max) -
 				      page_counter_read(&memcg->swap));
+out:
 	return nr_swap_pages;
 }
 
@@ -6957,18 +6980,30 @@ bool mem_cgroup_swap_full(struct page *page)
 
 	if (vm_swap_full())
 		return true;
-	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		return false;
+
+	if (mem_cgroup_disabled())
+		goto out;
+
+	/* Swap control disabled */
+	if (cgroup_memory_noswap)
+		goto out;
+
+	/*
+	 * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
+	 * which does not place restrictions specifically on swap.
+	 */
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		goto out;
 
 	memcg = page->mem_cgroup;
 	if (!memcg)
-		return false;
+		goto out;
 
 	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
 		if (page_counter_read(&memcg->swap) * 2 >=
 		    READ_ONCE(memcg->swap.max))
 			return true;
-
+out:
 	return false;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 19:17                           ` Johannes Weiner
@ 2020-05-21 20:06                             ` Hugh Dickins
  2020-05-21 21:58                               ` Johannes Weiner
  0 siblings, 1 reply; 48+ messages in thread
From: Hugh Dickins @ 2020-05-21 20:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Hugh Dickins, Naresh Kamboju, Chris Down,
	Yafang Shao, Anders Roxell, Linux F2FS DEV, Mailing List,
	linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Andrea Arcangeli,
	Matthew Wilcox, Chao Yu, lkft-triage, Roman Gushchin, Cgroups

On Thu, 21 May 2020, Johannes Weiner wrote:
> 
> Very much appreciate you guys tracking it down so quickly. Sorry about
> the breakage.
> 
> I think mem_cgroup_disabled() checks are pretty good markers of public
> entry points to the memcg API, so I'd prefer that even if a bit more
> verbose. What do you think?

An explicit mem_cgroup_disabled() check would be fine, but I must admit,
the patch below is rather too verbose for my own taste.  Your call.

> 
> ---
> From cd373ec232942a9bc43ee5e7d2171352019a58fb Mon Sep 17 00:00:00 2001
> From: Hugh Dickins <hughd@google.com>
> Date: Thu, 21 May 2020 14:58:36 -0400
> Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
>  fix
> 
> Fix crash with cgroup_disable=memory:
> 
> > > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > > 102400000, 214990848
> > > > > Allocating group tables:    0/7453                           done
> > > > > Writing inode tables:    0/7453                           done
> > > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > > pointer dereference, address: 000000c8
> > > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > > [   35.513506] #PF: error_code(0x0000) - not-present page
> > > > > [   35.518638] *pde = 00000000
> > > > > [   35.521514] Oops: 0000 [#1] SMP
> > > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > > 5.7.0-rc6-next-20200519+ #1
> > > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > 2.2 05/23/2018
> > > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> 
> do_memsw_account() used to be automatically false when the cgroup
> controller was disabled. Now that it's replaced by
> cgroup_memory_noswap, for which this isn't true, make the
> mem_cgroup_disabled() checks explicit in the swap control API.
> 
> [hannes@cmpxchg.org: use mem_cgroup_disabled() in all API functions]
> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
> Debugged-by: Hugh Dickins <hughd@google.com>
> Debugged-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 41 insertions(+), 6 deletions(-)

I'm certainly not against a mem_cgroup_disabled() check in the only
place that's been observed to need it, as a fixup to merge into your
original patch; but this seems rather an over-reaction - and I'm a
little surprised that setting mem_cgroup_disabled() doesn't just
force cgroup_memory_noswap, saving repetitious checks elsewhere
(perhaps there's a difficulty in that, I haven't looked).

Historically, I think we've added mem_cgroup_disabled() checks
(accessing a cacheline we'd rather avoid) where they're necessary,
rather than at every "interface".

And you seem to be in a very "goto out" mood today - we all have
our "goto out" days, alternating with our "return 0" days :)

Hugh

> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3e000a316b59..850bca380562 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6811,6 +6811,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  
> +	if (mem_cgroup_disabled())
> +		return;
> +
>  	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return;
>  
> @@ -6876,6 +6879,10 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
>  	struct mem_cgroup *memcg;
>  	unsigned short oldid;
>  
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	/* Only cgroup2 has swap.max */
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return 0;
>  
> @@ -6920,6 +6927,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
>  	struct mem_cgroup *memcg;
>  	unsigned short id;
>  
> +	if (mem_cgroup_disabled())
> +		return;
> +
>  	id = swap_cgroup_record(entry, 0, nr_pages);
>  	rcu_read_lock();
>  	memcg = mem_cgroup_from_id(id);
> @@ -6940,12 +6950,25 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
>  {
>  	long nr_swap_pages = get_nr_swap_pages();
>  
> -	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> -		return nr_swap_pages;
> +	if (mem_cgroup_disabled())
> +		goto out;
> +
> +	/* Swap control disabled */
> +	if (cgroup_memory_noswap)
> +		goto out;
> +
> +	/*
> +	 * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
> +	 * which does not place restrictions specifically on swap.
> +	 */
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		goto out;
> +
>  	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
>  		nr_swap_pages = min_t(long, nr_swap_pages,
>  				      READ_ONCE(memcg->swap.max) -
>  				      page_counter_read(&memcg->swap));
> +out:
>  	return nr_swap_pages;
>  }
>  
> @@ -6957,18 +6980,30 @@ bool mem_cgroup_swap_full(struct page *page)
>  
>  	if (vm_swap_full())
>  		return true;
> -	if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> -		return false;
> +
> +	if (mem_cgroup_disabled())
> +		goto out;
> +
> +	/* Swap control disabled */
> +	if (cgroup_memory_noswap)
> +		goto out;
> +
> +	/*
> +	 * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
> +	 * which does not place restrictions specifically on swap.
> +	 */
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		goto out;
>  
>  	memcg = page->mem_cgroup;
>  	if (!memcg)
> -		return false;
> +		goto out;
>  
>  	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
>  		if (page_counter_read(&memcg->swap) * 2 >=
>  		    READ_ONCE(memcg->swap.max))
>  			return true;
> -
> +out:
>  	return false;
>  }
>  
> -- 
> 2.26.2

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 19:00                     ` Naresh Kamboju
@ 2020-05-21 20:53                       ` Naresh Kamboju
  2020-05-28 15:03                         ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-21 20:53 UTC (permalink / raw)
  To: Yafang Shao, Michal Hocko, Chris Down
  Cc: Anders Roxell, Linux F2FS DEV, Mailing List, linux-ext4,
	linux-block, Andrew Morton, open list, Linux-Next Mailing List,
	linux-mm, Arnd Bergmann, Andreas Dilger, Jaegeuk Kim,
	Theodore Ts'o, Chao Yu, Hugh Dickins, Andrea Arcangeli,
	Matthew Wilcox, Chao Yu, lkft-triage, Johannes Weiner,
	Roman Gushchin, Cgroups

My apology !
As per the test results history this problem started happening from
Bad : next-20200430 (still reproducible on next-20200519)
Good : next-20200429

The git tree / tag used for testing is from linux next-20200430 tag and reverted
following three patches and oom-killer problem fixed.

Revert "mm, memcg: avoid stale protection values when cgroup is above
protection"
Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"

Ref tree:
https://github.com/roxell/linux/commits/my-next-20200430

Build images:
https://builds.tuxbuild.com/whyTLI1O8s5HiILwpLTLtg/

Test log:
https://lkft.validation.linaro.org/scheduler/job/1444321#L1164

- Naresh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 20:06                             ` Hugh Dickins
@ 2020-05-21 21:58                               ` Johannes Weiner
  2020-05-21 23:35                                 ` Hugh Dickins
  2020-05-28 14:59                                 ` Michal Hocko
  0 siblings, 2 replies; 48+ messages in thread
From: Johannes Weiner @ 2020-05-21 21:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Naresh Kamboju, Chris Down, Yafang Shao,
	Anders Roxell, Linux F2FS DEV, Mailing List, linux-ext4,
	linux-block, Andrew Morton, open list, Linux-Next Mailing List,
	linux-mm, Arnd Bergmann, Andreas Dilger, Jaegeuk Kim,
	Theodore Ts'o, Chao Yu, Andrea Arcangeli, Matthew Wilcox,
	Chao Yu, lkft-triage, Roman Gushchin, Cgroups

On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
> On Thu, 21 May 2020, Johannes Weiner wrote:
> > do_memsw_account() used to be automatically false when the cgroup
> > controller was disabled. Now that it's replaced by
> > cgroup_memory_noswap, for which this isn't true, make the
> > mem_cgroup_disabled() checks explicit in the swap control API.
> > 
> > [hannes@cmpxchg.org: use mem_cgroup_disabled() in all API functions]
> > Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
> > Debugged-by: Hugh Dickins <hughd@google.com>
> > Debugged-by: Michal Hocko <mhocko@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 41 insertions(+), 6 deletions(-)
> 
> I'm certainly not against a mem_cgroup_disabled() check in the only
> place that's been observed to need it, as a fixup to merge into your
> original patch; but this seems rather an over-reaction - and I'm a
> little surprised that setting mem_cgroup_disabled() doesn't just
> force cgroup_memory_noswap, saving repetitious checks elsewhere
> (perhaps there's a difficulty in that, I haven't looked).

Fair enough, I changed it to set the flag at initialization time if
mem_cgroup_disabled(). I was never a fan of the old flags, where it
was never clear what was commandline, and what was internal runtime
state - do_swap_account? really_do_swap_account? But I think it's
straight-forward in this case now.

> Historically, I think we've added mem_cgroup_disabled() checks
> (accessing a cacheline we'd rather avoid) where they're necessary,
> rather than at every "interface".

To me that always seemed like bugs waiting to happen. Like this one!

It's a jump label nowadays, so I've been liberal with these to avoid
subtle bugs.

> And you seem to be in a very "goto out" mood today - we all have
> our "goto out" days, alternating with our "return 0" days :)

:-)

But I agree, best to keep this fixup self-contained and defer anything
else to separate cleanup patches.

How about the below? It survives a swaptest with cgroup_disable=memory
for me.

Hugh, I started with your patch, which is why I kept you as the
author, but as the patch now (and arguably the previous one) is
sufficiently different, I dropped that now. I hope that's okay.

---
From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 21 May 2020 17:44:25 -0400
Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
 fix

Fix crash with cgroup_disable=memory:

> > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > 102400000, 214990848
> > > > Allocating group tables:    0/7453                           done
> > > > Writing inode tables:    0/7453                           done
> > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > pointer dereference, address: 000000c8
> > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > [   35.513506] #PF: error_code(0x0000) - not-present page
> > > > [   35.518638] *pde = 00000000
> > > > [   35.521514] Oops: 0000 [#1] SMP
> > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > 5.7.0-rc6-next-20200519+ #1
> > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.2 05/23/2018
> > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60

Swap accounting used to be implied-disabled when the cgroup controller
was disabled. Restore that for the new cgroup_memory_noswap, so that
we bail out of this function instead of dereferencing a NULL memcg.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Debugged-by: Hugh Dickins <hughd@google.com>
Debugged-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e000a316b59..e3b785d6e771 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = {
 
 static int __init mem_cgroup_swap_init(void)
 {
-	if (mem_cgroup_disabled() || cgroup_memory_noswap)
+	/* No memory control -> no swap control */
+	if (mem_cgroup_disabled())
+		cgroup_memory_noswap = true;
+
+	if (cgroup_memory_noswap)
 		return 0;
 
 	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 21:58                               ` Johannes Weiner
@ 2020-05-21 23:35                                 ` Hugh Dickins
  2020-05-28 14:59                                 ` Michal Hocko
  1 sibling, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2020-05-21 23:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Michal Hocko, Naresh Kamboju, Chris Down,
	Yafang Shao, Anders Roxell, Linux F2FS DEV, Mailing List,
	linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Andrea Arcangeli,
	Matthew Wilcox, Chao Yu, lkft-triage, Roman Gushchin, Cgroups

On Thu, 21 May 2020, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
> > On Thu, 21 May 2020, Johannes Weiner wrote:
> > > do_memsw_account() used to be automatically false when the cgroup
> > > controller was disabled. Now that it's replaced by
> > > cgroup_memory_noswap, for which this isn't true, make the
> > > mem_cgroup_disabled() checks explicit in the swap control API.
> > > 
> > > [hannes@cmpxchg.org: use mem_cgroup_disabled() in all API functions]
> > > Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
> > > Debugged-by: Hugh Dickins <hughd@google.com>
> > > Debugged-by: Michal Hocko <mhocko@kernel.org>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 41 insertions(+), 6 deletions(-)
> > 
> > I'm certainly not against a mem_cgroup_disabled() check in the only
> > place that's been observed to need it, as a fixup to merge into your
> > original patch; but this seems rather an over-reaction - and I'm a
> > little surprised that setting mem_cgroup_disabled() doesn't just
> > force cgroup_memory_noswap, saving repetitious checks elsewhere
> > (perhaps there's a difficulty in that, I haven't looked).
> 
> Fair enough, I changed it to set the flag at initialization time if
> mem_cgroup_disabled(). I was never a fan of the old flags, where it
> was never clear what was commandline, and what was internal runtime
> state - do_swap_account? really_do_swap_account? But I think it's
> straight-forward in this case now.
> 
> > Historically, I think we've added mem_cgroup_disabled() checks
> > (accessing a cacheline we'd rather avoid) where they're necessary,
> > rather than at every "interface".
> 
> To me that always seemed like bugs waiting to happen. Like this one!
> 
> It's a jump label nowadays, so I've been liberal with these to avoid
> subtle bugs.
> 
> > And you seem to be in a very "goto out" mood today - we all have
> > our "goto out" days, alternating with our "return 0" days :)
> 
> :-)
> 
> But I agree, best to keep this fixup self-contained and defer anything
> else to separate cleanup patches.
> 
> How about the below? It survives a swaptest with cgroup_disable=memory
> for me.

I like this version *a lot*, thank you. I got worried for a bit by
the "#define cgroup_memory_noswap 1" when #ifndef CONFIG_MEMCG_SWAP,
but now realize that fits perfectly.

> 
> Hugh, I started with your patch, which is why I kept you as the
> author, but as the patch now (and arguably the previous one) is
> sufficiently different, I dropped that now. I hope that's okay.

Absolutely okay, these are yours: I was a little uncomfortable to
see me on the From line before, but it also seemed just too petty
to insist that my name be removed.

(By the way, off-topic for this particular issue, but advance warning
that I hope to post a couple of patches to __read_swap_cache_async()
before the end of the day, first being fixup to some of your mods -
I suspect you got it working well enough, and intended to come back
to check a few details later, but never quite got around to that.)

> 
> ---
> From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 21 May 2020 17:44:25 -0400
> Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
>  fix
> 
> Fix crash with cgroup_disable=memory:
> 
> > > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > > 102400000, 214990848
> > > > > Allocating group tables:    0/7453                           done
> > > > > Writing inode tables:    0/7453                           done
> > > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > > pointer dereference, address: 000000c8
> > > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > > [   35.513506] #PF: error_code(0x0000) - not-present page
> > > > > [   35.518638] *pde = 00000000
> > > > > [   35.521514] Oops: 0000 [#1] SMP
> > > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > > 5.7.0-rc6-next-20200519+ #1
> > > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > 2.2 05/23/2018
> > > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> 
> Swap accounting used to be implied-disabled when the cgroup controller
> was disabled. Restore that for the new cgroup_memory_noswap, so that
> we bail out of this function instead of dereferencing a NULL memcg.
> 
> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
> Debugged-by: Hugh Dickins <hughd@google.com>
> Debugged-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  mm/memcontrol.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3e000a316b59..e3b785d6e771 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = {
>  
>  static int __init mem_cgroup_swap_init(void)
>  {
> -	if (mem_cgroup_disabled() || cgroup_memory_noswap)
> +	/* No memory control -> no swap control */
> +	if (mem_cgroup_disabled())
> +		cgroup_memory_noswap = true;
> +
> +	if (cgroup_memory_noswap)
>  		return 0;
>  
>  	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
> -- 
> 2.26.2

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 21:58                               ` Johannes Weiner
  2020-05-21 23:35                                 ` Hugh Dickins
@ 2020-05-28 14:59                                 ` Michal Hocko
  1 sibling, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2020-05-28 14:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Naresh Kamboju, Chris Down, Yafang Shao,
	Anders Roxell, Linux F2FS DEV, Mailing List, linux-ext4,
	linux-block, Andrew Morton, open list, Linux-Next Mailing List,
	linux-mm, Arnd Bergmann, Andreas Dilger, Jaegeuk Kim,
	Theodore Ts'o, Chao Yu, Andrea Arcangeli, Matthew Wilcox,
	Chao Yu, lkft-triage, Roman Gushchin, Cgroups

[Sorry for a late reply - was offline for few days]

On Thu 21-05-20 17:58:55, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
[...]
> >From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 21 May 2020 17:44:25 -0400
> Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
>  fix
> 
> Fix crash with cgroup_disable=memory:
> 
> > > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > > 102400000, 214990848
> > > > > Allocating group tables:    0/7453                           done
> > > > > Writing inode tables:    0/7453                           done
> > > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > > pointer dereference, address: 000000c8
> > > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > > [   35.513506] #PF: error_code(0x0000) - not-present page
> > > > > [   35.518638] *pde = 00000000
> > > > > [   35.521514] Oops: 0000 [#1] SMP
> > > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > > 5.7.0-rc6-next-20200519+ #1
> > > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > 2.2 05/23/2018
> > > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> 
> Swap accounting used to be implied-disabled when the cgroup controller
> was disabled. Restore that for the new cgroup_memory_noswap, so that
> we bail out of this function instead of dereferencing a NULL memcg.
> 
> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
> Debugged-by: Hugh Dickins <hughd@google.com>
> Debugged-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Yes this looks better. I hope to get to your series soon to have the
full picture finally.

> ---
>  mm/memcontrol.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3e000a316b59..e3b785d6e771 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = {
>  
>  static int __init mem_cgroup_swap_init(void)
>  {
> -	if (mem_cgroup_disabled() || cgroup_memory_noswap)
> +	/* No memory control -> no swap control */
> +	if (mem_cgroup_disabled())
> +		cgroup_memory_noswap = true;
> +
> +	if (cgroup_memory_noswap)
>  		return 0;
>  
>  	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 20:53                       ` Naresh Kamboju
@ 2020-05-28 15:03                         ` Michal Hocko
  2020-05-28 16:17                           ` Naresh Kamboju
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-05-28 15:03 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Yafang Shao, Chris Down, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> My apology !
> As per the test results history this problem started happening from
> Bad : next-20200430 (still reproducible on next-20200519)
> Good : next-20200429
> 
> The git tree / tag used for testing is from linux next-20200430 tag and reverted
> following three patches and oom-killer problem fixed.
> 
> Revert "mm, memcg: avoid stale protection values when cgroup is above
> protection"
> Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"

The discussion has fragmented and I got lost TBH.
In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@mail.gmail.com
you have said that none of the added tracing output has triggered. Does
this still hold? Because I still have a hard time to understand how
those three patches could have the observed effects.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-28 15:03                         ` Michal Hocko
@ 2020-05-28 16:17                           ` Naresh Kamboju
  2020-05-28 16:41                             ` Chris Down
  0 siblings, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-05-28 16:17 UTC (permalink / raw)
  To: Michal Hocko, Yafang Shao
  Cc: Chris Down, Anders Roxell, Linux F2FS DEV, Mailing List,
	linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, 28 May 2020 at 20:33, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> > My apology !
> > As per the test results history this problem started happening from
> > Bad : next-20200430 (still reproducible on next-20200519)
> > Good : next-20200429
> >
> > The git tree / tag used for testing is from linux next-20200430 tag and reverted
> > following three patches and oom-killer problem fixed.
> >
> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > protection"
> > Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> > Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
>
> The discussion has fragmented and I got lost TBH.
> In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@mail.gmail.com
> you have said that none of the added tracing output has triggered. Does
> this still hold? Because I still have a hard time to understand how
> those three patches could have the observed effects.

On the other email thread [1] this issue is concluded.

Yafang wrote on May 22 2020,

Regarding the root cause, my guess is it makes a similar mistake that
I tried to fix in the previous patch that the direct reclaimer read a
stale protection value.  But I don't think it is worth to add another
fix. The best way is to revert this commit.


[1]  [PATCH v3 2/2] mm, memcg: Decouple e{low,min} state mutations
from protection checks
https://lore.kernel.org/linux-mm/CALOAHbArZ3NsuR3mCnx_kbSF8ktpjhUF2kaaTa7Mb7ocJajsQg@mail.gmail.com/

- Naresh

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-28 16:17                           ` Naresh Kamboju
@ 2020-05-28 16:41                             ` Chris Down
  2020-05-29  1:50                               ` Yafang Shao
  0 siblings, 1 reply; 48+ messages in thread
From: Chris Down @ 2020-05-28 16:41 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Michal Hocko, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

Naresh Kamboju writes:
>On Thu, 28 May 2020 at 20:33, Michal Hocko <mhocko@kernel.org> wrote:
>>
>> On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
>> > My apology !
>> > As per the test results history this problem started happening from
>> > Bad : next-20200430 (still reproducible on next-20200519)
>> > Good : next-20200429
>> >
>> > The git tree / tag used for testing is from linux next-20200430 tag and reverted
>> > following three patches and oom-killer problem fixed.
>> >
>> > Revert "mm, memcg: avoid stale protection values when cgroup is above
>> > protection"
>> > Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
>> > Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
>>
>> The discussion has fragmented and I got lost TBH.
>> In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@mail.gmail.com
>> you have said that none of the added tracing output has triggered. Does
>> this still hold? Because I still have a hard time to understand how
>> those three patches could have the observed effects.
>
>On the other email thread [1] this issue is concluded.
>
>Yafang wrote on May 22 2020,
>
>Regarding the root cause, my guess is it makes a similar mistake that
>I tried to fix in the previous patch that the direct reclaimer read a
>stale protection value.  But I don't think it is worth to add another
>fix. The best way is to revert this commit.

This isn't a conclusion, just a guess (and one I think is unlikely). For this 
to reliably happen, it implies that the same race happens the same way each 
time.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-28 16:41                             ` Chris Down
@ 2020-05-29  1:50                               ` Yafang Shao
  2020-05-29  1:56                                 ` Chris Down
  0 siblings, 1 reply; 48+ messages in thread
From: Yafang Shao @ 2020-05-29  1:50 UTC (permalink / raw)
  To: Chris Down
  Cc: Naresh Kamboju, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Fri, May 29, 2020 at 12:41 AM Chris Down <chris@chrisdown.name> wrote:
>
> Naresh Kamboju writes:
> >On Thu, 28 May 2020 at 20:33, Michal Hocko <mhocko@kernel.org> wrote:
> >>
> >> On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> >> > My apology !
> >> > As per the test results history this problem started happening from
> >> > Bad : next-20200430 (still reproducible on next-20200519)
> >> > Good : next-20200429
> >> >
> >> > The git tree / tag used for testing is from linux next-20200430 tag and reverted
> >> > following three patches and oom-killer problem fixed.
> >> >
> >> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> >> > protection"
> >> > Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> >> > Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
> >>
> >> The discussion has fragmented and I got lost TBH.
> >> In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@mail.gmail.com
> >> you have said that none of the added tracing output has triggered. Does
> >> this still hold? Because I still have a hard time to understand how
> >> those three patches could have the observed effects.
> >
> >On the other email thread [1] this issue is concluded.
> >
> >Yafang wrote on May 22 2020,
> >
> >Regarding the root cause, my guess is it makes a similar mistake that
> >I tried to fix in the previous patch that the direct reclaimer read a
> >stale protection value.  But I don't think it is worth to add another
> >fix. The best way is to revert this commit.
>
> This isn't a conclusion, just a guess (and one I think is unlikely). For this
> to reliably happen, it implies that the same race happens the same way each
> time.


Hi Chris,

Look at this patch[1] carefully you will find that it introduces the
same issue that I tried to fix in another patch [2]. Even more sad is
these two patches are in the same patchset. Although this issue isn't
related with the issue found by Naresh, we have to ask ourselves why
we always make the same mistake ?
One possible answer is that we always forget the lifecyle of
memory.emin before we read it. memory.emin doesn't have the same
lifecycle with the memcg, while it really has the same lifecyle with
the reclaimer. IOW, once a reclaimer begins the protetion value should
be set to 0, and after we traversal the memcg tree we calculate a
protection value for this reclaimer, finnaly it disapears after the
reclaimer stops. That is why I highly suggest to add an new protection
member in scan_control before.

[1]. https://lore.kernel.org/linux-mm/20200505084127.12923-3-laoar.shao@gmail.com/
[2]. https://lore.kernel.org/linux-mm/20200505084127.12923-2-laoar.shao@gmail.com/

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-29  1:50                               ` Yafang Shao
@ 2020-05-29  1:56                                 ` Chris Down
  2020-05-29  9:49                                   ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Chris Down @ 2020-05-29  1:56 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Naresh Kamboju, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

Yafang Shao writes:
>Look at this patch[1] carefully you will find that it introduces the
>same issue that I tried to fix in another patch [2]. Even more sad is
>these two patches are in the same patchset. Although this issue isn't
>related with the issue found by Naresh, we have to ask ourselves why
>we always make the same mistake ?
>One possible answer is that we always forget the lifecyle of
>memory.emin before we read it. memory.emin doesn't have the same
>lifecycle with the memcg, while it really has the same lifecyle with
>the reclaimer. IOW, once a reclaimer begins the protetion value should
>be set to 0, and after we traversal the memcg tree we calculate a
>protection value for this reclaimer, finnaly it disapears after the
>reclaimer stops. That is why I highly suggest to add an new protection
>member in scan_control before.

I agree with you that the e{min,low} lifecycle is confusing for everyone -- the 
only thing I've not seen confirmation of is any confirmed correlation with the 
i386 oom killer issue. If you've validated that, I'd like to see the data :-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-29  1:56                                 ` Chris Down
@ 2020-05-29  9:49                                   ` Michal Hocko
  2020-06-11  9:55                                     ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-05-29  9:49 UTC (permalink / raw)
  To: Chris Down
  Cc: Yafang Shao, Naresh Kamboju, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Fri 29-05-20 02:56:44, Chris Down wrote:
> Yafang Shao writes:
> > Look at this patch[1] carefully you will find that it introduces the
> > same issue that I tried to fix in another patch [2]. Even more sad is
> > these two patches are in the same patchset. Although this issue isn't
> > related with the issue found by Naresh, we have to ask ourselves why
> > we always make the same mistake ?
> > One possible answer is that we always forget the lifecyle of
> > memory.emin before we read it. memory.emin doesn't have the same
> > lifecycle with the memcg, while it really has the same lifecyle with
> > the reclaimer. IOW, once a reclaimer begins the protetion value should
> > be set to 0, and after we traversal the memcg tree we calculate a
> > protection value for this reclaimer, finnaly it disapears after the
> > reclaimer stops. That is why I highly suggest to add an new protection
> > member in scan_control before.
> 
> I agree with you that the e{min,low} lifecycle is confusing for everyone --
> the only thing I've not seen confirmation of is any confirmed correlation
> with the i386 oom killer issue. If you've validated that, I'd like to see
> the data :-)

Agreed. Even if e{low,min} might still have some rough edges I am
completely puzzled how we could end up oom if none of the protection
path triggers which the additional debugging should confirm. Maybe my
debugging patch is incomplete or used incorrectly (maybe it would be
esier to use printk rather than trace_printk?).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-29  9:49                                   ` Michal Hocko
@ 2020-06-11  9:55                                     ` Michal Hocko
  2020-06-12  9:43                                       ` Naresh Kamboju
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-06-11  9:55 UTC (permalink / raw)
  To: Chris Down, Naresh Kamboju
  Cc: Yafang Shao, Anders Roxell, Linux F2FS DEV, Mailing List,
	linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> On Fri 29-05-20 02:56:44, Chris Down wrote:
> > Yafang Shao writes:
> > > Look at this patch[1] carefully you will find that it introduces the
> > > same issue that I tried to fix in another patch [2]. Even more sad is
> > > these two patches are in the same patchset. Although this issue isn't
> > > related with the issue found by Naresh, we have to ask ourselves why
> > > we always make the same mistake ?
> > > One possible answer is that we always forget the lifecyle of
> > > memory.emin before we read it. memory.emin doesn't have the same
> > > lifecycle with the memcg, while it really has the same lifecyle with
> > > the reclaimer. IOW, once a reclaimer begins the protetion value should
> > > be set to 0, and after we traversal the memcg tree we calculate a
> > > protection value for this reclaimer, finnaly it disapears after the
> > > reclaimer stops. That is why I highly suggest to add an new protection
> > > member in scan_control before.
> > 
> > I agree with you that the e{min,low} lifecycle is confusing for everyone --
> > the only thing I've not seen confirmation of is any confirmed correlation
> > with the i386 oom killer issue. If you've validated that, I'd like to see
> > the data :-)
> 
> Agreed. Even if e{low,min} might still have some rough edges I am
> completely puzzled how we could end up oom if none of the protection
> path triggers which the additional debugging should confirm. Maybe my
> debugging patch is incomplete or used incorrectly (maybe it would be
> esier to use printk rather than trace_printk?).

It would be really great if we could move forward. While the fix (which
has been dropped from mmotm) is not super urgent I would really like to
understand how it could hit the observed behavior. Can we double check
that the debugging patch really doesn't trigger (e.g.
s@trace_printk@printk in the first step)? I have checked it again but
do not see any potential code path which would be affected by the patch
yet not trigger any output. But another pair of eyes would be really
great.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-11  9:55                                     ` Michal Hocko
@ 2020-06-12  9:43                                       ` Naresh Kamboju
  2020-06-12 12:09                                         ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-06-12  9:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chris Down, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, 11 Jun 2020 at 15:25, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> > On Fri 29-05-20 02:56:44, Chris Down wrote:
> > > Yafang Shao writes:
> > Agreed. Even if e{low,min} might still have some rough edges I am
> > completely puzzled how we could end up oom if none of the protection
> > path triggers which the additional debugging should confirm. Maybe my
> > debugging patch is incomplete or used incorrectly (maybe it would be
> > esier to use printk rather than trace_printk?).
>
> It would be really great if we could move forward. While the fix (which
> has been dropped from mmotm) is not super urgent I would really like to
> understand how it could hit the observed behavior. Can we double check
> that the debugging patch really doesn't trigger (e.g.
> s@trace_printk@printk in the first step)?

Please suggest to me the way to get more debug information
by providing kernel debug patches and extra kernel configs.

I have applied your debug patch and tested on top on linux next 20200612
but did not find any printk output while running mkfs -t ext4 /drive test case.


> I have checked it again but
> do not see any potential code path which would be affected by the patch
> yet not trigger any output. But another pair of eyes would be really
> great.


---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..d13ce7b02de4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2375,6 +2375,8 @@ static void get_scan_count(struct lruvec
*lruvec, struct scan_control *sc,
  * sc->priority further than desirable.
  */
  scan = max(scan, SWAP_CLUSTER_MAX);
+
+ trace_printk("scan:%lu protection:%lu\n", scan, protection);
  } else {
  scan = lruvec_size;
  }
@@ -2618,6 +2620,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat,
struct scan_control *sc)

  switch (mem_cgroup_protected(target_memcg, memcg)) {
  case MEMCG_PROT_MIN:
+ trace_printk("under min:%lu emin:%lu\n", memcg->memory.min,
memcg->memory.emin);
  /*
  * Hard protection.
  * If there is no reclaimable memory, OOM.
@@ -2630,6 +2633,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat,
struct scan_control *sc)
  * there is an unprotected supply
  * of reclaimable memory from other cgroups.
  */
+ trace_printk("under low:%lu elow:%lu\n", memcg->memory.low,
memcg->memory.elow);
  if (!sc->memcg_low_reclaim) {
  sc->memcg_low_skipped = 1;
  continue;
-- 
2.23.0

ref:
test output:
https://lkft.validation.linaro.org/scheduler/job/1489767#L1388

Test artifacts link (kernel / modules):
https://builds.tuxbuild.com/5rRNgQqF_wHsSRptdj4A1A/
- Naresh

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-12  9:43                                       ` Naresh Kamboju
@ 2020-06-12 12:09                                         ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2020-06-12 12:09 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Chris Down, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Fri 12-06-20 15:13:22, Naresh Kamboju wrote:
> On Thu, 11 Jun 2020 at 15:25, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> > > On Fri 29-05-20 02:56:44, Chris Down wrote:
> > > > Yafang Shao writes:
> > > Agreed. Even if e{low,min} might still have some rough edges I am
> > > completely puzzled how we could end up oom if none of the protection
> > > path triggers which the additional debugging should confirm. Maybe my
> > > debugging patch is incomplete or used incorrectly (maybe it would be
> > > esier to use printk rather than trace_printk?).
> >
> > It would be really great if we could move forward. While the fix (which
> > has been dropped from mmotm) is not super urgent I would really like to
> > understand how it could hit the observed behavior. Can we double check
> > that the debugging patch really doesn't trigger (e.g.
> > s@trace_printk@printk in the first step)?
> 
> Please suggest to me the way to get more debug information
> by providing kernel debug patches and extra kernel configs.
> 
> I have applied your debug patch and tested on top on linux next 20200612
> but did not find any printk output while running mkfs -t ext4 /drive test case.

Have you tried s@trace_printk@printk@ in the patch? AFAIK trace_printk
doesn't dump anything into the printk ring buffer. You would have to
look into trace ring buffer.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-05-21 16:34                   ` Michal Hocko
  2020-05-21 19:00                     ` Naresh Kamboju
@ 2020-06-17 13:37                     ` Naresh Kamboju
  2020-06-17 13:57                       ` Chris Down
  2020-06-17 13:59                       ` Michal Hocko
  1 sibling, 2 replies; 48+ messages in thread
From: Naresh Kamboju @ 2020-06-17 13:37 UTC (permalink / raw)
  To: Michal Hocko, Chris Down, Yafang Shao
  Cc: Anders Roxell, Linux F2FS DEV, Mailing List, linux-ext4,
	linux-block, Andrew Morton, open list, Linux-Next Mailing List,
	linux-mm, Arnd Bergmann, Andreas Dilger, Jaegeuk Kim,
	Theodore Ts'o, Chao Yu, Hugh Dickins, Andrea Arcangeli,
	Matthew Wilcox, Chao Yu, lkft-triage, Johannes Weiner,
	Roman Gushchin, Cgroups

On Thu, 21 May 2020 at 22:04, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop.
>
> I was staring into the code and do not see anything.  Could you give the
> following debugging patch a try and see whether it triggers?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc555903a332..df2e8df0eb71 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>                          * sc->priority further than desirable.
>                          */
>                         scan = max(scan, SWAP_CLUSTER_MAX);
> +
> +                       trace_printk("scan:%lu protection:%lu\n", scan, protection);
>                 } else {
>                         scan = lruvec_size;
>                 }
> @@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>                 mem_cgroup_calculate_protection(target_memcg, memcg);
>
>                 if (mem_cgroup_below_min(memcg)) {
> +                       trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin);
>                         /*
>                          * Hard protection.
>                          * If there is no reclaimable memory, OOM.
> @@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>                          * there is an unprotected supply
>                          * of reclaimable memory from other cgroups.
>                          */
> +                       trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow);
>                         if (!sc->memcg_low_reclaim) {
>                                 sc->memcg_low_skipped = 1;
>                                 continue;

As per your suggestions on debugging this problem,
trace_printk is replaced with printk and applied to your patch on top of the
problematic kernel and here is the test output and link.

mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
Allocating group tables:    0/7453 done
Writing inode tables:    0/7453 done
Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
[   51.845304] under min:0 emin:0
[   51.848738] under min:0 emin:0
[   51.858147] under min:0 emin:0
[   51.861333] under min:0 emin:0
[   51.862034] under min:0 emin:0
[   51.862442] under min:0 emin:0
[   51.862763] under min:0 emin:0

Full test log link,
https://lkft.validation.linaro.org/scheduler/job/1497412#L1451

- Naresh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 13:37                     ` Naresh Kamboju
@ 2020-06-17 13:57                       ` Chris Down
  2020-06-17 14:11                         ` Michal Hocko
  2020-06-17 13:59                       ` Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Chris Down @ 2020-06-17 13:57 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Michal Hocko, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

Naresh Kamboju writes:
>mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
>mke2fs 1.43.8 (1-Jan-2018)
>Creating filesystem with 244190646 4k blocks and 61054976 inodes
>Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
>Superblock backups stored on blocks:
>32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>102400000, 214990848
>Allocating group tables:    0/7453 done
>Writing inode tables:    0/7453 done
>Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
>[   51.845304] under min:0 emin:0
>[   51.848738] under min:0 emin:0
>[   51.858147] under min:0 emin:0
>[   51.861333] under min:0 emin:0
>[   51.862034] under min:0 emin:0
>[   51.862442] under min:0 emin:0
>[   51.862763] under min:0 emin:0

Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even when 
min/emin is 0 (which should indeed be the case if you haven't set them in the 
hierarchy).

My guess is that page_counter_read(&memcg->memory) is 0, which means 
mem_cgroup_below_min will return 1.

However, I don't know for sure why that should then result in the OOM killer 
coming along. My guess is that since this memcg has 0 pages to scan anyway, we 
enter premature OOM under some conditions. I don't know why we wouldn't have 
hit that with the old version of mem_cgroup_protected that returned 
MEMCG_PROT_* members, though.

Can you please try the patch with the `>=` checks in mem_cgroup_below_min and 
mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a strong 
hint about what's going on here.

Thanks for your help!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 13:37                     ` Naresh Kamboju
  2020-06-17 13:57                       ` Chris Down
@ 2020-06-17 13:59                       ` Michal Hocko
  2020-06-17 14:08                         ` Chris Down
  1 sibling, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-06-17 13:59 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Chris Down, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Wed 17-06-20 19:07:20, Naresh Kamboju wrote:
> On Thu, 21 May 2020 at 22:04, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > Hi Naresh,
> > > >
> > > > Naresh Kamboju writes:
> > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > git bisected the problem and found bad commit(s) which caused this problem.
> > > > >
> > > > > The following two patches have been reverted on next-20200519 and retested the
> > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > ( invoked oom-killer is gone now)
> > > > >
> > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > protection"
> > > > >    This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > >
> > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > checks"
> > > > >    This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > >
> > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > >
> > > > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > > > in either of those commits from a (very) cursory glance, but they should
> > > > only be taking effect if protections are set.
> > >
> > > Agreed. If memory.{low,min} is not used then the patch should be
> > > effectively a nop.
> >
> > I was staring into the code and do not see anything.  Could you give the
> > following debugging patch a try and see whether it triggers?
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cc555903a332..df2e8df0eb71 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >                          * sc->priority further than desirable.
> >                          */
> >                         scan = max(scan, SWAP_CLUSTER_MAX);
> > +
> > +                       trace_printk("scan:%lu protection:%lu\n", scan, protection);
> >                 } else {
> >                         scan = lruvec_size;
> >                 }
> > @@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >                 mem_cgroup_calculate_protection(target_memcg, memcg);
> >
> >                 if (mem_cgroup_below_min(memcg)) {
> > +                       trace_printk("under min:%lu emin:%lu\n", memcg->memory.min, memcg->memory.emin);
> >                         /*
> >                          * Hard protection.
> >                          * If there is no reclaimable memory, OOM.
> > @@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >                          * there is an unprotected supply
> >                          * of reclaimable memory from other cgroups.
> >                          */
> > +                       trace_printk("under low:%lu elow:%lu\n", memcg->memory.low, memcg->memory.elow);
> >                         if (!sc->memcg_low_reclaim) {
> >                                 sc->memcg_low_skipped = 1;
> >                                 continue;
> 
> As per your suggestions on debugging this problem,
> trace_printk is replaced with printk and applied to your patch on top of the
> problematic kernel and here is the test output and link.
> 
> mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> mke2fs 1.43.8 (1-Jan-2018)
> Creating filesystem with 244190646 4k blocks and 61054976 inodes
> Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> 102400000, 214990848
> Allocating group tables:    0/7453 done
> Writing inode tables:    0/7453 done
> Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> [   51.845304] under min:0 emin:0
> [   51.848738] under min:0 emin:0
> [   51.858147] under min:0 emin:0
> [   51.861333] under min:0 emin:0
> [   51.862034] under min:0 emin:0
> [   51.862442] under min:0 emin:0
> [   51.862763] under min:0 emin:0
> 
> Full test log link,
> https://lkft.validation.linaro.org/scheduler/job/1497412#L1451

Thanks a lot. So it is clear that mem_cgroup_below_min got confused and
reported protected cgroup. Both effective and real limits are 0 so there
is no garbage in them. The problem is in mem_cgroup_below_* and it is
quite obvious.

We are doing the following
+static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+{
+       if (mem_cgroup_disabled())
+               return false;
+
+       return READ_ONCE(memcg->memory.emin) >=
+               page_counter_read(&memcg->memory);
+}

and it makes some sense. Except for the root memcg where we do not
account any memory. Adding if (mem_cgroup_is_root(memcg)) return false;
should do the trick. The same is the case for mem_cgroup_below_low.
Could you give it a try please just to confirm?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 13:59                       ` Michal Hocko
@ 2020-06-17 14:08                         ` Chris Down
  0 siblings, 0 replies; 48+ messages in thread
From: Chris Down @ 2020-06-17 14:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Naresh Kamboju, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

Michal Hocko writes:
>and it makes some sense. Except for the root memcg where we do not
>account any memory. Adding if (mem_cgroup_is_root(memcg)) return false;
>should do the trick. The same is the case for mem_cgroup_below_low.
>Could you give it a try please just to confirm?

Oh, of course :-) This seems more likely than what I proposed, and would be 
great to test.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 13:57                       ` Chris Down
@ 2020-06-17 14:11                         ` Michal Hocko
  2020-06-17 15:53                           ` Naresh Kamboju
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-06-17 14:11 UTC (permalink / raw)
  To: Chris Down
  Cc: Naresh Kamboju, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

[Our emails have crossed]

On Wed 17-06-20 14:57:58, Chris Down wrote:
> Naresh Kamboju writes:
> > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > mke2fs 1.43.8 (1-Jan-2018)
> > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > Superblock backups stored on blocks:
> > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > 102400000, 214990848
> > Allocating group tables:    0/7453 done
> > Writing inode tables:    0/7453 done
> > Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> > [   51.845304] under min:0 emin:0
> > [   51.848738] under min:0 emin:0
> > [   51.858147] under min:0 emin:0
> > [   51.861333] under min:0 emin:0
> > [   51.862034] under min:0 emin:0
> > [   51.862442] under min:0 emin:0
> > [   51.862763] under min:0 emin:0
> 
> Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> when min/emin is 0 (which should indeed be the case if you haven't set them
> in the hierarchy).
> 
> My guess is that page_counter_read(&memcg->memory) is 0, which means
> mem_cgroup_below_min will return 1.

Yes this is the case because this is likely the root memcg which skips
all charges.

> However, I don't know for sure why that should then result in the OOM killer
> coming along. My guess is that since this memcg has 0 pages to scan anyway,
> we enter premature OOM under some conditions. I don't know why we wouldn't
> have hit that with the old version of mem_cgroup_protected that returned
> MEMCG_PROT_* members, though.

Not really. There is likely no other memcg to reclaim from and assuming
min limit protection will result in no reclaimable memory and thus the
OOM killer.

> Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> strong hint about what's going on here.

This would work but I believe an explicit check for the root memcg would
be easier to spot the reasoning.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 14:11                         ` Michal Hocko
@ 2020-06-17 15:53                           ` Naresh Kamboju
  2020-06-17 16:06                             ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-06-17 15:53 UTC (permalink / raw)
  To: Chris Down, Michal Hocko
  Cc: Yafang Shao, Anders Roxell, Linux F2FS DEV, Mailing List,
	linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Wed, 17 Jun 2020 at 19:41, Michal Hocko <mhocko@kernel.org> wrote:
>
> [Our emails have crossed]
>
> On Wed 17-06-20 14:57:58, Chris Down wrote:
> > Naresh Kamboju writes:
> > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > mke2fs 1.43.8 (1-Jan-2018)
> > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > Superblock backups stored on blocks:
> > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > 102400000, 214990848
> > > Allocating group tables:    0/7453 done
> > > Writing inode tables:    0/7453 done
> > > Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> > > [   51.845304] under min:0 emin:0
> > > [   51.848738] under min:0 emin:0
> > > [   51.858147] under min:0 emin:0
> > > [   51.861333] under min:0 emin:0
> > > [   51.862034] under min:0 emin:0
> > > [   51.862442] under min:0 emin:0
> > > [   51.862763] under min:0 emin:0
> >
> > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> > when min/emin is 0 (which should indeed be the case if you haven't set them
> > in the hierarchy).
> >
> > My guess is that page_counter_read(&memcg->memory) is 0, which means
> > mem_cgroup_below_min will return 1.
>
> Yes this is the case because this is likely the root memcg which skips
> all charges.
>
> > However, I don't know for sure why that should then result in the OOM killer
> > coming along. My guess is that since this memcg has 0 pages to scan anyway,
> > we enter premature OOM under some conditions. I don't know why we wouldn't
> > have hit that with the old version of mem_cgroup_protected that returned
> > MEMCG_PROT_* members, though.
>
> Not really. There is likely no other memcg to reclaim from and assuming
> min limit protection will result in no reclaimable memory and thus the
> OOM killer.
>
> > Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> > and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> > strong hint about what's going on here.
>
> This would work but I believe an explicit check for the root memcg would
> be easier to spot the reasoning.

May I request you to send debugging or proposed fix patches here.
I am happy to do more testing.

FYI,
Here is my repository for testing.
git: https://github.com/nareshkamboju/linux/tree/printk
branch: printk

- Naresh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 15:53                           ` Naresh Kamboju
@ 2020-06-17 16:06                             ` Michal Hocko
  2020-06-17 20:13                               ` Naresh Kamboju
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-06-17 16:06 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Chris Down, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Wed 17-06-20 21:23:05, Naresh Kamboju wrote:
> On Wed, 17 Jun 2020 at 19:41, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > [Our emails have crossed]
> >
> > On Wed 17-06-20 14:57:58, Chris Down wrote:
> > > Naresh Kamboju writes:
> > > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > 102400000, 214990848
> > > > Allocating group tables:    0/7453 done
> > > > Writing inode tables:    0/7453 done
> > > > Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> > > > [   51.845304] under min:0 emin:0
> > > > [   51.848738] under min:0 emin:0
> > > > [   51.858147] under min:0 emin:0
> > > > [   51.861333] under min:0 emin:0
> > > > [   51.862034] under min:0 emin:0
> > > > [   51.862442] under min:0 emin:0
> > > > [   51.862763] under min:0 emin:0
> > >
> > > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> > > when min/emin is 0 (which should indeed be the case if you haven't set them
> > > in the hierarchy).
> > >
> > > My guess is that page_counter_read(&memcg->memory) is 0, which means
> > > mem_cgroup_below_min will return 1.
> >
> > Yes this is the case because this is likely the root memcg which skips
> > all charges.
> >
> > > However, I don't know for sure why that should then result in the OOM killer
> > > coming along. My guess is that since this memcg has 0 pages to scan anyway,
> > > we enter premature OOM under some conditions. I don't know why we wouldn't
> > > have hit that with the old version of mem_cgroup_protected that returned
> > > MEMCG_PROT_* members, though.
> >
> > Not really. There is likely no other memcg to reclaim from and assuming
> > min limit protection will result in no reclaimable memory and thus the
> > OOM killer.
> >
> > > Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> > > and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> > > strong hint about what's going on here.
> >
> > This would work but I believe an explicit check for the root memcg would
> > be easier to spot the reasoning.
> 
> May I request you to send debugging or proposed fix patches here.
> I am happy to do more testing.

Sure, here is the diff to test.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c74a8f2323f1..6b5a31672fbe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -392,6 +392,13 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
 	if (mem_cgroup_disabled())
 		return false;
 
+	/*
+	 * Root memcg doesn't account charges and doesn't support
+	 * protection
+	 */
+	if (mem_cgroup_is_root(memcg))
+		return false;
+
 	return READ_ONCE(memcg->memory.elow) >=
 		page_counter_read(&memcg->memory);
 }
@@ -401,6 +408,13 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
 	if (mem_cgroup_disabled())
 		return false;
 
+	/*
+	 * Root memcg doesn't account charges and doesn't support
+	 * protection
+	 */
+	if (mem_cgroup_is_root(memcg))
+		return false;
+
 	return READ_ONCE(memcg->memory.emin) >=
 		page_counter_read(&memcg->memory);
 }
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 16:06                             ` Michal Hocko
@ 2020-06-17 20:13                               ` Naresh Kamboju
  2020-06-17 21:09                                 ` Chris Down
  0 siblings, 1 reply; 48+ messages in thread
From: Naresh Kamboju @ 2020-06-17 20:13 UTC (permalink / raw)
  To: Michal Hocko, Chris Down
  Cc: Yafang Shao, Anders Roxell, Linux F2FS DEV, Mailing List,
	linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Wed, 17 Jun 2020 at 21:36, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 17-06-20 21:23:05, Naresh Kamboju wrote:
> > On Wed, 17 Jun 2020 at 19:41, Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > [Our emails have crossed]
> > >
> > > On Wed 17-06-20 14:57:58, Chris Down wrote:
> > > > Naresh Kamboju writes:
> > > > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
> > > > > 102400000, 214990848
> > > > > Allocating group tables:    0/7453 done
> > > > > Writing inode tables:    0/7453 done
> > > > > Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> > > > > [   51.845304] under min:0 emin:0
> > > > > [   51.848738] under min:0 emin:0
> > > > > [   51.858147] under min:0 emin:0
> > > > > [   51.861333] under min:0 emin:0
> > > > > [   51.862034] under min:0 emin:0
> > > > > [   51.862442] under min:0 emin:0
> > > > > [   51.862763] under min:0 emin:0
> > > >
> > > > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> > > > when min/emin is 0 (which should indeed be the case if you haven't set them
> > > > in the hierarchy).
> > > >
> > > > My guess is that page_counter_read(&memcg->memory) is 0, which means
> > > > mem_cgroup_below_min will return 1.
> > >
> > > Yes this is the case because this is likely the root memcg which skips
> > > all charges.
> > >
> > > > However, I don't know for sure why that should then result in the OOM killer
> > > > coming along. My guess is that since this memcg has 0 pages to scan anyway,
> > > > we enter premature OOM under some conditions. I don't know why we wouldn't
> > > > have hit that with the old version of mem_cgroup_protected that returned
> > > > MEMCG_PROT_* members, though.
> > >
> > > Not really. There is likely no other memcg to reclaim from and assuming
> > > min limit protection will result in no reclaimable memory and thus the
> > > OOM killer.
> > >
> > > > Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> > > > and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> > > > strong hint about what's going on here.
> > >
> > > This would work but I believe an explicit check for the root memcg would
> > > be easier to spot the reasoning.
> >
> > May I request you to send debugging or proposed fix patches here.
> > I am happy to do more testing.
>
> Sure, here is the diff to test.
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c74a8f2323f1..6b5a31672fbe 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -392,6 +392,13 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
>         if (mem_cgroup_disabled())
>                 return false;
>
> +       /*
> +        * Root memcg doesn't account charges and doesn't support
> +        * protection
> +        */
> +       if (mem_cgroup_is_root(memcg))
> +               return false;
> +
>         return READ_ONCE(memcg->memory.elow) >=
>                 page_counter_read(&memcg->memory);
>  }
> @@ -401,6 +408,13 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
>         if (mem_cgroup_disabled())
>                 return false;
>
> +       /*
> +        * Root memcg doesn't account charges and doesn't support
> +        * protection
> +        */
> +       if (mem_cgroup_is_root(memcg))
> +               return false;
> +
>         return READ_ONCE(memcg->memory.emin) >=
>                 page_counter_read(&memcg->memory);
>  }


After this patch applied the reported issue got fixed.

test log link,
https://lkft.validation.linaro.org/scheduler/job/1505417#L1429

- Naresh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 20:13                               ` Naresh Kamboju
@ 2020-06-17 21:09                                 ` Chris Down
  2020-06-18  1:43                                   ` Yafang Shao
  0 siblings, 1 reply; 48+ messages in thread
From: Chris Down @ 2020-06-17 21:09 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: Michal Hocko, Yafang Shao, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

Naresh Kamboju writes:
>After this patch applied the reported issue got fixed.

Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)

I'll send out a new version tomorrow with the fixes applied and both of you 
credited in the changelog for the detection and fix.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-17 21:09                                 ` Chris Down
@ 2020-06-18  1:43                                   ` Yafang Shao
  2020-06-18 12:37                                     ` Chris Down
  0 siblings, 1 reply; 48+ messages in thread
From: Yafang Shao @ 2020-06-18  1:43 UTC (permalink / raw)
  To: Chris Down
  Cc: Naresh Kamboju, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, Jun 18, 2020 at 5:09 AM Chris Down <chris@chrisdown.name> wrote:
>
> Naresh Kamboju writes:
> >After this patch applied the reported issue got fixed.
>
> Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
>
> I'll send out a new version tomorrow with the fixes applied and both of you
> credited in the changelog for the detection and fix.

As we have already found that the usage around memory.{emin, elow} has
many limitations, I think memory.{emin, elow} should be used for
memcg-tree internally only, that means they can only be used to
calculate the protection of a memcg in a specified memcg-tree but
should not be exposed to other MM parts.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-18  1:43                                   ` Yafang Shao
@ 2020-06-18 12:37                                     ` Chris Down
  2020-06-18 12:41                                       ` Michal Hocko
  2020-06-18 14:59                                       ` Yafang Shao
  0 siblings, 2 replies; 48+ messages in thread
From: Chris Down @ 2020-06-18 12:37 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Naresh Kamboju, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

Yafang Shao writes:
>On Thu, Jun 18, 2020 at 5:09 AM Chris Down <chris@chrisdown.name> wrote:
>>
>> Naresh Kamboju writes:
>> >After this patch applied the reported issue got fixed.
>>
>> Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
>>
>> I'll send out a new version tomorrow with the fixes applied and both of you
>> credited in the changelog for the detection and fix.
>
>As we have already found that the usage around memory.{emin, elow} has
>many limitations, I think memory.{emin, elow} should be used for
>memcg-tree internally only, that means they can only be used to
>calculate the protection of a memcg in a specified memcg-tree but
>should not be exposed to other MM parts.

I agree that the current semantics are mentally taxing and we should generally 
avoid exposing the implementation details outside of memcg where possible. Do 
you have a suggested rework? :-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-18 12:37                                     ` Chris Down
@ 2020-06-18 12:41                                       ` Michal Hocko
  2020-06-18 12:49                                         ` Chris Down
  2020-06-18 14:59                                       ` Yafang Shao
  1 sibling, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2020-06-18 12:41 UTC (permalink / raw)
  To: Chris Down
  Cc: Yafang Shao, Naresh Kamboju, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu 18-06-20 13:37:43, Chris Down wrote:
> Yafang Shao writes:
> > On Thu, Jun 18, 2020 at 5:09 AM Chris Down <chris@chrisdown.name> wrote:
> > > 
> > > Naresh Kamboju writes:
> > > >After this patch applied the reported issue got fixed.
> > > 
> > > Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
> > > 
> > > I'll send out a new version tomorrow with the fixes applied and both of you
> > > credited in the changelog for the detection and fix.
> > 
> > As we have already found that the usage around memory.{emin, elow} has
> > many limitations, I think memory.{emin, elow} should be used for
> > memcg-tree internally only, that means they can only be used to
> > calculate the protection of a memcg in a specified memcg-tree but
> > should not be exposed to other MM parts.
> 
> I agree that the current semantics are mentally taxing and we should
> generally avoid exposing the implementation details outside of memcg where
> possible. Do you have a suggested rework? :-)

I would really prefer to do that work on top of the fixes we (used to)
have in mmotm (with the fixup).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-18 12:41                                       ` Michal Hocko
@ 2020-06-18 12:49                                         ` Chris Down
  0 siblings, 0 replies; 48+ messages in thread
From: Chris Down @ 2020-06-18 12:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yafang Shao, Naresh Kamboju, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

Michal Hocko writes:
>I would really prefer to do that work on top of the fixes we (used to)
>have in mmotm (with the fixup).

Oh, for sure. We should reintroduce the patches with the fix, and then look at 
longer-term solutions once that's in :-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page
  2020-06-18 12:37                                     ` Chris Down
  2020-06-18 12:41                                       ` Michal Hocko
@ 2020-06-18 14:59                                       ` Yafang Shao
  1 sibling, 0 replies; 48+ messages in thread
From: Yafang Shao @ 2020-06-18 14:59 UTC (permalink / raw)
  To: Chris Down
  Cc: Naresh Kamboju, Michal Hocko, Anders Roxell, Linux F2FS DEV,
	Mailing List, linux-ext4, linux-block, Andrew Morton, open list,
	Linux-Next Mailing List, linux-mm, Arnd Bergmann, Andreas Dilger,
	Jaegeuk Kim, Theodore Ts'o, Chao Yu, Hugh Dickins,
	Andrea Arcangeli, Matthew Wilcox, Chao Yu, lkft-triage,
	Johannes Weiner, Roman Gushchin, Cgroups

On Thu, Jun 18, 2020 at 8:37 PM Chris Down <chris@chrisdown.name> wrote:
>
> Yafang Shao writes:
> >On Thu, Jun 18, 2020 at 5:09 AM Chris Down <chris@chrisdown.name> wrote:
> >>
> >> Naresh Kamboju writes:
> >> >After this patch applied the reported issue got fixed.
> >>
> >> Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)
> >>
> >> I'll send out a new version tomorrow with the fixes applied and both of you
> >> credited in the changelog for the detection and fix.
> >
> >As we have already found that the usage around memory.{emin, elow} has
> >many limitations, I think memory.{emin, elow} should be used for
> >memcg-tree internally only, that means they can only be used to
> >calculate the protection of a memcg in a specified memcg-tree but
> >should not be exposed to other MM parts.
>
> I agree that the current semantics are mentally taxing and we should generally
> avoid exposing the implementation details outside of memcg where possible. Do
> you have a suggested rework? :-)

Keeping the mem_cgroup_protected() as-is is my suggestion. Anyway I
think it is bad to put memory.{emin, elow} here and there.
If we don't have any better idea by now, just putting all the
references of memory.{emin, elow}  into one
wrapper(mem_cgroup_protected()) is the reasonable solution.

-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2020-06-18 15:00 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CA+G9fYu2ruH-8uxBHE0pdE6RgRTSx4QuQPAN=Nv3BCdRd2ouYA@mail.gmail.com>
     [not found] ` <20200501135806.4eebf0b92f84ab60bba3e1e7@linux-foundation.org>
2020-05-18 14:10   ` mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page Naresh Kamboju
2020-05-19  7:52     ` Michal Hocko
2020-05-19  8:11       ` Arnd Bergmann
2020-05-19  8:45         ` Michal Hocko
2020-05-20 11:56           ` Naresh Kamboju
2020-05-20 17:59             ` Naresh Kamboju
2020-05-20 19:09               ` Chris Down
2020-05-21  9:22                 ` Naresh Kamboju
2020-05-21  9:35                   ` Arnd Bergmann
2020-05-21  9:55                 ` Michal Hocko
2020-05-21 10:41                   ` Naresh Kamboju
2020-05-21 10:58                     ` Michal Hocko
2020-05-21 12:24                       ` Hugh Dickins
2020-05-21 12:44                         ` Michal Hocko
2020-05-21 19:17                           ` Johannes Weiner
2020-05-21 20:06                             ` Hugh Dickins
2020-05-21 21:58                               ` Johannes Weiner
2020-05-21 23:35                                 ` Hugh Dickins
2020-05-28 14:59                                 ` Michal Hocko
2020-05-21 16:34                   ` Michal Hocko
2020-05-21 19:00                     ` Naresh Kamboju
2020-05-21 20:53                       ` Naresh Kamboju
2020-05-28 15:03                         ` Michal Hocko
2020-05-28 16:17                           ` Naresh Kamboju
2020-05-28 16:41                             ` Chris Down
2020-05-29  1:50                               ` Yafang Shao
2020-05-29  1:56                                 ` Chris Down
2020-05-29  9:49                                   ` Michal Hocko
2020-06-11  9:55                                     ` Michal Hocko
2020-06-12  9:43                                       ` Naresh Kamboju
2020-06-12 12:09                                         ` Michal Hocko
2020-06-17 13:37                     ` Naresh Kamboju
2020-06-17 13:57                       ` Chris Down
2020-06-17 14:11                         ` Michal Hocko
2020-06-17 15:53                           ` Naresh Kamboju
2020-06-17 16:06                             ` Michal Hocko
2020-06-17 20:13                               ` Naresh Kamboju
2020-06-17 21:09                                 ` Chris Down
2020-06-18  1:43                                   ` Yafang Shao
2020-06-18 12:37                                     ` Chris Down
2020-06-18 12:41                                       ` Michal Hocko
2020-06-18 12:49                                         ` Chris Down
2020-06-18 14:59                                       ` Yafang Shao
2020-06-17 13:59                       ` Michal Hocko
2020-06-17 14:08                         ` Chris Down
2020-05-21  2:39               ` Yafang Shao
2020-05-21  8:58                 ` Naresh Kamboju
2020-05-21  9:47                   ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).