* MGLRU premature memcg OOM on slow writes @ 2024-02-09 2:31 Chris Down 2024-02-29 17:28 ` Chris Down 2024-02-29 23:51 ` Axel Rasmussen 0 siblings, 2 replies; 19+ messages in thread From: Chris Down @ 2024-02-09 2:31 UTC (permalink / raw) To: Yu Zhao; +Cc: linux-kernel, linux-mm, cgroups, kernel-team, Johannes Weiner Hi Yu, When running with MGLRU I'm encountering premature OOMs when transferring files to a slow disk. On non-MGLRU setups, writeback flushers are awakened and get to work. But on MGLRU, one can see OOM killer outputs like the following when doing an rsync with a memory.max of 32M: --- % systemd-run --user -t -p MemoryMax=32M -- rsync -rv ... /mnt/usb Running as unit: run-u640.service Press ^] three times within 1s to disconnect TTY. sending incremental file list ... rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(713) [generator=3.2.7] --- [41368.535735] Memory cgroup out of memory: Killed process 128824 (rsync) total-vm:14008kB, anon-rss:256kB, file-rss:5504kB, shmem-rss:0kB, UID:1000 pgtables:64kB oom_score_adj:200 [41369.847965] rsync invoked oom-killer: gfp_mask=0x408d40(GFP_NOFS|__GFP_NOFAIL|__GFP_ZERO|__GFP_ACCOUNT), order=0, oom_score_adj=200 [41369.847972] CPU: 1 PID: 128826 Comm: rsync Tainted: G S OE 6.7.4-arch1-1 #1 20d30c48b78a04be2046f4b305b40455f0b5b38b [41369.847975] Hardware name: LENOVO 20WNS23A0G/20WNS23A0G, BIOS N35ET53W (1.53 ) 03/22/2023 [41369.847977] Call Trace: [41369.847978] <TASK> [41369.847980] dump_stack_lvl+0x47/0x60 [41369.847985] dump_header+0x45/0x1b0 [41369.847988] oom_kill_process+0xfa/0x200 [41369.847990] out_of_memory+0x244/0x590 [41369.847992] mem_cgroup_out_of_memory+0x134/0x150 [41369.847995] try_charge_memcg+0x76d/0x870 [41369.847998] ? try_charge_memcg+0xcd/0x870 [41369.848000] obj_cgroup_charge+0xb8/0x1b0 [41369.848002] kmem_cache_alloc+0xaa/0x310 [41369.848005] ? alloc_buffer_head+0x1e/0x80 [41369.848007] alloc_buffer_head+0x1e/0x80 [41369.848009] folio_alloc_buffers+0xab/0x180 [41369.848012] ? __pfx_fat_get_block+0x10/0x10 [fat 0a109de409393851f8a884f020fb5682aab8dcd1] [41369.848021] create_empty_buffers+0x1d/0xb0 [41369.848023] __block_write_begin_int+0x524/0x600 [41369.848026] ? __pfx_fat_get_block+0x10/0x10 [fat 0a109de409393851f8a884f020fb5682aab8dcd1] [41369.848031] ? __filemap_get_folio+0x168/0x2e0 [41369.848033] ? __pfx_fat_get_block+0x10/0x10 [fat 0a109de409393851f8a884f020fb5682aab8dcd1] [41369.848038] block_write_begin+0x52/0x120 [41369.848040] fat_write_begin+0x34/0x80 [fat 0a109de409393851f8a884f020fb5682aab8dcd1] [41369.848046] ? __pfx_fat_get_block+0x10/0x10 [fat 0a109de409393851f8a884f020fb5682aab8dcd1] [41369.848051] generic_perform_write+0xd6/0x240 [41369.848054] generic_file_write_iter+0x65/0xd0 [41369.848056] vfs_write+0x23a/0x400 [41369.848060] ksys_write+0x6f/0xf0 [41369.848063] do_syscall_64+0x61/0xe0 [41369.848065] ? do_user_addr_fault+0x304/0x670 [41369.848069] ? exc_page_fault+0x7f/0x180 [41369.848071] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [41369.848074] RIP: 0033:0x7965df71a184 [41369.848116] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 3e 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48 [41369.848117] RSP: 002b:00007fffee661738 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [41369.848119] RAX: ffffffffffffffda RBX: 0000570f66343bb0 RCX: 00007965df71a184 [41369.848121] RDX: 0000000000040000 RSI: 0000570f66343bb0 RDI: 0000000000000003 [41369.848122] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000570f66343b20 [41369.848122] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000649 [41369.848123] R13: 0000570f651f8b40 R14: 0000000000008000 R15: 0000570f6633bba0 [41369.848125] </TASK> [41369.848126] memory: usage 32768kB, limit 32768kB, failcnt 21239 [41369.848126] swap: usage 2112kB, limit 9007199254740988kB, failcnt 0 [41369.848127] Memory cgroup stats for /user.slice/user-1000.slice/user@1000.service/app.slice/run-u640.service: [41369.848174] anon 0 [41369.848175] file 26927104 [41369.848176] kernel 6615040 [41369.848176] kernel_stack 32768 [41369.848177] pagetables 122880 [41369.848177] sec_pagetables 0 [41369.848177] percpu 480 [41369.848178] sock 0 [41369.848178] vmalloc 0 [41369.848178] shmem 0 [41369.848179] zswap 312451 [41369.848179] zswapped 1458176 [41369.848179] file_mapped 0 [41369.848180] file_dirty 26923008 [41369.848180] file_writeback 0 [41369.848180] swapcached 12288 [41369.848181] anon_thp 0 [41369.848181] file_thp 0 [41369.848181] shmem_thp 0 [41369.848182] inactive_anon 0 [41369.848182] active_anon 12288 [41369.848182] inactive_file 15908864 [41369.848183] active_file 11014144 [41369.848183] unevictable 0 [41369.848183] slab_reclaimable 5963640 [41369.848184] slab_unreclaimable 89048 [41369.848184] slab 6052688 [41369.848185] workingset_refault_anon 4031 [41369.848185] workingset_refault_file 9236 [41369.848185] workingset_activate_anon 691 [41369.848186] workingset_activate_file 2553 [41369.848186] workingset_restore_anon 691 [41369.848186] workingset_restore_file 0 [41369.848187] workingset_nodereclaim 0 [41369.848187] pgscan 40473 [41369.848187] pgsteal 20881 [41369.848188] pgscan_kswapd 0 [41369.848188] pgscan_direct 40473 [41369.848188] pgscan_khugepaged 0 [41369.848189] pgsteal_kswapd 0 [41369.848189] pgsteal_direct 20881 [41369.848190] pgsteal_khugepaged 0 [41369.848190] pgfault 6019 [41369.848190] pgmajfault 4033 [41369.848191] pgrefill 30578988 [41369.848191] pgactivate 2925 [41369.848191] pgdeactivate 0 [41369.848192] pglazyfree 0 [41369.848192] pglazyfreed 0 [41369.848192] zswpin 1520 [41369.848193] zswpout 1141 [41369.848193] thp_fault_alloc 0 [41369.848193] thp_collapse_alloc 0 [41369.848194] thp_swpout 0 [41369.848194] thp_swpout_fallback 0 [41369.848194] Tasks state (memory values in pages): [41369.848195] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [41369.848195] [ 128825] 1000 128825 3449 864 65536 192 200 rsync [41369.848198] [ 128826] 1000 128826 3523 288 57344 288 200 rsync [41369.848199] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/run-u640.service,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/run-u640.service,task=rsync,pid=128825,uid=1000 [41369.848207] Memory cgroup out of memory: Killed process 128825 (rsync) total-vm:13796kB, anon-rss:0kB, file-rss:3456kB, shmem-rss:0kB, UID:1000 pgtables:64kB oom_score_adj:200 --- Importantly, note that there appears to be no attempt to write back before declaring OOM -- file_writeback is 0 when file_dirty is 26923008. The issue is consistently reproducible (and thanks Johannes for looking at this with me). On non-MGLRU, flushers are active and are making forward progress in preventing OOM. This is writing to a slow disk with about ~10MiB/s available write speed, so the CPU and read speed is far faster than the write speed the disk can take. Is this a known problem in MGLRU? If not, could you point me to where MGLRU tries to handle flusher wakeup on slow I/O? I didn't immediately find it. Thanks, Chris ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-02-09 2:31 MGLRU premature memcg OOM on slow writes Chris Down @ 2024-02-29 17:28 ` Chris Down 2024-02-29 23:51 ` Axel Rasmussen 1 sibling, 0 replies; 19+ messages in thread From: Chris Down @ 2024-02-29 17:28 UTC (permalink / raw) To: Yu Zhao; +Cc: linux-kernel, linux-mm, cgroups, kernel-team, Johannes Weiner Hi Yu, Following up since it's been a few weeks since I reported this. If MGLRU does not handle writeback pressure on slow devices without OOM, that seems like a pretty significant problem, so I'd appreciate your opinion on the issue. Thanks, Chris ^ permalink raw reply [flat|nested] 19+ messages in thread
* MGLRU premature memcg OOM on slow writes 2024-02-09 2:31 MGLRU premature memcg OOM on slow writes Chris Down 2024-02-29 17:28 ` Chris Down @ 2024-02-29 23:51 ` Axel Rasmussen 2024-03-01 0:30 ` Chris Down 2024-03-01 11:25 ` Hillf Danton 1 sibling, 2 replies; 19+ messages in thread From: Axel Rasmussen @ 2024-02-29 23:51 UTC (permalink / raw) To: chris; +Cc: cgroups, hannes, kernel-team, linux-kernel, linux-mm, yuzhao Hi Chris, A couple of dumb questions. In your test, do you have any of the following configured / enabled? /proc/sys/vm/laptop_mode memory.low memory.min Besides that, it looks like the place non-MGLRU reclaim wakes up the flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it looks like it simply will not do this. Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It makes sense to me at least that doing writeback every time we age is too aggressive, but doing it in evict_folios() makes some sense to me, basically to copy the behavior the non-MGLRU path (shrink_inactive_list()) has. I can send a patch which tries to implement this next week. In the meantime, Yu, please let me know if what I've said here makes no sense for some reason. :) [1]: https://lore.kernel.org/lkml/YzSiWq9UEER5LKup@google.com/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-02-29 23:51 ` Axel Rasmussen @ 2024-03-01 0:30 ` Chris Down 2024-03-08 19:18 ` Axel Rasmussen 2024-03-01 11:25 ` Hillf Danton 1 sibling, 1 reply; 19+ messages in thread From: Chris Down @ 2024-03-01 0:30 UTC (permalink / raw) To: Axel Rasmussen Cc: cgroups, hannes, kernel-team, linux-kernel, linux-mm, yuzhao Axel Rasmussen writes: >A couple of dumb questions. In your test, do you have any of the following >configured / enabled? > >/proc/sys/vm/laptop_mode >memory.low >memory.min None of these are enabled. The issue is trivially reproducible by writing to any slow device with memory.max enabled, but from the code it looks like MGLRU is also susceptible to this on global reclaim (although it's less likely due to page diversity). >Besides that, it looks like the place non-MGLRU reclaim wakes up the >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it >looks like it simply will not do this. > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It >makes sense to me at least that doing writeback every time we age is too >aggressive, but doing it in evict_folios() makes some sense to me, basically to >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. Thanks! We may also need reclaim_throttle(), depending on how you implement it. Current non-MGLRU behaviour on slow storage is also highly suspect in terms of (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one thing at a time :-) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-01 0:30 ` Chris Down @ 2024-03-08 19:18 ` Axel Rasmussen 2024-03-08 21:22 ` Johannes Weiner 2024-03-11 9:11 ` Yafang Shao 0 siblings, 2 replies; 19+ messages in thread From: Axel Rasmussen @ 2024-03-08 19:18 UTC (permalink / raw) To: Chris Down; +Cc: cgroups, hannes, kernel-team, linux-kernel, linux-mm, yuzhao On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > Axel Rasmussen writes: > >A couple of dumb questions. In your test, do you have any of the following > >configured / enabled? > > > >/proc/sys/vm/laptop_mode > >memory.low > >memory.min > > None of these are enabled. The issue is trivially reproducible by writing to > any slow device with memory.max enabled, but from the code it looks like MGLRU > is also susceptible to this on global reclaim (although it's less likely due to > page diversity). > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > >looks like it simply will not do this. > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > >makes sense to me at least that doing writeback every time we age is too > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > thing at a time :-) Hmm, so I have a patch which I think will help with this situation, but I'm having some trouble reproducing the problem on 6.8-rc7 (so then I can verify the patch fixes it). If I understand the issue right, all we should need to do is get a slow filesystem, and then generate a bunch of dirty file pages on it, while running in a tightly constrained memcg. To that end, I tried the following script. But, in reality I seem to get little or no accumulation of dirty file pages. I thought maybe fio does something different than rsync which you said you originally tried, so I also tried rsync (copying /usr/bin into this loop mount) and didn't run into an OOM situation either. Maybe some dirty ratio settings need tweaking or something to get the behavior you see? Or maybe my test has a dumb mistake in it. :) #!/usr/bin/env bash echo 0 > /proc/sys/vm/laptop_mode || exit 1 echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 echo "Allocate disk image" IMAGE_SIZE_MIB=1024 IMAGE_PATH=/tmp/slow.img dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 echo "Setup loop device" LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 echo "Create dm-slow" DM_NAME=dm-slow DM_DEV=/dev/mapper/$DM_NAME echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 echo "Create fs" mkfs.ext4 "$DM_DEV" || exit 1 echo "Mount fs" MOUNT_PATH="/tmp/$DM_NAME" mkdir -p "$MOUNT_PATH" || exit 1 mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 echo "Generate dirty file pages" systemd-run --wait --pipe --collect -p MemoryMax=32M \ fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ -numjobs=10 -nrfiles=90 -filesize=1048576 \ -fallocate=posix \ -blocksize=4k -ioengine=mmap \ -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ -runtime=300 -time_based ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-08 19:18 ` Axel Rasmussen @ 2024-03-08 21:22 ` Johannes Weiner 2024-03-11 9:11 ` Yafang Shao 1 sibling, 0 replies; 19+ messages in thread From: Johannes Weiner @ 2024-03-08 21:22 UTC (permalink / raw) To: Axel Rasmussen Cc: Chris Down, cgroups, kernel-team, linux-kernel, linux-mm, yuzhao On Fri, Mar 08, 2024 at 11:18:28AM -0800, Axel Rasmussen wrote: > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > > > Axel Rasmussen writes: > > >A couple of dumb questions. In your test, do you have any of the following > > >configured / enabled? > > > > > >/proc/sys/vm/laptop_mode > > >memory.low > > >memory.min > > > > None of these are enabled. The issue is trivially reproducible by writing to > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > is also susceptible to this on global reclaim (although it's less likely due to > > page diversity). > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > >looks like it simply will not do this. > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > >makes sense to me at least that doing writeback every time we age is too > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > thing at a time :-) > > > Hmm, so I have a patch which I think will help with this situation, > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > then I can verify the patch fixes it). > > If I understand the issue right, all we should need to do is get a > slow filesystem, and then generate a bunch of dirty file pages on it, > while running in a tightly constrained memcg. To that end, I tried the > following script. But, in reality I seem to get little or no > accumulation of dirty file pages. > > I thought maybe fio does something different than rsync which you said > you originally tried, so I also tried rsync (copying /usr/bin into > this loop mount) and didn't run into an OOM situation either. > > Maybe some dirty ratio settings need tweaking or something to get the > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > #!/usr/bin/env bash > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > echo "Allocate disk image" > IMAGE_SIZE_MIB=1024 > IMAGE_PATH=/tmp/slow.img > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 > > echo "Setup loop device" > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 > > echo "Create dm-slow" > DM_NAME=dm-slow > DM_DEV=/dev/mapper/$DM_NAME > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 > > echo "Create fs" > mkfs.ext4 "$DM_DEV" || exit 1 > > echo "Mount fs" > MOUNT_PATH="/tmp/$DM_NAME" > mkdir -p "$MOUNT_PATH" || exit 1 > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > echo "Generate dirty file pages" > systemd-run --wait --pipe --collect -p MemoryMax=32M \ > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ > -numjobs=10 -nrfiles=90 -filesize=1048576 \ > -fallocate=posix \ > -blocksize=4k -ioengine=mmap \ > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ > -runtime=300 -time_based By doing only the writes in the cgroup, you might just be running into balance_dirty_pages(), which wakes the flushers and slows the writing/allocating task before hitting the cg memory limit. I think the key to what happens in Chris's case is: 1) The cgroup has a certain share of dirty pages, but in aggregate they are below the cgroup dirty limit (dirty < mdtc->avail * ratio) such that no writeback/dirty throttling is triggered from balance_dirty_pages(). 2) An unthrottled burst of (non-dirtying) allocations causes reclaim demand that suddenly exceeds the reclaimable clean pages on the LRU. Now you get into a situation where allocation and reclaim rate exceeds the writeback rate and the only reclaimable pages left on the LRU are dirty. In this case reclaim needs to wake the flushers and wait for writeback instead of blowing through the priority cycles and OOMing. Chris might be causing 2) from the read side of the copy also being in the cgroup. Especially if he's copying larger files that can saturate the readahead window and cause bigger allocation bursts. Those readahead pages are accounted to the cgroup and on the LRU as soon as they're allocated, but remain locked and unreclaimable until the read IO finishes. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-08 19:18 ` Axel Rasmussen 2024-03-08 21:22 ` Johannes Weiner @ 2024-03-11 9:11 ` Yafang Shao 2024-03-12 16:44 ` Axel Rasmussen 1 sibling, 1 reply; 19+ messages in thread From: Yafang Shao @ 2024-03-11 9:11 UTC (permalink / raw) To: Axel Rasmussen Cc: Chris Down, cgroups, hannes, kernel-team, linux-kernel, linux-mm, yuzhao On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@google.com> wrote: > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > > > Axel Rasmussen writes: > > >A couple of dumb questions. In your test, do you have any of the following > > >configured / enabled? > > > > > >/proc/sys/vm/laptop_mode > > >memory.low > > >memory.min > > > > None of these are enabled. The issue is trivially reproducible by writing to > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > is also susceptible to this on global reclaim (although it's less likely due to > > page diversity). > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > >looks like it simply will not do this. > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > >makes sense to me at least that doing writeback every time we age is too > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > thing at a time :-) > > > Hmm, so I have a patch which I think will help with this situation, > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > then I can verify the patch fixes it). We encountered the same premature OOM issue caused by numerous dirty pages. The issue disappears after we revert the commit 14aa8b2d5c2e "mm/mglru: don't sync disk for each aging cycle" To aid in replicating the issue, we've developed a straightforward script, which consistently reproduces it, even on the latest kernel. You can find the script provided below: ``` #!/bin/bash MEMCG="/sys/fs/cgroup/memory/mglru" ENABLE=$1 # Avoid waking up the flusher sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) if [ ! -d ${MEMCG} ]; then mkdir -p ${MEMCG} fi echo $$ > ${MEMCG}/cgroup.procs echo 1g > ${MEMCG}/memory.limit_in_bytes if [ $ENABLE -eq 0 ]; then echo 0 > /sys/kernel/mm/lru_gen/enabled else echo 0x7 > /sys/kernel/mm/lru_gen/enabled fi dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 rm -rf /data0/mglru.test ``` This issue disappears as well after we disable the mglru. We hope this script proves helpful in identifying and addressing the root cause. We eagerly await your insights and proposed fixes. > > If I understand the issue right, all we should need to do is get a > slow filesystem, and then generate a bunch of dirty file pages on it, > while running in a tightly constrained memcg. To that end, I tried the > following script. But, in reality I seem to get little or no > accumulation of dirty file pages. > > I thought maybe fio does something different than rsync which you said > you originally tried, so I also tried rsync (copying /usr/bin into > this loop mount) and didn't run into an OOM situation either. > > Maybe some dirty ratio settings need tweaking or something to get the > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > #!/usr/bin/env bash > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > echo "Allocate disk image" > IMAGE_SIZE_MIB=1024 > IMAGE_PATH=/tmp/slow.img > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 > > echo "Setup loop device" > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 > > echo "Create dm-slow" > DM_NAME=dm-slow > DM_DEV=/dev/mapper/$DM_NAME > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 > > echo "Create fs" > mkfs.ext4 "$DM_DEV" || exit 1 > > echo "Mount fs" > MOUNT_PATH="/tmp/$DM_NAME" > mkdir -p "$MOUNT_PATH" || exit 1 > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > echo "Generate dirty file pages" > systemd-run --wait --pipe --collect -p MemoryMax=32M \ > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ > -numjobs=10 -nrfiles=90 -filesize=1048576 \ > -fallocate=posix \ > -blocksize=4k -ioengine=mmap \ > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ > -runtime=300 -time_based > -- Regards Yafang ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-11 9:11 ` Yafang Shao @ 2024-03-12 16:44 ` Axel Rasmussen 2024-03-12 20:07 ` Yu Zhao 0 siblings, 1 reply; 19+ messages in thread From: Axel Rasmussen @ 2024-03-12 16:44 UTC (permalink / raw) To: Yafang Shao Cc: Chris Down, cgroups, hannes, kernel-team, linux-kernel, linux-mm, yuzhao On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@google.com> wrote: > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > > > > > Axel Rasmussen writes: > > > >A couple of dumb questions. In your test, do you have any of the following > > > >configured / enabled? > > > > > > > >/proc/sys/vm/laptop_mode > > > >memory.low > > > >memory.min > > > > > > None of these are enabled. The issue is trivially reproducible by writing to > > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > > is also susceptible to this on global reclaim (although it's less likely due to > > > page diversity). > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > > >looks like it simply will not do this. > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > > >makes sense to me at least that doing writeback every time we age is too > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > > thing at a time :-) > > > > > > Hmm, so I have a patch which I think will help with this situation, > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > then I can verify the patch fixes it). > > We encountered the same premature OOM issue caused by numerous dirty pages. > The issue disappears after we revert the commit 14aa8b2d5c2e > "mm/mglru: don't sync disk for each aging cycle" > > To aid in replicating the issue, we've developed a straightforward > script, which consistently reproduces it, even on the latest kernel. > You can find the script provided below: > > ``` > #!/bin/bash > > MEMCG="/sys/fs/cgroup/memory/mglru" > ENABLE=$1 > > # Avoid waking up the flusher > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) > > if [ ! -d ${MEMCG} ]; then > mkdir -p ${MEMCG} > fi > > echo $$ > ${MEMCG}/cgroup.procs > echo 1g > ${MEMCG}/memory.limit_in_bytes > > if [ $ENABLE -eq 0 ]; then > echo 0 > /sys/kernel/mm/lru_gen/enabled > else > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > fi > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 > rm -rf /data0/mglru.test > ``` > > This issue disappears as well after we disable the mglru. > > We hope this script proves helpful in identifying and addressing the > root cause. We eagerly await your insights and proposed fixes. Thanks Yafang, I was able to reproduce the issue using this script. Perhaps interestingly, I was not able to reproduce it with cgroupv2 memcgs. I know writeback semantics are quite a bit different there, so perhaps that explains why. Unfortunately, it also reproduces even with the commit I had in mind (basically stealing the "if (all isolated pages are unqueued dirty) { wakeup_flusher_threads(); reclaim_throttle(); }" from shrink_inactive_list, and adding it to MGLRU's evict_folios()). So I'll need to spend some more time on this; I'm planning to send something out for testing next week. > > > > > If I understand the issue right, all we should need to do is get a > > slow filesystem, and then generate a bunch of dirty file pages on it, > > while running in a tightly constrained memcg. To that end, I tried the > > following script. But, in reality I seem to get little or no > > accumulation of dirty file pages. > > > > I thought maybe fio does something different than rsync which you said > > you originally tried, so I also tried rsync (copying /usr/bin into > > this loop mount) and didn't run into an OOM situation either. > > > > Maybe some dirty ratio settings need tweaking or something to get the > > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > > > > > #!/usr/bin/env bash > > > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > > > echo "Allocate disk image" > > IMAGE_SIZE_MIB=1024 > > IMAGE_PATH=/tmp/slow.img > > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 > > > > echo "Setup loop device" > > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 > > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 > > > > echo "Create dm-slow" > > DM_NAME=dm-slow > > DM_DEV=/dev/mapper/$DM_NAME > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 > > > > echo "Create fs" > > mkfs.ext4 "$DM_DEV" || exit 1 > > > > echo "Mount fs" > > MOUNT_PATH="/tmp/$DM_NAME" > > mkdir -p "$MOUNT_PATH" || exit 1 > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > > > echo "Generate dirty file pages" > > systemd-run --wait --pipe --collect -p MemoryMax=32M \ > > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ > > -numjobs=10 -nrfiles=90 -filesize=1048576 \ > > -fallocate=posix \ > > -blocksize=4k -ioengine=mmap \ > > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ > > -runtime=300 -time_based > > > > > -- > Regards > Yafang ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-12 16:44 ` Axel Rasmussen @ 2024-03-12 20:07 ` Yu Zhao 2024-03-12 20:11 ` Yu Zhao 2024-03-12 21:08 ` Johannes Weiner 0 siblings, 2 replies; 19+ messages in thread From: Yu Zhao @ 2024-03-12 20:07 UTC (permalink / raw) To: Axel Rasmussen Cc: Yafang Shao, Chris Down, cgroups, hannes, kernel-team, linux-kernel, linux-mm On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote: > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@google.com> wrote: > > > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > > > > > > > Axel Rasmussen writes: > > > > >A couple of dumb questions. In your test, do you have any of the following > > > > >configured / enabled? > > > > > > > > > >/proc/sys/vm/laptop_mode > > > > >memory.low > > > > >memory.min > > > > > > > > None of these are enabled. The issue is trivially reproducible by writing to > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > > > is also susceptible to this on global reclaim (although it's less likely due to > > > > page diversity). > > > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > > > >looks like it simply will not do this. > > > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > > > >makes sense to me at least that doing writeback every time we age is too > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > > > thing at a time :-) > > > > > > > > > Hmm, so I have a patch which I think will help with this situation, > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > > then I can verify the patch fixes it). > > > > We encountered the same premature OOM issue caused by numerous dirty pages. > > The issue disappears after we revert the commit 14aa8b2d5c2e > > "mm/mglru: don't sync disk for each aging cycle" > > > > To aid in replicating the issue, we've developed a straightforward > > script, which consistently reproduces it, even on the latest kernel. > > You can find the script provided below: > > > > ``` > > #!/bin/bash > > > > MEMCG="/sys/fs/cgroup/memory/mglru" > > ENABLE=$1 > > > > # Avoid waking up the flusher > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) > > > > if [ ! -d ${MEMCG} ]; then > > mkdir -p ${MEMCG} > > fi > > > > echo $$ > ${MEMCG}/cgroup.procs > > echo 1g > ${MEMCG}/memory.limit_in_bytes > > > > if [ $ENABLE -eq 0 ]; then > > echo 0 > /sys/kernel/mm/lru_gen/enabled > > else > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > > fi > > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 > > rm -rf /data0/mglru.test > > ``` > > > > This issue disappears as well after we disable the mglru. > > > > We hope this script proves helpful in identifying and addressing the > > root cause. We eagerly await your insights and proposed fixes. > > Thanks Yafang, I was able to reproduce the issue using this script. > > Perhaps interestingly, I was not able to reproduce it with cgroupv2 > memcgs. I know writeback semantics are quite a bit different there, so > perhaps that explains why. > > Unfortunately, it also reproduces even with the commit I had in mind > (basically stealing the "if (all isolated pages are unqueued dirty) { > wakeup_flusher_threads(); reclaim_throttle(); }" from > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So > I'll need to spend some more time on this; I'm planning to send > something out for testing next week. Hi Chris, My apologies for not getting back to you sooner. And thanks everyone for all the input! My take is that Chris' premature OOM kills were NOT really due to the flusher not waking up or missing throttling. Yes, these two are among the differences between the active/inactive LRU and MGLRU, but their roles, IMO, are not as important as the LRU positions of dirty pages. The active/inactive LRU moves dirty pages all the way to the end of the line (reclaim happens at the front) whereas MGLRU moves them into the middle, during direct reclaim. The rationale for MGLRU was that this way those dirty pages would still be counted as "inactive" (or cold). This theory can be quickly verified by comparing how much nr_vmscan_immediate_reclaim grows, i.e., Before the copy grep nr_vmscan_immediate_reclaim /proc/vmstat And then after the copy grep nr_vmscan_immediate_reclaim /proc/vmstat The growth should be trivial for MGLRU and nontrivial for the active/inactive LRU. If this is indeed the case, I'd appreciate very much if anyone could try the following (I'll try it myself too later next week). diff --git a/mm/vmscan.c b/mm/vmscan.c index 4255619a1a31..020f5d98b9a1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c } /* waiting for writeback */ - if (folio_test_locked(folio) || folio_test_writeback(folio) || - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { - gen = folio_inc_gen(lruvec, folio, true); - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { + DEFINE_MAX_SEQ(lruvec); + int old_gen, new_gen = lru_gen_from_seq(max_seq); + + old_gen = folio_update_gen(folio, new_gen); + lru_gen_update_size(lruvec, folio, old_gen, new_gen); + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); return true; } > > > If I understand the issue right, all we should need to do is get a > > > slow filesystem, and then generate a bunch of dirty file pages on it, > > > while running in a tightly constrained memcg. To that end, I tried the > > > following script. But, in reality I seem to get little or no > > > accumulation of dirty file pages. > > > > > > I thought maybe fio does something different than rsync which you said > > > you originally tried, so I also tried rsync (copying /usr/bin into > > > this loop mount) and didn't run into an OOM situation either. > > > > > > Maybe some dirty ratio settings need tweaking or something to get the > > > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > > > > > > > > > #!/usr/bin/env bash > > > > > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > > > > > echo "Allocate disk image" > > > IMAGE_SIZE_MIB=1024 > > > IMAGE_PATH=/tmp/slow.img > > > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 > > > > > > echo "Setup loop device" > > > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 > > > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 > > > > > > echo "Create dm-slow" > > > DM_NAME=dm-slow > > > DM_DEV=/dev/mapper/$DM_NAME > > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 > > > > > > echo "Create fs" > > > mkfs.ext4 "$DM_DEV" || exit 1 > > > > > > echo "Mount fs" > > > MOUNT_PATH="/tmp/$DM_NAME" > > > mkdir -p "$MOUNT_PATH" || exit 1 > > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > > > > > echo "Generate dirty file pages" > > > systemd-run --wait --pipe --collect -p MemoryMax=32M \ > > > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ > > > -numjobs=10 -nrfiles=90 -filesize=1048576 \ > > > -fallocate=posix \ > > > -blocksize=4k -ioengine=mmap \ > > > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ > > > -runtime=300 -time_based ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-12 20:07 ` Yu Zhao @ 2024-03-12 20:11 ` Yu Zhao 2024-03-13 3:33 ` Yafang Shao 2024-03-12 21:08 ` Johannes Weiner 1 sibling, 1 reply; 19+ messages in thread From: Yu Zhao @ 2024-03-12 20:11 UTC (permalink / raw) To: Axel Rasmussen Cc: Yafang Shao, Chris Down, cgroups, hannes, kernel-team, linux-kernel, linux-mm On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote: > > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@google.com> wrote: > > > > > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > > > > > > > > > Axel Rasmussen writes: > > > > > >A couple of dumb questions. In your test, do you have any of the following > > > > > >configured / enabled? > > > > > > > > > > > >/proc/sys/vm/laptop_mode > > > > > >memory.low > > > > > >memory.min > > > > > > > > > > None of these are enabled. The issue is trivially reproducible by writing to > > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > > > > is also susceptible to this on global reclaim (although it's less likely due to > > > > > page diversity). > > > > > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > > > > >looks like it simply will not do this. > > > > > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > > > > >makes sense to me at least that doing writeback every time we age is too > > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > > > > thing at a time :-) > > > > > > > > > > > > Hmm, so I have a patch which I think will help with this situation, > > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > > > then I can verify the patch fixes it). > > > > > > We encountered the same premature OOM issue caused by numerous dirty pages. > > > The issue disappears after we revert the commit 14aa8b2d5c2e > > > "mm/mglru: don't sync disk for each aging cycle" > > > > > > To aid in replicating the issue, we've developed a straightforward > > > script, which consistently reproduces it, even on the latest kernel. > > > You can find the script provided below: > > > > > > ``` > > > #!/bin/bash > > > > > > MEMCG="/sys/fs/cgroup/memory/mglru" > > > ENABLE=$1 > > > > > > # Avoid waking up the flusher > > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) > > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) > > > > > > if [ ! -d ${MEMCG} ]; then > > > mkdir -p ${MEMCG} > > > fi > > > > > > echo $$ > ${MEMCG}/cgroup.procs > > > echo 1g > ${MEMCG}/memory.limit_in_bytes > > > > > > if [ $ENABLE -eq 0 ]; then > > > echo 0 > /sys/kernel/mm/lru_gen/enabled > > > else > > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > > > fi > > > > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 > > > rm -rf /data0/mglru.test > > > ``` > > > > > > This issue disappears as well after we disable the mglru. > > > > > > We hope this script proves helpful in identifying and addressing the > > > root cause. We eagerly await your insights and proposed fixes. > > > > Thanks Yafang, I was able to reproduce the issue using this script. > > > > Perhaps interestingly, I was not able to reproduce it with cgroupv2 > > memcgs. I know writeback semantics are quite a bit different there, so > > perhaps that explains why. > > > > Unfortunately, it also reproduces even with the commit I had in mind > > (basically stealing the "if (all isolated pages are unqueued dirty) { > > wakeup_flusher_threads(); reclaim_throttle(); }" from > > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So > > I'll need to spend some more time on this; I'm planning to send > > something out for testing next week. > > Hi Chris, > > My apologies for not getting back to you sooner. > > And thanks everyone for all the input! > > My take is that Chris' premature OOM kills were NOT really due to > the flusher not waking up or missing throttling. > > Yes, these two are among the differences between the active/inactive > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > positions of dirty pages. The active/inactive LRU moves dirty pages > all the way to the end of the line (reclaim happens at the front) > whereas MGLRU moves them into the middle, during direct reclaim. The > rationale for MGLRU was that this way those dirty pages would still > be counted as "inactive" (or cold). > > This theory can be quickly verified by comparing how much > nr_vmscan_immediate_reclaim grows, i.e., > > Before the copy > grep nr_vmscan_immediate_reclaim /proc/vmstat > And then after the copy > grep nr_vmscan_immediate_reclaim /proc/vmstat > > The growth should be trivial for MGLRU and nontrivial for the > active/inactive LRU. > > If this is indeed the case, I'd appreciate very much if anyone could > try the following (I'll try it myself too later next week). > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4255619a1a31..020f5d98b9a1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > } > > /* waiting for writeback */ > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > - gen = folio_inc_gen(lruvec, folio, true); > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > + DEFINE_MAX_SEQ(lruvec); > + int old_gen, new_gen = lru_gen_from_seq(max_seq); > + > + old_gen = folio_update_gen(folio, new_gen); > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); Sorry missing one line here: + folio_set_reclaim(folio); > return true; > } > > > > > If I understand the issue right, all we should need to do is get a > > > > slow filesystem, and then generate a bunch of dirty file pages on it, > > > > while running in a tightly constrained memcg. To that end, I tried the > > > > following script. But, in reality I seem to get little or no > > > > accumulation of dirty file pages. > > > > > > > > I thought maybe fio does something different than rsync which you said > > > > you originally tried, so I also tried rsync (copying /usr/bin into > > > > this loop mount) and didn't run into an OOM situation either. > > > > > > > > Maybe some dirty ratio settings need tweaking or something to get the > > > > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > > > > > > > > > > > > > #!/usr/bin/env bash > > > > > > > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > > > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > > > > > > > echo "Allocate disk image" > > > > IMAGE_SIZE_MIB=1024 > > > > IMAGE_PATH=/tmp/slow.img > > > > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 > > > > > > > > echo "Setup loop device" > > > > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 > > > > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 > > > > > > > > echo "Create dm-slow" > > > > DM_NAME=dm-slow > > > > DM_DEV=/dev/mapper/$DM_NAME > > > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 > > > > > > > > echo "Create fs" > > > > mkfs.ext4 "$DM_DEV" || exit 1 > > > > > > > > echo "Mount fs" > > > > MOUNT_PATH="/tmp/$DM_NAME" > > > > mkdir -p "$MOUNT_PATH" || exit 1 > > > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > > > > > > > echo "Generate dirty file pages" > > > > systemd-run --wait --pipe --collect -p MemoryMax=32M \ > > > > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ > > > > -numjobs=10 -nrfiles=90 -filesize=1048576 \ > > > > -fallocate=posix \ > > > > -blocksize=4k -ioengine=mmap \ > > > > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ > > > > -runtime=300 -time_based ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-12 20:11 ` Yu Zhao @ 2024-03-13 3:33 ` Yafang Shao 2024-03-14 22:23 ` Yu Zhao 0 siblings, 1 reply; 19+ messages in thread From: Yafang Shao @ 2024-03-13 3:33 UTC (permalink / raw) To: Yu Zhao Cc: Axel Rasmussen, Chris Down, cgroups, hannes, kernel-team, linux-kernel, linux-mm On Wed, Mar 13, 2024 at 4:11 AM Yu Zhao <yuzhao@google.com> wrote: > > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > > On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote: > > > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@google.com> wrote: > > > > > > > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > > > > > > > > > > > Axel Rasmussen writes: > > > > > > >A couple of dumb questions. In your test, do you have any of the following > > > > > > >configured / enabled? > > > > > > > > > > > > > >/proc/sys/vm/laptop_mode > > > > > > >memory.low > > > > > > >memory.min > > > > > > > > > > > > None of these are enabled. The issue is trivially reproducible by writing to > > > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > > > > > is also susceptible to this on global reclaim (although it's less likely due to > > > > > > page diversity). > > > > > > > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > > > > > >looks like it simply will not do this. > > > > > > > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > > > > > >makes sense to me at least that doing writeback every time we age is too > > > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > > > > > thing at a time :-) > > > > > > > > > > > > > > > Hmm, so I have a patch which I think will help with this situation, > > > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > > > > then I can verify the patch fixes it). > > > > > > > > We encountered the same premature OOM issue caused by numerous dirty pages. > > > > The issue disappears after we revert the commit 14aa8b2d5c2e > > > > "mm/mglru: don't sync disk for each aging cycle" > > > > > > > > To aid in replicating the issue, we've developed a straightforward > > > > script, which consistently reproduces it, even on the latest kernel. > > > > You can find the script provided below: > > > > > > > > ``` > > > > #!/bin/bash > > > > > > > > MEMCG="/sys/fs/cgroup/memory/mglru" > > > > ENABLE=$1 > > > > > > > > # Avoid waking up the flusher > > > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) > > > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) > > > > > > > > if [ ! -d ${MEMCG} ]; then > > > > mkdir -p ${MEMCG} > > > > fi > > > > > > > > echo $$ > ${MEMCG}/cgroup.procs > > > > echo 1g > ${MEMCG}/memory.limit_in_bytes > > > > > > > > if [ $ENABLE -eq 0 ]; then > > > > echo 0 > /sys/kernel/mm/lru_gen/enabled > > > > else > > > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > > > > fi > > > > > > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 > > > > rm -rf /data0/mglru.test > > > > ``` > > > > > > > > This issue disappears as well after we disable the mglru. > > > > > > > > We hope this script proves helpful in identifying and addressing the > > > > root cause. We eagerly await your insights and proposed fixes. > > > > > > Thanks Yafang, I was able to reproduce the issue using this script. > > > > > > Perhaps interestingly, I was not able to reproduce it with cgroupv2 > > > memcgs. I know writeback semantics are quite a bit different there, so > > > perhaps that explains why. > > > > > > Unfortunately, it also reproduces even with the commit I had in mind > > > (basically stealing the "if (all isolated pages are unqueued dirty) { > > > wakeup_flusher_threads(); reclaim_throttle(); }" from > > > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So > > > I'll need to spend some more time on this; I'm planning to send > > > something out for testing next week. > > > > Hi Chris, > > > > My apologies for not getting back to you sooner. > > > > And thanks everyone for all the input! > > > > My take is that Chris' premature OOM kills were NOT really due to > > the flusher not waking up or missing throttling. > > > > Yes, these two are among the differences between the active/inactive > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > > positions of dirty pages. The active/inactive LRU moves dirty pages > > all the way to the end of the line (reclaim happens at the front) > > whereas MGLRU moves them into the middle, during direct reclaim. The > > rationale for MGLRU was that this way those dirty pages would still > > be counted as "inactive" (or cold). > > > > This theory can be quickly verified by comparing how much > > nr_vmscan_immediate_reclaim grows, i.e., > > > > Before the copy > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > And then after the copy > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > > The growth should be trivial for MGLRU and nontrivial for the > > active/inactive LRU. > > > > If this is indeed the case, I'd appreciate very much if anyone could > > try the following (I'll try it myself too later next week). > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 4255619a1a31..020f5d98b9a1 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > } > > > > /* waiting for writeback */ > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > - gen = folio_inc_gen(lruvec, folio, true); > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > + DEFINE_MAX_SEQ(lruvec); > > + int old_gen, new_gen = lru_gen_from_seq(max_seq); > > + > > + old_gen = folio_update_gen(folio, new_gen); > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); > > Sorry missing one line here: > > + folio_set_reclaim(folio); > > > return true; > > } Hi Yu, I have validated it using the script provided for Axel, but unfortunately, it still triggers an OOM error with your patch applied. Here are the results with nr_vmscan_immediate_reclaim: - non-MGLRU $ grep nr_vmscan_immediate_reclaim /proc/vmstat nr_vmscan_immediate_reclaim 47411776 $ ./test.sh 0 1023+0 records in 1023+0 records out 1072693248 bytes (1.1 GB, 1023 MiB) copied, 0.538058 s, 2.0 GB/s $ grep nr_vmscan_immediate_reclaim /proc/vmstat nr_vmscan_immediate_reclaim 47412544 - MGLRU $ grep nr_vmscan_immediate_reclaim /proc/vmstat nr_vmscan_immediate_reclaim 47412544 $ ./test.sh 1 Killed $ grep nr_vmscan_immediate_reclaim /proc/vmstat nr_vmscan_immediate_reclaim 115455600 The detailed OOM info as follows, [Wed Mar 13 11:16:48 2024] dd invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=3, oom_score_adj=0 [Wed Mar 13 11:16:48 2024] CPU: 12 PID: 6911 Comm: dd Not tainted 6.8.0-rc6+ #24 [Wed Mar 13 11:16:48 2024] Hardware name: Tencent Cloud CVM, BIOS seabios-1.9.1-qemu-project.org 04/01/2014 [Wed Mar 13 11:16:48 2024] Call Trace: [Wed Mar 13 11:16:48 2024] <TASK> [Wed Mar 13 11:16:48 2024] dump_stack_lvl+0x6e/0x90 [Wed Mar 13 11:16:48 2024] dump_stack+0x10/0x20 [Wed Mar 13 11:16:48 2024] dump_header+0x47/0x2d0 [Wed Mar 13 11:16:48 2024] oom_kill_process+0x101/0x2e0 [Wed Mar 13 11:16:48 2024] out_of_memory+0xfc/0x430 [Wed Mar 13 11:16:48 2024] mem_cgroup_out_of_memory+0x13d/0x160 [Wed Mar 13 11:16:48 2024] try_charge_memcg+0x7be/0x850 [Wed Mar 13 11:16:48 2024] ? get_mem_cgroup_from_mm+0x5e/0x420 [Wed Mar 13 11:16:48 2024] ? rcu_read_unlock+0x25/0x70 [Wed Mar 13 11:16:48 2024] __mem_cgroup_charge+0x49/0x90 [Wed Mar 13 11:16:48 2024] __filemap_add_folio+0x277/0x450 [Wed Mar 13 11:16:48 2024] ? __pfx_workingset_update_node+0x10/0x10 [Wed Mar 13 11:16:48 2024] filemap_add_folio+0x3c/0xa0 [Wed Mar 13 11:16:48 2024] __filemap_get_folio+0x13d/0x2f0 [Wed Mar 13 11:16:48 2024] iomap_get_folio+0x4c/0x60 [Wed Mar 13 11:16:48 2024] iomap_write_begin+0x1bb/0x2e0 [Wed Mar 13 11:16:48 2024] iomap_write_iter+0xff/0x290 [Wed Mar 13 11:16:48 2024] iomap_file_buffered_write+0x91/0xf0 [Wed Mar 13 11:16:48 2024] xfs_file_buffered_write+0x9f/0x2d0 [xfs] [Wed Mar 13 11:16:48 2024] ? vfs_write+0x261/0x530 [Wed Mar 13 11:16:48 2024] ? debug_smp_processor_id+0x17/0x20 [Wed Mar 13 11:16:48 2024] xfs_file_write_iter+0xe9/0x120 [xfs] [Wed Mar 13 11:16:48 2024] vfs_write+0x37d/0x530 [Wed Mar 13 11:16:48 2024] ksys_write+0x6d/0xf0 [Wed Mar 13 11:16:48 2024] __x64_sys_write+0x19/0x20 [Wed Mar 13 11:16:48 2024] do_syscall_64+0x79/0x1a0 [Wed Mar 13 11:16:48 2024] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [Wed Mar 13 11:16:48 2024] RIP: 0033:0x7f63ea33e927 [Wed Mar 13 11:16:48 2024] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [Wed Mar 13 11:16:48 2024] RSP: 002b:00007ffc0e874768 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [Wed Mar 13 11:16:48 2024] RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007f63ea33e927 [Wed Mar 13 11:16:48 2024] RDX: 0000000000100000 RSI: 00007f63dcafe000 RDI: 0000000000000001 [Wed Mar 13 11:16:48 2024] RBP: 00007f63dcafe000 R08: 00007f63dcafe000 R09: 0000000000000000 [Wed Mar 13 11:16:48 2024] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000 [Wed Mar 13 11:16:48 2024] R13: 0000000000000000 R14: 0000000000000000 R15: 00007f63dcafe000 [Wed Mar 13 11:16:48 2024] </TASK> [Wed Mar 13 11:16:48 2024] memory: usage 1048556kB, limit 1048576kB, failcnt 153 [Wed Mar 13 11:16:48 2024] memory+swap: usage 1048556kB, limit 9007199254740988kB, failcnt 0 [Wed Mar 13 11:16:48 2024] kmem: usage 200kB, limit 9007199254740988kB, failcnt 0 [Wed Mar 13 11:16:48 2024] Memory cgroup stats for /mglru: [Wed Mar 13 11:16:48 2024] cache 1072365568 [Wed Mar 13 11:16:48 2024] rss 1150976 [Wed Mar 13 11:16:48 2024] rss_huge 0 [Wed Mar 13 11:16:48 2024] shmem 0 [Wed Mar 13 11:16:48 2024] mapped_file 0 [Wed Mar 13 11:16:48 2024] dirty 1072365568 [Wed Mar 13 11:16:48 2024] writeback 0 [Wed Mar 13 11:16:48 2024] workingset_refault_anon 0 [Wed Mar 13 11:16:48 2024] workingset_refault_file 0 [Wed Mar 13 11:16:48 2024] swap 0 [Wed Mar 13 11:16:48 2024] swapcached 0 [Wed Mar 13 11:16:48 2024] pgpgin 2783 [Wed Mar 13 11:16:48 2024] pgpgout 1444 [Wed Mar 13 11:16:48 2024] pgfault 885 [Wed Mar 13 11:16:48 2024] pgmajfault 0 [Wed Mar 13 11:16:48 2024] inactive_anon 1146880 [Wed Mar 13 11:16:48 2024] active_anon 4096 [Wed Mar 13 11:16:48 2024] inactive_file 802357248 [Wed Mar 13 11:16:48 2024] active_file 270008320 [Wed Mar 13 11:16:48 2024] unevictable 0 [Wed Mar 13 11:16:48 2024] hierarchical_memory_limit 1073741824 [Wed Mar 13 11:16:48 2024] hierarchical_memsw_limit 9223372036854771712 [Wed Mar 13 11:16:48 2024] total_cache 1072365568 [Wed Mar 13 11:16:48 2024] total_rss 1150976 [Wed Mar 13 11:16:48 2024] total_rss_huge 0 [Wed Mar 13 11:16:48 2024] total_shmem 0 [Wed Mar 13 11:16:48 2024] total_mapped_file 0 [Wed Mar 13 11:16:48 2024] total_dirty 1072365568 [Wed Mar 13 11:16:48 2024] total_writeback 0 [Wed Mar 13 11:16:48 2024] total_workingset_refault_anon 0 [Wed Mar 13 11:16:48 2024] total_workingset_refault_file 0 [Wed Mar 13 11:16:48 2024] total_swap 0 [Wed Mar 13 11:16:48 2024] total_swapcached 0 [Wed Mar 13 11:16:48 2024] total_pgpgin 2783 [Wed Mar 13 11:16:48 2024] total_pgpgout 1444 [Wed Mar 13 11:16:48 2024] total_pgfault 885 [Wed Mar 13 11:16:48 2024] total_pgmajfault 0 [Wed Mar 13 11:16:48 2024] total_inactive_anon 1146880 [Wed Mar 13 11:16:48 2024] total_active_anon 4096 [Wed Mar 13 11:16:48 2024] total_inactive_file 802357248 [Wed Mar 13 11:16:48 2024] total_active_file 270008320 [Wed Mar 13 11:16:48 2024] total_unevictable 0 [Wed Mar 13 11:16:48 2024] Tasks state (memory values in pages): [Wed Mar 13 11:16:48 2024] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name [Wed Mar 13 11:16:48 2024] [ 6911] 0 6911 55506 640 256 384 0 73728 0 0 dd [Wed Mar 13 11:16:48 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/mglru,task_memcg=/mglru,task=dd,pid=6911,uid=0 The key information extracted from the OOM info is as follows: [Wed Mar 13 11:16:48 2024] cache 1072365568 [Wed Mar 13 11:16:48 2024] dirty 1072365568 This information reveals that all file pages are dirty pages. As of now, it appears that the most effective solution to address this issue is to revert the commit 14aa8b2d5c2e. Regarding this commit 14aa8b2d5c2e, its original intention was to eliminate potential SSD wearout, although there's no concrete data available on how it might impact SSD longevity. If the concern about SSD wearout is purely theoretical, it might be reasonable to consider reverting this commit. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-13 3:33 ` Yafang Shao @ 2024-03-14 22:23 ` Yu Zhao 2024-03-15 2:38 ` Yafang Shao 0 siblings, 1 reply; 19+ messages in thread From: Yu Zhao @ 2024-03-14 22:23 UTC (permalink / raw) To: Yafang Shao Cc: Axel Rasmussen, Chris Down, cgroups, hannes, kernel-team, linux-kernel, linux-mm On Wed, Mar 13, 2024 at 11:33:21AM +0800, Yafang Shao wrote: > On Wed, Mar 13, 2024 at 4:11 AM Yu Zhao <yuzhao@google.com> wrote: > > > > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > > > On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote: > > > > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@google.com> wrote: > > > > > > > > > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > > > > > > > > > > > > > Axel Rasmussen writes: > > > > > > > >A couple of dumb questions. In your test, do you have any of the following > > > > > > > >configured / enabled? > > > > > > > > > > > > > > > >/proc/sys/vm/laptop_mode > > > > > > > >memory.low > > > > > > > >memory.min > > > > > > > > > > > > > > None of these are enabled. The issue is trivially reproducible by writing to > > > > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > > > > > > is also susceptible to this on global reclaim (although it's less likely due to > > > > > > > page diversity). > > > > > > > > > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > > > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > > > > > > >looks like it simply will not do this. > > > > > > > > > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > > > > > > >makes sense to me at least that doing writeback every time we age is too > > > > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > > > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > > > > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > > > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > > > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > > > > > > thing at a time :-) > > > > > > > > > > > > > > > > > > Hmm, so I have a patch which I think will help with this situation, > > > > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > > > > > then I can verify the patch fixes it). > > > > > > > > > > We encountered the same premature OOM issue caused by numerous dirty pages. > > > > > The issue disappears after we revert the commit 14aa8b2d5c2e > > > > > "mm/mglru: don't sync disk for each aging cycle" > > > > > > > > > > To aid in replicating the issue, we've developed a straightforward > > > > > script, which consistently reproduces it, even on the latest kernel. > > > > > You can find the script provided below: > > > > > > > > > > ``` > > > > > #!/bin/bash > > > > > > > > > > MEMCG="/sys/fs/cgroup/memory/mglru" > > > > > ENABLE=$1 > > > > > > > > > > # Avoid waking up the flusher > > > > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) > > > > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) > > > > > > > > > > if [ ! -d ${MEMCG} ]; then > > > > > mkdir -p ${MEMCG} > > > > > fi > > > > > > > > > > echo $$ > ${MEMCG}/cgroup.procs > > > > > echo 1g > ${MEMCG}/memory.limit_in_bytes > > > > > > > > > > if [ $ENABLE -eq 0 ]; then > > > > > echo 0 > /sys/kernel/mm/lru_gen/enabled > > > > > else > > > > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > > > > > fi > > > > > > > > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 > > > > > rm -rf /data0/mglru.test > > > > > ``` > > > > > > > > > > This issue disappears as well after we disable the mglru. > > > > > > > > > > We hope this script proves helpful in identifying and addressing the > > > > > root cause. We eagerly await your insights and proposed fixes. > > > > > > > > Thanks Yafang, I was able to reproduce the issue using this script. > > > > > > > > Perhaps interestingly, I was not able to reproduce it with cgroupv2 > > > > memcgs. I know writeback semantics are quite a bit different there, so > > > > perhaps that explains why. > > > > > > > > Unfortunately, it also reproduces even with the commit I had in mind > > > > (basically stealing the "if (all isolated pages are unqueued dirty) { > > > > wakeup_flusher_threads(); reclaim_throttle(); }" from > > > > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So > > > > I'll need to spend some more time on this; I'm planning to send > > > > something out for testing next week. > > > > > > Hi Chris, > > > > > > My apologies for not getting back to you sooner. > > > > > > And thanks everyone for all the input! > > > > > > My take is that Chris' premature OOM kills were NOT really due to > > > the flusher not waking up or missing throttling. > > > > > > Yes, these two are among the differences between the active/inactive > > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > > > positions of dirty pages. The active/inactive LRU moves dirty pages > > > all the way to the end of the line (reclaim happens at the front) > > > whereas MGLRU moves them into the middle, during direct reclaim. The > > > rationale for MGLRU was that this way those dirty pages would still > > > be counted as "inactive" (or cold). > > > > > > This theory can be quickly verified by comparing how much > > > nr_vmscan_immediate_reclaim grows, i.e., > > > > > > Before the copy > > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > And then after the copy > > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > > > > The growth should be trivial for MGLRU and nontrivial for the > > > active/inactive LRU. > > > > > > If this is indeed the case, I'd appreciate very much if anyone could > > > try the following (I'll try it myself too later next week). > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 4255619a1a31..020f5d98b9a1 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > > } > > > > > > /* waiting for writeback */ > > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > > - gen = folio_inc_gen(lruvec, folio, true); > > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > > > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > > + DEFINE_MAX_SEQ(lruvec); > > > + int old_gen, new_gen = lru_gen_from_seq(max_seq); > > > + > > > + old_gen = folio_update_gen(folio, new_gen); > > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); > > > > Sorry missing one line here: > > > > + folio_set_reclaim(folio); > > > > > return true; > > > } > > Hi Yu, > > I have validated it using the script provided for Axel, but > unfortunately, it still triggers an OOM error with your patch applied. > Here are the results with nr_vmscan_immediate_reclaim: Thanks for debunking it! > - non-MGLRU > $ grep nr_vmscan_immediate_reclaim /proc/vmstat > nr_vmscan_immediate_reclaim 47411776 > > $ ./test.sh 0 > 1023+0 records in > 1023+0 records out > 1072693248 bytes (1.1 GB, 1023 MiB) copied, 0.538058 s, 2.0 GB/s > > $ grep nr_vmscan_immediate_reclaim /proc/vmstat > nr_vmscan_immediate_reclaim 47412544 > > - MGLRU > $ grep nr_vmscan_immediate_reclaim /proc/vmstat > nr_vmscan_immediate_reclaim 47412544 > > $ ./test.sh 1 > Killed > > $ grep nr_vmscan_immediate_reclaim /proc/vmstat > nr_vmscan_immediate_reclaim 115455600 The delta is ~260GB, I'm still thinking how that could happen -- is this reliably reproducible? > The detailed OOM info as follows, > > [Wed Mar 13 11:16:48 2024] dd invoked oom-killer: > gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), > order=3, oom_score_adj=0 > [Wed Mar 13 11:16:48 2024] CPU: 12 PID: 6911 Comm: dd Not tainted 6.8.0-rc6+ #24 > [Wed Mar 13 11:16:48 2024] Hardware name: Tencent Cloud CVM, BIOS > seabios-1.9.1-qemu-project.org 04/01/2014 > [Wed Mar 13 11:16:48 2024] Call Trace: > [Wed Mar 13 11:16:48 2024] <TASK> > [Wed Mar 13 11:16:48 2024] dump_stack_lvl+0x6e/0x90 > [Wed Mar 13 11:16:48 2024] dump_stack+0x10/0x20 > [Wed Mar 13 11:16:48 2024] dump_header+0x47/0x2d0 > [Wed Mar 13 11:16:48 2024] oom_kill_process+0x101/0x2e0 > [Wed Mar 13 11:16:48 2024] out_of_memory+0xfc/0x430 > [Wed Mar 13 11:16:48 2024] mem_cgroup_out_of_memory+0x13d/0x160 > [Wed Mar 13 11:16:48 2024] try_charge_memcg+0x7be/0x850 > [Wed Mar 13 11:16:48 2024] ? get_mem_cgroup_from_mm+0x5e/0x420 > [Wed Mar 13 11:16:48 2024] ? rcu_read_unlock+0x25/0x70 > [Wed Mar 13 11:16:48 2024] __mem_cgroup_charge+0x49/0x90 > [Wed Mar 13 11:16:48 2024] __filemap_add_folio+0x277/0x450 > [Wed Mar 13 11:16:48 2024] ? __pfx_workingset_update_node+0x10/0x10 > [Wed Mar 13 11:16:48 2024] filemap_add_folio+0x3c/0xa0 > [Wed Mar 13 11:16:48 2024] __filemap_get_folio+0x13d/0x2f0 > [Wed Mar 13 11:16:48 2024] iomap_get_folio+0x4c/0x60 > [Wed Mar 13 11:16:48 2024] iomap_write_begin+0x1bb/0x2e0 > [Wed Mar 13 11:16:48 2024] iomap_write_iter+0xff/0x290 > [Wed Mar 13 11:16:48 2024] iomap_file_buffered_write+0x91/0xf0 > [Wed Mar 13 11:16:48 2024] xfs_file_buffered_write+0x9f/0x2d0 [xfs] > [Wed Mar 13 11:16:48 2024] ? vfs_write+0x261/0x530 > [Wed Mar 13 11:16:48 2024] ? debug_smp_processor_id+0x17/0x20 > [Wed Mar 13 11:16:48 2024] xfs_file_write_iter+0xe9/0x120 [xfs] > [Wed Mar 13 11:16:48 2024] vfs_write+0x37d/0x530 > [Wed Mar 13 11:16:48 2024] ksys_write+0x6d/0xf0 > [Wed Mar 13 11:16:48 2024] __x64_sys_write+0x19/0x20 > [Wed Mar 13 11:16:48 2024] do_syscall_64+0x79/0x1a0 > [Wed Mar 13 11:16:48 2024] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > [Wed Mar 13 11:16:48 2024] RIP: 0033:0x7f63ea33e927 > [Wed Mar 13 11:16:48 2024] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff > ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 > b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 > 24 18 48 89 74 24 > [Wed Mar 13 11:16:48 2024] RSP: 002b:00007ffc0e874768 EFLAGS: 00000246 > ORIG_RAX: 0000000000000001 > [Wed Mar 13 11:16:48 2024] RAX: ffffffffffffffda RBX: 0000000000100000 > RCX: 00007f63ea33e927 > [Wed Mar 13 11:16:48 2024] RDX: 0000000000100000 RSI: 00007f63dcafe000 > RDI: 0000000000000001 > [Wed Mar 13 11:16:48 2024] RBP: 00007f63dcafe000 R08: 00007f63dcafe000 > R09: 0000000000000000 > [Wed Mar 13 11:16:48 2024] R10: 0000000000000022 R11: 0000000000000246 > R12: 0000000000000000 > [Wed Mar 13 11:16:48 2024] R13: 0000000000000000 R14: 0000000000000000 > R15: 00007f63dcafe000 > [Wed Mar 13 11:16:48 2024] </TASK> > [Wed Mar 13 11:16:48 2024] memory: usage 1048556kB, limit 1048576kB, failcnt 153 > [Wed Mar 13 11:16:48 2024] memory+swap: usage 1048556kB, limit I see you were actually on cgroup v1 -- this might be a different problem than Chris' since he was on v2. For v1, the throttling is done by commit 81a70c21d9 ("mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1"). IOW, the active/inactive LRU throttles in both v1 and v2 (done in different ways) whereas MGLRU doesn't in either case. > 9007199254740988kB, failcnt 0 > [Wed Mar 13 11:16:48 2024] kmem: usage 200kB, limit > 9007199254740988kB, failcnt 0 > [Wed Mar 13 11:16:48 2024] Memory cgroup stats for /mglru: > [Wed Mar 13 11:16:48 2024] cache 1072365568 > [Wed Mar 13 11:16:48 2024] rss 1150976 > [Wed Mar 13 11:16:48 2024] rss_huge 0 > [Wed Mar 13 11:16:48 2024] shmem 0 > [Wed Mar 13 11:16:48 2024] mapped_file 0 > [Wed Mar 13 11:16:48 2024] dirty 1072365568 > [Wed Mar 13 11:16:48 2024] writeback 0 > [Wed Mar 13 11:16:48 2024] workingset_refault_anon 0 > [Wed Mar 13 11:16:48 2024] workingset_refault_file 0 > [Wed Mar 13 11:16:48 2024] swap 0 > [Wed Mar 13 11:16:48 2024] swapcached 0 > [Wed Mar 13 11:16:48 2024] pgpgin 2783 > [Wed Mar 13 11:16:48 2024] pgpgout 1444 > [Wed Mar 13 11:16:48 2024] pgfault 885 > [Wed Mar 13 11:16:48 2024] pgmajfault 0 > [Wed Mar 13 11:16:48 2024] inactive_anon 1146880 > [Wed Mar 13 11:16:48 2024] active_anon 4096 > [Wed Mar 13 11:16:48 2024] inactive_file 802357248 > [Wed Mar 13 11:16:48 2024] active_file 270008320 > [Wed Mar 13 11:16:48 2024] unevictable 0 > [Wed Mar 13 11:16:48 2024] hierarchical_memory_limit 1073741824 > [Wed Mar 13 11:16:48 2024] hierarchical_memsw_limit 9223372036854771712 > [Wed Mar 13 11:16:48 2024] total_cache 1072365568 > [Wed Mar 13 11:16:48 2024] total_rss 1150976 > [Wed Mar 13 11:16:48 2024] total_rss_huge 0 > [Wed Mar 13 11:16:48 2024] total_shmem 0 > [Wed Mar 13 11:16:48 2024] total_mapped_file 0 > [Wed Mar 13 11:16:48 2024] total_dirty 1072365568 > [Wed Mar 13 11:16:48 2024] total_writeback 0 > [Wed Mar 13 11:16:48 2024] total_workingset_refault_anon 0 > [Wed Mar 13 11:16:48 2024] total_workingset_refault_file 0 > [Wed Mar 13 11:16:48 2024] total_swap 0 > [Wed Mar 13 11:16:48 2024] total_swapcached 0 > [Wed Mar 13 11:16:48 2024] total_pgpgin 2783 > [Wed Mar 13 11:16:48 2024] total_pgpgout 1444 > [Wed Mar 13 11:16:48 2024] total_pgfault 885 > [Wed Mar 13 11:16:48 2024] total_pgmajfault 0 > [Wed Mar 13 11:16:48 2024] total_inactive_anon 1146880 > [Wed Mar 13 11:16:48 2024] total_active_anon 4096 > [Wed Mar 13 11:16:48 2024] total_inactive_file 802357248 > [Wed Mar 13 11:16:48 2024] total_active_file 270008320 > [Wed Mar 13 11:16:48 2024] total_unevictable 0 > [Wed Mar 13 11:16:48 2024] Tasks state (memory values in pages): > [Wed Mar 13 11:16:48 2024] [ pid ] uid tgid total_vm rss > rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name > [Wed Mar 13 11:16:48 2024] [ 6911] 0 6911 55506 640 > 256 384 0 73728 0 0 dd > [Wed Mar 13 11:16:48 2024] > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/mglru,task_memcg=/mglru,task=dd,pid=6911,uid=0 > > The key information extracted from the OOM info is as follows: > > [Wed Mar 13 11:16:48 2024] cache 1072365568 > [Wed Mar 13 11:16:48 2024] dirty 1072365568 > > This information reveals that all file pages are dirty pages. I'm surprised to see there was 0 pages under writeback: [Wed Mar 13 11:16:48 2024] total_writeback 0 What's your dirty limit? It's unfortunate that the mainline has no per-memcg dirty limit. (We do at Google.) > As of now, it appears that the most effective solution to address this > issue is to revert the commit 14aa8b2d5c2e. Regarding this commit > 14aa8b2d5c2e, its original intention was to eliminate potential SSD > wearout, although there's no concrete data available on how it might > impact SSD longevity. If the concern about SSD wearout is purely > theoretical, it might be reasonable to consider reverting this commit. The SSD wearout problem was real -- it wasn't really due to wakeup_flusher_threads() itself; rather, the original MGLRU code call the function improperly. It needs to be called under more restricted conditions so that it doesn't cause the SDD wearout problem again. However, IMO, wakeup_flusher_threads() is just another bandaid trying to work around a more fundamental problem. There is no guarantee that the flusher will target the dirty pages in the memcg under reclaim, right? Do you mind trying the following first to see if we can get around the problem without calling wakeup_flusher_threads(). Thanks! diff --git a/mm/vmscan.c b/mm/vmscan.c index 4255619a1a31..d3cfbd95996d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -225,7 +225,7 @@ static bool writeback_throttling_sane(struct scan_control *sc) if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) return true; #endif - return false; + return lru_gen_enabled(); } #else static bool cgroup_reclaim(struct scan_control *sc) @@ -4273,8 +4273,10 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c } /* waiting for writeback */ - if (folio_test_locked(folio) || folio_test_writeback(folio) || - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { + sc->nr.dirty += delta; + if (!folio_test_reclaim(folio)) + sc->nr.congested += delta; gen = folio_inc_gen(lruvec, folio, true); list_move(&folio->lru, &lrugen->folios[gen][type][zone]); return true; ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-14 22:23 ` Yu Zhao @ 2024-03-15 2:38 ` Yafang Shao 2024-03-15 14:27 ` Johannes Weiner 0 siblings, 1 reply; 19+ messages in thread From: Yafang Shao @ 2024-03-15 2:38 UTC (permalink / raw) To: Yu Zhao Cc: Axel Rasmussen, Chris Down, cgroups, hannes, kernel-team, linux-kernel, linux-mm On Fri, Mar 15, 2024 at 6:23 AM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, Mar 13, 2024 at 11:33:21AM +0800, Yafang Shao wrote: > > On Wed, Mar 13, 2024 at 4:11 AM Yu Zhao <yuzhao@google.com> wrote: > > > > > > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > > > > On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote: > > > > > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@google.com> wrote: > > > > > > > > > > > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote: > > > > > > > > > > > > > > > > Axel Rasmussen writes: > > > > > > > > >A couple of dumb questions. In your test, do you have any of the following > > > > > > > > >configured / enabled? > > > > > > > > > > > > > > > > > >/proc/sys/vm/laptop_mode > > > > > > > > >memory.low > > > > > > > > >memory.min > > > > > > > > > > > > > > > > None of these are enabled. The issue is trivially reproducible by writing to > > > > > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > > > > > > > is also susceptible to this on global reclaim (although it's less likely due to > > > > > > > > page diversity). > > > > > > > > > > > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > > > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > > > > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > > > > > > > >looks like it simply will not do this. > > > > > > > > > > > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > > > > > > > >makes sense to me at least that doing writeback every time we age is too > > > > > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > > > > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > > > > > > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > > > > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > > > > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > > > > > > > thing at a time :-) > > > > > > > > > > > > > > > > > > > > > Hmm, so I have a patch which I think will help with this situation, > > > > > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > > > > > > then I can verify the patch fixes it). > > > > > > > > > > > > We encountered the same premature OOM issue caused by numerous dirty pages. > > > > > > The issue disappears after we revert the commit 14aa8b2d5c2e > > > > > > "mm/mglru: don't sync disk for each aging cycle" > > > > > > > > > > > > To aid in replicating the issue, we've developed a straightforward > > > > > > script, which consistently reproduces it, even on the latest kernel. > > > > > > You can find the script provided below: > > > > > > > > > > > > ``` > > > > > > #!/bin/bash > > > > > > > > > > > > MEMCG="/sys/fs/cgroup/memory/mglru" > > > > > > ENABLE=$1 > > > > > > > > > > > > # Avoid waking up the flusher > > > > > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) > > > > > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) > > > > > > > > > > > > if [ ! -d ${MEMCG} ]; then > > > > > > mkdir -p ${MEMCG} > > > > > > fi > > > > > > > > > > > > echo $$ > ${MEMCG}/cgroup.procs > > > > > > echo 1g > ${MEMCG}/memory.limit_in_bytes > > > > > > > > > > > > if [ $ENABLE -eq 0 ]; then > > > > > > echo 0 > /sys/kernel/mm/lru_gen/enabled > > > > > > else > > > > > > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > > > > > > fi > > > > > > > > > > > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 > > > > > > rm -rf /data0/mglru.test > > > > > > ``` > > > > > > > > > > > > This issue disappears as well after we disable the mglru. > > > > > > > > > > > > We hope this script proves helpful in identifying and addressing the > > > > > > root cause. We eagerly await your insights and proposed fixes. > > > > > > > > > > Thanks Yafang, I was able to reproduce the issue using this script. > > > > > > > > > > Perhaps interestingly, I was not able to reproduce it with cgroupv2 > > > > > memcgs. I know writeback semantics are quite a bit different there, so > > > > > perhaps that explains why. > > > > > > > > > > Unfortunately, it also reproduces even with the commit I had in mind > > > > > (basically stealing the "if (all isolated pages are unqueued dirty) { > > > > > wakeup_flusher_threads(); reclaim_throttle(); }" from > > > > > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So > > > > > I'll need to spend some more time on this; I'm planning to send > > > > > something out for testing next week. > > > > > > > > Hi Chris, > > > > > > > > My apologies for not getting back to you sooner. > > > > > > > > And thanks everyone for all the input! > > > > > > > > My take is that Chris' premature OOM kills were NOT really due to > > > > the flusher not waking up or missing throttling. > > > > > > > > Yes, these two are among the differences between the active/inactive > > > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > > > > positions of dirty pages. The active/inactive LRU moves dirty pages > > > > all the way to the end of the line (reclaim happens at the front) > > > > whereas MGLRU moves them into the middle, during direct reclaim. The > > > > rationale for MGLRU was that this way those dirty pages would still > > > > be counted as "inactive" (or cold). > > > > > > > > This theory can be quickly verified by comparing how much > > > > nr_vmscan_immediate_reclaim grows, i.e., > > > > > > > > Before the copy > > > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > > And then after the copy > > > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > > > > > > The growth should be trivial for MGLRU and nontrivial for the > > > > active/inactive LRU. > > > > > > > > If this is indeed the case, I'd appreciate very much if anyone could > > > > try the following (I'll try it myself too later next week). > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index 4255619a1a31..020f5d98b9a1 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > > > } > > > > > > > > /* waiting for writeback */ > > > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > > > - gen = folio_inc_gen(lruvec, folio, true); > > > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > > > > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > > > + DEFINE_MAX_SEQ(lruvec); > > > > + int old_gen, new_gen = lru_gen_from_seq(max_seq); > > > > + > > > > + old_gen = folio_update_gen(folio, new_gen); > > > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > > > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); > > > > > > Sorry missing one line here: > > > > > > + folio_set_reclaim(folio); > > > > > > > return true; > > > > } > > > > Hi Yu, > > > > I have validated it using the script provided for Axel, but > > unfortunately, it still triggers an OOM error with your patch applied. > > Here are the results with nr_vmscan_immediate_reclaim: > > Thanks for debunking it! > > > - non-MGLRU > > $ grep nr_vmscan_immediate_reclaim /proc/vmstat > > nr_vmscan_immediate_reclaim 47411776 > > > > $ ./test.sh 0 > > 1023+0 records in > > 1023+0 records out > > 1072693248 bytes (1.1 GB, 1023 MiB) copied, 0.538058 s, 2.0 GB/s > > > > $ grep nr_vmscan_immediate_reclaim /proc/vmstat > > nr_vmscan_immediate_reclaim 47412544 > > > > - MGLRU > > $ grep nr_vmscan_immediate_reclaim /proc/vmstat > > nr_vmscan_immediate_reclaim 47412544 > > > > $ ./test.sh 1 > > Killed > > > > $ grep nr_vmscan_immediate_reclaim /proc/vmstat > > nr_vmscan_immediate_reclaim 115455600 > > The delta is ~260GB, I'm still thinking how that could happen -- is this reliably reproducible? Yes, it is reliably reproducible on cgroup1 with the script provided as follows: $ ./test.sh 1 > > > The detailed OOM info as follows, > > > > [Wed Mar 13 11:16:48 2024] dd invoked oom-killer: > > gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), > > order=3, oom_score_adj=0 > > [Wed Mar 13 11:16:48 2024] CPU: 12 PID: 6911 Comm: dd Not tainted 6.8.0-rc6+ #24 > > [Wed Mar 13 11:16:48 2024] Hardware name: Tencent Cloud CVM, BIOS > > seabios-1.9.1-qemu-project.org 04/01/2014 > > [Wed Mar 13 11:16:48 2024] Call Trace: > > [Wed Mar 13 11:16:48 2024] <TASK> > > [Wed Mar 13 11:16:48 2024] dump_stack_lvl+0x6e/0x90 > > [Wed Mar 13 11:16:48 2024] dump_stack+0x10/0x20 > > [Wed Mar 13 11:16:48 2024] dump_header+0x47/0x2d0 > > [Wed Mar 13 11:16:48 2024] oom_kill_process+0x101/0x2e0 > > [Wed Mar 13 11:16:48 2024] out_of_memory+0xfc/0x430 > > [Wed Mar 13 11:16:48 2024] mem_cgroup_out_of_memory+0x13d/0x160 > > [Wed Mar 13 11:16:48 2024] try_charge_memcg+0x7be/0x850 > > [Wed Mar 13 11:16:48 2024] ? get_mem_cgroup_from_mm+0x5e/0x420 > > [Wed Mar 13 11:16:48 2024] ? rcu_read_unlock+0x25/0x70 > > [Wed Mar 13 11:16:48 2024] __mem_cgroup_charge+0x49/0x90 > > [Wed Mar 13 11:16:48 2024] __filemap_add_folio+0x277/0x450 > > [Wed Mar 13 11:16:48 2024] ? __pfx_workingset_update_node+0x10/0x10 > > [Wed Mar 13 11:16:48 2024] filemap_add_folio+0x3c/0xa0 > > [Wed Mar 13 11:16:48 2024] __filemap_get_folio+0x13d/0x2f0 > > [Wed Mar 13 11:16:48 2024] iomap_get_folio+0x4c/0x60 > > [Wed Mar 13 11:16:48 2024] iomap_write_begin+0x1bb/0x2e0 > > [Wed Mar 13 11:16:48 2024] iomap_write_iter+0xff/0x290 > > [Wed Mar 13 11:16:48 2024] iomap_file_buffered_write+0x91/0xf0 > > [Wed Mar 13 11:16:48 2024] xfs_file_buffered_write+0x9f/0x2d0 [xfs] > > [Wed Mar 13 11:16:48 2024] ? vfs_write+0x261/0x530 > > [Wed Mar 13 11:16:48 2024] ? debug_smp_processor_id+0x17/0x20 > > [Wed Mar 13 11:16:48 2024] xfs_file_write_iter+0xe9/0x120 [xfs] > > [Wed Mar 13 11:16:48 2024] vfs_write+0x37d/0x530 > > [Wed Mar 13 11:16:48 2024] ksys_write+0x6d/0xf0 > > [Wed Mar 13 11:16:48 2024] __x64_sys_write+0x19/0x20 > > [Wed Mar 13 11:16:48 2024] do_syscall_64+0x79/0x1a0 > > [Wed Mar 13 11:16:48 2024] entry_SYSCALL_64_after_hwframe+0x6e/0x76 > > [Wed Mar 13 11:16:48 2024] RIP: 0033:0x7f63ea33e927 > > [Wed Mar 13 11:16:48 2024] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff > > ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 > > b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 > > 24 18 48 89 74 24 > > [Wed Mar 13 11:16:48 2024] RSP: 002b:00007ffc0e874768 EFLAGS: 00000246 > > ORIG_RAX: 0000000000000001 > > [Wed Mar 13 11:16:48 2024] RAX: ffffffffffffffda RBX: 0000000000100000 > > RCX: 00007f63ea33e927 > > [Wed Mar 13 11:16:48 2024] RDX: 0000000000100000 RSI: 00007f63dcafe000 > > RDI: 0000000000000001 > > [Wed Mar 13 11:16:48 2024] RBP: 00007f63dcafe000 R08: 00007f63dcafe000 > > R09: 0000000000000000 > > [Wed Mar 13 11:16:48 2024] R10: 0000000000000022 R11: 0000000000000246 > > R12: 0000000000000000 > > [Wed Mar 13 11:16:48 2024] R13: 0000000000000000 R14: 0000000000000000 > > R15: 00007f63dcafe000 > > [Wed Mar 13 11:16:48 2024] </TASK> > > [Wed Mar 13 11:16:48 2024] memory: usage 1048556kB, limit 1048576kB, failcnt 153 > > [Wed Mar 13 11:16:48 2024] memory+swap: usage 1048556kB, limit > > I see you were actually on cgroup v1 -- this might be a different > problem than Chris' since he was on v2. Right, we are still using cgroup1. They might not be the same issue. > > For v1, the throttling is done by commit 81a70c21d9 > ("mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1"). > IOW, the active/inactive LRU throttles in both v1 and v2 (done > in different ways) whereas MGLRU doesn't in either case. > > > 9007199254740988kB, failcnt 0 > > [Wed Mar 13 11:16:48 2024] kmem: usage 200kB, limit > > 9007199254740988kB, failcnt 0 > > [Wed Mar 13 11:16:48 2024] Memory cgroup stats for /mglru: > > [Wed Mar 13 11:16:48 2024] cache 1072365568 > > [Wed Mar 13 11:16:48 2024] rss 1150976 > > [Wed Mar 13 11:16:48 2024] rss_huge 0 > > [Wed Mar 13 11:16:48 2024] shmem 0 > > [Wed Mar 13 11:16:48 2024] mapped_file 0 > > [Wed Mar 13 11:16:48 2024] dirty 1072365568 > > [Wed Mar 13 11:16:48 2024] writeback 0 > > [Wed Mar 13 11:16:48 2024] workingset_refault_anon 0 > > [Wed Mar 13 11:16:48 2024] workingset_refault_file 0 > > [Wed Mar 13 11:16:48 2024] swap 0 > > [Wed Mar 13 11:16:48 2024] swapcached 0 > > [Wed Mar 13 11:16:48 2024] pgpgin 2783 > > [Wed Mar 13 11:16:48 2024] pgpgout 1444 > > [Wed Mar 13 11:16:48 2024] pgfault 885 > > [Wed Mar 13 11:16:48 2024] pgmajfault 0 > > [Wed Mar 13 11:16:48 2024] inactive_anon 1146880 > > [Wed Mar 13 11:16:48 2024] active_anon 4096 > > [Wed Mar 13 11:16:48 2024] inactive_file 802357248 > > [Wed Mar 13 11:16:48 2024] active_file 270008320 > > [Wed Mar 13 11:16:48 2024] unevictable 0 > > [Wed Mar 13 11:16:48 2024] hierarchical_memory_limit 1073741824 > > [Wed Mar 13 11:16:48 2024] hierarchical_memsw_limit 9223372036854771712 > > [Wed Mar 13 11:16:48 2024] total_cache 1072365568 > > [Wed Mar 13 11:16:48 2024] total_rss 1150976 > > [Wed Mar 13 11:16:48 2024] total_rss_huge 0 > > [Wed Mar 13 11:16:48 2024] total_shmem 0 > > [Wed Mar 13 11:16:48 2024] total_mapped_file 0 > > [Wed Mar 13 11:16:48 2024] total_dirty 1072365568 > > [Wed Mar 13 11:16:48 2024] total_writeback 0 > > [Wed Mar 13 11:16:48 2024] total_workingset_refault_anon 0 > > [Wed Mar 13 11:16:48 2024] total_workingset_refault_file 0 > > [Wed Mar 13 11:16:48 2024] total_swap 0 > > [Wed Mar 13 11:16:48 2024] total_swapcached 0 > > [Wed Mar 13 11:16:48 2024] total_pgpgin 2783 > > [Wed Mar 13 11:16:48 2024] total_pgpgout 1444 > > [Wed Mar 13 11:16:48 2024] total_pgfault 885 > > [Wed Mar 13 11:16:48 2024] total_pgmajfault 0 > > [Wed Mar 13 11:16:48 2024] total_inactive_anon 1146880 > > [Wed Mar 13 11:16:48 2024] total_active_anon 4096 > > [Wed Mar 13 11:16:48 2024] total_inactive_file 802357248 > > [Wed Mar 13 11:16:48 2024] total_active_file 270008320 > > [Wed Mar 13 11:16:48 2024] total_unevictable 0 > > [Wed Mar 13 11:16:48 2024] Tasks state (memory values in pages): > > [Wed Mar 13 11:16:48 2024] [ pid ] uid tgid total_vm rss > > rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name > > [Wed Mar 13 11:16:48 2024] [ 6911] 0 6911 55506 640 > > 256 384 0 73728 0 0 dd > > [Wed Mar 13 11:16:48 2024] > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/mglru,task_memcg=/mglru,task=dd,pid=6911,uid=0 > > > > The key information extracted from the OOM info is as follows: > > > > [Wed Mar 13 11:16:48 2024] cache 1072365568 > > [Wed Mar 13 11:16:48 2024] dirty 1072365568 > > > > This information reveals that all file pages are dirty pages. > > I'm surprised to see there was 0 pages under writeback: > [Wed Mar 13 11:16:48 2024] total_writeback 0 > What's your dirty limit? The background dirty threshold is 2G, and the dirty threshold is 4G. sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 * 2)) sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 * 4)) > > It's unfortunate that the mainline has no per-memcg dirty limit. (We > do at Google.) Per-memcg dirty limit is a useful feature. We also support it in our local kernel, but we didn't enable it for this test case. It is unclear why the memcg maintainers insist on rejecting the per-memcg dirty limit :( > > > As of now, it appears that the most effective solution to address this > > issue is to revert the commit 14aa8b2d5c2e. Regarding this commit > > 14aa8b2d5c2e, its original intention was to eliminate potential SSD > > wearout, although there's no concrete data available on how it might > > impact SSD longevity. If the concern about SSD wearout is purely > > theoretical, it might be reasonable to consider reverting this commit. > > The SSD wearout problem was real -- it wasn't really due to > wakeup_flusher_threads() itself; rather, the original MGLRU code call > the function improperly. It needs to be called under more restricted > conditions so that it doesn't cause the SDD wearout problem again. > However, IMO, wakeup_flusher_threads() is just another bandaid trying > to work around a more fundamental problem. There is no guarantee that > the flusher will target the dirty pages in the memcg under reclaim, > right? Right, it is a system-wide fluser. > > Do you mind trying the following first to see if we can get around > the problem without calling wakeup_flusher_threads(). I have tried it, but it still triggers the OOM. Below is the information. [ 71.713649] dd invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=3, oom_score_adj=0 [ 71.716317] CPU: 60 PID: 7218 Comm: dd Not tainted 6.8.0-rc6+ #26 [ 71.717677] Call Trace: [ 71.717917] <TASK> [ 71.718137] dump_stack_lvl+0x6e/0x90 [ 71.718485] dump_stack+0x10/0x20 [ 71.718799] dump_header+0x47/0x2d0 [ 71.719147] oom_kill_process+0x101/0x2e0 [ 71.719523] out_of_memory+0xfc/0x430 [ 71.719868] mem_cgroup_out_of_memory+0x13d/0x160 [ 71.720322] try_charge_memcg+0x7be/0x850 [ 71.720701] ? get_mem_cgroup_from_mm+0x5e/0x420 [ 71.721137] ? rcu_read_unlock+0x25/0x70 [ 71.721506] __mem_cgroup_charge+0x49/0x90 [ 71.721887] __filemap_add_folio+0x277/0x450 [ 71.722304] ? __pfx_workingset_update_node+0x10/0x10 [ 71.722773] filemap_add_folio+0x3c/0xa0 [ 71.723149] __filemap_get_folio+0x13d/0x2f0 [ 71.723551] iomap_get_folio+0x4c/0x60 [ 71.723911] iomap_write_begin+0x1bb/0x2e0 [ 71.724309] iomap_write_iter+0xff/0x290 [ 71.724683] iomap_file_buffered_write+0x91/0xf0 [ 71.725140] xfs_file_buffered_write+0x9f/0x2d0 [xfs] [ 71.725793] ? vfs_write+0x261/0x530 [ 71.726148] ? debug_smp_processor_id+0x17/0x20 [ 71.726574] xfs_file_write_iter+0xe9/0x120 [xfs] [ 71.727161] vfs_write+0x37d/0x530 [ 71.727501] ksys_write+0x6d/0xf0 [ 71.727821] __x64_sys_write+0x19/0x20 [ 71.728181] do_syscall_64+0x79/0x1a0 [ 71.728529] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 71.729002] RIP: 0033:0x7fd77053e927 [ 71.729340] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 71.730988] RSP: 002b:00007fff032b7218 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 71.731664] RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007fd77053e927 [ 71.732308] RDX: 0000000000100000 RSI: 00007fd762cfe000 RDI: 0000000000000001 [ 71.732955] RBP: 00007fd762cfe000 R08: 00007fd762cfe000 R09: 0000000000000000 [ 71.733592] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000 [ 71.734237] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fd762cfe000 [ 71.735175] </TASK> [ 71.736115] memory: usage 1048548kB, limit 1048576kB, failcnt 114 [ 71.736123] memory+swap: usage 1048548kB, limit 9007199254740988kB, failcnt 0 [ 71.736127] kmem: usage 184kB, limit 9007199254740988kB, failcnt 0 [ 71.736131] Memory cgroup stats for /mglru: [ 71.736364] cache 1072300032 [ 71.736370] rss 1224704 [ 71.736373] rss_huge 0 [ 71.736376] shmem 0 [ 71.736380] mapped_file 0 [ 71.736383] dirty 1072300032 [ 71.736386] writeback 0 [ 71.736389] workingset_refault_anon 0 [ 71.736393] workingset_refault_file 0 [ 71.736396] swap 0 [ 71.736400] swapcached 0 [ 71.736403] pgpgin 2782 [ 71.736406] pgpgout 1427 [ 71.736410] pgfault 882 [ 71.736414] pgmajfault 0 [ 71.736417] inactive_anon 0 [ 71.736421] active_anon 1220608 [ 71.736424] inactive_file 0 [ 71.736428] active_file 1072300032 [ 71.736431] unevictable 0 [ 71.736435] hierarchical_memory_limit 1073741824 [ 71.736438] hierarchical_memsw_limit 9223372036854771712 [ 71.736442] total_cache 1072300032 [ 71.736445] total_rss 1224704 [ 71.736448] total_rss_huge 0 [ 71.736451] total_shmem 0 [ 71.736455] total_mapped_file 0 [ 71.736458] total_dirty 1072300032 [ 71.736462] total_writeback 0 [ 71.736465] total_workingset_refault_anon 0 [ 71.736469] total_workingset_refault_file 0 [ 71.736472] total_swap 0 [ 71.736475] total_swapcached 0 [ 71.736478] total_pgpgin 2782 [ 71.736482] total_pgpgout 1427 [ 71.736485] total_pgfault 882 [ 71.736488] total_pgmajfault 0 [ 71.736491] total_inactive_anon 0 [ 71.736494] total_active_anon 1220608 [ 71.736497] total_inactive_file 0 [ 71.736501] total_active_file 1072300032 [ 71.736504] total_unevictable 0 [ 71.736508] Tasks state (memory values in pages): [ 71.736512] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name [ 71.736522] [ 7215] 0 7215 55663 768 0 768 0 81920 0 0 test.sh [ 71.736586] [ 7218] 0 7218 55506 640 256 384 0 69632 0 0 dd [ 71.736596] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-1,oom_memcg=/mglru,task_memcg=/mglru,task=test.sh,pid=7215,uid=0 [ 71.736766] Memory cgroup out of memory: Killed process 7215 (test.sh) total-vm:222652kB, anon-rss:0kB, file-rss:3072kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0 And the key information: [ 71.736442] total_cache 1072300032 [ 71.736458] total_dirty 1072300032 [ 71.736462] total_writeback 0 > > Thanks! > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4255619a1a31..d3cfbd95996d 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -225,7 +225,7 @@ static bool writeback_throttling_sane(struct scan_control *sc) > if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) > return true; > #endif > - return false; > + return lru_gen_enabled(); > } > #else > static bool cgroup_reclaim(struct scan_control *sc) > @@ -4273,8 +4273,10 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > } > > /* waiting for writeback */ > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > + sc->nr.dirty += delta; > + if (!folio_test_reclaim(folio)) > + sc->nr.congested += delta; > gen = folio_inc_gen(lruvec, folio, true); > list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > return true; -- Regards Yafang ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-15 2:38 ` Yafang Shao @ 2024-03-15 14:27 ` Johannes Weiner 0 siblings, 0 replies; 19+ messages in thread From: Johannes Weiner @ 2024-03-15 14:27 UTC (permalink / raw) To: Yafang Shao Cc: Yu Zhao, Axel Rasmussen, Chris Down, cgroups, kernel-team, linux-kernel, linux-mm On Fri, Mar 15, 2024 at 10:38:31AM +0800, Yafang Shao wrote: > On Fri, Mar 15, 2024 at 6:23 AM Yu Zhao <yuzhao@google.com> wrote: > > I'm surprised to see there was 0 pages under writeback: > > [Wed Mar 13 11:16:48 2024] total_writeback 0 > > What's your dirty limit? > > The background dirty threshold is 2G, and the dirty threshold is 4G. > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 * 2)) > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 * 4)) > > > > > It's unfortunate that the mainline has no per-memcg dirty limit. (We > > do at Google.) > > Per-memcg dirty limit is a useful feature. We also support it in our > local kernel, but we didn't enable it for this test case. > It is unclear why the memcg maintainers insist on rejecting the > per-memcg dirty limit :( I don't think that assessment is fair. It's just that nobody has seriously proposed it (at least not that I remember) since the cgroup-aware writeback was merged in 2015. We run millions of machines with different workloads, memory sizes, and IO devices, and don't feel the need to tune the settings for the global dirty limits away from the defaults. Cgroups allot those allowances in proportion to observed writeback speed and available memory in the container. We set IO rate and memory limits per container, and it adapts as necessary. If you have an actual usecase, I'm more than willing to hear you out. I'm sure that the other maintainers feel the same. If you're proposing it as a workaround for cgroup1 being architecturally unable to implement proper writeback cache management, then it's a more difficult argument. That's one of the big reasons why cgroup2 exists after all. > > > As of now, it appears that the most effective solution to address this > > > issue is to revert the commit 14aa8b2d5c2e. Regarding this commit > > > 14aa8b2d5c2e, its original intention was to eliminate potential SSD > > > wearout, although there's no concrete data available on how it might > > > impact SSD longevity. If the concern about SSD wearout is purely > > > theoretical, it might be reasonable to consider reverting this commit. > > > > The SSD wearout problem was real -- it wasn't really due to > > wakeup_flusher_threads() itself; rather, the original MGLRU code call > > the function improperly. It needs to be called under more restricted > > conditions so that it doesn't cause the SDD wearout problem again. > > However, IMO, wakeup_flusher_threads() is just another bandaid trying > > to work around a more fundamental problem. There is no guarantee that > > the flusher will target the dirty pages in the memcg under reclaim, > > right? > > Right, it is a system-wide fluser. Is it possible it was woken up just too frequently? Conventional reclaim wakes it based on actually observed dirty pages off the LRU. I'm not super familiar with MGLRU, but it looks like it woke it on every generational bump? That might indeed be too frequent, and doesn't seem related to the writeback cache state. We're monitoring write rates quite closely due to wearout concern as well, especially because we use disk swap too. This is the first time I'm hearing about reclaim-driven wakeups being a concern. (The direct writepage calls were a huge problem. But not waking the flushers.) Frankly, I don't think the issue is fixable without bringing the wakeup back in some form. Even if you had per-cgroup dirty limits. As soon as you have non-zero dirty pages, you can produce allocation patterns that drive reclaim into them before background writeback kicks in. If reclaim doesn't wake the flushers and waits for writeback, the premature OOM margin is the size of the background limit - 1. Yes, cgroup1 and cgroup2 react differently to seeing pages under writeback: cgroup1 does wait_on_page_writeback(); cgroup2 samples batches of pages and throttles at a higher level. But both of them need the flushers woken, or there is nothing to wait for. Unless you want to wait for dirty expiration :) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-12 20:07 ` Yu Zhao 2024-03-12 20:11 ` Yu Zhao @ 2024-03-12 21:08 ` Johannes Weiner 2024-03-13 2:08 ` Yu Zhao 2024-03-13 10:59 ` Hillf Danton 1 sibling, 2 replies; 19+ messages in thread From: Johannes Weiner @ 2024-03-12 21:08 UTC (permalink / raw) To: Yu Zhao Cc: Axel Rasmussen, Yafang Shao, Chris Down, cgroups, kernel-team, linux-kernel, linux-mm On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > Yes, these two are among the differences between the active/inactive > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > positions of dirty pages. The active/inactive LRU moves dirty pages > all the way to the end of the line (reclaim happens at the front) > whereas MGLRU moves them into the middle, during direct reclaim. The > rationale for MGLRU was that this way those dirty pages would still > be counted as "inactive" (or cold). Note that activating the page is not a statement on the page's hotness. It's simply to park it away from the scanner. We could as well have moved it to the unevictable list - this is just easier. folio_end_writeback() will call folio_rotate_reclaimable() and move it back to the inactive tail, to make it the very next reclaim target as soon as it's clean. > This theory can be quickly verified by comparing how much > nr_vmscan_immediate_reclaim grows, i.e., > > Before the copy > grep nr_vmscan_immediate_reclaim /proc/vmstat > And then after the copy > grep nr_vmscan_immediate_reclaim /proc/vmstat > > The growth should be trivial for MGLRU and nontrivial for the > active/inactive LRU. > > If this is indeed the case, I'd appreciate very much if anyone could > try the following (I'll try it myself too later next week). > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4255619a1a31..020f5d98b9a1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > } > > /* waiting for writeback */ > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > - gen = folio_inc_gen(lruvec, folio, true); > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > + DEFINE_MAX_SEQ(lruvec); > + int old_gen, new_gen = lru_gen_from_seq(max_seq); > + > + old_gen = folio_update_gen(folio, new_gen); > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); > return true; Right, because MGLRU sorts these pages out before calling the scanner, so they never get marked for immediate reclaim. But that also implies they won't get rotated back to the tail when writeback finishes. Doesn't that mean that you now have pages that a) came from the oldest generation and were only deferred due to their writeback state, and b) are now clean and should be reclaimed. But since they're permanently advanced to the next gen, you'll instead reclaim pages that were originally ahead of them, and likely hotter. Isn't that an age inversion? Back to the broader question though: if reclaim demand outstrips clean pages and the only viable candidates are dirty ones (e.g. an allocation spike in the presence of dirty/writeback pages), there only seem to be 3 options: 1) sleep-wait for writeback 2) continue scanning, aka busy-wait for writeback + age inversions 3) find nothing and declare OOM Since you're not doing 1), it must be one of the other two, no? One way or another it has to either pace-match to IO completions, or OOM. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-12 21:08 ` Johannes Weiner @ 2024-03-13 2:08 ` Yu Zhao 2024-03-13 3:22 ` Johannes Weiner 2024-03-13 10:59 ` Hillf Danton 1 sibling, 1 reply; 19+ messages in thread From: Yu Zhao @ 2024-03-13 2:08 UTC (permalink / raw) To: Johannes Weiner Cc: Axel Rasmussen, Yafang Shao, Chris Down, cgroups, kernel-team, linux-kernel, linux-mm On Tue, Mar 12, 2024 at 5:08 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > > Yes, these two are among the differences between the active/inactive > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > > positions of dirty pages. The active/inactive LRU moves dirty pages > > all the way to the end of the line (reclaim happens at the front) > > whereas MGLRU moves them into the middle, during direct reclaim. The > > rationale for MGLRU was that this way those dirty pages would still > > be counted as "inactive" (or cold). > > Note that activating the page is not a statement on the page's > hotness. It's simply to park it away from the scanner. We could as > well have moved it to the unevictable list - this is just easier. > > folio_end_writeback() will call folio_rotate_reclaimable() and move it > back to the inactive tail, to make it the very next reclaim target as > soon as it's clean. > > > This theory can be quickly verified by comparing how much > > nr_vmscan_immediate_reclaim grows, i.e., > > > > Before the copy > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > And then after the copy > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > > The growth should be trivial for MGLRU and nontrivial for the > > active/inactive LRU. > > > > If this is indeed the case, I'd appreciate very much if anyone could > > try the following (I'll try it myself too later next week). > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 4255619a1a31..020f5d98b9a1 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > } > > > > /* waiting for writeback */ > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > - gen = folio_inc_gen(lruvec, folio, true); > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > + DEFINE_MAX_SEQ(lruvec); > > + int old_gen, new_gen = lru_gen_from_seq(max_seq); > > + > > + old_gen = folio_update_gen(folio, new_gen); > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); > > return true; > > Right, because MGLRU sorts these pages out before calling the scanner, > so they never get marked for immediate reclaim. > > But that also implies they won't get rotated back to the tail when > writeback finishes. Those dirty pages are marked by PG_reclaim either by folio_inc_gen() { ... if (reclaiming) new_flags |= BIT(PG_reclaim); ... } or [1], which I missed initially. So they should be rotated on writeback finishing up. [1] https://lore.kernel.org/linux-mm/ZfC2612ZYwwxpOmR@google.com/ > Doesn't that mean that you now have pages that > > a) came from the oldest generation and were only deferred due to their > writeback state, and > > b) are now clean and should be reclaimed. But since they're > permanently advanced to the next gen, you'll instead reclaim pages > that were originally ahead of them, and likely hotter. > > Isn't that an age inversion? > > Back to the broader question though: if reclaim demand outstrips clean > pages and the only viable candidates are dirty ones (e.g. an > allocation spike in the presence of dirty/writeback pages), there only > seem to be 3 options: > > 1) sleep-wait for writeback > 2) continue scanning, aka busy-wait for writeback + age inversions > 3) find nothing and declare OOM > > Since you're not doing 1), it must be one of the other two, no? One > way or another it has to either pace-match to IO completions, or OOM. Yes, and in this case, 2) is possible but 3) is very likely. MGLRU doesn't do 1) for sure (in the reclaim path of course). I didn't find any throttling on dirty pages for cgroup v2 either in the active/inactive LRU -- I assume Chris was on v2, and hence my take on throttling on dirty pages in the reclaim path not being the key for his case. With the above change, I'm hoping balance_dirty_pages() will wake up the flusher, again for Chris' case, so that MGLRU won't have to call wakeup_flusher_threads(), since it can wake up the flusher too often and in turn cause excessive IOs when considering SSD wearout. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-13 2:08 ` Yu Zhao @ 2024-03-13 3:22 ` Johannes Weiner 0 siblings, 0 replies; 19+ messages in thread From: Johannes Weiner @ 2024-03-13 3:22 UTC (permalink / raw) To: Yu Zhao Cc: Axel Rasmussen, Yafang Shao, Chris Down, cgroups, kernel-team, linux-kernel, linux-mm On Tue, Mar 12, 2024 at 10:08:13PM -0400, Yu Zhao wrote: > On Tue, Mar 12, 2024 at 5:08 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote: > > > Yes, these two are among the differences between the active/inactive > > > LRU and MGLRU, but their roles, IMO, are not as important as the LRU > > > positions of dirty pages. The active/inactive LRU moves dirty pages > > > all the way to the end of the line (reclaim happens at the front) > > > whereas MGLRU moves them into the middle, during direct reclaim. The > > > rationale for MGLRU was that this way those dirty pages would still > > > be counted as "inactive" (or cold). > > > > Note that activating the page is not a statement on the page's > > hotness. It's simply to park it away from the scanner. We could as > > well have moved it to the unevictable list - this is just easier. > > > > folio_end_writeback() will call folio_rotate_reclaimable() and move it > > back to the inactive tail, to make it the very next reclaim target as > > soon as it's clean. > > > > > This theory can be quickly verified by comparing how much > > > nr_vmscan_immediate_reclaim grows, i.e., > > > > > > Before the copy > > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > And then after the copy > > > grep nr_vmscan_immediate_reclaim /proc/vmstat > > > > > > The growth should be trivial for MGLRU and nontrivial for the > > > active/inactive LRU. > > > > > > If this is indeed the case, I'd appreciate very much if anyone could > > > try the following (I'll try it myself too later next week). > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 4255619a1a31..020f5d98b9a1 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c > > > } > > > > > > /* waiting for writeback */ > > > - if (folio_test_locked(folio) || folio_test_writeback(folio) || > > > - (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > > - gen = folio_inc_gen(lruvec, folio, true); > > > - list_move(&folio->lru, &lrugen->folios[gen][type][zone]); > > > + if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { > > > + DEFINE_MAX_SEQ(lruvec); > > > + int old_gen, new_gen = lru_gen_from_seq(max_seq); > > > + > > > + old_gen = folio_update_gen(folio, new_gen); > > > + lru_gen_update_size(lruvec, folio, old_gen, new_gen); > > > + list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]); > > > return true; > > > > Right, because MGLRU sorts these pages out before calling the scanner, > > so they never get marked for immediate reclaim. > > > > But that also implies they won't get rotated back to the tail when > > writeback finishes. > > Those dirty pages are marked by PG_reclaim either by > > folio_inc_gen() > { > ... > if (reclaiming) > new_flags |= BIT(PG_reclaim); > ... > } > > or [1], which I missed initially. So they should be rotated on writeback > finishing up. > > [1] https://lore.kernel.org/linux-mm/ZfC2612ZYwwxpOmR@google.com/ Ah, I missed that! Thanks. > > Doesn't that mean that you now have pages that > > > > a) came from the oldest generation and were only deferred due to their > > writeback state, and > > > > b) are now clean and should be reclaimed. But since they're > > permanently advanced to the next gen, you'll instead reclaim pages > > that were originally ahead of them, and likely hotter. > > > > Isn't that an age inversion? > > > > Back to the broader question though: if reclaim demand outstrips clean > > pages and the only viable candidates are dirty ones (e.g. an > > allocation spike in the presence of dirty/writeback pages), there only > > seem to be 3 options: > > > > 1) sleep-wait for writeback > > 2) continue scanning, aka busy-wait for writeback + age inversions > > 3) find nothing and declare OOM > > > > Since you're not doing 1), it must be one of the other two, no? One > > way or another it has to either pace-match to IO completions, or OOM. > > Yes, and in this case, 2) is possible but 3) is very likely. > > MGLRU doesn't do 1) for sure (in the reclaim path of course). I didn't > find any throttling on dirty pages for cgroup v2 either in the > active/inactive LRU -- I assume Chris was on v2, and hence my take on > throttling on dirty pages in the reclaim path not being the key for > his case. It's kind of spread out, but it's there: shrink_folio_list() will bump nr_dirty on dirty pages, and nr_congested if immediate reclaim folios cycle back around. shrink_inactive_list() will wake the flushers if all the dirty pages it encountered are still unqueued. shrink_node() will set LRUVEC_CGROUP_CONGESTED, and then call reclaim_throttle() on it. (As Chris points out, though, the throttle call was not long ago changed from VMSCAN_THROTTLE_WRITEBACK to VMSCAN_THROTTLE_CONGESTED, and appears a bit more fragile now than it used to be. Probably worth following up on this.) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-03-12 21:08 ` Johannes Weiner 2024-03-13 2:08 ` Yu Zhao @ 2024-03-13 10:59 ` Hillf Danton 1 sibling, 0 replies; 19+ messages in thread From: Hillf Danton @ 2024-03-13 10:59 UTC (permalink / raw) To: Johannes Weiner Cc: Yu Zhao, Axel Rasmussen, Chris Down, cgroups, linux-kernel, linux-mm On Tue, 12 Mar 2024 17:08:22 -0400 Johannes Weiner <hannes@cmpxchg.org> > > Back to the broader question though: if reclaim demand outstrips clean > pages and the only viable candidates are dirty ones (e.g. an > allocation spike in the presence of dirty/writeback pages), there only > seem to be 3 options: > > 1) sleep-wait for writeback > 2) continue scanning, aka busy-wait for writeback + age inversions > 3) find nothing and declare OOM 4) make dirty ratio match your writeback bandwidth [1] [1] Subject: Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free https://lore.kernel.org/lkml/CA+55aFzNe=3e=cDig+vEzZS5jm2c6apPV4s5NKG4eYL4_jxQjQ@mail.gmail.com/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: MGLRU premature memcg OOM on slow writes 2024-02-29 23:51 ` Axel Rasmussen 2024-03-01 0:30 ` Chris Down @ 2024-03-01 11:25 ` Hillf Danton 1 sibling, 0 replies; 19+ messages in thread From: Hillf Danton @ 2024-03-01 11:25 UTC (permalink / raw) To: Axel Rasmussen; +Cc: chris, hannes, linux-kernel, linux-mm, yuzhao On Thu, 29 Feb 2024 15:51:33 -0800 Axel Rasmussen <axelrasmussen@google.com> > > Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > makes sense to me at least that doing writeback every time we age is too > aggressive, but doing it in evict_folios() makes some sense to me, basically to > copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > I can send a patch which tries to implement this next week. In the meantime, Yu, Better after working out why flusher failed to do the job, given background writeback and balance_dirty_pages_ratelimited(). If pushing kswapd on the back makes any sense, what prevents you from pushing flusher instead, given they are two different things by define? > please let me know if what I've said here makes no sense for some reason. :) > > [1]: https://lore.kernel.org/lkml/YzSiWq9UEER5LKup@google.com/ ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-03-15 14:27 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-02-09 2:31 MGLRU premature memcg OOM on slow writes Chris Down 2024-02-29 17:28 ` Chris Down 2024-02-29 23:51 ` Axel Rasmussen 2024-03-01 0:30 ` Chris Down 2024-03-08 19:18 ` Axel Rasmussen 2024-03-08 21:22 ` Johannes Weiner 2024-03-11 9:11 ` Yafang Shao 2024-03-12 16:44 ` Axel Rasmussen 2024-03-12 20:07 ` Yu Zhao 2024-03-12 20:11 ` Yu Zhao 2024-03-13 3:33 ` Yafang Shao 2024-03-14 22:23 ` Yu Zhao 2024-03-15 2:38 ` Yafang Shao 2024-03-15 14:27 ` Johannes Weiner 2024-03-12 21:08 ` Johannes Weiner 2024-03-13 2:08 ` Yu Zhao 2024-03-13 3:22 ` Johannes Weiner 2024-03-13 10:59 ` Hillf Danton 2024-03-01 11:25 ` Hillf Danton
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.