* Regression in workingset_refault latency on 5.15 @ 2022-02-23 13:51 Daniel Dao 2022-02-23 15:57 ` Shakeel Butt 2022-02-24 9:22 ` Thorsten Leemhuis 0 siblings, 2 replies; 38+ messages in thread From: Daniel Dao @ 2022-02-23 13:51 UTC (permalink / raw) To: shakeelb Cc: kernel-team, linux-mm, hannes, guro, feng.tang, mhocko, hdanton, mkoutny, akpm, torvalds Hi all, We are observing some regressions in workingset_refault on our newly upgraded 5.15.19 nodes with zram as swap. This manifests in several ways: 1) Regression of workingset_refault duration observed in flamegraph We regularly collect flamegraphs for running services on the node. Since upgrade to 5.15.19, we see that workingset_refault occupied a more significant part of the service flamegraph (13%) with the following call trace workingset_refault+0x128 add_to_page_cache_lru+0x9f page_cache_ra_unbounded+0x154 force_page_cache_ra+0xe2 filemap_get_pages+0xe9 filemap_read+0xa4 xfs_file_buffered_read+0x98 xfs_file_read_iter+0x6a new_sync_read+0x118 vfs_read+0xf2 __x64_sys_pread64+0x89 do_syscall_64+0x3b entry_SYSCALL_64_after_hwframe+0x44 2) Regression of userspace performance sensitive code We have some performance sensentive code running in userspace that have their runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) func() elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more frequently than before. This went away after we disabled swap for the service using `memory.swap.max=0` memcg configuration. The new thing in 5.15 workingset_refault seems to be introduction of mem_cgroup_flush_stats() by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush lruvec stats in the refault). Given that mem_cgroup_flush_stats can take quite a long time for us on the standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats Tracing function calls slower than 10 ms... Ctrl+C to quit. TIME COMM PID LAT(ms) RVAL FUNC 0.000000 <redacted> 804776 11.50 200 mem_cgroup_flush_stats 0.343383 <redacted> 647496 10.58 200 mem_cgroup_flush_stats 0.604309 <redacted> 804776 10.50 200 mem_cgroup_flush_stats 1.230416 <redacted> 803293 10.01 200 mem_cgroup_flush_stats 1.248442 <redacted> 646400 11.02 200 mem_cgroup_flush_stats could it be possible that workingset_refault in some unfortunate case can take much longer than before such that it increases the time observed by CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of workingset_refault observed by perf ? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 13:51 Regression in workingset_refault latency on 5.15 Daniel Dao @ 2022-02-23 15:57 ` Shakeel Butt 2022-02-23 16:00 ` Shakeel Butt 2022-02-24 9:22 ` Thorsten Leemhuis 1 sibling, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-02-23 15:57 UTC (permalink / raw) To: Daniel Dao, Ivan Babrou Cc: kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds +Ivan Babrou Hi Daniel, On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@cloudflare.com> wrote: > > Hi all, > > We are observing some regressions in workingset_refault on our newly upgraded > 5.15.19 nodes with zram as swap. This manifests in several ways: > > 1) Regression of workingset_refault duration observed in flamegraph > > We regularly collect flamegraphs for running services on the node. Since upgrade > to 5.15.19, we see that workingset_refault occupied a more significant part of > the service flamegraph (13%) with the following call trace > > workingset_refault+0x128 > add_to_page_cache_lru+0x9f > page_cache_ra_unbounded+0x154 > force_page_cache_ra+0xe2 > filemap_get_pages+0xe9 > filemap_read+0xa4 > xfs_file_buffered_read+0x98 > xfs_file_read_iter+0x6a > new_sync_read+0x118 > vfs_read+0xf2 > __x64_sys_pread64+0x89 > do_syscall_64+0x3b > entry_SYSCALL_64_after_hwframe+0x44 > > 2) Regression of userspace performance sensitive code > > We have some performance sensentive code running in userspace that have their > runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: > > now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) > func() > elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now > > Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more > frequently than before. This went away after we disabled swap for the service > using `memory.swap.max=0` memcg configuration. > > The new thing in 5.15 workingset_refault seems to be introduction of > mem_cgroup_flush_stats() > by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush > lruvec stats in the > refault). > > Given that mem_cgroup_flush_stats can take quite a long time for us on the > standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) > > sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats > Tracing function calls slower than 10 ms... Ctrl+C to quit. > TIME COMM PID LAT(ms) RVAL FUNC > 0.000000 <redacted> 804776 11.50 200 > mem_cgroup_flush_stats > 0.343383 <redacted> 647496 10.58 200 > mem_cgroup_flush_stats > 0.604309 <redacted> 804776 10.50 200 > mem_cgroup_flush_stats > 1.230416 <redacted> 803293 10.01 200 > mem_cgroup_flush_stats > 1.248442 <redacted> 646400 11.02 200 > mem_cgroup_flush_stats > > could it be possible that workingset_refault in some unfortunate case can take > much longer than before such that it increases the time observed by > CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of > workingset_refault > observed by perf ? Ivan reported this issue last month and since then we have backported three patches to 5.15-stable to fix the issue. These got merged into 5.15.18. However you are reporting the issue is still present 5.15.19. Can you share a bit more detail on your hardware configuration (num of cpus) and if possible the flamegraph? thanks, Shakeel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 15:57 ` Shakeel Butt @ 2022-02-23 16:00 ` Shakeel Butt 2022-02-23 17:07 ` Daniel Dao 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-02-23 16:00 UTC (permalink / raw) To: Daniel Dao, Ivan Babrou Cc: kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Wed, Feb 23, 2022 at 7:57 AM Shakeel Butt <shakeelb@google.com> wrote: > > +Ivan Babrou > > Hi Daniel, > > On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@cloudflare.com> wrote: > > > > Hi all, > > > > We are observing some regressions in workingset_refault on our newly upgraded > > 5.15.19 nodes with zram as swap. This manifests in several ways: > > > > 1) Regression of workingset_refault duration observed in flamegraph > > > > We regularly collect flamegraphs for running services on the node. Since upgrade > > to 5.15.19, we see that workingset_refault occupied a more significant part of > > the service flamegraph (13%) with the following call trace > > > > workingset_refault+0x128 > > add_to_page_cache_lru+0x9f > > page_cache_ra_unbounded+0x154 > > force_page_cache_ra+0xe2 > > filemap_get_pages+0xe9 > > filemap_read+0xa4 > > xfs_file_buffered_read+0x98 > > xfs_file_read_iter+0x6a > > new_sync_read+0x118 > > vfs_read+0xf2 > > __x64_sys_pread64+0x89 > > do_syscall_64+0x3b > > entry_SYSCALL_64_after_hwframe+0x44 > > > > 2) Regression of userspace performance sensitive code > > > > We have some performance sensentive code running in userspace that have their > > runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: > > > > now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) > > func() > > elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now > > > > Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more > > frequently than before. This went away after we disabled swap for the service > > using `memory.swap.max=0` memcg configuration. > > > > The new thing in 5.15 workingset_refault seems to be introduction of > > mem_cgroup_flush_stats() > > by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush > > lruvec stats in the > > refault). > > > > Given that mem_cgroup_flush_stats can take quite a long time for us on the > > standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) > > > > sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats > > Tracing function calls slower than 10 ms... Ctrl+C to quit. > > TIME COMM PID LAT(ms) RVAL FUNC > > 0.000000 <redacted> 804776 11.50 200 > > mem_cgroup_flush_stats > > 0.343383 <redacted> 647496 10.58 200 > > mem_cgroup_flush_stats > > 0.604309 <redacted> 804776 10.50 200 > > mem_cgroup_flush_stats > > 1.230416 <redacted> 803293 10.01 200 > > mem_cgroup_flush_stats > > 1.248442 <redacted> 646400 11.02 200 > > mem_cgroup_flush_stats > > > > could it be possible that workingset_refault in some unfortunate case can take > > much longer than before such that it increases the time observed by > > CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of > > workingset_refault > > observed by perf ? > > Ivan reported this issue last month and since then we have backported > three patches to 5.15-stable to fix the issue. These got merged into > 5.15.18. However you are reporting the issue is still present 5.15.19. > > Can you share a bit more detail on your hardware configuration (num of > cpus) and if possible the flamegraph? > Also if you can reproduce the issue, can you try the patch at https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/ ? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 16:00 ` Shakeel Butt @ 2022-02-23 17:07 ` Daniel Dao 2022-02-23 17:36 ` Shakeel Butt 0 siblings, 1 reply; 38+ messages in thread From: Daniel Dao @ 2022-02-23 17:07 UTC (permalink / raw) To: Shakeel Butt Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Wed, Feb 23, 2022 at 4:00 PM Shakeel Butt <shakeelb@google.com> wrote: > > Can you share a bit more detail on your hardware configuration (num of > > cpus) and if possible the flamegraph? > > We have a mix of 96 and 128 cpus. I'm not yet sure if it's possible to share the flamegraphs. We may have to come back to that later if necessary. > > Also if you can reproduce the issue, can you try the patch at > https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/ > ? We can give it a try. I also wrote a bpftrace script to get the kernel stack when we encounter slow mem_cgroup_flush_stats ( with 10ms as threshold ) kprobe:mem_cgroup_flush_stats { @start[tid] = nsecs; @stack[tid] = kstack; } kretprobe:mem_cgroup_flush_stats /@start[tid]/ { $usecs = (nsecs - @start[tid]) / 1000; if ($usecs >= 10000) { printf("mem_cgroup_flush_stats: %d us\n", $usecs); printf("stack: %s\n", @stack[tid]); } delete(@start[tid]); delete(@stack[tid]); } END { clear(@start); clear(@stack); } Running it on a production node yields output like mem_cgroup_flush_stats: 10697 us stack: mem_cgroup_flush_stats+1 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 force_page_cache_ra+226 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 new_sync_read+277 vfs_read+242 __x64_sys_pread64+137 do_syscall_64+56 entry_SYSCALL_64_after_hwframe+68 I think the addition of many milliseconds on workingset_refault is too high. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 17:07 ` Daniel Dao @ 2022-02-23 17:36 ` Shakeel Butt 2022-02-23 19:28 ` Ivan Babrou 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-02-23 17:36 UTC (permalink / raw) To: Daniel Dao Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Wed, Feb 23, 2022 at 9:07 AM Daniel Dao <dqminh@cloudflare.com> wrote: > > On Wed, Feb 23, 2022 at 4:00 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > Can you share a bit more detail on your hardware configuration (num of > > > cpus) and if possible the flamegraph? > > > > > We have a mix of 96 and 128 cpus. I'm not yet sure if it's possible to share > the flamegraphs. We may have to come back to that later if necessary. > Can you please first confirm if you are running 5.15.19 kernel? If that is the case then it means the machine is seeing (nr_cpus * 32) stat updates. Couple of questions: 1) Did you see any call stack other than xfs read where expensive mem_cgroup_flush_stats() is showing up? 2) Can you please use the similar bpf+kprobe tracing for the memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the source of frequent stat updates. 3) I am still pondering why disabling swap resolves the issue for you. Is that only for a workload different from xfs read? 4) Can you please also trace cgroup_rstat_flush_locked()? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 17:36 ` Shakeel Butt @ 2022-02-23 19:28 ` Ivan Babrou 2022-02-23 20:28 ` Shakeel Butt 0 siblings, 1 reply; 38+ messages in thread From: Ivan Babrou @ 2022-02-23 19:28 UTC (permalink / raw) To: Shakeel Butt Cc: Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds > Can you please first confirm if you are running 5.15.19 kernel? If > that is the case then it means the machine is seeing (nr_cpus * 32) > stat updates. Yes, we're on 5.15.19. > Couple of questions: > 1) Did you see any call stack other than xfs read where expensive > mem_cgroup_flush_stats() is showing up? Yes: mem_cgroup_flush_stats: 7717 us stack: mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swapin_readahead+817 do_swap_page+561 __handle_mm_fault+2318 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 > 2) Can you please use the similar bpf+kprobe tracing for the > memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the > source of frequent stat updates. "memcg_rstat_updated" is "static inline". With the following: bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }' The top stacks are: @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 clear_page_dirty_for_io+292 write_cache_pages+323 iomap_writepages+28 xfs_vm_writepages+113 do_writepages+195 __writeback_single_inode+55 writeback_sb_inodes+493 __writeback_inodes_wb+76 ]: 36429 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 __test_set_page_writeback+421 iomap_do_writepage+1241 write_cache_pages+393 iomap_writepages+28 xfs_vm_writepages+113 do_writepages+195 __writeback_single_inode+55 writeback_sb_inodes+493 ]: 36431 @stacks[ __mod_memcg_lruvec_state+1 mod_objcg_state+421 memcg_slab_post_alloc_hook+345 __kmalloc_node_track_caller+386 __alloc_skb+139 alloc_skb_with_frags+75 sock_alloc_send_pskb+495 unix_stream_sendmsg+599 sock_sendmsg+94 sock_write_iter+151 ]: 39459 @stacks[ __mod_memcg_lruvec_state+1 release_pages+760 tlb_finish_mmu+89 zap_page_range+265 do_madvise.part.0+1844 __x64_sys_madvise+69 do_syscall_64+59 entry_SYSCALL_64_after_hwframe+68 ]: 54648 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 page_remove_rmap+85 unmap_page_range+1413 zap_page_range+224 do_madvise.part.0+1844 __x64_sys_madvise+69 do_syscall_64+59 entry_SYSCALL_64_after_hwframe+68 ]: 54719 @stacks[ __mod_memcg_lruvec_state+1 __pagevec_lru_add+256 lru_cache_add+82 __handle_mm_fault+4686 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 98103 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 page_add_new_anon_rmap+78 __handle_mm_fault+4675 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 105071 > 3) I am still pondering why disabling swap resolves the issue for you. > Is that only for a workload different from xfs read? My understanding is that any block IO (including swap) triggers new memcg accounting code. In our process we don't have any other IO than swap, so disabling swap removes the major (if not only) vector of triggering this issue. > 4) Can you please also trace cgroup_rstat_flush_locked()? Stacks are: bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @stacks[kstack(10)]++ }' Attaching 1 probe... ^C @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 shrink_node+163 balance_pgdat+777 kswapd+508 kthread+295 ret_from_fork+34 ]: 1 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 __read_swap_cache_async+512 swapin_readahead+817 do_swap_page+561 __handle_mm_fault+2318 handle_mm_fault+191 do_user_addr_fault+428 ]: 5 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+569 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 ]: 18 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 ]: 31 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 force_page_cache_ra+226 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 ]: 223 Let me know if you only need the slow ones. I should mention that there are really two issues: 1. Expensive workingset_refault, which shows up on flamegraphs. We see it for our rocksdb based database, which persists data on xfs (local nvme). 2. Expensive workingset_refault that causes latency hiccups, but doesn't show up on flamegraphs. We see it in our nginx based proxy with swap enabled (either zram or regular file on xfs on local nvme). We solved the latter by disabling swap. I think the proper solution would be for workingset_refault to be fast enough to be invisible, in line with what was happening on Linux 5.10. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 19:28 ` Ivan Babrou @ 2022-02-23 20:28 ` Shakeel Butt 2022-02-23 21:16 ` Ivan Babrou 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-02-23 20:28 UTC (permalink / raw) To: Ivan Babrou Cc: Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Wed, Feb 23, 2022 at 11:28 AM Ivan Babrou <ivan@cloudflare.com> wrote: > [...] > > 2) Can you please use the similar bpf+kprobe tracing for the > > memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the > > source of frequent stat updates. > > "memcg_rstat_updated" is "static inline". > > With the following: > > bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }' > [...] Thanks, it is helpful. It seems like most of the stats updates are happening on the anon page faults and based on signature, it seems like swap refaults. > > > 3) I am still pondering why disabling swap resolves the issue for you. > > Is that only for a workload different from xfs read? > > My understanding is that any block IO (including swap) triggers new > memcg accounting code. In our process we don't have any other IO than > swap, so disabling swap removes the major (if not only) vector of > triggering this issue. > Now, I understand why disabling swap is helping your case as the number of stat updates would be reduced drastically and rstat flush would happen async most of the time. [...] > I should mention that there are really two issues: > > 1. Expensive workingset_refault, which shows up on flamegraphs. We see > it for our rocksdb based database, which persists data on xfs (local > nvme). > 2. Expensive workingset_refault that causes latency hiccups, but > doesn't show up on flamegraphs. We see it in our nginx based proxy > with swap enabled (either zram or regular file on xfs on local nvme). > > We solved the latter by disabling swap. I think the proper solution > would be for workingset_refault to be fast enough to be invisible, in > line with what was happening on Linux 5.10. Thanks for the info. Is it possible to test https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/ ? If that patch did not help then we either have to optimize rstat flushing or further increase the update buffer which is nr_cpus * 32. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 20:28 ` Shakeel Butt @ 2022-02-23 21:16 ` Ivan Babrou 2022-02-24 14:46 ` Daniel Dao 0 siblings, 1 reply; 38+ messages in thread From: Ivan Babrou @ 2022-02-23 21:16 UTC (permalink / raw) To: Shakeel Butt Cc: Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds > Thanks for the info. Is it possible to test > https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/ > ? > > If that patch did not help then we either have to optimize rstat > flushing or further increase the update buffer which is nr_cpus * 32. We're working on it and should have some data tomorrow. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 21:16 ` Ivan Babrou @ 2022-02-24 14:46 ` Daniel Dao 2022-02-24 16:58 ` Shakeel Butt 0 siblings, 1 reply; 38+ messages in thread From: Daniel Dao @ 2022-02-24 14:46 UTC (permalink / raw) To: Ivan Babrou Cc: Shakeel Butt, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Wed, Feb 23, 2022 at 9:16 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > Thanks for the info. Is it possible to test > https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/ We built the patch on top of 5.15.25 and deployed it with swap-on-zram on some nodes. Test data follow: 1) Latency of workingset_refault over 60 seconds $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 4643 |********************** | 512 -> 1023 : 4259 |******************** | 1024 -> 2047 : 1178 |***** | 2048 -> 4095 : 4044 |******************* | 4096 -> 8191 : 1804 |******** | 8192 -> 16383 : 109 | | 16384 -> 32767 : 52 | | 32768 -> 65535 : 45 | | 65536 -> 131071 : 15 | | 131072 -> 262143 : 2 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 7 | | 1048576 -> 2097151 : 900 |**** | 2097152 -> 4194303 : 8218 |****************************************| 4194304 -> 8388607 : 967 |**** | 8388608 -> 16777215 : 89 | | avg = 1229607 nsecs, total: 32378027676 nsecs, count: 26332 2) Latency of cgroup_rstat_flush_locked over 60 seconds $ sudo /usr/share/bcc/tools/funclatency -d 60 cgroup_rstat_flush_locked Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 36 | | 2048 -> 4095 : 73 | | 4096 -> 8191 : 151 | | 8192 -> 16383 : 606 |** | 16384 -> 32767 : 387 |* | 32768 -> 65535 : 49 | | 65536 -> 131071 : 26 | | 131072 -> 262143 : 6 | | 262144 -> 524287 : 3 | | 524288 -> 1048575 : 14 | | 1048576 -> 2097151 : 2085 |********* | 2097152 -> 4194303 : 8580 |****************************************| 4194304 -> 8388607 : 772 |*** | 8388608 -> 16777215 : 117 | | 16777216 -> 33554431 : 1 | | avg = 2592894 nsecs, total: 33463898782 nsecs, count: 12906 3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swap_cluster_readahead+440 shmem_swapin+164 shmem_swapin_page+396 shmem_getpage_gfp+1228 shmem_fault+104 __do_fault+55 __handle_mm_fault+3102 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 1 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swap_cluster_readahead+333 shmem_swapin+164 shmem_swapin_page+396 shmem_getpage_gfp+1228 shmem_fault+104 __do_fault+55 __handle_mm_fault+3649 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 1 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swap_cluster_readahead+333 shmem_swapin+164 shmem_swapin_page+396 shmem_getpage_gfp+1228 shmem_fault+104 __do_fault+55 __handle_mm_fault+3102 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 4 @stackcount[ mem_cgroup_flush_stats+1 shrink_node+163 balance_pgdat+791 kswapd+508 kthread+295 ret_from_fork+34 ]: 7 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+569 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 new_sync_read+280 vfs_read+242 __x64_sys_pread64+137 __noinstr_text_start+59 entry_SYSCALL_64_after_hwframe+68 ]: 10 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 new_sync_read+280 vfs_read+242 __x64_sys_pread64+137 __noinstr_text_start+59 entry_SYSCALL_64_after_hwframe+68 ]: 111 @stackcount[ mem_cgroup_flush_stats+1 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 force_page_cache_ra+226 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 new_sync_read+280 vfs_read+242 __x64_sys_pread64+137 __noinstr_text_start+59 entry_SYSCALL_64_after_hwframe+68 ]: 225 This was collected with the following bpftrace script kprobe:mem_cgroup_flush_stats { @start[tid] = nsecs; @stack[tid] = kstack; } kretprobe:mem_cgroup_flush_stats /@start[tid]/ { $usecs = (nsecs - @start[tid]) / 1000; if ($usecs >= 5000) { @stackcount[@stack[tid]]++; } delete(@start[tid]); delete(@stack[tid]); } interval:s:60 { print(@stackcount); exit(); } > > If that patch did not help then we either have to optimize rstat > flushing or further increase the update buffer which is nr_cpus * 32. I think overall, we don't see any significant improvements. The appearance of expensive mem_cgroup_flush_stats is still there. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-24 14:46 ` Daniel Dao @ 2022-02-24 16:58 ` Shakeel Butt 2022-02-24 17:34 ` Daniel Dao 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-02-24 16:58 UTC (permalink / raw) To: Daniel Dao Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Thu, Feb 24, 2022 at 02:46:27PM +0000, Daniel Dao wrote: [...] > 3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms Can you please check if flush_memcg_stats_dwork() appears in any stack traces at all? [...] > > > > If that patch did not help then we either have to optimize rstat > > flushing or further increase the update buffer which is nr_cpus * 32. > I think overall, we don't see any significant improvements. The > appearance of > expensive mem_cgroup_flush_stats is still there. Thanks for testing. At the moment I am suspecting the async worker is not getting the CPU. Can you share your CONFIG_HZ setting? Also can you try the following patch and see if that helps otherwise keep halving the delay (i.e. 2HZ -> HZ -> HZ/2 -> ...) and find at what value the issue you are seeing get resolved? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 32ba963ebf2e..198a7baaeeb7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -685,7 +685,7 @@ void mem_cgroup_flush_stats(void) static void flush_memcg_stats_dwork(struct work_struct *w) { __mem_cgroup_flush_stats(); - queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, HZ); } /** ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-24 16:58 ` Shakeel Butt @ 2022-02-24 17:34 ` Daniel Dao 2022-02-24 18:00 ` Shakeel Butt 0 siblings, 1 reply; 38+ messages in thread From: Daniel Dao @ 2022-02-24 17:34 UTC (permalink / raw) To: Shakeel Butt Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Thu, Feb 24, 2022 at 4:58 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Feb 24, 2022 at 02:46:27PM +0000, Daniel Dao wrote: > > [...] > > > > 3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms > > > Can you please check if flush_memcg_stats_dwork() appears in any stack > traces at all? Here is the result of probes on flush_memcg_stats_dwork: $ sudo /usr/share/bcc/tools/funccount -d 30 flush_memcg_stats_dwork Tracing 1 functions for "b'flush_memcg_stats_dwork'"... Hit Ctrl-C to end. FUNC COUNT b'flush_memcg_stats_dwork' 14 sudo /usr/share/bcc/tools/funclatency -d 30 flush_memcg_stats_dwork Tracing 1 functions for "flush_memcg_stats_dwork"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 8 |****************************************| 8192 -> 16383 : 0 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 0 | | 1048576 -> 2097151 : 1 |***** | 2097152 -> 4194303 : 4 |******************** | 4194304 -> 8388607 : 2 |********** | avg = 1725693 nsecs, total: 25885397 nsecs, count: 15 So we triggered the async flush as expected, around every 2 seconds. But they mostly run faster than the inline call from workingset_refault. I think on busy servers with varied workloads that touch swap/page_cache, it's very likely that most of the cost is in inline mem_cgroup_flush_stats() of workingset_refault rather than from async flush. > Thanks for testing. At the moment I am suspecting the async worker is > not getting the CPU. Can you share your CONFIG_HZ setting? Also can you > try the following patch and see if that helps otherwise keep halving the > delay (i.e. 2HZ -> HZ -> HZ/2 -> ...) and find at what value the issue > you are seeing get resolved? We have CONFIG_HZ=1000. We can try to increase the frequency of async flush, but that seems like a not great bandaid. Is it possible to remove mem_cgroup_flush_stats() from workingset_refault, or at least scope it down to some targeted cgroup so we don't need to flush from root with potentially large sets of cgroups to walk ? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-24 17:34 ` Daniel Dao @ 2022-02-24 18:00 ` Shakeel Butt 2022-02-24 18:52 ` Shakeel Butt 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-02-24 18:00 UTC (permalink / raw) To: Daniel Dao Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Thu, Feb 24, 2022 at 9:34 AM Daniel Dao <dqminh@cloudflare.com> wrote: > [...] > > Thanks for testing. At the moment I am suspecting the async worker is > > not getting the CPU. Can you share your CONFIG_HZ setting? Also can you > > try the following patch and see if that helps otherwise keep halving the > > delay (i.e. 2HZ -> HZ -> HZ/2 -> ...) and find at what value the issue > > you are seeing get resolved? > > We have CONFIG_HZ=1000. We can try to increase the frequency of async flush, but > that seems like a not great bandaid. Is it possible to remove > mem_cgroup_flush_stats() > from workingset_refault, or at least scope it down to some targeted cgroup so > we don't need to flush from root with potentially large sets of > cgroups to walk ? I actually wanted to know what would be the good frequency of rstat flushing for your workload. I am not planning to propose to change the default frequency. Anyways I am thinking of introducing mem_cgroup_flush_stats_asyn() which will schedule flush_memcg_stats_dwork() without delay. Let me prepare the patch based on 5.15-stable for you to test. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-24 18:00 ` Shakeel Butt @ 2022-02-24 18:52 ` Shakeel Butt 2022-02-25 10:23 ` Daniel Dao 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-02-24 18:52 UTC (permalink / raw) To: Daniel Dao Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Thu, Feb 24, 2022 at 10:00:55AM -0800, Shakeel Butt wrote: > On Thu, Feb 24, 2022 at 9:34 AM Daniel Dao <dqminh@cloudflare.com> wrote: [...] > Anyways I am thinking of introducing mem_cgroup_flush_stats_asyn() > which will schedule flush_memcg_stats_dwork() without delay. Let me > prepare the patch based on 5.15-stable for you to test. Can you please try the following patch and let me the results? diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d9b8df5ef212..cd732d7e00ca 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1002,6 +1002,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, } void mem_cgroup_flush_stats(void); +void mem_cgroup_flush_stats_async(void); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 32ba963ebf2e..0f298cbd4763 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -682,6 +682,14 @@ void mem_cgroup_flush_stats(void) __mem_cgroup_flush_stats(); } +void mem_cgroup_flush_stats_aync(void) +{ + if (atomic_read(&stats_flush_threshold) > num_online_cpus()) { + mod_delayed_work(system_unbound_wq, &stats_flush_dwork, 0); + atomic_set(&stats_flush_threshold, 0); + } +} + static void flush_memcg_stats_dwork(struct work_struct *w) { __mem_cgroup_flush_stats(); diff --git a/mm/workingset.c b/mm/workingset.c index d5b81e4f4cbe..86d43bfc5c63 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -352,7 +352,7 @@ void workingset_refault(struct page *page, void *shadow) inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_async(); /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-24 18:52 ` Shakeel Butt @ 2022-02-25 10:23 ` Daniel Dao 2022-02-25 17:08 ` Ivan Babrou 0 siblings, 1 reply; 38+ messages in thread From: Daniel Dao @ 2022-02-25 10:23 UTC (permalink / raw) To: Shakeel Butt Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Thu, Feb 24, 2022 at 6:52 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Feb 24, 2022 at 10:00:55AM -0800, Shakeel Butt wrote: > > On Thu, Feb 24, 2022 at 9:34 AM Daniel Dao <dqminh@cloudflare.com> wrote: > [...] > > Anyways I am thinking of introducing mem_cgroup_flush_stats_asyn() > > which will schedule flush_memcg_stats_dwork() without delay. Let me > > prepare the patch based on 5.15-stable for you to test. > > > Can you please try the following patch and let me the results? > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index d9b8df5ef212..cd732d7e00ca 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -1002,6 +1002,7 @@ static inline unsigned long > lruvec_page_state_local(struct lruvec *lruvec, > } > > void mem_cgroup_flush_stats(void); > +void mem_cgroup_flush_stats_async(void); > > void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item > idx, > int val); > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 32ba963ebf2e..0f298cbd4763 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -682,6 +682,14 @@ void mem_cgroup_flush_stats(void) > __mem_cgroup_flush_stats(); > } > > +void mem_cgroup_flush_stats_aync(void) > +{ > + if (atomic_read(&stats_flush_threshold) > num_online_cpus()) { > + mod_delayed_work(system_unbound_wq, &stats_flush_dwork, 0); > + atomic_set(&stats_flush_threshold, 0); > + } > +} > + > static void flush_memcg_stats_dwork(struct work_struct *w) > { > __mem_cgroup_flush_stats(); > diff --git a/mm/workingset.c b/mm/workingset.c > index d5b81e4f4cbe..86d43bfc5c63 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -352,7 +352,7 @@ void workingset_refault(struct page *page, void *shadow) > > inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file); > > - mem_cgroup_flush_stats(); > + mem_cgroup_flush_stats_async(); > /* > * Compare the distance to the existing workingset size. We > * don't activate pages that couldn't stay resident even if I think this looks good so far. Some stats: 1) More time spent in flush_memcg_stats_dwork $ sudo /usr/share/bcc/tools/funclatency -d 60 flush_memcg_stats_dwork Tracing 1 functions for "flush_memcg_stats_dwork"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 21 | | 512 -> 1023 : 115 | | 1024 -> 2047 : 47 | | 2048 -> 4095 : 86 | | 4096 -> 8191 : 48 | | 8192 -> 16383 : 2 | | 16384 -> 32767 : 2 | | 32768 -> 65535 : 1 | | 65536 -> 131071 : 52 | | 131072 -> 262143 : 400 |** | 262144 -> 524287 : 2172 |**************** | 524288 -> 1048575 : 4193 |******************************* | 1048576 -> 2097151 : 5357 |****************************************| 2097152 -> 4194303 : 1834 |************* | 4194304 -> 8388607 : 485 |*** | 8388608 -> 16777215 : 3 | | avg = 1346983 nsecs, total: 19959597708 nsecs, count: 14818 2) Less time spent in workingset_refault ~$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 10318 |****************************************| 512 -> 1023 : 2877 |*********** | 1024 -> 2047 : 1089 |**** | 2048 -> 4095 : 4785 |****************** | 4096 -> 8191 : 4528 |***************** | 8192 -> 16383 : 1270 |**** | 16384 -> 32767 : 82 | | 32768 -> 65535 : 80 | | 65536 -> 131071 : 20 | | 131072 -> 262143 : 2 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 1 | | avg = 2705 nsecs, total: 67782139 nsecs, count: 25052 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-25 10:23 ` Daniel Dao @ 2022-02-25 17:08 ` Ivan Babrou 2022-02-25 17:22 ` Shakeel Butt 2022-02-25 18:03 ` Michal Koutný 0 siblings, 2 replies; 38+ messages in thread From: Ivan Babrou @ 2022-02-25 17:08 UTC (permalink / raw) To: Daniel Dao Cc: Shakeel Butt, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Fri, Feb 25, 2022 at 2:23 AM Daniel Dao <dqminh@cloudflare.com> wrote: > I think this looks good so far. I compared a flamegraph before to a flamegraph after (10s @ 99Hz on 96-core CPU evenly loaded to ~75% in both cases). Before: 1.4% spent in workingset_refault. After: 0.5% spent in flush_memcg_stats_dwork. The latter is all in kworkers (as expected), while the former is spread across IO active tasks. This seems like a great first step that should be merged on its own. It would be good to also do something to improve the CPU time spent in delayed work, if possible, as 0.5% of on-CPU time is not a negligible amount. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-25 17:08 ` Ivan Babrou @ 2022-02-25 17:22 ` Shakeel Butt 2022-02-25 18:03 ` Michal Koutný 1 sibling, 0 replies; 38+ messages in thread From: Shakeel Butt @ 2022-02-25 17:22 UTC (permalink / raw) To: Ivan Babrou Cc: Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný, Andrew Morton, Linus Torvalds On Fri, Feb 25, 2022 at 9:09 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Fri, Feb 25, 2022 at 2:23 AM Daniel Dao <dqminh@cloudflare.com> wrote: > > I think this looks good so far. > > I compared a flamegraph before to a flamegraph after (10s @ 99Hz on > 96-core CPU evenly loaded to ~75% in both cases). > > Before: 1.4% spent in workingset_refault. > After: 0.5% spent in flush_memcg_stats_dwork. > > The latter is all in kworkers (as expected), while the former is > spread across IO active tasks. > > This seems like a great first step that should be merged on its own. > It would be good to also do something to improve the CPU time spent in > delayed work, if possible, as 0.5% of on-CPU time is not a negligible > amount. I will send the official async patch soon and I agree that we need to improve rstat flush infrastructure as well but that would need more thought and time. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-25 17:08 ` Ivan Babrou 2022-02-25 17:22 ` Shakeel Butt @ 2022-02-25 18:03 ` Michal Koutný 2022-02-25 18:08 ` Ivan Babrou 1 sibling, 1 reply; 38+ messages in thread From: Michal Koutný @ 2022-02-25 18:03 UTC (permalink / raw) To: Ivan Babrou Cc: Daniel Dao, Shakeel Butt, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Fri, Feb 25, 2022 at 09:08:50AM -0800, Ivan Babrou <ivan@cloudflare.com> wrote: > This seems like a great first step that should be merged on its own. > It would be good to also do something to improve the CPU time spent in > delayed work, if possible, as 0.5% of on-CPU time is not a negligible > amount. BTW how many levels deep is the affected memory cgroup hierarchy (where the workingset_refault happens)? (Self answer: Probably less than nr_cpus*MEMCG_BATCH, so not relevant.) Thanks, Michal ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-25 18:03 ` Michal Koutný @ 2022-02-25 18:08 ` Ivan Babrou 2022-02-28 23:09 ` Shakeel Butt 0 siblings, 1 reply; 38+ messages in thread From: Ivan Babrou @ 2022-02-25 18:08 UTC (permalink / raw) To: Michal Koutný Cc: Daniel Dao, Shakeel Butt, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Fri, Feb 25, 2022 at 10:03 AM Michal Koutný <mkoutny@suse.com> wrote: > BTW how many levels deep is the affected memory cgroup hierarchy (where > the workingset_refault happens)? (Self answer: Probably less than > nr_cpus*MEMCG_BATCH, so not relevant.) It's /system.slice/thingy.service (a regular systemd unit on cgroupv2), so not very deep. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-25 18:08 ` Ivan Babrou @ 2022-02-28 23:09 ` Shakeel Butt 2022-02-28 23:34 ` Ivan Babrou 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-02-28 23:09 UTC (permalink / raw) To: Ivan Babrou Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Fri, Feb 25, 2022 at 10:08:04AM -0800, Ivan Babrou wrote: > On Fri, Feb 25, 2022 at 10:03 AM Michal Koutn� <mkoutny@suse.com> wrote: > > BTW how many levels deep is the affected memory cgroup hierarchy (where > > the workingset_refault happens)? (Self answer: Probably less than > > nr_cpus*MEMCG_BATCH, so not relevant.) > It's /system.slice/thingy.service (a regular systemd unit on > cgroupv2), so not very deep. Hi Ivan & Daniel, Can you please try the following patch but without the async patch and let us know if there is any difference in regression you are observing? diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0abbd685703b..59014d20149a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -999,6 +999,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, } void mem_cgroup_flush_stats(void); +void mem_cgroup_flush_stats_memcg(struct mem_cgroup *memcg); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 36e9f38c919d..ae59bfc8788a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -642,14 +642,14 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } -static void __mem_cgroup_flush_stats(void) +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg) { unsigned long flag; if (!spin_trylock_irqsave(&stats_flush_lock, flag)) return; - cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); + cgroup_rstat_flush_irqsafe(memcg->css.cgroup); atomic_set(&stats_flush_threshold, 0); spin_unlock_irqrestore(&stats_flush_lock, flag); } @@ -657,12 +657,18 @@ static void __mem_cgroup_flush_stats(void) void mem_cgroup_flush_stats(void) { if (atomic_read(&stats_flush_threshold) > num_online_cpus()) - __mem_cgroup_flush_stats(); + __mem_cgroup_flush_stats(root_mem_cgroup); +} + +void mem_cgroup_flush_stats_memcg(struct mem_cgroup *memcg) +{ + if (atomic_read(&stats_flush_threshold) > num_online_cpus()) + __mem_cgroup_flush_stats(memcg); } static void flush_memcg_stats_dwork(struct work_struct *w) { - __mem_cgroup_flush_stats(); + __mem_cgroup_flush_stats(root_mem_cgroup); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); } diff --git a/mm/workingset.c b/mm/workingset.c index 8c03afe1d67c..9b9ebc5e5110 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -354,7 +354,7 @@ void workingset_refault(struct folio *folio, void *shadow) mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_memcg(eviction_memcg); /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-28 23:09 ` Shakeel Butt @ 2022-02-28 23:34 ` Ivan Babrou 2022-02-28 23:43 ` Shakeel Butt 2022-03-02 0:48 ` Ivan Babrou 0 siblings, 2 replies; 38+ messages in thread From: Ivan Babrou @ 2022-02-28 23:34 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Mon, Feb 28, 2022 at 3:09 PM Shakeel Butt <shakeelb@google.com> wrote: > Hi Ivan & Daniel, > > Can you please try the following patch but without the async patch and > let us know if there is any difference in regression you are observing? Hey Shakeel, I scheduled some reboots for overnight, one machine in each group: 1. Control on 5.15.25 2. Your previous patch with delayed flush on top of 5.15.25 3. You current patch with targeted flush on top of 5.15.25 Either Daniel or I should have some numbers tomorrow. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-28 23:34 ` Ivan Babrou @ 2022-02-28 23:43 ` Shakeel Butt 2022-03-02 0:48 ` Ivan Babrou 1 sibling, 0 replies; 38+ messages in thread From: Shakeel Butt @ 2022-02-28 23:43 UTC (permalink / raw) To: Ivan Babrou Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Mon, Feb 28, 2022 at 3:34 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Mon, Feb 28, 2022 at 3:09 PM Shakeel Butt <shakeelb@google.com> wrote: > > Hi Ivan & Daniel, > > > > Can you please try the following patch but without the async patch and > > let us know if there is any difference in regression you are observing? > > Hey Shakeel, > > I scheduled some reboots for overnight, one machine in each group: > > 1. Control on 5.15.25 > 2. Your previous patch with delayed flush on top of 5.15.25 > 3. You current patch with targeted flush on top of 5.15.25 > > Either Daniel or I should have some numbers tomorrow. Thanks a lot Ivan. I am also working on improving mem_cgroup_css_rstat_flush() and cgroup_rstat_flush_locked() which will be orthogonal to the patches you are testing. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-28 23:34 ` Ivan Babrou 2022-02-28 23:43 ` Shakeel Butt @ 2022-03-02 0:48 ` Ivan Babrou 2022-03-02 2:50 ` Shakeel Butt 2022-03-02 10:08 ` Michal Koutný 1 sibling, 2 replies; 38+ messages in thread From: Ivan Babrou @ 2022-03-02 0:48 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Mon, Feb 28, 2022 at 3:34 PM Ivan Babrou <ivan@cloudflare.com> wrote: > I scheduled some reboots for overnight, one machine in each group: > > 1. Control on 5.15.25 > 2. Your previous patch with delayed flush on top of 5.15.25 > 3. Your current patch with targeted flush on top of 5.15.25 > > Either Daniel or I should have some numbers tomorrow. Numbers for #2 (previous patch): $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 27313 |****************************************| 512 -> 1023 : 11887 |***************** | 1024 -> 2047 : 3718 |***** | 2048 -> 4095 : 7095 |********** | 4096 -> 8191 : 11860 |***************** | 8192 -> 16383 : 4193 |****** | 16384 -> 32767 : 470 | | 32768 -> 65535 : 320 | | 65536 -> 131071 : 107 | | 131072 -> 262143 : 20 | | 262144 -> 524287 : 2 | | avg = 2929 nsecs, total: 196247510 nsecs, count: 66985 Numbers for #3 (the latest patch): $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 6520 |****************************************| 512 -> 1023 : 3502 |********************* | 1024 -> 2047 : 726 |**** | 2048 -> 4095 : 2206 |************* | 4096 -> 8191 : 1078 |****** | 8192 -> 16383 : 70 | | 16384 -> 32767 : 39 | | 32768 -> 65535 : 28 | | 65536 -> 131071 : 8 | | 131072 -> 262143 : 1 | | 262144 -> 524287 : 11 | | 524288 -> 1048575 : 118 | | 1048576 -> 2097151 : 3722 |********************** | 2097152 -> 4194303 : 5854 |*********************************** | 4194304 -> 8388607 : 570 |*** | 8388608 -> 16777215 : 122 | | avg = 1100789 nsecs, total: 27051906434 nsecs, count: 24575 Looks like you were right that targeted flush is not going to be as good. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 0:48 ` Ivan Babrou @ 2022-03-02 2:50 ` Shakeel Butt 2022-03-02 3:40 ` Ivan Babrou 2022-03-02 11:49 ` Frank Hofmann 2022-03-02 10:08 ` Michal Koutný 1 sibling, 2 replies; 38+ messages in thread From: Shakeel Butt @ 2022-03-02 2:50 UTC (permalink / raw) To: Ivan Babrou Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote: [...] > Looks like you were right that targeted flush is not going to be as good. Thanks a lot. Can you please try the following patch (independent of other patches) as well? diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d9b8df5ef212..499f75e066f3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -75,29 +75,8 @@ enum mem_cgroup_events_target { MEM_CGROUP_NTARGETS, }; -struct memcg_vmstats_percpu { - /* Local (CPU and cgroup) page state & events */ - long state[MEMCG_NR_STAT]; - unsigned long events[NR_VM_EVENT_ITEMS]; - - /* Delta calculation for lockless upward propagation */ - long state_prev[MEMCG_NR_STAT]; - unsigned long events_prev[NR_VM_EVENT_ITEMS]; - - /* Cgroup1: threshold notifications & softlimit tree updates */ - unsigned long nr_page_events; - unsigned long targets[MEM_CGROUP_NTARGETS]; -}; - -struct memcg_vmstats { - /* Aggregated (CPU and subtree) page state & events */ - long state[MEMCG_NR_STAT]; - unsigned long events[NR_VM_EVENT_ITEMS]; - - /* Pending child counts during tree propagation */ - long state_pending[MEMCG_NR_STAT]; - unsigned long events_pending[NR_VM_EVENT_ITEMS]; -}; +struct memcg_vmstats_percpu; +struct memcg_vmstats; struct mem_cgroup_reclaim_iter { struct mem_cgroup *position; @@ -304,7 +283,7 @@ struct mem_cgroup { MEMCG_PADDING(_pad1_); /* memory.stat */ - struct memcg_vmstats vmstats; + struct memcg_vmstats *vmstats; /* memory.events */ atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; @@ -964,11 +943,7 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg, local_irq_restore(flags); } -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) -{ - return READ_ONCE(memcg->vmstats.state[idx]); -} - +unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx); static inline unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 32ba963ebf2e..c65f155c2048 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -688,6 +688,71 @@ static void flush_memcg_stats_dwork(struct work_struct *w) queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); } +static const unsigned int memcg_vm_events[] = { + PGPGIN, + PGPGOUT, + PGFAULT, + PGMAJFAULT, + PGREFILL, + PGSCAN_KSWAPD, + PGSCAN_DIRECT, + PGSTEAL_KSWAPD, + PGSTEAL_DIRECT, + PGACTIVATE, + PGDEACTIVATE, + PGLAZYFREE, + PGLAZYFREED, +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + THP_FAULT_ALLOC, + THP_COLLAPSE_ALLOC, +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +}; +#define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_events) + +static int memcg_events_index[NR_VM_EVENT_ITEMS] __read_mostly; + +static void init_memcg_events(void) +{ + int i; + + for (i = 0; i < NR_MEMCG_EVENTS; ++i) + memcg_events_index[memcg_vm_events[i]] = i + 1; +} + +static int get_memcg_events_index(enum vm_event_item idx) +{ + return memcg_events_index[idx] - 1; +} + +struct memcg_vmstats_percpu { + /* Local (CPU and cgroup) page state & events */ + long state[MEMCG_NR_STAT]; + unsigned long events[NR_MEMCG_EVENTS]; + + /* Delta calculation for lockless upward propagation */ + long state_prev[MEMCG_NR_STAT]; + unsigned long events_prev[NR_MEMCG_EVENTS]; + + /* Cgroup1: threshold notifications & softlimit tree updates */ + unsigned long nr_page_events; + unsigned long targets[MEM_CGROUP_NTARGETS]; +}; + +struct memcg_vmstats { + /* Aggregated (CPU and subtree) page state & events */ + long state[MEMCG_NR_STAT]; + unsigned long events[NR_MEMCG_EVENTS]; + + /* Pending child counts during tree propagation */ + long state_pending[MEMCG_NR_STAT]; + unsigned long events_pending[NR_MEMCG_EVENTS]; +}; + +unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) +{ + return READ_ONCE(memcg->vmstats->state[idx]); +} + /** * __mod_memcg_state - update cgroup memory statistics * @memcg: the memory cgroup @@ -831,25 +896,33 @@ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, unsigned long count) { - if (mem_cgroup_disabled()) + int index = get_memcg_events_index(idx); + + if (mem_cgroup_disabled() || index < 0) return; - __this_cpu_add(memcg->vmstats_percpu->events[idx], count); + __this_cpu_add(memcg->vmstats_percpu->events[index], count); memcg_rstat_updated(memcg, count); } static unsigned long memcg_events(struct mem_cgroup *memcg, int event) { - return READ_ONCE(memcg->vmstats.events[event]); + int index = get_memcg_events_index(event); + + if (index < 0) + return 0; + return READ_ONCE(memcg->vmstats->events[index]); } static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) { long x = 0; int cpu; + int index = get_memcg_events_index(event); - for_each_possible_cpu(cpu) - x += per_cpu(memcg->vmstats_percpu->events[event], cpu); + if (index >= 0) + for_each_possible_cpu(cpu) + x += per_cpu(memcg->vmstats_percpu->events[index], cpu); return x; } @@ -5147,6 +5220,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); + kfree(memcg->vmstats); free_percpu(memcg->vmstats_percpu); kfree(memcg); } @@ -5180,6 +5254,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; } + memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats), GFP_KERNEL); + if (!memcg->vmstats) + goto fail; + memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_percpu) @@ -5248,6 +5326,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) page_counter_init(&memcg->kmem, &parent->kmem); page_counter_init(&memcg->tcpmem, &parent->tcpmem); } else { + init_memcg_events(); page_counter_init(&memcg->memory, NULL); page_counter_init(&memcg->swap, NULL); page_counter_init(&memcg->kmem, NULL); @@ -5400,9 +5479,9 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) * below us. We're in a per-cpu loop here and this is * a global counter, so the first cycle will get them. */ - delta = memcg->vmstats.state_pending[i]; + delta = memcg->vmstats->state_pending[i]; if (delta) - memcg->vmstats.state_pending[i] = 0; + memcg->vmstats->state_pending[i] = 0; /* Add CPU changes on this level since the last flush */ v = READ_ONCE(statc->state[i]); @@ -5415,15 +5494,15 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) continue; /* Aggregate counts on this level and propagate upwards */ - memcg->vmstats.state[i] += delta; + memcg->vmstats->state[i] += delta; if (parent) - parent->vmstats.state_pending[i] += delta; + parent->vmstats->state_pending[i] += delta; } - for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - delta = memcg->vmstats.events_pending[i]; + for (i = 0; i < NR_MEMCG_EVENTS; i++) { + delta = memcg->vmstats->events_pending[i]; if (delta) - memcg->vmstats.events_pending[i] = 0; + memcg->vmstats->events_pending[i] = 0; v = READ_ONCE(statc->events[i]); if (v != statc->events_prev[i]) { @@ -5434,9 +5513,9 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (!delta) continue; - memcg->vmstats.events[i] += delta; + memcg->vmstats->events[i] += delta; if (parent) - parent->vmstats.events_pending[i] += delta; + parent->vmstats->events_pending[i] += delta; } for_each_node_state(nid, N_MEMORY) { -- 2.35.1.574.g5d30c73bfb-goog ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 2:50 ` Shakeel Butt @ 2022-03-02 3:40 ` Ivan Babrou 2022-03-02 22:33 ` Ivan Babrou 2022-03-02 11:49 ` Frank Hofmann 1 sibling, 1 reply; 38+ messages in thread From: Ivan Babrou @ 2022-03-02 3:40 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Tue, Mar 1, 2022 at 6:50 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote: > > [...] > > > Looks like you were right that targeted flush is not going to be as good. > > > Thanks a lot. Can you please try the following patch (independent of other > patches) as well? Sure thing, I'll come back with the numbers tomorrow. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 3:40 ` Ivan Babrou @ 2022-03-02 22:33 ` Ivan Babrou 2022-03-03 2:32 ` Shakeel Butt 2022-03-03 2:35 ` Shakeel Butt 0 siblings, 2 replies; 38+ messages in thread From: Ivan Babrou @ 2022-03-02 22:33 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Tue, Mar 1, 2022 at 6:50 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote: > > > > [...] > > > > > Looks like you were right that targeted flush is not going to be as good. > > > > > > Thanks a lot. Can you please try the following patch (independent of other > > patches) as well? > > Sure thing, I'll come back with the numbers tomorrow. Measurements from different configurations: * Baseline 5.15.25: $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 15931 |****************************************| 512 -> 1023 : 7040 |***************** | 1024 -> 2047 : 1566 |*** | 2048 -> 4095 : 3844 |********* | 4096 -> 8191 : 1689 |**** | 8192 -> 16383 : 115 | | 16384 -> 32767 : 54 | | 32768 -> 65535 : 42 | | 65536 -> 131071 : 14 | | 131072 -> 262143 : 4 | | 262144 -> 524287 : 4 | | 524288 -> 1048575 : 92 | | 1048576 -> 2097151 : 2322 |***** | 2097152 -> 4194303 : 8244 |******************** | 4194304 -> 8388607 : 775 |* | 8388608 -> 16777215 : 73 | | avg = 776487 nsecs, total: 32464181849 nsecs, count: 41809 * Delayed flush: $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 16271 |****************************************| 512 -> 1023 : 8147 |******************** | 1024 -> 2047 : 2743 |****** | 2048 -> 4095 : 3964 |********* | 4096 -> 8191 : 7137 |***************** | 8192 -> 16383 : 4039 |********* | 16384 -> 32767 : 352 | | 32768 -> 65535 : 214 | | 65536 -> 131071 : 80 | | 131072 -> 262143 : 15 | | 262144 -> 524287 : 0 | | 524288 -> 1048575 : 1 | | avg = 3248 nsecs, total: 139586555 nsecs, count: 42963 * Targeted flush: $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 25298 |****************************************| 512 -> 1023 : 13334 |********************* | 1024 -> 2047 : 1519 |** | 2048 -> 4095 : 4265 |****** | 4096 -> 8191 : 2339 |*** | 8192 -> 16383 : 162 | | 16384 -> 32767 : 73 | | 32768 -> 65535 : 70 | | 65536 -> 131071 : 23 | | 131072 -> 262143 : 1 | | 262144 -> 524287 : 3 | | 524288 -> 1048575 : 55 | | 1048576 -> 2097151 : 1409 |** | 2097152 -> 4194303 : 8306 |************* | 4194304 -> 8388607 : 1139 |* | 8388608 -> 16777215 : 89 | | avg = 589361 nsecs, total: 34233041412 nsecs, count: 5808 * Your most recent patch: $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 37051 |****************************************| 512 -> 1023 : 17144 |****************** | 1024 -> 2047 : 1051 |* | 2048 -> 4095 : 2705 |** | 4096 -> 8191 : 1911 |** | 8192 -> 16383 : 184 | | 16384 -> 32767 : 93 | | 32768 -> 65535 : 70 | | 65536 -> 131071 : 30 | | 131072 -> 262143 : 21 | | 262144 -> 524287 : 216 | | 524288 -> 1048575 : 1100 |* | 1048576 -> 2097151 : 6522 |******* | 2097152 -> 4194303 : 5903 |****** | 4194304 -> 8388607 : 467 | | 8388608 -> 16777215 : 19 | | avg = 403261 nsecs, total: 30037729036 nsecs, count: 74487 Looks like there are plenty of longer pauses still. I can also get the numbers from 5.10 for you if you want. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 22:33 ` Ivan Babrou @ 2022-03-03 2:32 ` Shakeel Butt 2022-03-03 2:35 ` Shakeel Butt 1 sibling, 0 replies; 38+ messages in thread From: Shakeel Butt @ 2022-03-03 2:32 UTC (permalink / raw) To: Ivan Babrou Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Wed, Mar 2, 2022 at 2:33 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > > > On Tue, Mar 1, 2022 at 6:50 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote: > > > > > > [...] > > > > > > > Looks like you were right that targeted flush is not going to be as good. > > > > > > > > > Thanks a lot. Can you please try the following patch (independent of other > > > patches) as well? > > > > Sure thing, I'll come back with the numbers tomorrow. > > Measurements from different configurations: [...] > avg = 403261 nsecs, total: 30037729036 nsecs, count: 74487 > > Looks like there are plenty of longer pauses still. Thanks for running this. > > I can also get the numbers from 5.10 for you if you want. Yes please, that would be good to compare with. Also I am going to send a new patch in a couple of minutes for you to try and you don't need to run the "Targeted flush" and the last one which reduces the memcg_vmstats size. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 22:33 ` Ivan Babrou 2022-03-03 2:32 ` Shakeel Butt @ 2022-03-03 2:35 ` Shakeel Butt 2022-03-04 0:21 ` Ivan Babrou 1 sibling, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-03-03 2:35 UTC (permalink / raw) To: Ivan Babrou Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Wed, Mar 02, 2022 at 02:33:41PM -0800, Ivan Babrou wrote: > On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote: [...] Please try the following patch: diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d9b8df5ef212..274e4ecff534 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1002,6 +1002,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, } void mem_cgroup_flush_stats(void); +void mem_cgroup_flush_stats_if_late(void); void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val); @@ -1422,6 +1423,10 @@ static inline void mem_cgroup_flush_stats(void) { } +static inline void mem_cgroup_flush_stats_if_late(void) +{ +} + static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 32ba963ebf2e..00c924e25ce8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -650,6 +650,9 @@ static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); static DEFINE_SPINLOCK(stats_flush_lock); static DEFINE_PER_CPU(unsigned int, stats_updates); static atomic_t stats_flush_threshold = ATOMIC_INIT(0); +static u64 flush_next_time; + +#define FLUSH_TIME (2UL*HZ) static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { @@ -671,6 +674,7 @@ static void __mem_cgroup_flush_stats(void) if (!spin_trylock_irqsave(&stats_flush_lock, flag)) return; + flush_next_time = jiffies_64 + 2*FLUSH_TIME; cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup); atomic_set(&stats_flush_threshold, 0); spin_unlock_irqrestore(&stats_flush_lock, flag); @@ -682,10 +686,16 @@ void mem_cgroup_flush_stats(void) __mem_cgroup_flush_stats(); } +void mem_cgroup_flush_stats_if_late(void) +{ + if (time_after64(jiffies_64, flush_next_time)) + mem_cgroup_flush_stats(); +} + static void flush_memcg_stats_dwork(struct work_struct *w) { __mem_cgroup_flush_stats(); - queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } /** @@ -4518,7 +4528,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_if_late(); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); diff --git a/mm/vmscan.c b/mm/vmscan.c index 74296c2d1fed..fd279621c3fc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3041,7 +3041,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) * Flush the memory cgroup stats, so that we read accurate per-memcg * lruvec stats for heuristics. */ - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_if_late(); memset(&sc->nr, 0, sizeof(sc->nr)); diff --git a/mm/workingset.c b/mm/workingset.c index d5b81e4f4cbe..7edc52037e38 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -352,7 +352,7 @@ void workingset_refault(struct page *page, void *shadow) inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_if_late(); /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-03 2:35 ` Shakeel Butt @ 2022-03-04 0:21 ` Ivan Babrou 2022-03-04 1:05 ` Shakeel Butt 0 siblings, 1 reply; 38+ messages in thread From: Ivan Babrou @ 2022-03-04 0:21 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Wed, Mar 2, 2022 at 6:35 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Mar 02, 2022 at 02:33:41PM -0800, Ivan Babrou wrote: > > On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote: > [...] > > Please try the following patch: * Linux 5.10.101: $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 5496 |****************************************| 512 -> 1023 : 3161 |*********************** | 1024 -> 2047 : 1059 |******* | 2048 -> 4095 : 3149 |********************** | 4096 -> 8191 : 2074 |*************** | 8192 -> 16383 : 90 | | 16384 -> 32767 : 43 | | 32768 -> 65535 : 44 | | 65536 -> 131071 : 12 | | 131072 -> 262143 : 3 | | 262144 -> 524287 : 1 | | avg = 2144 nsecs, total: 32453571 nsecs, count: 15132 * Linux 5.15.25: $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 132411 |****************************************| 512 -> 1023 : 45547 |************* | 1024 -> 2047 : 5869 |* | 2048 -> 4095 : 10553 |*** | 4096 -> 8191 : 5297 |* | 8192 -> 16383 : 455 | | 16384 -> 32767 : 214 | | 32768 -> 65535 : 175 | | 65536 -> 131071 : 50 | | 131072 -> 262143 : 6 | | 262144 -> 524287 : 2 | | 524288 -> 1048575 : 80 | | 1048576 -> 2097151 : 3228 | | 2097152 -> 4194303 : 13514 |**** | 4194304 -> 8388607 : 792 | | 8388608 -> 16777215 : 1 | | avg = 221955 nsecs, total: 48429632352 nsecs, count: 218195 * Linux 5.15.25 + delayed flush: $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 103067 |****************************************| 512 -> 1023 : 39054 |*************** | 1024 -> 2047 : 8953 |*** | 2048 -> 4095 : 12986 |***** | 4096 -> 8191 : 10173 |*** | 8192 -> 16383 : 8090 |*** | 16384 -> 32767 : 890 | | 32768 -> 65535 : 540 | | 65536 -> 131071 : 198 | | 131072 -> 262143 : 50 | | 262144 -> 524287 : 7 | | avg = 1837 nsecs, total: 338115740 nsecs, count: 184008 * Linux 5.15.25 + your latest patch with "if_late": $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 9241 |****************************************| 512 -> 1023 : 7041 |****************************** | 1024 -> 2047 : 1457 |****** | 2048 -> 4095 : 3401 |************** | 4096 -> 8191 : 2207 |********* | 8192 -> 16383 : 130 | | 16384 -> 32767 : 53 | | 32768 -> 65535 : 40 | | 65536 -> 131071 : 16 | | 131072 -> 262143 : 1 | | 262144 -> 524287 : 1 | | avg = 1663 nsecs, total: 39249164 nsecs, count: 23588 The numbers look pretty good. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-04 0:21 ` Ivan Babrou @ 2022-03-04 1:05 ` Shakeel Butt 2022-03-04 1:12 ` Ivan Babrou 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-03-04 1:05 UTC (permalink / raw) To: Ivan Babrou Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Thu, Mar 3, 2022 at 4:21 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Wed, Mar 2, 2022 at 6:35 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Wed, Mar 02, 2022 at 02:33:41PM -0800, Ivan Babrou wrote: > > > On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > [...] > > > > Please try the following patch: > > * Linux 5.10.101: > > $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault > Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. > > nsecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 5496 |****************************************| > 512 -> 1023 : 3161 |*********************** | > 1024 -> 2047 : 1059 |******* | > 2048 -> 4095 : 3149 |********************** | > 4096 -> 8191 : 2074 |*************** | > 8192 -> 16383 : 90 | | > 16384 -> 32767 : 43 | | > 32768 -> 65535 : 44 | | > 65536 -> 131071 : 12 | | > 131072 -> 262143 : 3 | | > 262144 -> 524287 : 1 | | > > avg = 2144 nsecs, total: 32453571 nsecs, count: 15132 > > * Linux 5.15.25: > > $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault > Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. > > nsecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 132411 |****************************************| > 512 -> 1023 : 45547 |************* | > 1024 -> 2047 : 5869 |* | > 2048 -> 4095 : 10553 |*** | > 4096 -> 8191 : 5297 |* | > 8192 -> 16383 : 455 | | > 16384 -> 32767 : 214 | | > 32768 -> 65535 : 175 | | > 65536 -> 131071 : 50 | | > 131072 -> 262143 : 6 | | > 262144 -> 524287 : 2 | | > 524288 -> 1048575 : 80 | | > 1048576 -> 2097151 : 3228 | | > 2097152 -> 4194303 : 13514 |**** | > 4194304 -> 8388607 : 792 | | > 8388608 -> 16777215 : 1 | | > > avg = 221955 nsecs, total: 48429632352 nsecs, count: 218195 > > * Linux 5.15.25 + delayed flush: > > $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault > Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. > > nsecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 103067 |****************************************| > 512 -> 1023 : 39054 |*************** | > 1024 -> 2047 : 8953 |*** | > 2048 -> 4095 : 12986 |***** | > 4096 -> 8191 : 10173 |*** | > 8192 -> 16383 : 8090 |*** | > 16384 -> 32767 : 890 | | > 32768 -> 65535 : 540 | | > 65536 -> 131071 : 198 | | > 131072 -> 262143 : 50 | | > 262144 -> 524287 : 7 | | > > avg = 1837 nsecs, total: 338115740 nsecs, count: 184008 > > * Linux 5.15.25 + your latest patch with "if_late": > > $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault > Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end. > > nsecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 9241 |****************************************| > 512 -> 1023 : 7041 |****************************** | > 1024 -> 2047 : 1457 |****** | > 2048 -> 4095 : 3401 |************** | > 4096 -> 8191 : 2207 |********* | > 8192 -> 16383 : 130 | | > 16384 -> 32767 : 53 | | > 32768 -> 65535 : 40 | | > 65536 -> 131071 : 16 | | > 131072 -> 262143 : 1 | | > 262144 -> 524287 : 1 | | > > avg = 1663 nsecs, total: 39249164 nsecs, count: 23588 > > The numbers look pretty good. Thanks a lot Ivan. I will send the signed-off patch for 'if_late' and I would like to have your Tested-by on that. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-04 1:05 ` Shakeel Butt @ 2022-03-04 1:12 ` Ivan Babrou 0 siblings, 0 replies; 38+ messages in thread From: Ivan Babrou @ 2022-03-04 1:12 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Thu, Mar 3, 2022 at 5:05 PM Shakeel Butt <shakeelb@google.com> wrote: > Thanks a lot Ivan. I will send the signed-off patch for 'if_late' and > I would like to have your Tested-by on that. Thanks, I appreciate your help here. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 2:50 ` Shakeel Butt 2022-03-02 3:40 ` Ivan Babrou @ 2022-03-02 11:49 ` Frank Hofmann 2022-03-02 15:52 ` Shakeel Butt 1 sibling, 1 reply; 38+ messages in thread From: Frank Hofmann @ 2022-03-02 11:49 UTC (permalink / raw) To: Shakeel Butt Cc: Ivan Babrou, Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Wed, Mar 2, 2022 at 2:50 AM 'Shakeel Butt' via kernel-team+notifications <kernel-team@cloudflare.com> wrote: > > On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote: > > [...] > > > Looks like you were right that targeted flush is not going to be as good. > > > Thanks a lot. Can you please try the following patch (independent of other > patches) as well? Hi Shakeel, maybe it's just me, but the codechange below confuses me. I'll comment inline. > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index d9b8df5ef212..499f75e066f3 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -75,29 +75,8 @@ enum mem_cgroup_events_target { > MEM_CGROUP_NTARGETS, > }; > > -struct memcg_vmstats_percpu { > - /* Local (CPU and cgroup) page state & events */ > - long state[MEMCG_NR_STAT]; > - unsigned long events[NR_VM_EVENT_ITEMS]; > - > - /* Delta calculation for lockless upward propagation */ > - long state_prev[MEMCG_NR_STAT]; > - unsigned long events_prev[NR_VM_EVENT_ITEMS]; > - > - /* Cgroup1: threshold notifications & softlimit tree updates */ > - unsigned long nr_page_events; > - unsigned long targets[MEM_CGROUP_NTARGETS]; > -}; > - > -struct memcg_vmstats { > - /* Aggregated (CPU and subtree) page state & events */ > - long state[MEMCG_NR_STAT]; > - unsigned long events[NR_VM_EVENT_ITEMS]; > - > - /* Pending child counts during tree propagation */ > - long state_pending[MEMCG_NR_STAT]; > - unsigned long events_pending[NR_VM_EVENT_ITEMS]; > -}; > +struct memcg_vmstats_percpu; > +struct memcg_vmstats; > > struct mem_cgroup_reclaim_iter { > struct mem_cgroup *position; > @@ -304,7 +283,7 @@ struct mem_cgroup { > MEMCG_PADDING(_pad1_); > > /* memory.stat */ > - struct memcg_vmstats vmstats; > + struct memcg_vmstats *vmstats; > > /* memory.events */ > atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; > @@ -964,11 +943,7 @@ static inline void mod_memcg_state(struct mem_cgroup > *memcg, > local_irq_restore(flags); > } > > -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int > idx) > -{ > - return READ_ONCE(memcg->vmstats.state[idx]); > -} > - > +unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx); > static inline unsigned long lruvec_page_state(struct lruvec *lruvec, > enum node_stat_item idx) > { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 32ba963ebf2e..c65f155c2048 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -688,6 +688,71 @@ static void flush_memcg_stats_dwork(struct work_struct > *w) > queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ); > } > > +static const unsigned int memcg_vm_events[] = { > + PGPGIN, > + PGPGOUT, > + PGFAULT, > + PGMAJFAULT, > + PGREFILL, > + PGSCAN_KSWAPD, > + PGSCAN_DIRECT, > + PGSTEAL_KSWAPD, > + PGSTEAL_DIRECT, > + PGACTIVATE, > + PGDEACTIVATE, > + PGLAZYFREE, > + PGLAZYFREED, > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + THP_FAULT_ALLOC, > + THP_COLLAPSE_ALLOC, > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > +}; > +#define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_events) > + > +static int memcg_events_index[NR_VM_EVENT_ITEMS] __read_mostly; > + > +static void init_memcg_events(void) > +{ > + int i; > + > + for (i = 0; i < NR_MEMCG_EVENTS; ++i) > + memcg_events_index[memcg_vm_events[i]] = i + 1; > +} > + > +static int get_memcg_events_index(enum vm_event_item idx) > +{ > + return memcg_events_index[idx] - 1; > +} It's this. The table, memcg_vm_events[], is basically a "truth set" - if the event is in this set, update memcg_events_index[] for it. This code looks like one attempts to re-base C arrays to start-at-one though. It then does that accessor function to translate "unsigned zero" to "signed -1" ... for a truth test. > + > +struct memcg_vmstats_percpu { > + /* Local (CPU and cgroup) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_MEMCG_EVENTS]; > + > + /* Delta calculation for lockless upward propagation */ > + long state_prev[MEMCG_NR_STAT]; > + unsigned long events_prev[NR_MEMCG_EVENTS]; > + > + /* Cgroup1: threshold notifications & softlimit tree updates */ > + unsigned long nr_page_events; > + unsigned long targets[MEM_CGROUP_NTARGETS]; > +}; > + > +struct memcg_vmstats { > + /* Aggregated (CPU and subtree) page state & events */ > + long state[MEMCG_NR_STAT]; > + unsigned long events[NR_MEMCG_EVENTS]; > + > + /* Pending child counts during tree propagation */ > + long state_pending[MEMCG_NR_STAT]; > + unsigned long events_pending[NR_MEMCG_EVENTS]; > +}; > + > +unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > +{ > + return READ_ONCE(memcg->vmstats->state[idx]); > +} > + > /** > * __mod_memcg_state - update cgroup memory statistics > * @memcg: the memory cgroup > @@ -831,25 +896,33 @@ static inline void mod_objcg_mlstate(struct > obj_cgroup *objcg, > void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, > unsigned long count) > { > - if (mem_cgroup_disabled()) > + int index = get_memcg_events_index(idx); > + > + if (mem_cgroup_disabled() || index < 0) And it's these where it's used; basically, this says "if the event is not in the monitored set, return". Likewise in all below. Why not have this explicit ? Make something like: static int memcg_event_exists_percg(int index) { switch (index) { case PGPGIN: case PGPGOUT: case PGFAULT: case PGMAJFAULT: case PGREFILL: case PGSCAN_KSWAPD: case PGSCAN_DIRECT: case PGSTEAL_KSWAPD: case PGSTEAL_DIRECT: case PGACTIVATE: case PGDEACTIVATE: case PGLAZYFREE: case PGLAZYFREED: #ifdef CONFIG_TRANSPARENT_HUGEPAGE case THP_FAULT_ALLOC: case THP_COLLAPSE_ALLOC: #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ return 1; default: return 0; } } and let the compiler choose how to lay this out where-used ? > return; > > - __this_cpu_add(memcg->vmstats_percpu->events[idx], count); > + __this_cpu_add(memcg->vmstats_percpu->events[index], count); [ ... ] I understand there is a memory saving from the different table types/sizes, while the vm events per memcg is a (small) subset of vm events. How relevant is this to the change, though? Best regards, Frank Hofmann ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 11:49 ` Frank Hofmann @ 2022-03-02 15:52 ` Shakeel Butt 0 siblings, 0 replies; 38+ messages in thread From: Shakeel Butt @ 2022-03-02 15:52 UTC (permalink / raw) To: Frank Hofmann Cc: Ivan Babrou, Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds Hi Frank On Wed, Mar 2, 2022 at 3:50 AM Frank Hofmann <fhofmann@cloudflare.com> wrote: [...] > > Hi Shakeel, > > maybe it's just me, but the codechange below confuses me. I'll comment inline. > [...] > > I understand there is a memory saving from the different table > types/sizes, while the vm events per memcg is a (small) subset of vm > events. How relevant is this to the change, though? > Actually this is the whole reason behind this change i.e. reduce the size of events[_prev] arrays from ~100 to 15. I suspect updating events arrays might be the half the time the rstat flushing is spending on. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 0:48 ` Ivan Babrou 2022-03-02 2:50 ` Shakeel Butt @ 2022-03-02 10:08 ` Michal Koutný 2022-03-02 15:53 ` Shakeel Butt 1 sibling, 1 reply; 38+ messages in thread From: Michal Koutný @ 2022-03-02 10:08 UTC (permalink / raw) To: Ivan Babrou Cc: Shakeel Butt, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou <ivan@cloudflare.com> wrote: > avg = 1100789 nsecs, total: 27051906434 nsecs, count: 24575 Thank you for checking. > Looks like you were right that targeted flush is not going to be as good. Still lots of work to do at once :-/ BTW what was the "last good" kernel that you use as baseline for comparisons? (I can't find it in the thread and I only assume we're comparing against pre-rstat switch, i.e. last good < v5.13?) Regards, Michal ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 10:08 ` Michal Koutný @ 2022-03-02 15:53 ` Shakeel Butt 2022-03-02 17:28 ` Ivan Babrou 0 siblings, 1 reply; 38+ messages in thread From: Shakeel Butt @ 2022-03-02 15:53 UTC (permalink / raw) To: Michal Koutný Cc: Ivan Babrou, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Wed, Mar 2, 2022 at 2:08 AM Michal Koutný <mkoutny@suse.com> wrote: > [...] > > BTW what was the "last good" kernel that you use as baseline for > comparisons? (I can't find it in the thread and I only assume we're > comparing against pre-rstat switch, i.e. last good < v5.13?) > I think 5.10 is the last good kernel for them (and yes pre-rstat switch). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-03-02 15:53 ` Shakeel Butt @ 2022-03-02 17:28 ` Ivan Babrou 0 siblings, 0 replies; 38+ messages in thread From: Ivan Babrou @ 2022-03-02 17:28 UTC (permalink / raw) To: Shakeel Butt Cc: Michal Koutný, Daniel Dao, kernel-team, Linux MM, Johannes Weiner, Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton, Andrew Morton, Linus Torvalds On Wed, Mar 2, 2022 at 7:53 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Mar 2, 2022 at 2:08 AM Michal Koutný <mkoutny@suse.com> wrote: > > > [...] > > > > BTW what was the "last good" kernel that you use as baseline for > > comparisons? (I can't find it in the thread and I only assume we're > > comparing against pre-rstat switch, i.e. last good < v5.13?) > > > > I think 5.10 is the last good kernel for them (and yes pre-rstat switch). Correct, we stick to LTS kernels. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 2022-02-23 13:51 Regression in workingset_refault latency on 5.15 Daniel Dao 2022-02-23 15:57 ` Shakeel Butt @ 2022-02-24 9:22 ` Thorsten Leemhuis 2022-04-11 10:17 ` Regression in workingset_refault latency on 5.15 #forregzbot Thorsten Leemhuis 1 sibling, 1 reply; 38+ messages in thread From: Thorsten Leemhuis @ 2022-02-24 9:22 UTC (permalink / raw) To: Daniel Dao, shakeelb Cc: kernel-team, linux-mm, hannes, guro, feng.tang, mhocko, hdanton, mkoutny, akpm, torvalds, regressions [TLDR: I'm adding below regression report below to regzbot, the Linux kernel regression tracking bot; all text you find below is compiled from a few templates paragraphs you might have encountered already already from similar mails.] Hi, this is your Linux kernel regression tracker speaking. Thanks for the report. CCing the regression mailing list, as it should be in the loop for all regressions, as explained here: https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html To be sure this issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression tracking bot: #regzbot ^introduced v3.0..v5.15.19 #regzbot ignore-activity Reminder for developers: when fixing the issue, please add a 'Link:' tags pointing to the report (the mail quoted below) using lore.kernel.org/r/, as explained in 'Documentation/process/submitting-patches.rst' and 'Documentation/process/5.Posting.rst'. This allows the bot to connect the report with any patches posted or committed to fix the issue; this again allows the bot to show the current status of regressions and automatically resolve the issue when the fix hits the right tree. I'm sending this to everyone that got the initial report, to make them aware of the tracking. I also hope that messages like this motivate people to directly get at least the regression mailing list and ideally even regzbot involved when dealing with regressions, as messages like this wouldn't be needed then. Don't worry, I'll send further messages wrt to this regression just to the lists (with a tag in the subject so people can filter them away), if they are relevant just for regzbot. With a bit of luck no such messages will be needed anyway. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) P.S.: As the Linux kernel's regression tracker I'm getting a lot of reports on my table. I can only look briefly into most of them and lack knowledge about most of the areas they concern. I thus unfortunately will sometimes get things wrong or miss something important. I hope that's not the case here; if you think it is, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight. On 23.02.22 14:51, Daniel Dao wrote: > Hi all, > > We are observing some regressions in workingset_refault on our newly upgraded > 5.15.19 nodes with zram as swap. This manifests in several ways: > > 1) Regression of workingset_refault duration observed in flamegraph > > We regularly collect flamegraphs for running services on the node. Since upgrade > to 5.15.19, we see that workingset_refault occupied a more significant part of > the service flamegraph (13%) with the following call trace > > workingset_refault+0x128 > add_to_page_cache_lru+0x9f > page_cache_ra_unbounded+0x154 > force_page_cache_ra+0xe2 > filemap_get_pages+0xe9 > filemap_read+0xa4 > xfs_file_buffered_read+0x98 > xfs_file_read_iter+0x6a > new_sync_read+0x118 > vfs_read+0xf2 > __x64_sys_pread64+0x89 > do_syscall_64+0x3b > entry_SYSCALL_64_after_hwframe+0x44 > > 2) Regression of userspace performance sensitive code > > We have some performance sensentive code running in userspace that have their > runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: > > now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) > func() > elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now > > Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more > frequently than before. This went away after we disabled swap for the service > using `memory.swap.max=0` memcg configuration. > > The new thing in 5.15 workingset_refault seems to be introduction of > mem_cgroup_flush_stats() > by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush > lruvec stats in the > refault). > > Given that mem_cgroup_flush_stats can take quite a long time for us on the > standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) > > sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats > Tracing function calls slower than 10 ms... Ctrl+C to quit. > TIME COMM PID LAT(ms) RVAL FUNC > 0.000000 <redacted> 804776 11.50 200 > mem_cgroup_flush_stats > 0.343383 <redacted> 647496 10.58 200 > mem_cgroup_flush_stats > 0.604309 <redacted> 804776 10.50 200 > mem_cgroup_flush_stats > 1.230416 <redacted> 803293 10.01 200 > mem_cgroup_flush_stats > 1.248442 <redacted> 646400 11.02 200 > mem_cgroup_flush_stats > > could it be possible that workingset_refault in some unfortunate case can take > much longer than before such that it increases the time observed by > CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of > workingset_refault > observed by perf ? > -- Additional information about regzbot: If you want to know more about regzbot, check out its web-interface, the getting start guide, and the references documentation: https://linux-regtracking.leemhuis.info/regzbot/ https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md The last two documents will explain how you can interact with regzbot yourself if your want to. Hint for reporters: when reporting a regression it's in your interest to CC the regression list and tell regzbot about the issue, as that ensures the regression makes it onto the radar of the Linux kernel's regression tracker -- that's in your interest, as it ensures your report won't fall through the cracks unnoticed. Hint for developers: you normally don't need to care about regzbot once it's involved. Fix the issue as you normally would, just remember to include 'Link:' tag in the patch descriptions pointing to all reports about the issue. This has been expected from developers even before regzbot showed up for reasons explained in 'Documentation/process/submitting-patches.rst' and 'Documentation/process/5.Posting.rst'. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 #forregzbot 2022-02-24 9:22 ` Thorsten Leemhuis @ 2022-04-11 10:17 ` Thorsten Leemhuis 2022-05-16 12:51 ` Thorsten Leemhuis 0 siblings, 1 reply; 38+ messages in thread From: Thorsten Leemhuis @ 2022-04-11 10:17 UTC (permalink / raw) To: regressions TWIMC: this mail is primarily send for documentation purposes and for regzbot, my Linux kernel regression tracking bot. These mails usually contain '#forregzbot' in the subject, to make them easy to spot and filter. #regzbot backburner: patch in mm, but it needs to be tested properly and the issue is not urgend On 24.02.22 10:22, Thorsten Leemhuis wrote: > [TLDR: I'm adding below regression report below to regzbot, the Linux > kernel regression tracking bot; all text you find below is compiled from > a few templates paragraphs you might have encountered already already > from similar mails.] > > Hi, this is your Linux kernel regression tracker speaking. > > Thanks for the report. CCing the regression mailing list, as it should > be in the loop for all regressions, as explained here: > https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html > > To be sure this issue doesn't fall through the cracks unnoticed, I'm > adding it to regzbot, my Linux kernel regression tracking bot: > > #regzbot ^introduced v3.0..v5.15.19 > #regzbot ignore-activity > > Reminder for developers: when fixing the issue, please add a 'Link:' > tags pointing to the report (the mail quoted below) using > lore.kernel.org/r/, as explained in > 'Documentation/process/submitting-patches.rst' and > 'Documentation/process/5.Posting.rst'. This allows the bot to connect > the report with any patches posted or committed to fix the issue; this > again allows the bot to show the current status of regressions and > automatically resolve the issue when the fix hits the right tree. > > I'm sending this to everyone that got the initial report, to make them > aware of the tracking. I also hope that messages like this motivate > people to directly get at least the regression mailing list and ideally > even regzbot involved when dealing with regressions, as messages like > this wouldn't be needed then. > > Don't worry, I'll send further messages wrt to this regression just to > the lists (with a tag in the subject so people can filter them away), if > they are relevant just for regzbot. With a bit of luck no such messages > will be needed anyway. > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > > P.S.: As the Linux kernel's regression tracker I'm getting a lot of > reports on my table. I can only look briefly into most of them and lack > knowledge about most of the areas they concern. I thus unfortunately > will sometimes get things wrong or miss something important. I hope > that's not the case here; if you think it is, don't hesitate to tell me > in a public reply, it's in everyone's interest to set the public record > straight. > > On 23.02.22 14:51, Daniel Dao wrote: >> Hi all, >> >> We are observing some regressions in workingset_refault on our newly upgraded >> 5.15.19 nodes with zram as swap. This manifests in several ways: >> >> 1) Regression of workingset_refault duration observed in flamegraph >> >> We regularly collect flamegraphs for running services on the node. Since upgrade >> to 5.15.19, we see that workingset_refault occupied a more significant part of >> the service flamegraph (13%) with the following call trace >> >> workingset_refault+0x128 >> add_to_page_cache_lru+0x9f >> page_cache_ra_unbounded+0x154 >> force_page_cache_ra+0xe2 >> filemap_get_pages+0xe9 >> filemap_read+0xa4 >> xfs_file_buffered_read+0x98 >> xfs_file_read_iter+0x6a >> new_sync_read+0x118 >> vfs_read+0xf2 >> __x64_sys_pread64+0x89 >> do_syscall_64+0x3b >> entry_SYSCALL_64_after_hwframe+0x44 >> >> 2) Regression of userspace performance sensitive code >> >> We have some performance sensentive code running in userspace that have their >> runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: >> >> now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) >> func() >> elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now >> >> Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more >> frequently than before. This went away after we disabled swap for the service >> using `memory.swap.max=0` memcg configuration. >> >> The new thing in 5.15 workingset_refault seems to be introduction of >> mem_cgroup_flush_stats() >> by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush >> lruvec stats in the >> refault). >> >> Given that mem_cgroup_flush_stats can take quite a long time for us on the >> standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) >> >> sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats >> Tracing function calls slower than 10 ms... Ctrl+C to quit. >> TIME COMM PID LAT(ms) RVAL FUNC >> 0.000000 <redacted> 804776 11.50 200 >> mem_cgroup_flush_stats >> 0.343383 <redacted> 647496 10.58 200 >> mem_cgroup_flush_stats >> 0.604309 <redacted> 804776 10.50 200 >> mem_cgroup_flush_stats >> 1.230416 <redacted> 803293 10.01 200 >> mem_cgroup_flush_stats >> 1.248442 <redacted> 646400 11.02 200 >> mem_cgroup_flush_stats >> >> could it be possible that workingset_refault in some unfortunate case can take >> much longer than before such that it increases the time observed by >> CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of >> workingset_refault >> observed by perf ? >> > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Regression in workingset_refault latency on 5.15 #forregzbot 2022-04-11 10:17 ` Regression in workingset_refault latency on 5.15 #forregzbot Thorsten Leemhuis @ 2022-05-16 12:51 ` Thorsten Leemhuis 0 siblings, 0 replies; 38+ messages in thread From: Thorsten Leemhuis @ 2022-05-16 12:51 UTC (permalink / raw) To: regressions On 11.04.22 12:17, Thorsten Leemhuis wrote: > TWIMC: this mail is primarily send for documentation purposes and for > regzbot, my Linux kernel regression tracking bot. These mails usually > contain '#forregzbot' in the subject, to make them easy to spot and filter. > > #regzbot backburner: patch in mm, but it needs to be tested properly and > the issue is not urgend #regzbot fixed-by: 9b3016154c913b2e7 > On 24.02.22 10:22, Thorsten Leemhuis wrote: >> [TLDR: I'm adding below regression report below to regzbot, the Linux >> kernel regression tracking bot; all text you find below is compiled from >> a few templates paragraphs you might have encountered already already >> from similar mails.] >> >> Hi, this is your Linux kernel regression tracker speaking. >> >> Thanks for the report. CCing the regression mailing list, as it should >> be in the loop for all regressions, as explained here: >> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html >> >> To be sure this issue doesn't fall through the cracks unnoticed, I'm >> adding it to regzbot, my Linux kernel regression tracking bot: >> >> #regzbot ^introduced v3.0..v5.15.19 >> #regzbot ignore-activity >> >> Reminder for developers: when fixing the issue, please add a 'Link:' >> tags pointing to the report (the mail quoted below) using >> lore.kernel.org/r/, as explained in >> 'Documentation/process/submitting-patches.rst' and >> 'Documentation/process/5.Posting.rst'. This allows the bot to connect >> the report with any patches posted or committed to fix the issue; this >> again allows the bot to show the current status of regressions and >> automatically resolve the issue when the fix hits the right tree. >> >> I'm sending this to everyone that got the initial report, to make them >> aware of the tracking. I also hope that messages like this motivate >> people to directly get at least the regression mailing list and ideally >> even regzbot involved when dealing with regressions, as messages like >> this wouldn't be needed then. >> >> Don't worry, I'll send further messages wrt to this regression just to >> the lists (with a tag in the subject so people can filter them away), if >> they are relevant just for regzbot. With a bit of luck no such messages >> will be needed anyway. >> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >> >> P.S.: As the Linux kernel's regression tracker I'm getting a lot of >> reports on my table. I can only look briefly into most of them and lack >> knowledge about most of the areas they concern. I thus unfortunately >> will sometimes get things wrong or miss something important. I hope >> that's not the case here; if you think it is, don't hesitate to tell me >> in a public reply, it's in everyone's interest to set the public record >> straight. >> >> On 23.02.22 14:51, Daniel Dao wrote: >>> Hi all, >>> >>> We are observing some regressions in workingset_refault on our newly upgraded >>> 5.15.19 nodes with zram as swap. This manifests in several ways: >>> >>> 1) Regression of workingset_refault duration observed in flamegraph >>> >>> We regularly collect flamegraphs for running services on the node. Since upgrade >>> to 5.15.19, we see that workingset_refault occupied a more significant part of >>> the service flamegraph (13%) with the following call trace >>> >>> workingset_refault+0x128 >>> add_to_page_cache_lru+0x9f >>> page_cache_ra_unbounded+0x154 >>> force_page_cache_ra+0xe2 >>> filemap_get_pages+0xe9 >>> filemap_read+0xa4 >>> xfs_file_buffered_read+0x98 >>> xfs_file_read_iter+0x6a >>> new_sync_read+0x118 >>> vfs_read+0xf2 >>> __x64_sys_pread64+0x89 >>> do_syscall_64+0x3b >>> entry_SYSCALL_64_after_hwframe+0x44 >>> >>> 2) Regression of userspace performance sensitive code >>> >>> We have some performance sensentive code running in userspace that have their >>> runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as: >>> >>> now = clock_gettime(CLOCK_THREAD_CPUTIME_ID) >>> func() >>> elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now >>> >>> Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more >>> frequently than before. This went away after we disabled swap for the service >>> using `memory.swap.max=0` memcg configuration. >>> >>> The new thing in 5.15 workingset_refault seems to be introduction of >>> mem_cgroup_flush_stats() >>> by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush >>> lruvec stats in the >>> refault). >>> >>> Given that mem_cgroup_flush_stats can take quite a long time for us on the >>> standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service ) >>> >>> sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats >>> Tracing function calls slower than 10 ms... Ctrl+C to quit. >>> TIME COMM PID LAT(ms) RVAL FUNC >>> 0.000000 <redacted> 804776 11.50 200 >>> mem_cgroup_flush_stats >>> 0.343383 <redacted> 647496 10.58 200 >>> mem_cgroup_flush_stats >>> 0.604309 <redacted> 804776 10.50 200 >>> mem_cgroup_flush_stats >>> 1.230416 <redacted> 803293 10.01 200 >>> mem_cgroup_flush_stats >>> 1.248442 <redacted> 646400 11.02 200 >>> mem_cgroup_flush_stats >>> >>> could it be possible that workingset_refault in some unfortunate case can take >>> much longer than before such that it increases the time observed by >>> CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of >>> workingset_refault >>> observed by perf ? >>> >> >> ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2022-05-16 12:51 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-02-23 13:51 Regression in workingset_refault latency on 5.15 Daniel Dao 2022-02-23 15:57 ` Shakeel Butt 2022-02-23 16:00 ` Shakeel Butt 2022-02-23 17:07 ` Daniel Dao 2022-02-23 17:36 ` Shakeel Butt 2022-02-23 19:28 ` Ivan Babrou 2022-02-23 20:28 ` Shakeel Butt 2022-02-23 21:16 ` Ivan Babrou 2022-02-24 14:46 ` Daniel Dao 2022-02-24 16:58 ` Shakeel Butt 2022-02-24 17:34 ` Daniel Dao 2022-02-24 18:00 ` Shakeel Butt 2022-02-24 18:52 ` Shakeel Butt 2022-02-25 10:23 ` Daniel Dao 2022-02-25 17:08 ` Ivan Babrou 2022-02-25 17:22 ` Shakeel Butt 2022-02-25 18:03 ` Michal Koutný 2022-02-25 18:08 ` Ivan Babrou 2022-02-28 23:09 ` Shakeel Butt 2022-02-28 23:34 ` Ivan Babrou 2022-02-28 23:43 ` Shakeel Butt 2022-03-02 0:48 ` Ivan Babrou 2022-03-02 2:50 ` Shakeel Butt 2022-03-02 3:40 ` Ivan Babrou 2022-03-02 22:33 ` Ivan Babrou 2022-03-03 2:32 ` Shakeel Butt 2022-03-03 2:35 ` Shakeel Butt 2022-03-04 0:21 ` Ivan Babrou 2022-03-04 1:05 ` Shakeel Butt 2022-03-04 1:12 ` Ivan Babrou 2022-03-02 11:49 ` Frank Hofmann 2022-03-02 15:52 ` Shakeel Butt 2022-03-02 10:08 ` Michal Koutný 2022-03-02 15:53 ` Shakeel Butt 2022-03-02 17:28 ` Ivan Babrou 2022-02-24 9:22 ` Thorsten Leemhuis 2022-04-11 10:17 ` Regression in workingset_refault latency on 5.15 #forregzbot Thorsten Leemhuis 2022-05-16 12:51 ` Thorsten Leemhuis
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.