All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression in workingset_refault latency on 5.15
@ 2022-02-23 13:51 Daniel Dao
  2022-02-23 15:57 ` Shakeel Butt
  2022-02-24  9:22 ` Thorsten Leemhuis
  0 siblings, 2 replies; 38+ messages in thread
From: Daniel Dao @ 2022-02-23 13:51 UTC (permalink / raw)
  To: shakeelb
  Cc: kernel-team, linux-mm, hannes, guro, feng.tang, mhocko, hdanton,
	mkoutny, akpm, torvalds

Hi all,

We are observing some regressions in workingset_refault on our newly upgraded
5.15.19 nodes with zram as swap. This manifests in several ways:

1) Regression of workingset_refault duration observed in flamegraph

We regularly collect flamegraphs for running services on the node. Since upgrade
to 5.15.19, we see that workingset_refault occupied a more significant part of
the service flamegraph (13%) with the following call trace

  workingset_refault+0x128
  add_to_page_cache_lru+0x9f
  page_cache_ra_unbounded+0x154
  force_page_cache_ra+0xe2
  filemap_get_pages+0xe9
  filemap_read+0xa4
  xfs_file_buffered_read+0x98
  xfs_file_read_iter+0x6a
  new_sync_read+0x118
  vfs_read+0xf2
  __x64_sys_pread64+0x89
  do_syscall_64+0x3b
  entry_SYSCALL_64_after_hwframe+0x44

2) Regression of userspace performance sensitive code

We have some performance sensentive code running in userspace that have their
runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:

  now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
  func()
  elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now

Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
frequently than before.  This went away after we disabled swap for the service
using `memory.swap.max=0` memcg configuration.

The new thing in 5.15 workingset_refault seems to be introduction of
mem_cgroup_flush_stats()
by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
lruvec stats in the
refault).

Given that mem_cgroup_flush_stats can take quite a long time for us on the
standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )

  sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
  Tracing function calls slower than 10 ms... Ctrl+C to quit.
  TIME       COMM           PID    LAT(ms)             RVAL FUNC
  0.000000   <redacted>     804776   11.50              200
mem_cgroup_flush_stats
  0.343383   <redacted>     647496   10.58              200
mem_cgroup_flush_stats
  0.604309   <redacted>     804776   10.50              200
mem_cgroup_flush_stats
  1.230416   <redacted>     803293   10.01              200
mem_cgroup_flush_stats
  1.248442   <redacted>     646400   11.02              200
mem_cgroup_flush_stats

could it be possible that workingset_refault in some unfortunate case can take
much longer than before such that it increases the time observed by
CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
workingset_refault
observed by perf ?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 13:51 Regression in workingset_refault latency on 5.15 Daniel Dao
@ 2022-02-23 15:57 ` Shakeel Butt
  2022-02-23 16:00   ` Shakeel Butt
  2022-02-24  9:22 ` Thorsten Leemhuis
  1 sibling, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-02-23 15:57 UTC (permalink / raw)
  To: Daniel Dao, Ivan Babrou
  Cc: kernel-team, Linux MM, Johannes Weiner, Roman Gushchin,
	Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný,
	Andrew Morton, Linus Torvalds

+Ivan Babrou

Hi Daniel,

On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@cloudflare.com> wrote:
>
> Hi all,
>
> We are observing some regressions in workingset_refault on our newly upgraded
> 5.15.19 nodes with zram as swap. This manifests in several ways:
>
> 1) Regression of workingset_refault duration observed in flamegraph
>
> We regularly collect flamegraphs for running services on the node. Since upgrade
> to 5.15.19, we see that workingset_refault occupied a more significant part of
> the service flamegraph (13%) with the following call trace
>
>   workingset_refault+0x128
>   add_to_page_cache_lru+0x9f
>   page_cache_ra_unbounded+0x154
>   force_page_cache_ra+0xe2
>   filemap_get_pages+0xe9
>   filemap_read+0xa4
>   xfs_file_buffered_read+0x98
>   xfs_file_read_iter+0x6a
>   new_sync_read+0x118
>   vfs_read+0xf2
>   __x64_sys_pread64+0x89
>   do_syscall_64+0x3b
>   entry_SYSCALL_64_after_hwframe+0x44
>
> 2) Regression of userspace performance sensitive code
>
> We have some performance sensentive code running in userspace that have their
> runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:
>
>   now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
>   func()
>   elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now
>
> Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
> frequently than before.  This went away after we disabled swap for the service
> using `memory.swap.max=0` memcg configuration.
>
> The new thing in 5.15 workingset_refault seems to be introduction of
> mem_cgroup_flush_stats()
> by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
> lruvec stats in the
> refault).
>
> Given that mem_cgroup_flush_stats can take quite a long time for us on the
> standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )
>
>   sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
>   Tracing function calls slower than 10 ms... Ctrl+C to quit.
>   TIME       COMM           PID    LAT(ms)             RVAL FUNC
>   0.000000   <redacted>     804776   11.50              200
> mem_cgroup_flush_stats
>   0.343383   <redacted>     647496   10.58              200
> mem_cgroup_flush_stats
>   0.604309   <redacted>     804776   10.50              200
> mem_cgroup_flush_stats
>   1.230416   <redacted>     803293   10.01              200
> mem_cgroup_flush_stats
>   1.248442   <redacted>     646400   11.02              200
> mem_cgroup_flush_stats
>
> could it be possible that workingset_refault in some unfortunate case can take
> much longer than before such that it increases the time observed by
> CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
> workingset_refault
> observed by perf ?

Ivan reported this issue last month and since then we have backported
three patches to 5.15-stable to fix the issue. These got merged into
5.15.18. However you are reporting the issue is still present 5.15.19.

Can you share a bit more detail on your hardware configuration (num of
cpus) and if possible the flamegraph?

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 15:57 ` Shakeel Butt
@ 2022-02-23 16:00   ` Shakeel Butt
  2022-02-23 17:07     ` Daniel Dao
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-02-23 16:00 UTC (permalink / raw)
  To: Daniel Dao, Ivan Babrou
  Cc: kernel-team, Linux MM, Johannes Weiner, Roman Gushchin,
	Feng Tang, Michal Hocko, Hillf Danton, Michal Koutný,
	Andrew Morton, Linus Torvalds

On Wed, Feb 23, 2022 at 7:57 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> +Ivan Babrou
>
> Hi Daniel,
>
> On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@cloudflare.com> wrote:
> >
> > Hi all,
> >
> > We are observing some regressions in workingset_refault on our newly upgraded
> > 5.15.19 nodes with zram as swap. This manifests in several ways:
> >
> > 1) Regression of workingset_refault duration observed in flamegraph
> >
> > We regularly collect flamegraphs for running services on the node. Since upgrade
> > to 5.15.19, we see that workingset_refault occupied a more significant part of
> > the service flamegraph (13%) with the following call trace
> >
> >   workingset_refault+0x128
> >   add_to_page_cache_lru+0x9f
> >   page_cache_ra_unbounded+0x154
> >   force_page_cache_ra+0xe2
> >   filemap_get_pages+0xe9
> >   filemap_read+0xa4
> >   xfs_file_buffered_read+0x98
> >   xfs_file_read_iter+0x6a
> >   new_sync_read+0x118
> >   vfs_read+0xf2
> >   __x64_sys_pread64+0x89
> >   do_syscall_64+0x3b
> >   entry_SYSCALL_64_after_hwframe+0x44
> >
> > 2) Regression of userspace performance sensitive code
> >
> > We have some performance sensentive code running in userspace that have their
> > runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:
> >
> >   now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
> >   func()
> >   elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now
> >
> > Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
> > frequently than before.  This went away after we disabled swap for the service
> > using `memory.swap.max=0` memcg configuration.
> >
> > The new thing in 5.15 workingset_refault seems to be introduction of
> > mem_cgroup_flush_stats()
> > by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
> > lruvec stats in the
> > refault).
> >
> > Given that mem_cgroup_flush_stats can take quite a long time for us on the
> > standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )
> >
> >   sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
> >   Tracing function calls slower than 10 ms... Ctrl+C to quit.
> >   TIME       COMM           PID    LAT(ms)             RVAL FUNC
> >   0.000000   <redacted>     804776   11.50              200
> > mem_cgroup_flush_stats
> >   0.343383   <redacted>     647496   10.58              200
> > mem_cgroup_flush_stats
> >   0.604309   <redacted>     804776   10.50              200
> > mem_cgroup_flush_stats
> >   1.230416   <redacted>     803293   10.01              200
> > mem_cgroup_flush_stats
> >   1.248442   <redacted>     646400   11.02              200
> > mem_cgroup_flush_stats
> >
> > could it be possible that workingset_refault in some unfortunate case can take
> > much longer than before such that it increases the time observed by
> > CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
> > workingset_refault
> > observed by perf ?
>
> Ivan reported this issue last month and since then we have backported
> three patches to 5.15-stable to fix the issue. These got merged into
> 5.15.18. However you are reporting the issue is still present 5.15.19.
>
> Can you share a bit more detail on your hardware configuration (num of
> cpus) and if possible the flamegraph?
>

Also if you can reproduce the issue, can you try the patch at
https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/
?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 16:00   ` Shakeel Butt
@ 2022-02-23 17:07     ` Daniel Dao
  2022-02-23 17:36       ` Shakeel Butt
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Dao @ 2022-02-23 17:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Wed, Feb 23, 2022 at 4:00 PM Shakeel Butt <shakeelb@google.com> wrote:

> > Can you share a bit more detail on your hardware configuration (num of
> > cpus) and if possible the flamegraph?
> >

We have a mix of 96 and 128 cpus. I'm not yet sure if it's possible to share
the flamegraphs. We may have to come back to that later if necessary.

>
> Also if you can reproduce the issue, can you try the patch at
> https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/
> ?

We can give it a try. I also wrote a bpftrace script to get the kernel
stack when we
encounter slow mem_cgroup_flush_stats ( with 10ms as threshold )

kprobe:mem_cgroup_flush_stats
{
  @start[tid] = nsecs;
  @stack[tid] = kstack;
}

kretprobe:mem_cgroup_flush_stats
/@start[tid]/
{
  $usecs = (nsecs - @start[tid]) / 1000;
  if ($usecs >= 10000) {
    printf("mem_cgroup_flush_stats: %d us\n", $usecs);
    printf("stack: %s\n", @stack[tid]);
  }
  delete(@start[tid]);
  delete(@stack[tid]);
}

END
{
        clear(@start);
        clear(@stack);
}

Running it on a production node yields output like

mem_cgroup_flush_stats: 10697 us
stack:
        mem_cgroup_flush_stats+1
        workingset_refault+296
        add_to_page_cache_lru+159
        page_cache_ra_unbounded+340
        force_page_cache_ra+226
        filemap_get_pages+233
        filemap_read+164
        xfs_file_buffered_read+152
        xfs_file_read_iter+106
        new_sync_read+277
        vfs_read+242
        __x64_sys_pread64+137
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+68

I think the addition of many milliseconds on workingset_refault is too high.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 17:07     ` Daniel Dao
@ 2022-02-23 17:36       ` Shakeel Butt
  2022-02-23 19:28         ` Ivan Babrou
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-02-23 17:36 UTC (permalink / raw)
  To: Daniel Dao
  Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Wed, Feb 23, 2022 at 9:07 AM Daniel Dao <dqminh@cloudflare.com> wrote:
>
> On Wed, Feb 23, 2022 at 4:00 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> > > Can you share a bit more detail on your hardware configuration (num of
> > > cpus) and if possible the flamegraph?
> > >
>
> We have a mix of 96 and 128 cpus. I'm not yet sure if it's possible to share
> the flamegraphs. We may have to come back to that later if necessary.
>

Can you please first confirm if you are running 5.15.19 kernel? If
that is the case then it means the machine is seeing (nr_cpus * 32)
stat updates.

Couple of questions:
1) Did you see any call stack other than xfs read where expensive
mem_cgroup_flush_stats() is showing up?
2) Can you please use the similar bpf+kprobe tracing for the
memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the
source of frequent stat updates.
3) I am still pondering why disabling swap resolves the issue for you.
Is that only for a workload different from xfs read?
4) Can you please also trace cgroup_rstat_flush_locked()?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 17:36       ` Shakeel Butt
@ 2022-02-23 19:28         ` Ivan Babrou
  2022-02-23 20:28           ` Shakeel Butt
  0 siblings, 1 reply; 38+ messages in thread
From: Ivan Babrou @ 2022-02-23 19:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

> Can you please first confirm if you are running 5.15.19 kernel? If
> that is the case then it means the machine is seeing (nr_cpus * 32)
> stat updates.

Yes, we're on 5.15.19.

> Couple of questions:
> 1) Did you see any call stack other than xfs read where expensive
> mem_cgroup_flush_stats() is showing up?

Yes:

mem_cgroup_flush_stats: 7717 us
stack:
        mem_cgroup_flush_stats+1
        workingset_refault+296
        __read_swap_cache_async+512
        swapin_readahead+817
        do_swap_page+561
        __handle_mm_fault+2318
        handle_mm_fault+191
        do_user_addr_fault+428
        exc_page_fault+103
        asm_exc_page_fault+30

> 2) Can you please use the similar bpf+kprobe tracing for the
> memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the
> source of frequent stat updates.

"memcg_rstat_updated" is "static inline".

With the following:

bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }'

The top stacks are:

@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    clear_page_dirty_for_io+292
    write_cache_pages+323
    iomap_writepages+28
    xfs_vm_writepages+113
    do_writepages+195
    __writeback_single_inode+55
    writeback_sb_inodes+493
    __writeback_inodes_wb+76
]: 36429
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    __test_set_page_writeback+421
    iomap_do_writepage+1241
    write_cache_pages+393
    iomap_writepages+28
    xfs_vm_writepages+113
    do_writepages+195
    __writeback_single_inode+55
    writeback_sb_inodes+493
]: 36431
@stacks[
    __mod_memcg_lruvec_state+1
    mod_objcg_state+421
    memcg_slab_post_alloc_hook+345
    __kmalloc_node_track_caller+386
    __alloc_skb+139
    alloc_skb_with_frags+75
    sock_alloc_send_pskb+495
    unix_stream_sendmsg+599
    sock_sendmsg+94
    sock_write_iter+151
]: 39459
@stacks[
    __mod_memcg_lruvec_state+1
    release_pages+760
    tlb_finish_mmu+89
    zap_page_range+265
    do_madvise.part.0+1844
    __x64_sys_madvise+69
    do_syscall_64+59
    entry_SYSCALL_64_after_hwframe+68
]: 54648
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    page_remove_rmap+85
    unmap_page_range+1413
    zap_page_range+224
    do_madvise.part.0+1844
    __x64_sys_madvise+69
    do_syscall_64+59
    entry_SYSCALL_64_after_hwframe+68
]: 54719
@stacks[
    __mod_memcg_lruvec_state+1
    __pagevec_lru_add+256
    lru_cache_add+82
    __handle_mm_fault+4686
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 98103
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    page_add_new_anon_rmap+78
    __handle_mm_fault+4675
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 105071

> 3) I am still pondering why disabling swap resolves the issue for you.
> Is that only for a workload different from xfs read?

My understanding is that any block IO (including swap) triggers new
memcg accounting code. In our process we don't have any other IO than
swap, so disabling swap removes the major (if not only) vector of
triggering this issue.

> 4) Can you please also trace cgroup_rstat_flush_locked()?

Stacks are:

bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @stacks[kstack(10)]++ }'
Attaching 1 probe...
^C

@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    shrink_node+163
    balance_pgdat+777
    kswapd+508
    kthread+295
    ret_from_fork+34
]: 1
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    __read_swap_cache_async+512
    swapin_readahead+817
    do_swap_page+561
    __handle_mm_fault+2318
    handle_mm_fault+191
    do_user_addr_fault+428
]: 5
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    filemap_get_pages+569
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
]: 18
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    filemap_get_pages+233
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
]: 31
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    force_page_cache_ra+226
    filemap_get_pages+233
    filemap_read+164
    xfs_file_buffered_read+152
]: 223

Let me know if you only need the slow ones.

I should mention that there are really two issues:

1. Expensive workingset_refault, which shows up on flamegraphs. We see
it for our rocksdb based database, which persists data on xfs (local
nvme).
2. Expensive workingset_refault that causes latency hiccups, but
doesn't show up on flamegraphs. We see it in our nginx based proxy
with swap enabled (either zram or regular file on xfs on local nvme).

We solved the latter by disabling swap. I think the proper solution
would be for workingset_refault to be fast enough to be invisible, in
line with what was happening on Linux 5.10.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 19:28         ` Ivan Babrou
@ 2022-02-23 20:28           ` Shakeel Butt
  2022-02-23 21:16             ` Ivan Babrou
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-02-23 20:28 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Wed, Feb 23, 2022 at 11:28 AM Ivan Babrou <ivan@cloudflare.com> wrote:
>
[...]
> > 2) Can you please use the similar bpf+kprobe tracing for the
> > memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the
> > source of frequent stat updates.
>
> "memcg_rstat_updated" is "static inline".
>
> With the following:
>
> bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }'
>
[...]

Thanks, it is helpful. It seems like most of the stats updates are
happening on the anon page faults and based on signature, it seems
like swap refaults.

>
> > 3) I am still pondering why disabling swap resolves the issue for you.
> > Is that only for a workload different from xfs read?
>
> My understanding is that any block IO (including swap) triggers new
> memcg accounting code. In our process we don't have any other IO than
> swap, so disabling swap removes the major (if not only) vector of
> triggering this issue.
>

Now, I understand why disabling swap is helping your case as the
number of stat updates would be reduced drastically and rstat flush
would happen async most of the time.

[...]
> I should mention that there are really two issues:
>
> 1. Expensive workingset_refault, which shows up on flamegraphs. We see
> it for our rocksdb based database, which persists data on xfs (local
> nvme).
> 2. Expensive workingset_refault that causes latency hiccups, but
> doesn't show up on flamegraphs. We see it in our nginx based proxy
> with swap enabled (either zram or regular file on xfs on local nvme).
>
> We solved the latter by disabling swap. I think the proper solution
> would be for workingset_refault to be fast enough to be invisible, in
> line with what was happening on Linux 5.10.

Thanks for the info. Is it possible to test
https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/
?

If that patch did not help then we either have to optimize rstat
flushing or further increase the update buffer which is nr_cpus * 32.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 20:28           ` Shakeel Butt
@ 2022-02-23 21:16             ` Ivan Babrou
  2022-02-24 14:46               ` Daniel Dao
  0 siblings, 1 reply; 38+ messages in thread
From: Ivan Babrou @ 2022-02-23 21:16 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

> Thanks for the info. Is it possible to test
> https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/
> ?
>
> If that patch did not help then we either have to optimize rstat
> flushing or further increase the update buffer which is nr_cpus * 32.

We're working on it and should have some data tomorrow.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 13:51 Regression in workingset_refault latency on 5.15 Daniel Dao
  2022-02-23 15:57 ` Shakeel Butt
@ 2022-02-24  9:22 ` Thorsten Leemhuis
  2022-04-11 10:17   ` Regression in workingset_refault latency on 5.15 #forregzbot Thorsten Leemhuis
  1 sibling, 1 reply; 38+ messages in thread
From: Thorsten Leemhuis @ 2022-02-24  9:22 UTC (permalink / raw)
  To: Daniel Dao, shakeelb
  Cc: kernel-team, linux-mm, hannes, guro, feng.tang, mhocko, hdanton,
	mkoutny, akpm, torvalds, regressions

[TLDR: I'm adding below regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]

Hi, this is your Linux kernel regression tracker speaking.

Thanks for the report. CCing the regression mailing list, as it should
be in the loop for all regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced v3.0..v5.15.19
#regzbot ignore-activity

Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted below) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. This allows the bot to connect
the report with any patches posted or committed to fix the issue; this
again allows the bot to show the current status of regressions and
automatically resolve the issue when the fix hits the right tree.

I'm sending this to everyone that got the initial report, to make them
aware of the tracking. I also hope that messages like this motivate
people to directly get at least the regression mailing list and ideally
even regzbot involved when dealing with regressions, as messages like
this wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), if
they are relevant just for regzbot. With a bit of luck no such messages
will be needed anyway.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

On 23.02.22 14:51, Daniel Dao wrote:
> Hi all,
> 
> We are observing some regressions in workingset_refault on our newly upgraded
> 5.15.19 nodes with zram as swap. This manifests in several ways:
> 
> 1) Regression of workingset_refault duration observed in flamegraph
> 
> We regularly collect flamegraphs for running services on the node. Since upgrade
> to 5.15.19, we see that workingset_refault occupied a more significant part of
> the service flamegraph (13%) with the following call trace
> 
>   workingset_refault+0x128
>   add_to_page_cache_lru+0x9f
>   page_cache_ra_unbounded+0x154
>   force_page_cache_ra+0xe2
>   filemap_get_pages+0xe9
>   filemap_read+0xa4
>   xfs_file_buffered_read+0x98
>   xfs_file_read_iter+0x6a
>   new_sync_read+0x118
>   vfs_read+0xf2
>   __x64_sys_pread64+0x89
>   do_syscall_64+0x3b
>   entry_SYSCALL_64_after_hwframe+0x44
> 
> 2) Regression of userspace performance sensitive code
> 
> We have some performance sensentive code running in userspace that have their
> runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:
> 
>   now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
>   func()
>   elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now
> 
> Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
> frequently than before.  This went away after we disabled swap for the service
> using `memory.swap.max=0` memcg configuration.
> 
> The new thing in 5.15 workingset_refault seems to be introduction of
> mem_cgroup_flush_stats()
> by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
> lruvec stats in the
> refault).
> 
> Given that mem_cgroup_flush_stats can take quite a long time for us on the
> standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )
> 
>   sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
>   Tracing function calls slower than 10 ms... Ctrl+C to quit.
>   TIME       COMM           PID    LAT(ms)             RVAL FUNC
>   0.000000   <redacted>     804776   11.50              200
> mem_cgroup_flush_stats
>   0.343383   <redacted>     647496   10.58              200
> mem_cgroup_flush_stats
>   0.604309   <redacted>     804776   10.50              200
> mem_cgroup_flush_stats
>   1.230416   <redacted>     803293   10.01              200
> mem_cgroup_flush_stats
>   1.248442   <redacted>     646400   11.02              200
> mem_cgroup_flush_stats
> 
> could it be possible that workingset_refault in some unfortunate case can take
> much longer than before such that it increases the time observed by
> CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
> workingset_refault
> observed by perf ?
> 


-- 
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-23 21:16             ` Ivan Babrou
@ 2022-02-24 14:46               ` Daniel Dao
  2022-02-24 16:58                 ` Shakeel Butt
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Dao @ 2022-02-24 14:46 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Shakeel Butt, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Wed, Feb 23, 2022 at 9:16 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> Thanks for the info. Is it possible to test
> https://lore.kernel.org/all/20210929235936.2859271-1-shakeelb@google.com/

We built the patch on top of 5.15.25 and deployed it with swap-on-zram on some
nodes. Test data follow:

1) Latency of workingset_refault over 60 seconds

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 4643     |**********************                  |
       512 -> 1023       : 4259     |********************                    |
      1024 -> 2047       : 1178     |*****                                   |
      2048 -> 4095       : 4044     |*******************                     |
      4096 -> 8191       : 1804     |********                                |
      8192 -> 16383      : 109      |                                        |
     16384 -> 32767      : 52       |                                        |
     32768 -> 65535      : 45       |                                        |
     65536 -> 131071     : 15       |                                        |
    131072 -> 262143     : 2        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 7        |                                        |
   1048576 -> 2097151    : 900      |****                                    |
   2097152 -> 4194303    : 8218     |****************************************|
   4194304 -> 8388607    : 967      |****                                    |
   8388608 -> 16777215   : 89       |                                        |

avg = 1229607 nsecs, total: 32378027676 nsecs, count: 26332

2) Latency of cgroup_rstat_flush_locked over 60 seconds

$ sudo /usr/share/bcc/tools/funclatency -d 60 cgroup_rstat_flush_locked
Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 36       |                                        |
      2048 -> 4095       : 73       |                                        |
      4096 -> 8191       : 151      |                                        |
      8192 -> 16383      : 606      |**                                      |
     16384 -> 32767      : 387      |*                                       |
     32768 -> 65535      : 49       |                                        |
     65536 -> 131071     : 26       |                                        |
    131072 -> 262143     : 6        |                                        |
    262144 -> 524287     : 3        |                                        |
    524288 -> 1048575    : 14       |                                        |
   1048576 -> 2097151    : 2085     |*********                               |
   2097152 -> 4194303    : 8580     |****************************************|
   4194304 -> 8388607    : 772      |***                                     |
   8388608 -> 16777215   : 117      |                                        |
  16777216 -> 33554431   : 1        |                                        |

avg = 2592894 nsecs, total: 33463898782 nsecs, count: 12906

3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms

@stackcount[
    mem_cgroup_flush_stats+1
    workingset_refault+296
    __read_swap_cache_async+512
    swap_cluster_readahead+440
    shmem_swapin+164
    shmem_swapin_page+396
    shmem_getpage_gfp+1228
    shmem_fault+104
    __do_fault+55
    __handle_mm_fault+3102
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 1
@stackcount[
    mem_cgroup_flush_stats+1
    workingset_refault+296
    __read_swap_cache_async+512
    swap_cluster_readahead+333
    shmem_swapin+164
    shmem_swapin_page+396
    shmem_getpage_gfp+1228
    shmem_fault+104
    __do_fault+55
    __handle_mm_fault+3649
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 1
@stackcount[
    mem_cgroup_flush_stats+1
    workingset_refault+296
    __read_swap_cache_async+512
    swap_cluster_readahead+333
    shmem_swapin+164
    shmem_swapin_page+396
    shmem_getpage_gfp+1228
    shmem_fault+104
    __do_fault+55
    __handle_mm_fault+3102
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 4
@stackcount[
    mem_cgroup_flush_stats+1
    shrink_node+163
    balance_pgdat+791
    kswapd+508
    kthread+295
    ret_from_fork+34
]: 7
@stackcount[
    mem_cgroup_flush_stats+1
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    filemap_get_pages+569
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
    new_sync_read+280
    vfs_read+242
    __x64_sys_pread64+137
    __noinstr_text_start+59
    entry_SYSCALL_64_after_hwframe+68
]: 10
@stackcount[
    mem_cgroup_flush_stats+1
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    filemap_get_pages+233
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
    new_sync_read+280
    vfs_read+242
    __x64_sys_pread64+137
    __noinstr_text_start+59
    entry_SYSCALL_64_after_hwframe+68
]: 111
@stackcount[
    mem_cgroup_flush_stats+1
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    force_page_cache_ra+226
    filemap_get_pages+233
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
    new_sync_read+280
    vfs_read+242
    __x64_sys_pread64+137
    __noinstr_text_start+59
    entry_SYSCALL_64_after_hwframe+68
]: 225

This was collected with the following bpftrace script

kprobe:mem_cgroup_flush_stats
{
  @start[tid] = nsecs;
  @stack[tid] = kstack;
}

kretprobe:mem_cgroup_flush_stats
/@start[tid]/
{
  $usecs = (nsecs - @start[tid]) / 1000;
  if ($usecs >= 5000) {
    @stackcount[@stack[tid]]++;
  }
  delete(@start[tid]);
  delete(@stack[tid]);
}

interval:s:60
{
        print(@stackcount);
        exit();
}

>
> If that patch did not help then we either have to optimize rstat
> flushing or further increase the update buffer which is nr_cpus * 32.

I think overall, we don't see any significant improvements. The appearance of
expensive mem_cgroup_flush_stats is still there.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-24 14:46               ` Daniel Dao
@ 2022-02-24 16:58                 ` Shakeel Butt
  2022-02-24 17:34                   ` Daniel Dao
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-02-24 16:58 UTC (permalink / raw)
  To: Daniel Dao
  Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Thu, Feb 24, 2022 at 02:46:27PM +0000, Daniel Dao wrote:

[...]


> 3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms


Can you please check if flush_memcg_stats_dwork() appears in any stack
traces at all?

[...]

> >
> > If that patch did not help then we either have to optimize rstat
> > flushing or further increase the update buffer which is nr_cpus * 32.

> I think overall, we don't see any significant improvements. The  
> appearance of
> expensive mem_cgroup_flush_stats is still there.


Thanks for testing. At the moment I am suspecting the async worker is
not getting the CPU. Can you share your CONFIG_HZ setting? Also can you
try the following patch and see if that helps otherwise keep halving the
delay (i.e. 2HZ -> HZ -> HZ/2 -> ...) and find at what value the issue
you are seeing get resolved?


diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 32ba963ebf2e..198a7baaeeb7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -685,7 +685,7 @@ void mem_cgroup_flush_stats(void)
  static void flush_memcg_stats_dwork(struct work_struct *w)
  {
  	__mem_cgroup_flush_stats();
-	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
+	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, HZ);
  }

  /**


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-24 16:58                 ` Shakeel Butt
@ 2022-02-24 17:34                   ` Daniel Dao
  2022-02-24 18:00                     ` Shakeel Butt
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Dao @ 2022-02-24 17:34 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Thu, Feb 24, 2022 at 4:58 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Feb 24, 2022 at 02:46:27PM +0000, Daniel Dao wrote:
>
> [...]
>
>
> > 3) Summary of stack traces when mem_cgroup_flush_stats is over 5ms
>
>
> Can you please check if flush_memcg_stats_dwork() appears in any stack
> traces at all?

Here is the result of probes on flush_memcg_stats_dwork:

$ sudo /usr/share/bcc/tools/funccount -d 30 flush_memcg_stats_dwork
Tracing 1 functions for "b'flush_memcg_stats_dwork'"... Hit Ctrl-C to end.

FUNC                                    COUNT
b'flush_memcg_stats_dwork'                 14

 sudo /usr/share/bcc/tools/funclatency -d 30 flush_memcg_stats_dwork
Tracing 1 functions for "flush_memcg_stats_dwork"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 8        |****************************************|
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 1        |*****                                   |
   2097152 -> 4194303    : 4        |********************                    |
   4194304 -> 8388607    : 2        |**********                              |

avg = 1725693 nsecs, total: 25885397 nsecs, count: 15

So we triggered the async flush as expected, around every 2 seconds.
But they mostly
run faster than the inline call from workingset_refault. I think on busy servers
with varied workloads that touch swap/page_cache, it's very likely that most of
the cost is in inline mem_cgroup_flush_stats() of workingset_refault rather than
from async flush.

> Thanks for testing. At the moment I am suspecting the async worker is
> not getting the CPU. Can you share your CONFIG_HZ setting? Also can you
> try the following patch and see if that helps otherwise keep halving the
> delay (i.e. 2HZ -> HZ -> HZ/2 -> ...) and find at what value the issue
> you are seeing get resolved?

We have CONFIG_HZ=1000. We can try to increase the frequency of async flush, but
that seems like a not great bandaid. Is it possible to remove
mem_cgroup_flush_stats()
from workingset_refault, or at least scope it down to some targeted cgroup so
we don't need to flush from root with potentially large sets of
cgroups to walk ?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-24 17:34                   ` Daniel Dao
@ 2022-02-24 18:00                     ` Shakeel Butt
  2022-02-24 18:52                       ` Shakeel Butt
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-02-24 18:00 UTC (permalink / raw)
  To: Daniel Dao
  Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Thu, Feb 24, 2022 at 9:34 AM Daniel Dao <dqminh@cloudflare.com> wrote:
>
[...]
> > Thanks for testing. At the moment I am suspecting the async worker is
> > not getting the CPU. Can you share your CONFIG_HZ setting? Also can you
> > try the following patch and see if that helps otherwise keep halving the
> > delay (i.e. 2HZ -> HZ -> HZ/2 -> ...) and find at what value the issue
> > you are seeing get resolved?
>
> We have CONFIG_HZ=1000. We can try to increase the frequency of async flush, but
> that seems like a not great bandaid. Is it possible to remove
> mem_cgroup_flush_stats()
> from workingset_refault, or at least scope it down to some targeted cgroup so
> we don't need to flush from root with potentially large sets of
> cgroups to walk ?

I actually wanted to know what would be the good frequency of rstat
flushing for your workload. I am not planning to propose to change the
default frequency.

Anyways I am thinking of introducing mem_cgroup_flush_stats_asyn()
which will schedule flush_memcg_stats_dwork() without delay. Let me
prepare the patch based on 5.15-stable for you to test.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-24 18:00                     ` Shakeel Butt
@ 2022-02-24 18:52                       ` Shakeel Butt
  2022-02-25 10:23                         ` Daniel Dao
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-02-24 18:52 UTC (permalink / raw)
  To: Daniel Dao
  Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Thu, Feb 24, 2022 at 10:00:55AM -0800, Shakeel Butt wrote:
> On Thu, Feb 24, 2022 at 9:34 AM Daniel Dao <dqminh@cloudflare.com> wrote:
[...]
> Anyways I am thinking of introducing mem_cgroup_flush_stats_asyn()
> which will schedule flush_memcg_stats_dwork() without delay. Let me
> prepare the patch based on 5.15-stable for you to test.


Can you please try the following patch and let me the results?

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d9b8df5ef212..cd732d7e00ca 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1002,6 +1002,7 @@ static inline unsigned long  
lruvec_page_state_local(struct lruvec *lruvec,
  }

  void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_async(void);

  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item  
idx,
  			      int val);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 32ba963ebf2e..0f298cbd4763 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -682,6 +682,14 @@ void mem_cgroup_flush_stats(void)
  		__mem_cgroup_flush_stats();
  }

+void mem_cgroup_flush_stats_aync(void)
+{
+	if (atomic_read(&stats_flush_threshold) > num_online_cpus()) {
+		mod_delayed_work(system_unbound_wq, &stats_flush_dwork, 0);
+		atomic_set(&stats_flush_threshold, 0);
+	}
+}
+
  static void flush_memcg_stats_dwork(struct work_struct *w)
  {
  	__mem_cgroup_flush_stats();
diff --git a/mm/workingset.c b/mm/workingset.c
index d5b81e4f4cbe..86d43bfc5c63 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -352,7 +352,7 @@ void workingset_refault(struct page *page, void *shadow)

  	inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);

-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_async();
  	/*
  	 * Compare the distance to the existing workingset size. We
  	 * don't activate pages that couldn't stay resident even if


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-24 18:52                       ` Shakeel Butt
@ 2022-02-25 10:23                         ` Daniel Dao
  2022-02-25 17:08                           ` Ivan Babrou
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Dao @ 2022-02-25 10:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Ivan Babrou, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Thu, Feb 24, 2022 at 6:52 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Feb 24, 2022 at 10:00:55AM -0800, Shakeel Butt wrote:
> > On Thu, Feb 24, 2022 at 9:34 AM Daniel Dao <dqminh@cloudflare.com> wrote:
> [...]
> > Anyways I am thinking of introducing mem_cgroup_flush_stats_asyn()
> > which will schedule flush_memcg_stats_dwork() without delay. Let me
> > prepare the patch based on 5.15-stable for you to test.
>
>
> Can you please try the following patch and let me the results?
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d9b8df5ef212..cd732d7e00ca 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1002,6 +1002,7 @@ static inline unsigned long
> lruvec_page_state_local(struct lruvec *lruvec,
>   }
>
>   void mem_cgroup_flush_stats(void);
> +void mem_cgroup_flush_stats_async(void);
>
>   void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item
> idx,
>                               int val);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 32ba963ebf2e..0f298cbd4763 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -682,6 +682,14 @@ void mem_cgroup_flush_stats(void)
>                 __mem_cgroup_flush_stats();
>   }
>
> +void mem_cgroup_flush_stats_aync(void)
> +{
> +       if (atomic_read(&stats_flush_threshold) > num_online_cpus()) {
> +               mod_delayed_work(system_unbound_wq, &stats_flush_dwork, 0);
> +               atomic_set(&stats_flush_threshold, 0);
> +       }
> +}
> +
>   static void flush_memcg_stats_dwork(struct work_struct *w)
>   {
>         __mem_cgroup_flush_stats();
> diff --git a/mm/workingset.c b/mm/workingset.c
> index d5b81e4f4cbe..86d43bfc5c63 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -352,7 +352,7 @@ void workingset_refault(struct page *page, void *shadow)
>
>         inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
>
> -       mem_cgroup_flush_stats();
> +       mem_cgroup_flush_stats_async();
>         /*
>          * Compare the distance to the existing workingset size. We
>          * don't activate pages that couldn't stay resident even if

I think this looks good so far. Some stats:

1) More time spent in flush_memcg_stats_dwork

$  sudo /usr/share/bcc/tools/funclatency -d 60 flush_memcg_stats_dwork
Tracing 1 functions for "flush_memcg_stats_dwork"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 21       |                                        |
       512 -> 1023       : 115      |                                        |
      1024 -> 2047       : 47       |                                        |
      2048 -> 4095       : 86       |                                        |
      4096 -> 8191       : 48       |                                        |
      8192 -> 16383      : 2        |                                        |
     16384 -> 32767      : 2        |                                        |
     32768 -> 65535      : 1        |                                        |
     65536 -> 131071     : 52       |                                        |
    131072 -> 262143     : 400      |**                                      |
    262144 -> 524287     : 2172     |****************                        |
    524288 -> 1048575    : 4193     |*******************************         |
   1048576 -> 2097151    : 5357     |****************************************|
   2097152 -> 4194303    : 1834     |*************                           |
   4194304 -> 8388607    : 485      |***                                     |
   8388608 -> 16777215   : 3        |                                        |

avg = 1346983 nsecs, total: 19959597708 nsecs, count: 14818


2) Less time spent in workingset_refault

~$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 10318    |****************************************|
       512 -> 1023       : 2877     |***********                             |
      1024 -> 2047       : 1089     |****                                    |
      2048 -> 4095       : 4785     |******************                      |
      4096 -> 8191       : 4528     |*****************                       |
      8192 -> 16383      : 1270     |****                                    |
     16384 -> 32767      : 82       |                                        |
     32768 -> 65535      : 80       |                                        |
     65536 -> 131071     : 20       |                                        |
    131072 -> 262143     : 2        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 1        |                                        |

avg = 2705 nsecs, total: 67782139 nsecs, count: 25052


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-25 10:23                         ` Daniel Dao
@ 2022-02-25 17:08                           ` Ivan Babrou
  2022-02-25 17:22                             ` Shakeel Butt
  2022-02-25 18:03                             ` Michal Koutný
  0 siblings, 2 replies; 38+ messages in thread
From: Ivan Babrou @ 2022-02-25 17:08 UTC (permalink / raw)
  To: Daniel Dao
  Cc: Shakeel Butt, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Fri, Feb 25, 2022 at 2:23 AM Daniel Dao <dqminh@cloudflare.com> wrote:
> I think this looks good so far.

I compared a flamegraph before to a flamegraph after (10s @ 99Hz on
96-core CPU evenly loaded to ~75% in both cases).

Before: 1.4% spent in workingset_refault.
After: 0.5% spent in flush_memcg_stats_dwork.

The latter is all in kworkers (as expected), while the former is
spread across IO active tasks.

This seems like a great first step that should be merged on its own.
It would be good to also do something to improve the CPU time spent in
delayed work, if possible, as 0.5% of on-CPU time is not a negligible
amount.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-25 17:08                           ` Ivan Babrou
@ 2022-02-25 17:22                             ` Shakeel Butt
  2022-02-25 18:03                             ` Michal Koutný
  1 sibling, 0 replies; 38+ messages in thread
From: Shakeel Butt @ 2022-02-25 17:22 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Michal Koutný,
	Andrew Morton, Linus Torvalds

On Fri, Feb 25, 2022 at 9:09 AM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Fri, Feb 25, 2022 at 2:23 AM Daniel Dao <dqminh@cloudflare.com> wrote:
> > I think this looks good so far.
>
> I compared a flamegraph before to a flamegraph after (10s @ 99Hz on
> 96-core CPU evenly loaded to ~75% in both cases).
>
> Before: 1.4% spent in workingset_refault.
> After: 0.5% spent in flush_memcg_stats_dwork.
>
> The latter is all in kworkers (as expected), while the former is
> spread across IO active tasks.
>
> This seems like a great first step that should be merged on its own.
> It would be good to also do something to improve the CPU time spent in
> delayed work, if possible, as 0.5% of on-CPU time is not a negligible
> amount.

I will send the official async patch soon and I agree that we need to
improve rstat flush infrastructure as well but that would need more
thought and time.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-25 17:08                           ` Ivan Babrou
  2022-02-25 17:22                             ` Shakeel Butt
@ 2022-02-25 18:03                             ` Michal Koutný
  2022-02-25 18:08                               ` Ivan Babrou
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Koutný @ 2022-02-25 18:03 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Daniel Dao, Shakeel Butt, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Fri, Feb 25, 2022 at 09:08:50AM -0800, Ivan Babrou <ivan@cloudflare.com> wrote:
> This seems like a great first step that should be merged on its own.
> It would be good to also do something to improve the CPU time spent in
> delayed work, if possible, as 0.5% of on-CPU time is not a negligible
> amount.

BTW how many levels deep is the affected memory cgroup hierarchy (where
the workingset_refault happens)? (Self answer: Probably less than
nr_cpus*MEMCG_BATCH, so not relevant.)

Thanks,
Michal


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-25 18:03                             ` Michal Koutný
@ 2022-02-25 18:08                               ` Ivan Babrou
  2022-02-28 23:09                                 ` Shakeel Butt
  0 siblings, 1 reply; 38+ messages in thread
From: Ivan Babrou @ 2022-02-25 18:08 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Daniel Dao, Shakeel Butt, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Fri, Feb 25, 2022 at 10:03 AM Michal Koutný <mkoutny@suse.com> wrote:
> BTW how many levels deep is the affected memory cgroup hierarchy (where
> the workingset_refault happens)? (Self answer: Probably less than
> nr_cpus*MEMCG_BATCH, so not relevant.)

It's /system.slice/thingy.service (a regular systemd unit on
cgroupv2), so not very deep.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-25 18:08                               ` Ivan Babrou
@ 2022-02-28 23:09                                 ` Shakeel Butt
  2022-02-28 23:34                                   ` Ivan Babrou
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-02-28 23:09 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Fri, Feb 25, 2022 at 10:08:04AM -0800, Ivan Babrou wrote:
> On Fri, Feb 25, 2022 at 10:03 AM Michal Koutn� <mkoutny@suse.com> wrote:
> > BTW how many levels deep is the affected memory cgroup hierarchy (where
> > the workingset_refault happens)? (Self answer: Probably less than
> > nr_cpus*MEMCG_BATCH, so not relevant.)

> It's /system.slice/thingy.service (a regular systemd unit on
> cgroupv2), so not very deep.


Hi Ivan & Daniel,

Can you please try the following patch but without the async patch and
let us know if there is any difference in regression you are observing?


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..59014d20149a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -999,6 +999,7 @@ static inline unsigned long  
lruvec_page_state_local(struct lruvec *lruvec,
  }

  void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_memcg(struct mem_cgroup *memcg);

  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item  
idx,
  			      int val);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36e9f38c919d..ae59bfc8788a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -642,14 +642,14 @@ static inline void memcg_rstat_updated(struct  
mem_cgroup *memcg, int val)
  	}
  }

-static void __mem_cgroup_flush_stats(void)
+static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg)
  {
  	unsigned long flag;

  	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
  		return;

-	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
+	cgroup_rstat_flush_irqsafe(memcg->css.cgroup);
  	atomic_set(&stats_flush_threshold, 0);
  	spin_unlock_irqrestore(&stats_flush_lock, flag);
  }
@@ -657,12 +657,18 @@ static void __mem_cgroup_flush_stats(void)
  void mem_cgroup_flush_stats(void)
  {
  	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
-		__mem_cgroup_flush_stats();
+		__mem_cgroup_flush_stats(root_mem_cgroup);
+}
+
+void mem_cgroup_flush_stats_memcg(struct mem_cgroup *memcg)
+{
+	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
+		__mem_cgroup_flush_stats(memcg);
  }

  static void flush_memcg_stats_dwork(struct work_struct *w)
  {
-	__mem_cgroup_flush_stats();
+	__mem_cgroup_flush_stats(root_mem_cgroup);
  	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
  }

diff --git a/mm/workingset.c b/mm/workingset.c
index 8c03afe1d67c..9b9ebc5e5110 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -354,7 +354,7 @@ void workingset_refault(struct folio *folio, void  
*shadow)

  	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);

-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_memcg(eviction_memcg);
  	/*
  	 * Compare the distance to the existing workingset size. We
  	 * don't activate pages that couldn't stay resident even if

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-28 23:09                                 ` Shakeel Butt
@ 2022-02-28 23:34                                   ` Ivan Babrou
  2022-02-28 23:43                                     ` Shakeel Butt
  2022-03-02  0:48                                     ` Ivan Babrou
  0 siblings, 2 replies; 38+ messages in thread
From: Ivan Babrou @ 2022-02-28 23:34 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Mon, Feb 28, 2022 at 3:09 PM Shakeel Butt <shakeelb@google.com> wrote:
> Hi Ivan & Daniel,
>
> Can you please try the following patch but without the async patch and
> let us know if there is any difference in regression you are observing?

Hey Shakeel,

I scheduled some reboots for overnight, one machine in each group:

1. Control on 5.15.25
2. Your previous patch with delayed flush on top of 5.15.25
3. You current patch with targeted flush on top of 5.15.25

Either Daniel or I should have some numbers tomorrow.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-28 23:34                                   ` Ivan Babrou
@ 2022-02-28 23:43                                     ` Shakeel Butt
  2022-03-02  0:48                                     ` Ivan Babrou
  1 sibling, 0 replies; 38+ messages in thread
From: Shakeel Butt @ 2022-02-28 23:43 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Mon, Feb 28, 2022 at 3:34 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Mon, Feb 28, 2022 at 3:09 PM Shakeel Butt <shakeelb@google.com> wrote:
> > Hi Ivan & Daniel,
> >
> > Can you please try the following patch but without the async patch and
> > let us know if there is any difference in regression you are observing?
>
> Hey Shakeel,
>
> I scheduled some reboots for overnight, one machine in each group:
>
> 1. Control on 5.15.25
> 2. Your previous patch with delayed flush on top of 5.15.25
> 3. You current patch with targeted flush on top of 5.15.25
>
> Either Daniel or I should have some numbers tomorrow.

Thanks a lot Ivan. I am also working on improving
mem_cgroup_css_rstat_flush() and cgroup_rstat_flush_locked() which
will be orthogonal to the patches you are testing.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-02-28 23:34                                   ` Ivan Babrou
  2022-02-28 23:43                                     ` Shakeel Butt
@ 2022-03-02  0:48                                     ` Ivan Babrou
  2022-03-02  2:50                                       ` Shakeel Butt
  2022-03-02 10:08                                       ` Michal Koutný
  1 sibling, 2 replies; 38+ messages in thread
From: Ivan Babrou @ 2022-03-02  0:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Mon, Feb 28, 2022 at 3:34 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> I scheduled some reboots for overnight, one machine in each group:
>
> 1. Control on 5.15.25
> 2. Your previous patch with delayed flush on top of 5.15.25
> 3. Your current patch with targeted flush on top of 5.15.25
>
> Either Daniel or I should have some numbers tomorrow.

Numbers for #2 (previous patch):

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 27313    |****************************************|
       512 -> 1023       : 11887    |*****************                       |
      1024 -> 2047       : 3718     |*****                                   |
      2048 -> 4095       : 7095     |**********                              |
      4096 -> 8191       : 11860    |*****************                       |
      8192 -> 16383      : 4193     |******                                  |
     16384 -> 32767      : 470      |                                        |
     32768 -> 65535      : 320      |                                        |
     65536 -> 131071     : 107      |                                        |
    131072 -> 262143     : 20       |                                        |
    262144 -> 524287     : 2        |                                        |

avg = 2929 nsecs, total: 196247510 nsecs, count: 66985

Numbers for #3 (the latest patch):

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 6520     |****************************************|
       512 -> 1023       : 3502     |*********************                   |
      1024 -> 2047       : 726      |****                                    |
      2048 -> 4095       : 2206     |*************                           |
      4096 -> 8191       : 1078     |******                                  |
      8192 -> 16383      : 70       |                                        |
     16384 -> 32767      : 39       |                                        |
     32768 -> 65535      : 28       |                                        |
     65536 -> 131071     : 8        |                                        |
    131072 -> 262143     : 1        |                                        |
    262144 -> 524287     : 11       |                                        |
    524288 -> 1048575    : 118      |                                        |
   1048576 -> 2097151    : 3722     |**********************                  |
   2097152 -> 4194303    : 5854     |***********************************     |
   4194304 -> 8388607    : 570      |***                                     |
   8388608 -> 16777215   : 122      |                                        |

avg = 1100789 nsecs, total: 27051906434 nsecs, count: 24575

Looks like you were right that targeted flush is not going to be as good.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02  0:48                                     ` Ivan Babrou
@ 2022-03-02  2:50                                       ` Shakeel Butt
  2022-03-02  3:40                                         ` Ivan Babrou
  2022-03-02 11:49                                         ` Frank Hofmann
  2022-03-02 10:08                                       ` Michal Koutný
  1 sibling, 2 replies; 38+ messages in thread
From: Shakeel Butt @ 2022-03-02  2:50 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote:

[...]

> Looks like you were right that targeted flush is not going to be as good.


Thanks a lot. Can you please try the following patch (independent of other
patches) as well?


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d9b8df5ef212..499f75e066f3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -75,29 +75,8 @@ enum mem_cgroup_events_target {
  	MEM_CGROUP_NTARGETS,
  };

-struct memcg_vmstats_percpu {
-	/* Local (CPU and cgroup) page state & events */
-	long			state[MEMCG_NR_STAT];
-	unsigned long		events[NR_VM_EVENT_ITEMS];
-
-	/* Delta calculation for lockless upward propagation */
-	long			state_prev[MEMCG_NR_STAT];
-	unsigned long		events_prev[NR_VM_EVENT_ITEMS];
-
-	/* Cgroup1: threshold notifications & softlimit tree updates */
-	unsigned long		nr_page_events;
-	unsigned long		targets[MEM_CGROUP_NTARGETS];
-};
-
-struct memcg_vmstats {
-	/* Aggregated (CPU and subtree) page state & events */
-	long			state[MEMCG_NR_STAT];
-	unsigned long		events[NR_VM_EVENT_ITEMS];
-
-	/* Pending child counts during tree propagation */
-	long			state_pending[MEMCG_NR_STAT];
-	unsigned long		events_pending[NR_VM_EVENT_ITEMS];
-};
+struct memcg_vmstats_percpu;
+struct memcg_vmstats;

  struct mem_cgroup_reclaim_iter {
  	struct mem_cgroup *position;
@@ -304,7 +283,7 @@ struct mem_cgroup {
  	MEMCG_PADDING(_pad1_);

  	/* memory.stat */
-	struct memcg_vmstats	vmstats;
+	struct memcg_vmstats	*vmstats;

  	/* memory.events */
  	atomic_long_t		memory_events[MEMCG_NR_MEMORY_EVENTS];
@@ -964,11 +943,7 @@ static inline void mod_memcg_state(struct mem_cgroup  
*memcg,
  	local_irq_restore(flags);
  }

-static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int  
idx)
-{
-	return READ_ONCE(memcg->vmstats.state[idx]);
-}
-
+unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
  static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
  					      enum node_stat_item idx)
  {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 32ba963ebf2e..c65f155c2048 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -688,6 +688,71 @@ static void flush_memcg_stats_dwork(struct work_struct  
*w)
  	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
  }

+static const unsigned int memcg_vm_events[] = {
+	PGPGIN,
+	PGPGOUT,
+	PGFAULT,
+	PGMAJFAULT,
+	PGREFILL,
+	PGSCAN_KSWAPD,
+	PGSCAN_DIRECT,
+	PGSTEAL_KSWAPD,
+	PGSTEAL_DIRECT,
+	PGACTIVATE,
+	PGDEACTIVATE,
+	PGLAZYFREE,
+	PGLAZYFREED,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	THP_FAULT_ALLOC,
+	THP_COLLAPSE_ALLOC,
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+};
+#define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_events)
+
+static int memcg_events_index[NR_VM_EVENT_ITEMS] __read_mostly;
+
+static void init_memcg_events(void)
+{
+	int i;
+
+	for (i = 0; i < NR_MEMCG_EVENTS; ++i)
+		memcg_events_index[memcg_vm_events[i]] = i + 1;
+}
+
+static int get_memcg_events_index(enum vm_event_item idx)
+{
+	return memcg_events_index[idx] - 1;
+}
+
+struct memcg_vmstats_percpu {
+	/* Local (CPU and cgroup) page state & events */
+	long			state[MEMCG_NR_STAT];
+	unsigned long		events[NR_MEMCG_EVENTS];
+
+	/* Delta calculation for lockless upward propagation */
+	long			state_prev[MEMCG_NR_STAT];
+	unsigned long		events_prev[NR_MEMCG_EVENTS];
+
+	/* Cgroup1: threshold notifications & softlimit tree updates */
+	unsigned long		nr_page_events;
+	unsigned long		targets[MEM_CGROUP_NTARGETS];
+};
+
+struct memcg_vmstats {
+	/* Aggregated (CPU and subtree) page state & events */
+	long			state[MEMCG_NR_STAT];
+	unsigned long		events[NR_MEMCG_EVENTS];
+
+	/* Pending child counts during tree propagation */
+	long			state_pending[MEMCG_NR_STAT];
+	unsigned long		events_pending[NR_MEMCG_EVENTS];
+};
+
+unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
+{
+	return READ_ONCE(memcg->vmstats->state[idx]);
+}
+
  /**
   * __mod_memcg_state - update cgroup memory statistics
   * @memcg: the memory cgroup
@@ -831,25 +896,33 @@ static inline void mod_objcg_mlstate(struct  
obj_cgroup *objcg,
  void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
  			  unsigned long count)
  {
-	if (mem_cgroup_disabled())
+	int index = get_memcg_events_index(idx);
+
+	if (mem_cgroup_disabled() || index < 0)
  		return;

-	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
+	__this_cpu_add(memcg->vmstats_percpu->events[index], count);
  	memcg_rstat_updated(memcg, count);
  }

  static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
  {
-	return READ_ONCE(memcg->vmstats.events[event]);
+	int index = get_memcg_events_index(event);
+
+	if (index < 0)
+		return 0;
+	return READ_ONCE(memcg->vmstats->events[index]);
  }

  static unsigned long memcg_events_local(struct mem_cgroup *memcg, int  
event)
  {
  	long x = 0;
  	int cpu;
+	int index = get_memcg_events_index(event);

-	for_each_possible_cpu(cpu)
-		x += per_cpu(memcg->vmstats_percpu->events[event], cpu);
+	if (index >= 0)
+		for_each_possible_cpu(cpu)
+			x += per_cpu(memcg->vmstats_percpu->events[index], cpu);
  	return x;
  }

@@ -5147,6 +5220,7 @@ static void __mem_cgroup_free(struct mem_cgroup  
*memcg)

  	for_each_node(node)
  		free_mem_cgroup_per_node_info(memcg, node);
+	kfree(memcg->vmstats);
  	free_percpu(memcg->vmstats_percpu);
  	kfree(memcg);
  }
@@ -5180,6 +5254,10 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
  		goto fail;
  	}

+	memcg->vmstats = kzalloc(sizeof(struct memcg_vmstats), GFP_KERNEL);
+	if (!memcg->vmstats)
+		goto fail;
+
  	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
  						 GFP_KERNEL_ACCOUNT);
  	if (!memcg->vmstats_percpu)
@@ -5248,6 +5326,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state  
*parent_css)
  		page_counter_init(&memcg->kmem, &parent->kmem);
  		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
  	} else {
+		init_memcg_events();
  		page_counter_init(&memcg->memory, NULL);
  		page_counter_init(&memcg->swap, NULL);
  		page_counter_init(&memcg->kmem, NULL);
@@ -5400,9 +5479,9 @@ static void mem_cgroup_css_rstat_flush(struct  
cgroup_subsys_state *css, int cpu)
  		 * below us. We're in a per-cpu loop here and this is
  		 * a global counter, so the first cycle will get them.
  		 */
-		delta = memcg->vmstats.state_pending[i];
+		delta = memcg->vmstats->state_pending[i];
  		if (delta)
-			memcg->vmstats.state_pending[i] = 0;
+			memcg->vmstats->state_pending[i] = 0;

  		/* Add CPU changes on this level since the last flush */
  		v = READ_ONCE(statc->state[i]);
@@ -5415,15 +5494,15 @@ static void mem_cgroup_css_rstat_flush(struct  
cgroup_subsys_state *css, int cpu)
  			continue;

  		/* Aggregate counts on this level and propagate upwards */
-		memcg->vmstats.state[i] += delta;
+		memcg->vmstats->state[i] += delta;
  		if (parent)
-			parent->vmstats.state_pending[i] += delta;
+			parent->vmstats->state_pending[i] += delta;
  	}

-	for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
-		delta = memcg->vmstats.events_pending[i];
+	for (i = 0; i < NR_MEMCG_EVENTS; i++) {
+		delta = memcg->vmstats->events_pending[i];
  		if (delta)
-			memcg->vmstats.events_pending[i] = 0;
+			memcg->vmstats->events_pending[i] = 0;

  		v = READ_ONCE(statc->events[i]);
  		if (v != statc->events_prev[i]) {
@@ -5434,9 +5513,9 @@ static void mem_cgroup_css_rstat_flush(struct  
cgroup_subsys_state *css, int cpu)
  		if (!delta)
  			continue;

-		memcg->vmstats.events[i] += delta;
+		memcg->vmstats->events[i] += delta;
  		if (parent)
-			parent->vmstats.events_pending[i] += delta;
+			parent->vmstats->events_pending[i] += delta;
  	}

  	for_each_node_state(nid, N_MEMORY) {
-- 
2.35.1.574.g5d30c73bfb-goog



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02  2:50                                       ` Shakeel Butt
@ 2022-03-02  3:40                                         ` Ivan Babrou
  2022-03-02 22:33                                           ` Ivan Babrou
  2022-03-02 11:49                                         ` Frank Hofmann
  1 sibling, 1 reply; 38+ messages in thread
From: Ivan Babrou @ 2022-03-02  3:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Tue, Mar 1, 2022 at 6:50 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote:
>
> [...]
>
> > Looks like you were right that targeted flush is not going to be as good.
>
>
> Thanks a lot. Can you please try the following patch (independent of other
> patches) as well?

Sure thing, I'll come back with the numbers tomorrow.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02  0:48                                     ` Ivan Babrou
  2022-03-02  2:50                                       ` Shakeel Butt
@ 2022-03-02 10:08                                       ` Michal Koutný
  2022-03-02 15:53                                         ` Shakeel Butt
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Koutný @ 2022-03-02 10:08 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Shakeel Butt, Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou <ivan@cloudflare.com> wrote:
> avg = 1100789 nsecs, total: 27051906434 nsecs, count: 24575

Thank you for checking.

> Looks like you were right that targeted flush is not going to be as good.

Still lots of work to do at once :-/

BTW what was the "last good" kernel that you use as baseline for
comparisons? (I can't find it in the thread and I only assume we're
comparing against pre-rstat switch, i.e. last good < v5.13?)

Regards,
Michal


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02  2:50                                       ` Shakeel Butt
  2022-03-02  3:40                                         ` Ivan Babrou
@ 2022-03-02 11:49                                         ` Frank Hofmann
  2022-03-02 15:52                                           ` Shakeel Butt
  1 sibling, 1 reply; 38+ messages in thread
From: Frank Hofmann @ 2022-03-02 11:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Ivan Babrou, Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Wed, Mar 2, 2022 at 2:50 AM 'Shakeel Butt' via
kernel-team+notifications <kernel-team@cloudflare.com> wrote:
>
> On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote:
>
> [...]
>
> > Looks like you were right that targeted flush is not going to be as good.
>
>
> Thanks a lot. Can you please try the following patch (independent of other
> patches) as well?

Hi Shakeel,

maybe it's just me, but the codechange below confuses me. I'll comment inline.

>
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d9b8df5ef212..499f75e066f3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -75,29 +75,8 @@ enum mem_cgroup_events_target {
>         MEM_CGROUP_NTARGETS,
>   };
>
> -struct memcg_vmstats_percpu {
> -       /* Local (CPU and cgroup) page state & events */
> -       long                    state[MEMCG_NR_STAT];
> -       unsigned long           events[NR_VM_EVENT_ITEMS];
> -
> -       /* Delta calculation for lockless upward propagation */
> -       long                    state_prev[MEMCG_NR_STAT];
> -       unsigned long           events_prev[NR_VM_EVENT_ITEMS];
> -
> -       /* Cgroup1: threshold notifications & softlimit tree updates */
> -       unsigned long           nr_page_events;
> -       unsigned long           targets[MEM_CGROUP_NTARGETS];
> -};
> -
> -struct memcg_vmstats {
> -       /* Aggregated (CPU and subtree) page state & events */
> -       long                    state[MEMCG_NR_STAT];
> -       unsigned long           events[NR_VM_EVENT_ITEMS];
> -
> -       /* Pending child counts during tree propagation */
> -       long                    state_pending[MEMCG_NR_STAT];
> -       unsigned long           events_pending[NR_VM_EVENT_ITEMS];
> -};
> +struct memcg_vmstats_percpu;
> +struct memcg_vmstats;
>
>   struct mem_cgroup_reclaim_iter {
>         struct mem_cgroup *position;
> @@ -304,7 +283,7 @@ struct mem_cgroup {
>         MEMCG_PADDING(_pad1_);
>
>         /* memory.stat */
> -       struct memcg_vmstats    vmstats;
> +       struct memcg_vmstats    *vmstats;
>
>         /* memory.events */
>         atomic_long_t           memory_events[MEMCG_NR_MEMORY_EVENTS];
> @@ -964,11 +943,7 @@ static inline void mod_memcg_state(struct mem_cgroup
> *memcg,
>         local_irq_restore(flags);
>   }
>
> -static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int
> idx)
> -{
> -       return READ_ONCE(memcg->vmstats.state[idx]);
> -}
> -
> +unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
>   static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
>                                               enum node_stat_item idx)
>   {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 32ba963ebf2e..c65f155c2048 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -688,6 +688,71 @@ static void flush_memcg_stats_dwork(struct work_struct
> *w)
>         queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
>   }
>
> +static const unsigned int memcg_vm_events[] = {
> +       PGPGIN,
> +       PGPGOUT,
> +       PGFAULT,
> +       PGMAJFAULT,
> +       PGREFILL,
> +       PGSCAN_KSWAPD,
> +       PGSCAN_DIRECT,
> +       PGSTEAL_KSWAPD,
> +       PGSTEAL_DIRECT,
> +       PGACTIVATE,
> +       PGDEACTIVATE,
> +       PGLAZYFREE,
> +       PGLAZYFREED,
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       THP_FAULT_ALLOC,
> +       THP_COLLAPSE_ALLOC,
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +};
> +#define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_events)
> +
> +static int memcg_events_index[NR_VM_EVENT_ITEMS] __read_mostly;
> +
> +static void init_memcg_events(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < NR_MEMCG_EVENTS; ++i)
> +               memcg_events_index[memcg_vm_events[i]] = i + 1;
> +}
> +
> +static int get_memcg_events_index(enum vm_event_item idx)
> +{
> +       return memcg_events_index[idx] - 1;
> +}

It's this. The table, memcg_vm_events[], is basically a "truth set" -
if the event is in this set, update memcg_events_index[] for it.
This code looks like one attempts to re-base C arrays to start-at-one though.
It then does that accessor function to translate "unsigned zero" to
"signed -1" ... for a truth test.

> +
> +struct memcg_vmstats_percpu {
> +       /* Local (CPU and cgroup) page state & events */
> +       long                    state[MEMCG_NR_STAT];
> +       unsigned long           events[NR_MEMCG_EVENTS];
> +
> +       /* Delta calculation for lockless upward propagation */
> +       long                    state_prev[MEMCG_NR_STAT];
> +       unsigned long           events_prev[NR_MEMCG_EVENTS];
> +
> +       /* Cgroup1: threshold notifications & softlimit tree updates */
> +       unsigned long           nr_page_events;
> +       unsigned long           targets[MEM_CGROUP_NTARGETS];
> +};
> +
> +struct memcg_vmstats {
> +       /* Aggregated (CPU and subtree) page state & events */
> +       long                    state[MEMCG_NR_STAT];
> +       unsigned long           events[NR_MEMCG_EVENTS];
> +
> +       /* Pending child counts during tree propagation */
> +       long                    state_pending[MEMCG_NR_STAT];
> +       unsigned long           events_pending[NR_MEMCG_EVENTS];
> +};
> +
> +unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
> +{
> +       return READ_ONCE(memcg->vmstats->state[idx]);
> +}
> +
>   /**
>    * __mod_memcg_state - update cgroup memory statistics
>    * @memcg: the memory cgroup
> @@ -831,25 +896,33 @@ static inline void mod_objcg_mlstate(struct
> obj_cgroup *objcg,
>   void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
>                           unsigned long count)
>   {
> -       if (mem_cgroup_disabled())
> +       int index = get_memcg_events_index(idx);
> +
> +       if (mem_cgroup_disabled() || index < 0)

And it's these where it's used; basically, this says "if the event is
not in the monitored set, return". Likewise in all below.

Why not have this explicit ? Make something like:

static int memcg_event_exists_percg(int index) {
    switch (index) {
    case PGPGIN:
    case PGPGOUT:
    case PGFAULT:
    case PGMAJFAULT:
    case PGREFILL:
    case PGSCAN_KSWAPD:
    case PGSCAN_DIRECT:
    case PGSTEAL_KSWAPD:
    case PGSTEAL_DIRECT:
    case PGACTIVATE:
    case PGDEACTIVATE:
    case PGLAZYFREE:
    case PGLAZYFREED:
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
    case THP_FAULT_ALLOC:
    case THP_COLLAPSE_ALLOC:
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
        return 1;
    default:
        return 0;
    }
}

and let the compiler choose how to lay this out where-used ?

>                 return;
>
> -       __this_cpu_add(memcg->vmstats_percpu->events[idx], count);
> +       __this_cpu_add(memcg->vmstats_percpu->events[index], count);
[ ... ]

I understand there is a memory saving from the different table
types/sizes, while the vm events per memcg is a (small) subset of vm
events. How relevant is this to the change, though?


Best regards,
Frank Hofmann


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02 11:49                                         ` Frank Hofmann
@ 2022-03-02 15:52                                           ` Shakeel Butt
  0 siblings, 0 replies; 38+ messages in thread
From: Shakeel Butt @ 2022-03-02 15:52 UTC (permalink / raw)
  To: Frank Hofmann
  Cc: Ivan Babrou, Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

Hi Frank

On Wed, Mar 2, 2022 at 3:50 AM Frank Hofmann <fhofmann@cloudflare.com> wrote:
[...]
>
> Hi Shakeel,
>
> maybe it's just me, but the codechange below confuses me. I'll comment inline.
>
[...]
>
> I understand there is a memory saving from the different table
> types/sizes, while the vm events per memcg is a (small) subset of vm
> events. How relevant is this to the change, though?
>

Actually this is the whole reason behind this change i.e. reduce the
size of events[_prev] arrays from ~100 to 15. I suspect updating
events arrays might be the half the time the rstat flushing is
spending on.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02 10:08                                       ` Michal Koutný
@ 2022-03-02 15:53                                         ` Shakeel Butt
  2022-03-02 17:28                                           ` Ivan Babrou
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-03-02 15:53 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Ivan Babrou, Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Wed, Mar 2, 2022 at 2:08 AM Michal Koutný <mkoutny@suse.com> wrote:
>
[...]
>
> BTW what was the "last good" kernel that you use as baseline for
> comparisons? (I can't find it in the thread and I only assume we're
> comparing against pre-rstat switch, i.e. last good < v5.13?)
>

I think 5.10 is the last good kernel for them (and yes pre-rstat switch).


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02 15:53                                         ` Shakeel Butt
@ 2022-03-02 17:28                                           ` Ivan Babrou
  0 siblings, 0 replies; 38+ messages in thread
From: Ivan Babrou @ 2022-03-02 17:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Wed, Mar 2, 2022 at 7:53 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Mar 2, 2022 at 2:08 AM Michal Koutný <mkoutny@suse.com> wrote:
> >
> [...]
> >
> > BTW what was the "last good" kernel that you use as baseline for
> > comparisons? (I can't find it in the thread and I only assume we're
> > comparing against pre-rstat switch, i.e. last good < v5.13?)
> >
>
> I think 5.10 is the last good kernel for them (and yes pre-rstat switch).

Correct, we stick to LTS kernels.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02  3:40                                         ` Ivan Babrou
@ 2022-03-02 22:33                                           ` Ivan Babrou
  2022-03-03  2:32                                             ` Shakeel Butt
  2022-03-03  2:35                                             ` Shakeel Butt
  0 siblings, 2 replies; 38+ messages in thread
From: Ivan Babrou @ 2022-03-02 22:33 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Tue, Mar 1, 2022 at 6:50 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote:
> >
> > [...]
> >
> > > Looks like you were right that targeted flush is not going to be as good.
> >
> >
> > Thanks a lot. Can you please try the following patch (independent of other
> > patches) as well?
>
> Sure thing, I'll come back with the numbers tomorrow.

Measurements from different configurations:

* Baseline 5.15.25:

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 15931    |****************************************|
       512 -> 1023       : 7040     |*****************                       |
      1024 -> 2047       : 1566     |***                                     |
      2048 -> 4095       : 3844     |*********                               |
      4096 -> 8191       : 1689     |****                                    |
      8192 -> 16383      : 115      |                                        |
     16384 -> 32767      : 54       |                                        |
     32768 -> 65535      : 42       |                                        |
     65536 -> 131071     : 14       |                                        |
    131072 -> 262143     : 4        |                                        |
    262144 -> 524287     : 4        |                                        |
    524288 -> 1048575    : 92       |                                        |
   1048576 -> 2097151    : 2322     |*****                                   |
   2097152 -> 4194303    : 8244     |********************                    |
   4194304 -> 8388607    : 775      |*                                       |
   8388608 -> 16777215   : 73       |                                        |

avg = 776487 nsecs, total: 32464181849 nsecs, count: 41809

* Delayed flush:

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 16271    |****************************************|
       512 -> 1023       : 8147     |********************                    |
      1024 -> 2047       : 2743     |******                                  |
      2048 -> 4095       : 3964     |*********                               |
      4096 -> 8191       : 7137     |*****************                       |
      8192 -> 16383      : 4039     |*********                               |
     16384 -> 32767      : 352      |                                        |
     32768 -> 65535      : 214      |                                        |
     65536 -> 131071     : 80       |                                        |
    131072 -> 262143     : 15       |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 1        |                                        |

avg = 3248 nsecs, total: 139586555 nsecs, count: 42963

* Targeted flush:

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 25298    |****************************************|
       512 -> 1023       : 13334    |*********************                   |
      1024 -> 2047       : 1519     |**                                      |
      2048 -> 4095       : 4265     |******                                  |
      4096 -> 8191       : 2339     |***                                     |
      8192 -> 16383      : 162      |                                        |
     16384 -> 32767      : 73       |                                        |
     32768 -> 65535      : 70       |                                        |
     65536 -> 131071     : 23       |                                        |
    131072 -> 262143     : 1        |                                        |
    262144 -> 524287     : 3        |                                        |
    524288 -> 1048575    : 55       |                                        |
   1048576 -> 2097151    : 1409     |**                                      |
   2097152 -> 4194303    : 8306     |*************                           |
   4194304 -> 8388607    : 1139     |*                                       |
   8388608 -> 16777215   : 89       |                                        |

avg = 589361 nsecs, total: 34233041412 nsecs, count: 5808

* Your most recent patch:

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 37051    |****************************************|
       512 -> 1023       : 17144    |******************                      |
      1024 -> 2047       : 1051     |*                                       |
      2048 -> 4095       : 2705     |**                                      |
      4096 -> 8191       : 1911     |**                                      |
      8192 -> 16383      : 184      |                                        |
     16384 -> 32767      : 93       |                                        |
     32768 -> 65535      : 70       |                                        |
     65536 -> 131071     : 30       |                                        |
    131072 -> 262143     : 21       |                                        |
    262144 -> 524287     : 216      |                                        |
    524288 -> 1048575    : 1100     |*                                       |
   1048576 -> 2097151    : 6522     |*******                                 |
   2097152 -> 4194303    : 5903     |******                                  |
   4194304 -> 8388607    : 467      |                                        |
   8388608 -> 16777215   : 19       |                                        |

avg = 403261 nsecs, total: 30037729036 nsecs, count: 74487

Looks like there are plenty of longer pauses still.

I can also get the numbers from 5.10 for you if you want.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02 22:33                                           ` Ivan Babrou
@ 2022-03-03  2:32                                             ` Shakeel Butt
  2022-03-03  2:35                                             ` Shakeel Butt
  1 sibling, 0 replies; 38+ messages in thread
From: Shakeel Butt @ 2022-03-03  2:32 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Wed, Mar 2, 2022 at 2:33 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> >
> > On Tue, Mar 1, 2022 at 6:50 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Tue, Mar 01, 2022 at 04:48:00PM -0800, Ivan Babrou wrote:
> > >
> > > [...]
> > >
> > > > Looks like you were right that targeted flush is not going to be as good.
> > >
> > >
> > > Thanks a lot. Can you please try the following patch (independent of other
> > > patches) as well?
> >
> > Sure thing, I'll come back with the numbers tomorrow.
>
> Measurements from different configurations:
[...]
> avg = 403261 nsecs, total: 30037729036 nsecs, count: 74487
>
> Looks like there are plenty of longer pauses still.

Thanks for running this.

>
> I can also get the numbers from 5.10 for you if you want.

Yes please, that would be good to compare with. Also I am going to
send a new patch in a couple of minutes for you to try and you don't
need to run the "Targeted flush" and the last one which reduces the
memcg_vmstats size.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-02 22:33                                           ` Ivan Babrou
  2022-03-03  2:32                                             ` Shakeel Butt
@ 2022-03-03  2:35                                             ` Shakeel Butt
  2022-03-04  0:21                                               ` Ivan Babrou
  1 sibling, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-03-03  2:35 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Wed, Mar 02, 2022 at 02:33:41PM -0800, Ivan Babrou wrote:
> On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote:
[...]

Please try the following patch:


diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d9b8df5ef212..274e4ecff534 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1002,6 +1002,7 @@ static inline unsigned long  
lruvec_page_state_local(struct lruvec *lruvec,
  }

  void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_if_late(void);

  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item  
idx,
  			      int val);
@@ -1422,6 +1423,10 @@ static inline void mem_cgroup_flush_stats(void)
  {
  }

+static inline void mem_cgroup_flush_stats_if_late(void)
+{
+}
+
  static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
  					    enum node_stat_item idx, int val)
  {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 32ba963ebf2e..00c924e25ce8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -650,6 +650,9 @@ static DECLARE_DEFERRABLE_WORK(stats_flush_dwork,  
flush_memcg_stats_dwork);
  static DEFINE_SPINLOCK(stats_flush_lock);
  static DEFINE_PER_CPU(unsigned int, stats_updates);
  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
+static u64 flush_next_time;
+
+#define FLUSH_TIME (2UL*HZ)

  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
  {
@@ -671,6 +674,7 @@ static void __mem_cgroup_flush_stats(void)
  	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
  		return;

+	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
  	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
  	atomic_set(&stats_flush_threshold, 0);
  	spin_unlock_irqrestore(&stats_flush_lock, flag);
@@ -682,10 +686,16 @@ void mem_cgroup_flush_stats(void)
  		__mem_cgroup_flush_stats();
  }

+void mem_cgroup_flush_stats_if_late(void)
+{
+	if (time_after64(jiffies_64, flush_next_time))
+		mem_cgroup_flush_stats();
+}
+
  static void flush_memcg_stats_dwork(struct work_struct *w)
  {
  	__mem_cgroup_flush_stats();
-	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
+	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
  }

  /**
@@ -4518,7 +4528,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb,  
unsigned long *pfilepages,
  	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
  	struct mem_cgroup *parent;

-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_if_late();

  	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
  	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2d1fed..fd279621c3fc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3041,7 +3041,7 @@ static void shrink_node(pg_data_t *pgdat, struct  
scan_control *sc)
  	 * Flush the memory cgroup stats, so that we read accurate per-memcg
  	 * lruvec stats for heuristics.
  	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_if_late();

  	memset(&sc->nr, 0, sizeof(sc->nr));

diff --git a/mm/workingset.c b/mm/workingset.c
index d5b81e4f4cbe..7edc52037e38 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -352,7 +352,7 @@ void workingset_refault(struct page *page, void *shadow)

  	inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);

-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_if_late();
  	/*
  	 * Compare the distance to the existing workingset size. We
  	 * don't activate pages that couldn't stay resident even if


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-03  2:35                                             ` Shakeel Butt
@ 2022-03-04  0:21                                               ` Ivan Babrou
  2022-03-04  1:05                                                 ` Shakeel Butt
  0 siblings, 1 reply; 38+ messages in thread
From: Ivan Babrou @ 2022-03-04  0:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Wed, Mar 2, 2022 at 6:35 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Mar 02, 2022 at 02:33:41PM -0800, Ivan Babrou wrote:
> > On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> [...]
>
> Please try the following patch:

* Linux 5.10.101:

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 5496     |****************************************|
       512 -> 1023       : 3161     |***********************                 |
      1024 -> 2047       : 1059     |*******                                 |
      2048 -> 4095       : 3149     |**********************                  |
      4096 -> 8191       : 2074     |***************                         |
      8192 -> 16383      : 90       |                                        |
     16384 -> 32767      : 43       |                                        |
     32768 -> 65535      : 44       |                                        |
     65536 -> 131071     : 12       |                                        |
    131072 -> 262143     : 3        |                                        |
    262144 -> 524287     : 1        |                                        |

avg = 2144 nsecs, total: 32453571 nsecs, count: 15132

* Linux 5.15.25:

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 132411   |****************************************|
       512 -> 1023       : 45547    |*************                           |
      1024 -> 2047       : 5869     |*                                       |
      2048 -> 4095       : 10553    |***                                     |
      4096 -> 8191       : 5297     |*                                       |
      8192 -> 16383      : 455      |                                        |
     16384 -> 32767      : 214      |                                        |
     32768 -> 65535      : 175      |                                        |
     65536 -> 131071     : 50       |                                        |
    131072 -> 262143     : 6        |                                        |
    262144 -> 524287     : 2        |                                        |
    524288 -> 1048575    : 80       |                                        |
   1048576 -> 2097151    : 3228     |                                        |
   2097152 -> 4194303    : 13514    |****                                    |
   4194304 -> 8388607    : 792      |                                        |
   8388608 -> 16777215   : 1        |                                        |

avg = 221955 nsecs, total: 48429632352 nsecs, count: 218195

* Linux 5.15.25 + delayed flush:

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 103067   |****************************************|
       512 -> 1023       : 39054    |***************                         |
      1024 -> 2047       : 8953     |***                                     |
      2048 -> 4095       : 12986    |*****                                   |
      4096 -> 8191       : 10173    |***                                     |
      8192 -> 16383      : 8090     |***                                     |
     16384 -> 32767      : 890      |                                        |
     32768 -> 65535      : 540      |                                        |
     65536 -> 131071     : 198      |                                        |
    131072 -> 262143     : 50       |                                        |
    262144 -> 524287     : 7        |                                        |

avg = 1837 nsecs, total: 338115740 nsecs, count: 184008

* Linux 5.15.25 + your latest patch with "if_late":

$ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 9241     |****************************************|
       512 -> 1023       : 7041     |******************************          |
      1024 -> 2047       : 1457     |******                                  |
      2048 -> 4095       : 3401     |**************                          |
      4096 -> 8191       : 2207     |*********                               |
      8192 -> 16383      : 130      |                                        |
     16384 -> 32767      : 53       |                                        |
     32768 -> 65535      : 40       |                                        |
     65536 -> 131071     : 16       |                                        |
    131072 -> 262143     : 1        |                                        |
    262144 -> 524287     : 1        |                                        |

avg = 1663 nsecs, total: 39249164 nsecs, count: 23588

The numbers look pretty good.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-04  0:21                                               ` Ivan Babrou
@ 2022-03-04  1:05                                                 ` Shakeel Butt
  2022-03-04  1:12                                                   ` Ivan Babrou
  0 siblings, 1 reply; 38+ messages in thread
From: Shakeel Butt @ 2022-03-04  1:05 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Thu, Mar 3, 2022 at 4:21 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Wed, Mar 2, 2022 at 6:35 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Wed, Mar 02, 2022 at 02:33:41PM -0800, Ivan Babrou wrote:
> > > On Tue, Mar 1, 2022 at 7:40 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> > [...]
> >
> > Please try the following patch:
>
> * Linux 5.10.101:
>
> $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
> Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 5496     |****************************************|
>        512 -> 1023       : 3161     |***********************                 |
>       1024 -> 2047       : 1059     |*******                                 |
>       2048 -> 4095       : 3149     |**********************                  |
>       4096 -> 8191       : 2074     |***************                         |
>       8192 -> 16383      : 90       |                                        |
>      16384 -> 32767      : 43       |                                        |
>      32768 -> 65535      : 44       |                                        |
>      65536 -> 131071     : 12       |                                        |
>     131072 -> 262143     : 3        |                                        |
>     262144 -> 524287     : 1        |                                        |
>
> avg = 2144 nsecs, total: 32453571 nsecs, count: 15132
>
> * Linux 5.15.25:
>
> $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
> Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 132411   |****************************************|
>        512 -> 1023       : 45547    |*************                           |
>       1024 -> 2047       : 5869     |*                                       |
>       2048 -> 4095       : 10553    |***                                     |
>       4096 -> 8191       : 5297     |*                                       |
>       8192 -> 16383      : 455      |                                        |
>      16384 -> 32767      : 214      |                                        |
>      32768 -> 65535      : 175      |                                        |
>      65536 -> 131071     : 50       |                                        |
>     131072 -> 262143     : 6        |                                        |
>     262144 -> 524287     : 2        |                                        |
>     524288 -> 1048575    : 80       |                                        |
>    1048576 -> 2097151    : 3228     |                                        |
>    2097152 -> 4194303    : 13514    |****                                    |
>    4194304 -> 8388607    : 792      |                                        |
>    8388608 -> 16777215   : 1        |                                        |
>
> avg = 221955 nsecs, total: 48429632352 nsecs, count: 218195
>
> * Linux 5.15.25 + delayed flush:
>
> $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
> Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 103067   |****************************************|
>        512 -> 1023       : 39054    |***************                         |
>       1024 -> 2047       : 8953     |***                                     |
>       2048 -> 4095       : 12986    |*****                                   |
>       4096 -> 8191       : 10173    |***                                     |
>       8192 -> 16383      : 8090     |***                                     |
>      16384 -> 32767      : 890      |                                        |
>      32768 -> 65535      : 540      |                                        |
>      65536 -> 131071     : 198      |                                        |
>     131072 -> 262143     : 50       |                                        |
>     262144 -> 524287     : 7        |                                        |
>
> avg = 1837 nsecs, total: 338115740 nsecs, count: 184008
>
> * Linux 5.15.25 + your latest patch with "if_late":
>
> $ sudo /usr/share/bcc/tools/funclatency -d 60 workingset_refault
> Tracing 1 functions for "workingset_refault"... Hit Ctrl-C to end.
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 9241     |****************************************|
>        512 -> 1023       : 7041     |******************************          |
>       1024 -> 2047       : 1457     |******                                  |
>       2048 -> 4095       : 3401     |**************                          |
>       4096 -> 8191       : 2207     |*********                               |
>       8192 -> 16383      : 130      |                                        |
>      16384 -> 32767      : 53       |                                        |
>      32768 -> 65535      : 40       |                                        |
>      65536 -> 131071     : 16       |                                        |
>     131072 -> 262143     : 1        |                                        |
>     262144 -> 524287     : 1        |                                        |
>
> avg = 1663 nsecs, total: 39249164 nsecs, count: 23588
>
> The numbers look pretty good.

Thanks a lot Ivan. I will send the signed-off patch for 'if_late' and
I would like to have your Tested-by on that.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15
  2022-03-04  1:05                                                 ` Shakeel Butt
@ 2022-03-04  1:12                                                   ` Ivan Babrou
  0 siblings, 0 replies; 38+ messages in thread
From: Ivan Babrou @ 2022-03-04  1:12 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Koutný,
	Daniel Dao, kernel-team, Linux MM, Johannes Weiner,
	Roman Gushchin, Feng Tang, Michal Hocko, Hillf Danton,
	Andrew Morton, Linus Torvalds

On Thu, Mar 3, 2022 at 5:05 PM Shakeel Butt <shakeelb@google.com> wrote:
> Thanks a lot Ivan. I will send the signed-off patch for 'if_late' and
> I would like to have your Tested-by on that.

Thanks, I appreciate your help here.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15 #forregzbot
  2022-02-24  9:22 ` Thorsten Leemhuis
@ 2022-04-11 10:17   ` Thorsten Leemhuis
  2022-05-16 12:51     ` Thorsten Leemhuis
  0 siblings, 1 reply; 38+ messages in thread
From: Thorsten Leemhuis @ 2022-04-11 10:17 UTC (permalink / raw)
  To: regressions

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

#regzbot backburner: patch in mm, but it needs to be tested properly and
the issue is not urgend

On 24.02.22 10:22, Thorsten Leemhuis wrote:
> [TLDR: I'm adding below regression report below to regzbot, the Linux
> kernel regression tracking bot; all text you find below is compiled from
> a few templates paragraphs you might have encountered already already
> from similar mails.]
> 
> Hi, this is your Linux kernel regression tracker speaking.
> 
> Thanks for the report. CCing the regression mailing list, as it should
> be in the loop for all regressions, as explained here:
> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
> 
> To be sure this issue doesn't fall through the cracks unnoticed, I'm
> adding it to regzbot, my Linux kernel regression tracking bot:
> 
> #regzbot ^introduced v3.0..v5.15.19
> #regzbot ignore-activity
> 
> Reminder for developers: when fixing the issue, please add a 'Link:'
> tags pointing to the report (the mail quoted below) using
> lore.kernel.org/r/, as explained in
> 'Documentation/process/submitting-patches.rst' and
> 'Documentation/process/5.Posting.rst'. This allows the bot to connect
> the report with any patches posted or committed to fix the issue; this
> again allows the bot to show the current status of regressions and
> automatically resolve the issue when the fix hits the right tree.
> 
> I'm sending this to everyone that got the initial report, to make them
> aware of the tracking. I also hope that messages like this motivate
> people to directly get at least the regression mailing list and ideally
> even regzbot involved when dealing with regressions, as messages like
> this wouldn't be needed then.
> 
> Don't worry, I'll send further messages wrt to this regression just to
> the lists (with a tag in the subject so people can filter them away), if
> they are relevant just for regzbot. With a bit of luck no such messages
> will be needed anyway.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> 
> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them and lack
> knowledge about most of the areas they concern. I thus unfortunately
> will sometimes get things wrong or miss something important. I hope
> that's not the case here; if you think it is, don't hesitate to tell me
> in a public reply, it's in everyone's interest to set the public record
> straight.
> 
> On 23.02.22 14:51, Daniel Dao wrote:
>> Hi all,
>>
>> We are observing some regressions in workingset_refault on our newly upgraded
>> 5.15.19 nodes with zram as swap. This manifests in several ways:
>>
>> 1) Regression of workingset_refault duration observed in flamegraph
>>
>> We regularly collect flamegraphs for running services on the node. Since upgrade
>> to 5.15.19, we see that workingset_refault occupied a more significant part of
>> the service flamegraph (13%) with the following call trace
>>
>>   workingset_refault+0x128
>>   add_to_page_cache_lru+0x9f
>>   page_cache_ra_unbounded+0x154
>>   force_page_cache_ra+0xe2
>>   filemap_get_pages+0xe9
>>   filemap_read+0xa4
>>   xfs_file_buffered_read+0x98
>>   xfs_file_read_iter+0x6a
>>   new_sync_read+0x118
>>   vfs_read+0xf2
>>   __x64_sys_pread64+0x89
>>   do_syscall_64+0x3b
>>   entry_SYSCALL_64_after_hwframe+0x44
>>
>> 2) Regression of userspace performance sensitive code
>>
>> We have some performance sensentive code running in userspace that have their
>> runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:
>>
>>   now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
>>   func()
>>   elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now
>>
>> Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
>> frequently than before.  This went away after we disabled swap for the service
>> using `memory.swap.max=0` memcg configuration.
>>
>> The new thing in 5.15 workingset_refault seems to be introduction of
>> mem_cgroup_flush_stats()
>> by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
>> lruvec stats in the
>> refault).
>>
>> Given that mem_cgroup_flush_stats can take quite a long time for us on the
>> standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )
>>
>>   sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
>>   Tracing function calls slower than 10 ms... Ctrl+C to quit.
>>   TIME       COMM           PID    LAT(ms)             RVAL FUNC
>>   0.000000   <redacted>     804776   11.50              200
>> mem_cgroup_flush_stats
>>   0.343383   <redacted>     647496   10.58              200
>> mem_cgroup_flush_stats
>>   0.604309   <redacted>     804776   10.50              200
>> mem_cgroup_flush_stats
>>   1.230416   <redacted>     803293   10.01              200
>> mem_cgroup_flush_stats
>>   1.248442   <redacted>     646400   11.02              200
>> mem_cgroup_flush_stats
>>
>> could it be possible that workingset_refault in some unfortunate case can take
>> much longer than before such that it increases the time observed by
>> CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
>> workingset_refault
>> observed by perf ?
>>
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Regression in workingset_refault latency on 5.15 #forregzbot
  2022-04-11 10:17   ` Regression in workingset_refault latency on 5.15 #forregzbot Thorsten Leemhuis
@ 2022-05-16 12:51     ` Thorsten Leemhuis
  0 siblings, 0 replies; 38+ messages in thread
From: Thorsten Leemhuis @ 2022-05-16 12:51 UTC (permalink / raw)
  To: regressions



On 11.04.22 12:17, Thorsten Leemhuis wrote:
> TWIMC: this mail is primarily send for documentation purposes and for
> regzbot, my Linux kernel regression tracking bot. These mails usually
> contain '#forregzbot' in the subject, to make them easy to spot and filter.
> 
> #regzbot backburner: patch in mm, but it needs to be tested properly and
> the issue is not urgend

#regzbot fixed-by: 9b3016154c913b2e7

> On 24.02.22 10:22, Thorsten Leemhuis wrote:
>> [TLDR: I'm adding below regression report below to regzbot, the Linux
>> kernel regression tracking bot; all text you find below is compiled from
>> a few templates paragraphs you might have encountered already already
>> from similar mails.]
>>
>> Hi, this is your Linux kernel regression tracker speaking.
>>
>> Thanks for the report. CCing the regression mailing list, as it should
>> be in the loop for all regressions, as explained here:
>> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
>>
>> To be sure this issue doesn't fall through the cracks unnoticed, I'm
>> adding it to regzbot, my Linux kernel regression tracking bot:
>>
>> #regzbot ^introduced v3.0..v5.15.19
>> #regzbot ignore-activity
>>
>> Reminder for developers: when fixing the issue, please add a 'Link:'
>> tags pointing to the report (the mail quoted below) using
>> lore.kernel.org/r/, as explained in
>> 'Documentation/process/submitting-patches.rst' and
>> 'Documentation/process/5.Posting.rst'. This allows the bot to connect
>> the report with any patches posted or committed to fix the issue; this
>> again allows the bot to show the current status of regressions and
>> automatically resolve the issue when the fix hits the right tree.
>>
>> I'm sending this to everyone that got the initial report, to make them
>> aware of the tracking. I also hope that messages like this motivate
>> people to directly get at least the regression mailing list and ideally
>> even regzbot involved when dealing with regressions, as messages like
>> this wouldn't be needed then.
>>
>> Don't worry, I'll send further messages wrt to this regression just to
>> the lists (with a tag in the subject so people can filter them away), if
>> they are relevant just for regzbot. With a bit of luck no such messages
>> will be needed anyway.
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>
>> P.S.: As the Linux kernel's regression tracker I'm getting a lot of
>> reports on my table. I can only look briefly into most of them and lack
>> knowledge about most of the areas they concern. I thus unfortunately
>> will sometimes get things wrong or miss something important. I hope
>> that's not the case here; if you think it is, don't hesitate to tell me
>> in a public reply, it's in everyone's interest to set the public record
>> straight.
>>
>> On 23.02.22 14:51, Daniel Dao wrote:
>>> Hi all,
>>>
>>> We are observing some regressions in workingset_refault on our newly upgraded
>>> 5.15.19 nodes with zram as swap. This manifests in several ways:
>>>
>>> 1) Regression of workingset_refault duration observed in flamegraph
>>>
>>> We regularly collect flamegraphs for running services on the node. Since upgrade
>>> to 5.15.19, we see that workingset_refault occupied a more significant part of
>>> the service flamegraph (13%) with the following call trace
>>>
>>>   workingset_refault+0x128
>>>   add_to_page_cache_lru+0x9f
>>>   page_cache_ra_unbounded+0x154
>>>   force_page_cache_ra+0xe2
>>>   filemap_get_pages+0xe9
>>>   filemap_read+0xa4
>>>   xfs_file_buffered_read+0x98
>>>   xfs_file_read_iter+0x6a
>>>   new_sync_read+0x118
>>>   vfs_read+0xf2
>>>   __x64_sys_pread64+0x89
>>>   do_syscall_64+0x3b
>>>   entry_SYSCALL_64_after_hwframe+0x44
>>>
>>> 2) Regression of userspace performance sensitive code
>>>
>>> We have some performance sensentive code running in userspace that have their
>>> runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:
>>>
>>>   now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
>>>   func()
>>>   elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now
>>>
>>> Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
>>> frequently than before.  This went away after we disabled swap for the service
>>> using `memory.swap.max=0` memcg configuration.
>>>
>>> The new thing in 5.15 workingset_refault seems to be introduction of
>>> mem_cgroup_flush_stats()
>>> by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
>>> lruvec stats in the
>>> refault).
>>>
>>> Given that mem_cgroup_flush_stats can take quite a long time for us on the
>>> standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )
>>>
>>>   sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
>>>   Tracing function calls slower than 10 ms... Ctrl+C to quit.
>>>   TIME       COMM           PID    LAT(ms)             RVAL FUNC
>>>   0.000000   <redacted>     804776   11.50              200
>>> mem_cgroup_flush_stats
>>>   0.343383   <redacted>     647496   10.58              200
>>> mem_cgroup_flush_stats
>>>   0.604309   <redacted>     804776   10.50              200
>>> mem_cgroup_flush_stats
>>>   1.230416   <redacted>     803293   10.01              200
>>> mem_cgroup_flush_stats
>>>   1.248442   <redacted>     646400   11.02              200
>>> mem_cgroup_flush_stats
>>>
>>> could it be possible that workingset_refault in some unfortunate case can take
>>> much longer than before such that it increases the time observed by
>>> CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
>>> workingset_refault
>>> observed by perf ?
>>>
>>
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-05-16 12:51 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-23 13:51 Regression in workingset_refault latency on 5.15 Daniel Dao
2022-02-23 15:57 ` Shakeel Butt
2022-02-23 16:00   ` Shakeel Butt
2022-02-23 17:07     ` Daniel Dao
2022-02-23 17:36       ` Shakeel Butt
2022-02-23 19:28         ` Ivan Babrou
2022-02-23 20:28           ` Shakeel Butt
2022-02-23 21:16             ` Ivan Babrou
2022-02-24 14:46               ` Daniel Dao
2022-02-24 16:58                 ` Shakeel Butt
2022-02-24 17:34                   ` Daniel Dao
2022-02-24 18:00                     ` Shakeel Butt
2022-02-24 18:52                       ` Shakeel Butt
2022-02-25 10:23                         ` Daniel Dao
2022-02-25 17:08                           ` Ivan Babrou
2022-02-25 17:22                             ` Shakeel Butt
2022-02-25 18:03                             ` Michal Koutný
2022-02-25 18:08                               ` Ivan Babrou
2022-02-28 23:09                                 ` Shakeel Butt
2022-02-28 23:34                                   ` Ivan Babrou
2022-02-28 23:43                                     ` Shakeel Butt
2022-03-02  0:48                                     ` Ivan Babrou
2022-03-02  2:50                                       ` Shakeel Butt
2022-03-02  3:40                                         ` Ivan Babrou
2022-03-02 22:33                                           ` Ivan Babrou
2022-03-03  2:32                                             ` Shakeel Butt
2022-03-03  2:35                                             ` Shakeel Butt
2022-03-04  0:21                                               ` Ivan Babrou
2022-03-04  1:05                                                 ` Shakeel Butt
2022-03-04  1:12                                                   ` Ivan Babrou
2022-03-02 11:49                                         ` Frank Hofmann
2022-03-02 15:52                                           ` Shakeel Butt
2022-03-02 10:08                                       ` Michal Koutný
2022-03-02 15:53                                         ` Shakeel Butt
2022-03-02 17:28                                           ` Ivan Babrou
2022-02-24  9:22 ` Thorsten Leemhuis
2022-04-11 10:17   ` Regression in workingset_refault latency on 5.15 #forregzbot Thorsten Leemhuis
2022-05-16 12:51     ` Thorsten Leemhuis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.