All of lore.kernel.org
 help / color / mirror / Atom feed
* high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
@ 2023-11-08 14:35 Jaroslav Pulchart
  2023-11-08 18:47 ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-08 14:35 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm

Hello,

I would like to report to you an unpleasant behavior of multi-gen LRU
with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
system (16numa domains).

Symptoms of my issue are

/A/ if mult-gen LRU is enabled
1/ [kswapd3] is consuming 100% CPU

    top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
18.26, 15.01
    Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
0.4 si,  0.0 st
    MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
    MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
    ...
        765 root      20   0       0      0      0 R  98.3   0.0
34969:04 kswapd3
    ...
2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
observed with swap disk as well and cause IO latency issues due to
some kind of locking)
3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out


/B/ if mult-gen LRU is disabled
1/ [kswapd3] is consuming 3%-10% CPU
    top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
17.77, 14.77
    Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
0.4 si,  0.0 st
    MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
    MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
    ...
       765 root      20   0       0      0      0 S   3.6   0.0
34966:46 [kswapd3]
    ...
2/ swap space usage is low (4MB)
3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out

Both situations are wrong as they are using swap in/out extensively,
however the multi-gen LRU situation is 10times worse.

The perf record of case /A/
-  100.00%     0.00%  kswapd3  [kernel.kallsyms]  [k] kswapd

                                        ▒
   - kswapd

                                        ▒
      - 99.88% balance_pgdat

                                        ▒
         - 99.84% shrink_node

                                        ▒
            - 99.78% shrink_many

                                        ▒
               - 61.66% shrink_one

                                        ▒
                  - 55.32% try_to_shrink_lruvec

                                        ▒
                     - 49.80% try_to_inc_max_seq.constprop.0

                                        ▒
                        - 49.53% walk_mm

                                        ▒
                           - 49.46% walk_page_range

                                        ▒
                              - 49.32% __walk_page_range

                                        ▒
                                 - walk_pgd_range

                                        ▒
                                    - walk_p4d_range

                                        ▒
                                    - walk_pud_range

                                        ▒
                                       - 49.02% walk_pmd_range

                                        ▒
                                          - 45.94% get_next_vma

                                        ▒
                                             - 30.08% mas_find

                                        ▒
                                                - 29.33% mas_walk

                                        ▒
                                                     26.83%
mtree_range_walk
                                                   ▒
                                               2.86% should_skip_vma

                                        ▒
                                               0.58% mas_next_slot

                                        ▒
                                            1.25%
walk_pmd_range_locked.isra.0
                                                             ▒
                     - 5.46% evict_folios

                                        ▒
                        - 3.41% shrink_folio_list

                                        ▒
                           - 1.15% pageout

                                        ▒
                              - swap_writepage

                                        ▒
                                 - 1.12% swap_writepage_bdev_sync

                                        ▒
                                    - 1.01% submit_bio_wait

                                        ▒
                                       - 1.00% __submit_bio_noacct

                                        ▒
                                          - __submit_bio

                                        ▒
                                             - zram_bio_write

                                        ▒
                                                - 0.96%
zram_write_page
                                                       ▒
                                                   - 0.82%
lzorle_compress
                                                    ▒
                                                      -
lzogeneric1x_1_compress
                                                       ▒
                                                           0.73%
lzo1x_1_do_compress
                                              ▒
                             0.68% __remove_mapping

                                        ▒
                        - 1.02% isolate_folios

                                        ▒
                           - scan_folios

                                        ▒
                                0.65% isolate_folio.isra.0

                                        ▒
                          0.55% move_folios_to_lru

                                        ▒
                  - 5.43% lruvec_is_sizable

                                        ▒
                     - 0.93% get_swappiness

                                        ▒
                          mem_cgroup_get_nr_swap_pages

                                        ▒
               - 32.07% lru_gen_rotate_memcg

                                        ▒
                  - 3.23% _raw_spin_lock_irqsave

                                        ▒
                       2.32% native_queued_spin_lock_slowpath

                                        ▒
                    1.91% get_random_u8

                                        ▒
               - 0.94% _raw_spin_unlock_irqrestore

                                        ▒
                  - asm_sysvec_apic_timer_interrupt

                                        ▒
                     - sysvec_apic_timer_interrupt

                                        ▒
                        - 0.69% __sysvec_apic_timer_interrupt

                                        ▒
                           - hrtimer_interrupt

                                        ▒
                              - 0.65% __hrtimer_run_queues

                                        ▒
                                 - 0.63% tick_sched_timer

                                        ▒
                                    - 0.62% tick_sched_handle

                                        ▒
                                       - update_process_times

                                        ▒
                                            0.51% scheduler_tick

                                        ▒

The perf record of case /B/
-  100.00%     0.00%  kswapd3  [kernel.kallsyms]  [k] kswapd

              ▒
   - kswapd

              ▒
      - 99.66% balance_pgdat

              ▒
         - 90.96% shrink_node

              ▒
            - 75.69% shrink_node_memcgs

              ▒
               - 25.73% shrink_lruvec

              ▒
                  - 18.74% get_scan_count

              ▒
                       2.76% mem_cgroup_get_nr_swap_pages

              ▒
                  - 2.50% blk_finish_plug

              ▒
                     - __blk_flush_plug

              ▒
                          blk_mq_flush_plug_list

              ▒
                    1.02% shrink_inactive_list

              ▒
                    1.01% inactive_is_low

              ▒
               - 17.33% shrink_slab_memcg

              ▒
                  - 4.02% do_shrink_slab

              ▒
                     - 1.57% nfs4_xattr_entry_count

              ▒
                        - list_lru_count_one

              ▒
                             0.56% __rcu_read_unlock

              ▒
                     - 0.79% super_cache_count

              ▒
                          list_lru_count_one

              ▒
                     - 0.68% nfs4_xattr_cache_count

              ▒
                        - list_lru_count_one

              ▒
                             xa_load

              ▒
                    3.12% _find_next_bit

              ▒
                    1.87% __radix_tree_lookup

              ▒
                    0.67% up_read

              ▒
                    0.67% down_read_trylock

              ▒
               - 16.34% mem_cgroup_iter

              ▒
                    0.57% __rcu_read_lock

              ▒
                    0.54% __rcu_read_unlock

              ▒
               - 9.36% shrink_slab

              ▒
                  - do_shrink_slab

              ▒
                     - 2.37% super_cache_count

              ▒
                          1.04% list_lru_count_one

              ▒
                       2.14% count_shadow_nodes

              ▒
                       1.71% kfree_rcu_shrink_count

              ▒
                 1.24% vmpressure

              ▒
            - 15.27% prepare_scan_count

              ▒
               - 15.04% do_flush_stats

              ▒
                  - 14.93% cgroup_rstat_flush

              ▒
                     - cgroup_rstat_flush_locked

              ▒
                          13.20% mem_cgroup_css_rstat_flush

              ▒
                          0.78% __blkcg_rstat_flush.isra.0

              ▒
         - 5.87% shrink_active_list

              ▒
              2.16% __count_memcg_events

              ▒
              1.64% _raw_spin_lock_irq

              ▒
              0.94% isolate_lru_folios

              ▒
           2.24% mem_cgroup_iter

              ▒


Could I ask for any suggestions on how to avoid the kswapd utilization
pattern? There is a free RAM in each numa node for the few MB used in
swap:
    NUMA stats:
    NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
65486 65486 65486 65486 65486 65486 65424
    MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
2623 2833 2530 2269
the in/out usage does not make sense for me nor the CPU utilization by
multi-gen LRU.

Many thanks and best regards,
-- 
Jaroslav Pulchart


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-08 14:35 high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU Jaroslav Pulchart
@ 2023-11-08 18:47 ` Yu Zhao
  2023-11-08 20:04   ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-08 18:47 UTC (permalink / raw)
  To: Jaroslav Pulchart; +Cc: linux-mm, akpm

Hi Jaroslav,

On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> Hello,
>
> I would like to report to you an unpleasant behavior of multi-gen LRU
> with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> system (16numa domains).

Kernel version please?

> Symptoms of my issue are
>
> /A/ if mult-gen LRU is enabled
> 1/ [kswapd3] is consuming 100% CPU

Just thinking out loud: kswapd3 means the fourth node was under memory pressure.

>     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> 18.26, 15.01
>     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
>     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> 0.4 si,  0.0 st
>     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
>     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
>     ...
>         765 root      20   0       0      0      0 R  98.3   0.0
> 34969:04 kswapd3
>     ...
> 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> observed with swap disk as well and cause IO latency issues due to
> some kind of locking)
> 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
>
>
> /B/ if mult-gen LRU is disabled
> 1/ [kswapd3] is consuming 3%-10% CPU
>     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> 17.77, 14.77
>     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
>     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> 0.4 si,  0.0 st
>     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
>     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
>     ...
>        765 root      20   0       0      0      0 S   3.6   0.0
> 34966:46 [kswapd3]
>     ...
> 2/ swap space usage is low (4MB)
> 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
>
> Both situations are wrong as they are using swap in/out extensively,
> however the multi-gen LRU situation is 10times worse.

From the stats below, node 3 had the lowest free memory. So I think in
both cases, the reclaim activities were as expected.

> Could I ask for any suggestions on how to avoid the kswapd utilization
> pattern?

The easiest way is to disable NUMA domain so that there would be only
two nodes with 8x more memory. IOW, you have fewer pools but each pool
has more memory and therefore they are less likely to become empty.

> There is a free RAM in each numa node for the few MB used in
> swap:
>     NUMA stats:
>     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> 65486 65486 65486 65486 65486 65486 65424
>     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> 2623 2833 2530 2269
> the in/out usage does not make sense for me nor the CPU utilization by
> multi-gen LRU.

My questions:
1. Were there any OOM kills with either case?
2. Was THP enabled?
MGLRU might have spent the extra CPU cycles just to void OOM kills or
produce more THPs.

If disabling the NUMA domain isn't an option, I'd recommend:
1. Try the latest kernel (6.6.1) if you haven't.
2. Disable THP if it was enabled, to verify whether it has an impact.

Thanks.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-08 18:47 ` Yu Zhao
@ 2023-11-08 20:04   ` Jaroslav Pulchart
  2023-11-08 22:09     ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-08 20:04 UTC (permalink / raw)
  To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik

>
> Hi Jaroslav,

Hi Yu Zhao

thanks for response, see answers inline:

>
> On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > Hello,
> >
> > I would like to report to you an unpleasant behavior of multi-gen LRU
> > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > system (16numa domains).
>
> Kernel version please?

6.5.y, but we saw it sooner as it is in investigation from 23th May
(6.4.y and maybe even the 6.3.y).

>
> > Symptoms of my issue are
> >
> > /A/ if mult-gen LRU is enabled
> > 1/ [kswapd3] is consuming 100% CPU
>
> Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
>
> >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > 18.26, 15.01
> >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > 0.4 si,  0.0 st
> >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> >     ...
> >         765 root      20   0       0      0      0 R  98.3   0.0
> > 34969:04 kswapd3
> >     ...
> > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > observed with swap disk as well and cause IO latency issues due to
> > some kind of locking)
> > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> >
> >
> > /B/ if mult-gen LRU is disabled
> > 1/ [kswapd3] is consuming 3%-10% CPU
> >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > 17.77, 14.77
> >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > 0.4 si,  0.0 st
> >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> >     ...
> >        765 root      20   0       0      0      0 S   3.6   0.0
> > 34966:46 [kswapd3]
> >     ...
> > 2/ swap space usage is low (4MB)
> > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> >
> > Both situations are wrong as they are using swap in/out extensively,
> > however the multi-gen LRU situation is 10times worse.
>
> From the stats below, node 3 had the lowest free memory. So I think in
> both cases, the reclaim activities were as expected.

I do not see a reason for the memory pressure and reclaims. This node
has the lowest free memory of all nodes (~302MB free) that is true,
however the swap space usage is just 4MB (still going in and out). So
what can be the reason for that behaviour?

The workers/application is running in pre-allocated HugePages and the
rest is used for a small set of system services and drivers of
devices. It is static and not growing. The issue persists when I stop
the system services and free the memory.

>
> > Could I ask for any suggestions on how to avoid the kswapd utilization
> > pattern?
>
> The easiest way is to disable NUMA domain so that there would be only
> two nodes with 8x more memory. IOW, you have fewer pools but each pool
> has more memory and therefore they are less likely to become empty.
>
> > There is a free RAM in each numa node for the few MB used in
> > swap:
> >     NUMA stats:
> >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > 65486 65486 65486 65486 65486 65486 65424
> >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > 2623 2833 2530 2269
> > the in/out usage does not make sense for me nor the CPU utilization by
> > multi-gen LRU.
>
> My questions:
> 1. Were there any OOM kills with either case?

There is no OOM. The memory usage is not growing nor the swap space
usage, it is still a few MB there.

> 2. Was THP enabled?

Both situations with enabled and with disabled THP.

> MGLRU might have spent the extra CPU cycles just to void OOM kills or
> produce more THPs.
>
> If disabling the NUMA domain isn't an option, I'd recommend:

Disabling numa is not an option. However we are now testing a setup
with -1GB in HugePages per each numa.

> 1. Try the latest kernel (6.6.1) if you haven't.

Not yet, the 6.6.1 was released today.

> 2. Disable THP if it was enabled, to verify whether it has an impact.

I try disabling THP without any effect.

>
> Thanks.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-08 20:04   ` Jaroslav Pulchart
@ 2023-11-08 22:09     ` Yu Zhao
  2023-11-09  6:39       ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-08 22:09 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

[-- Attachment #1: Type: text/plain, Size: 5724 bytes --]

On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > Hi Jaroslav,
>
> Hi Yu Zhao
>
> thanks for response, see answers inline:
>
> >
> > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > Hello,
> > >
> > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > system (16numa domains).
> >
> > Kernel version please?
>
> 6.5.y, but we saw it sooner as it is in investigation from 23th May
> (6.4.y and maybe even the 6.3.y).

v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
for you if you run into other problems with v6.6.

> > > Symptoms of my issue are
> > >
> > > /A/ if mult-gen LRU is enabled
> > > 1/ [kswapd3] is consuming 100% CPU
> >
> > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> >
> > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > 18.26, 15.01
> > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > 0.4 si,  0.0 st
> > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > >     ...
> > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > 34969:04 kswapd3
> > >     ...
> > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > observed with swap disk as well and cause IO latency issues due to
> > > some kind of locking)
> > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > >
> > >
> > > /B/ if mult-gen LRU is disabled
> > > 1/ [kswapd3] is consuming 3%-10% CPU
> > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > 17.77, 14.77
> > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > 0.4 si,  0.0 st
> > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > >     ...
> > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > 34966:46 [kswapd3]
> > >     ...
> > > 2/ swap space usage is low (4MB)
> > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > >
> > > Both situations are wrong as they are using swap in/out extensively,
> > > however the multi-gen LRU situation is 10times worse.
> >
> > From the stats below, node 3 had the lowest free memory. So I think in
> > both cases, the reclaim activities were as expected.
>
> I do not see a reason for the memory pressure and reclaims. This node
> has the lowest free memory of all nodes (~302MB free) that is true,
> however the swap space usage is just 4MB (still going in and out). So
> what can be the reason for that behaviour?

The best analogy is that refuel (reclaim) happens before the tank
becomes empty, and it happens even sooner when there is a long road
ahead (high order allocations).

> The workers/application is running in pre-allocated HugePages and the
> rest is used for a small set of system services and drivers of
> devices. It is static and not growing. The issue persists when I stop
> the system services and free the memory.

Yes, this helps. Also could you attach /proc/buddyinfo from the moment
you hit the problem?

> > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > pattern?
> >
> > The easiest way is to disable NUMA domain so that there would be only
> > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > has more memory and therefore they are less likely to become empty.
> >
> > > There is a free RAM in each numa node for the few MB used in
> > > swap:
> > >     NUMA stats:
> > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > 65486 65486 65486 65486 65486 65486 65424
> > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > 2623 2833 2530 2269
> > > the in/out usage does not make sense for me nor the CPU utilization by
> > > multi-gen LRU.
> >
> > My questions:
> > 1. Were there any OOM kills with either case?
>
> There is no OOM. The memory usage is not growing nor the swap space
> usage, it is still a few MB there.
>
> > 2. Was THP enabled?
>
> Both situations with enabled and with disabled THP.

My suspicion is that you packed the node 3 too perfectly :) And that
might have triggered a known but currently a low priority problem in
MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
for me in case v6.6 by itself still has the problem?

> > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > produce more THPs.
> >
> > If disabling the NUMA domain isn't an option, I'd recommend:
>
> Disabling numa is not an option. However we are now testing a setup
> with -1GB in HugePages per each numa.
>
> > 1. Try the latest kernel (6.6.1) if you haven't.
>
> Not yet, the 6.6.1 was released today.
>
> > 2. Disable THP if it was enabled, to verify whether it has an impact.
>
> I try disabling THP without any effect.

Gochat. Please try the patch with MGLRU and let me know. Thanks!

(Also CC Charan @ Qualcomm who initially reported the problem that
ended up with the attached patch.)

[-- Attachment #2: 0001-mm-mglru-curb-kswapd-overshooting-high-wmarks.patch --]
[-- Type: application/octet-stream, Size: 3209 bytes --]

From a188169d26b2d40fe0a91393761cf2292984545c Mon Sep 17 00:00:00 2001
From: Yu Zhao <yuzhao@google.com>
Date: Wed, 8 Nov 2023 14:56:58 -0700
Subject: [PATCH] mm/mglru: curb kswapd overshooting high wmarks

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..dc0bd2cc27e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5341,20 +5341,47 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
 	return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
 }
 
-static unsigned long get_nr_to_reclaim(struct scan_control *sc)
+static unsigned long get_nr_to_reclaim(struct lruvec *lruvec, struct scan_control *sc)
 {
+	int i;
+	unsigned long nr_to_reclaim;
+
 	/* don't abort memcg reclaim to ensure fairness */
 	if (!root_reclaim(sc))
 		return -1;
 
-	return max(sc->nr_to_reclaim, compact_gap(sc->order));
+	nr_to_reclaim = max(sc->nr_to_reclaim, compact_gap(sc->order));
+	if (sc->nr_reclaimed >= nr_to_reclaim)
+		return 0;
+
+	/* don't abort direct reclaim to avoid premature OOM */
+	if (!current_is_kswapd())
+		return nr_to_reclaim;
+
+	/* abort only if all eligible zones are balanced */
+	for (i = 0; i <= sc->reclaim_idx; i++) {
+		unsigned long wmark;
+		struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
+
+		if (!managed_zone(zone))
+			continue;
+
+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
+			wmark = wmark_pages(zone, WMARK_PROMO);
+		else
+			wmark = high_wmark_pages(zone);
+
+		if (!zone_watermark_ok_safe(zone, sc->order, wmark, sc->reclaim_idx))
+			return nr_to_reclaim;
+	}
+
+	return i > sc->reclaim_idx ? 0 : nr_to_reclaim;
 }
 
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	long nr_to_scan;
 	unsigned long scanned = 0;
-	unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
 	int swappiness = get_swappiness(lruvec, sc);
 
 	/* clean file folios are more likely to exist */
@@ -5376,7 +5403,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (scanned >= nr_to_scan)
 			break;
 
-		if (sc->nr_reclaimed >= nr_to_reclaim)
+		if (sc->nr_reclaimed >= get_nr_to_reclaim(lruvec, sc))
 			break;
 
 		cond_resched();
@@ -5437,7 +5464,6 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
 	struct lru_gen_folio *lrugen;
 	struct mem_cgroup *memcg;
 	const struct hlist_nulls_node *pos;
-	unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
 
 	bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
 restart:
@@ -5470,7 +5496,7 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
 
 		rcu_read_lock();
 
-		if (sc->nr_reclaimed >= nr_to_reclaim)
+		if (sc->nr_reclaimed >= get_nr_to_reclaim(lruvec, sc))
 			break;
 	}
 
@@ -5481,7 +5507,7 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
 
 	mem_cgroup_put(memcg);
 
-	if (sc->nr_reclaimed >= nr_to_reclaim)
+	if (!is_a_nulls(pos))
 		return;
 
 	/* restart if raced with lru_gen_rotate_memcg() */
-- 
2.42.0.869.gea05f2083d-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-08 22:09     ` Yu Zhao
@ 2023-11-09  6:39       ` Jaroslav Pulchart
  2023-11-09  6:48         ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-09  6:39 UTC (permalink / raw)
  To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

>
> On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > Hi Jaroslav,
> >
> > Hi Yu Zhao
> >
> > thanks for response, see answers inline:
> >
> > >
> > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > system (16numa domains).
> > >
> > > Kernel version please?
> >
> > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > (6.4.y and maybe even the 6.3.y).
>
> v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> for you if you run into other problems with v6.6.
>

I will give it a try using 6.6.y. When it will work we can switch to
6.6.y instead of backporting the stuff to 6.5.y.

> > > > Symptoms of my issue are
> > > >
> > > > /A/ if mult-gen LRU is enabled
> > > > 1/ [kswapd3] is consuming 100% CPU
> > >
> > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > >
> > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > 18.26, 15.01
> > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > 0.4 si,  0.0 st
> > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > >     ...
> > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > 34969:04 kswapd3
> > > >     ...
> > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > observed with swap disk as well and cause IO latency issues due to
> > > > some kind of locking)
> > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > >
> > > >
> > > > /B/ if mult-gen LRU is disabled
> > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > 17.77, 14.77
> > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > 0.4 si,  0.0 st
> > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > >     ...
> > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > 34966:46 [kswapd3]
> > > >     ...
> > > > 2/ swap space usage is low (4MB)
> > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > >
> > > > Both situations are wrong as they are using swap in/out extensively,
> > > > however the multi-gen LRU situation is 10times worse.
> > >
> > > From the stats below, node 3 had the lowest free memory. So I think in
> > > both cases, the reclaim activities were as expected.
> >
> > I do not see a reason for the memory pressure and reclaims. This node
> > has the lowest free memory of all nodes (~302MB free) that is true,
> > however the swap space usage is just 4MB (still going in and out). So
> > what can be the reason for that behaviour?
>
> The best analogy is that refuel (reclaim) happens before the tank
> becomes empty, and it happens even sooner when there is a long road
> ahead (high order allocations).
>
> > The workers/application is running in pre-allocated HugePages and the
> > rest is used for a small set of system services and drivers of
> > devices. It is static and not growing. The issue persists when I stop
> > the system services and free the memory.
>
> Yes, this helps.
>  Also could you attach /proc/buddyinfo from the moment
> you hit the problem?
>

I can. The problem is continuous, it is 100% of time continuously
doing in/out and consuming 100% of CPU and locking IO.

The output of /proc/buddyinfo is:

# cat /proc/buddyinfo
Node 0, zone      DMA      7      2      2      1      1      2      1
     1      1      2      1
Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
    61     43     23      4
Node 0, zone   Normal     19    190    140    129    136     75     66
    41      9      1      5
Node 1, zone   Normal    194   1210   2080   1800    715    255    111
    56     42     36     55
Node 2, zone   Normal    204    768   3766   3394   1742    468    185
   194    238     47     74
Node 3, zone   Normal   1622   2137   1058    846    388    208     97
    44     14     42     10
Node 4, zone   Normal    282    705    623    274    184     90     63
    41     11      1     28
Node 5, zone   Normal    505    620   6180   3706   1724   1083    592
   410    417    168     70
Node 6, zone   Normal   1120    357   3314   3437   2264    872    606
   209    215    123    265
Node 7, zone   Normal    365   5499  12035   7486   3845   1743    635
   243    309    292     78
Node 8, zone   Normal    248    740   2280   1094   1225   2087    846
   308    192     65     55
Node 9, zone   Normal    356    763   1625    944    740   1920   1174
   696    217    235    111
Node 10, zone   Normal    727   1479   7002   6114   2487   1084
407    269    157     78     16
Node 11, zone   Normal    189   3287   9141   5039   2560   1183
1247    693    506    252      8
Node 12, zone   Normal    142    378   1317    466   1512   1568
646    359    248    264    228
Node 13, zone   Normal    444   1977   3173   2625   2105   1493
931    600    369    266    230
Node 14, zone   Normal    376    221    120    360   2721   2378
1521    826    442    204     59
Node 15, zone   Normal   1210    966    922   2046   4128   2904
1518    744    352    102     58


> > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > pattern?
> > >
> > > The easiest way is to disable NUMA domain so that there would be only
> > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > has more memory and therefore they are less likely to become empty.
> > >
> > > > There is a free RAM in each numa node for the few MB used in
> > > > swap:
> > > >     NUMA stats:
> > > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > 65486 65486 65486 65486 65486 65486 65424
> > > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > 2623 2833 2530 2269
> > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > multi-gen LRU.
> > >
> > > My questions:
> > > 1. Were there any OOM kills with either case?
> >
> > There is no OOM. The memory usage is not growing nor the swap space
> > usage, it is still a few MB there.
> >
> > > 2. Was THP enabled?
> >
> > Both situations with enabled and with disabled THP.
>
> My suspicion is that you packed the node 3 too perfectly :) And that
> might have triggered a known but currently a low priority problem in
> MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> for me in case v6.6 by itself still has the problem?
>

I would not focus just to node3, we had issues on different servers
with node0 and node2 both in parallel, but mostly it is the node3.

How our setup looks like:
* each node has 64GB of RAM,
* 61GB from it is in 1GB Huge Pages,
* rest 3GB is used by host system

There are running kvm VMs vCPUs pinned to the NUMA domains and using
the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
cpus), the qemu-kvm threads are pinned to the same numa domain as the
vCPUs. System services are not pinned, I'm not sure why the node3 is
used at most as the vms are balanced and the host's system services
can move between domains.

> > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > produce more THPs.
> > >
> > > If disabling the NUMA domain isn't an option, I'd recommend:
> >
> > Disabling numa is not an option. However we are now testing a setup
> > with -1GB in HugePages per each numa.
> >
> > > 1. Try the latest kernel (6.6.1) if you haven't.
> >
> > Not yet, the 6.6.1 was released today.
> >
> > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> >
> > I try disabling THP without any effect.
>
> Gochat. Please try the patch with MGLRU and let me know. Thanks!
>
> (Also CC Charan @ Qualcomm who initially reported the problem that
> ended up with the attached patch.)

I can try it. Will let you know.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-09  6:39       ` Jaroslav Pulchart
@ 2023-11-09  6:48         ` Yu Zhao
  2023-11-09 10:58           ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-09  6:48 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > Hi Jaroslav,
> > >
> > > Hi Yu Zhao
> > >
> > > thanks for response, see answers inline:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > system (16numa domains).
> > > >
> > > > Kernel version please?
> > >
> > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > (6.4.y and maybe even the 6.3.y).
> >
> > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > for you if you run into other problems with v6.6.
> >
>
> I will give it a try using 6.6.y. When it will work we can switch to
> 6.6.y instead of backporting the stuff to 6.5.y.
>
> > > > > Symptoms of my issue are
> > > > >
> > > > > /A/ if mult-gen LRU is enabled
> > > > > 1/ [kswapd3] is consuming 100% CPU
> > > >
> > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > >
> > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > 18.26, 15.01
> > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > 0.4 si,  0.0 st
> > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > >     ...
> > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > 34969:04 kswapd3
> > > > >     ...
> > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > some kind of locking)
> > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > >
> > > > >
> > > > > /B/ if mult-gen LRU is disabled
> > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > 17.77, 14.77
> > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > 0.4 si,  0.0 st
> > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > >     ...
> > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > 34966:46 [kswapd3]
> > > > >     ...
> > > > > 2/ swap space usage is low (4MB)
> > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > >
> > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > however the multi-gen LRU situation is 10times worse.
> > > >
> > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > both cases, the reclaim activities were as expected.
> > >
> > > I do not see a reason for the memory pressure and reclaims. This node
> > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > however the swap space usage is just 4MB (still going in and out). So
> > > what can be the reason for that behaviour?
> >
> > The best analogy is that refuel (reclaim) happens before the tank
> > becomes empty, and it happens even sooner when there is a long road
> > ahead (high order allocations).
> >
> > > The workers/application is running in pre-allocated HugePages and the
> > > rest is used for a small set of system services and drivers of
> > > devices. It is static and not growing. The issue persists when I stop
> > > the system services and free the memory.
> >
> > Yes, this helps.
> >  Also could you attach /proc/buddyinfo from the moment
> > you hit the problem?
> >
>
> I can. The problem is continuous, it is 100% of time continuously
> doing in/out and consuming 100% of CPU and locking IO.
>
> The output of /proc/buddyinfo is:
>
> # cat /proc/buddyinfo
> Node 0, zone      DMA      7      2      2      1      1      2      1
>      1      1      2      1
> Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
>     61     43     23      4
> Node 0, zone   Normal     19    190    140    129    136     75     66
>     41      9      1      5
> Node 1, zone   Normal    194   1210   2080   1800    715    255    111
>     56     42     36     55
> Node 2, zone   Normal    204    768   3766   3394   1742    468    185
>    194    238     47     74
> Node 3, zone   Normal   1622   2137   1058    846    388    208     97
>     44     14     42     10

Again, thinking out loud: there is only one zone on node 3, i.e., the
normal zone, and this excludes the problem commit
669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
reclaim") fixed in v6.6.

> Node 4, zone   Normal    282    705    623    274    184     90     63
>     41     11      1     28
> Node 5, zone   Normal    505    620   6180   3706   1724   1083    592
>    410    417    168     70
> Node 6, zone   Normal   1120    357   3314   3437   2264    872    606
>    209    215    123    265
> Node 7, zone   Normal    365   5499  12035   7486   3845   1743    635
>    243    309    292     78
> Node 8, zone   Normal    248    740   2280   1094   1225   2087    846
>    308    192     65     55
> Node 9, zone   Normal    356    763   1625    944    740   1920   1174
>    696    217    235    111
> Node 10, zone   Normal    727   1479   7002   6114   2487   1084
> 407    269    157     78     16
> Node 11, zone   Normal    189   3287   9141   5039   2560   1183
> 1247    693    506    252      8
> Node 12, zone   Normal    142    378   1317    466   1512   1568
> 646    359    248    264    228
> Node 13, zone   Normal    444   1977   3173   2625   2105   1493
> 931    600    369    266    230
> Node 14, zone   Normal    376    221    120    360   2721   2378
> 1521    826    442    204     59
> Node 15, zone   Normal   1210    966    922   2046   4128   2904
> 1518    744    352    102     58
>
>
> > > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > > pattern?
> > > >
> > > > The easiest way is to disable NUMA domain so that there would be only
> > > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > > has more memory and therefore they are less likely to become empty.
> > > >
> > > > > There is a free RAM in each numa node for the few MB used in
> > > > > swap:
> > > > >     NUMA stats:
> > > > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > > 65486 65486 65486 65486 65486 65486 65424
> > > > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > > 2623 2833 2530 2269
> > > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > > multi-gen LRU.
> > > >
> > > > My questions:
> > > > 1. Were there any OOM kills with either case?
> > >
> > > There is no OOM. The memory usage is not growing nor the swap space
> > > usage, it is still a few MB there.
> > >
> > > > 2. Was THP enabled?
> > >
> > > Both situations with enabled and with disabled THP.
> >
> > My suspicion is that you packed the node 3 too perfectly :) And that
> > might have triggered a known but currently a low priority problem in
> > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> > for me in case v6.6 by itself still has the problem?
> >
>
> I would not focus just to node3, we had issues on different servers
> with node0 and node2 both in parallel, but mostly it is the node3.
>
> How our setup looks like:
> * each node has 64GB of RAM,
> * 61GB from it is in 1GB Huge Pages,
> * rest 3GB is used by host system
>
> There are running kvm VMs vCPUs pinned to the NUMA domains and using
> the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
> cpus), the qemu-kvm threads are pinned to the same numa domain as the
> vCPUs. System services are not pinned, I'm not sure why the node3 is
> used at most as the vms are balanced and the host's system services
> can move between domains.
>
> > > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > > produce more THPs.
> > > >
> > > > If disabling the NUMA domain isn't an option, I'd recommend:
> > >
> > > Disabling numa is not an option. However we are now testing a setup
> > > with -1GB in HugePages per each numa.
> > >
> > > > 1. Try the latest kernel (6.6.1) if you haven't.
> > >
> > > Not yet, the 6.6.1 was released today.
> > >
> > > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> > >
> > > I try disabling THP without any effect.
> >
> > Gochat. Please try the patch with MGLRU and let me know. Thanks!
> >
> > (Also CC Charan @ Qualcomm who initially reported the problem that
> > ended up with the attached patch.)
>
> I can try it. Will let you know.

Great, thanks!


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-09  6:48         ` Yu Zhao
@ 2023-11-09 10:58           ` Jaroslav Pulchart
  2023-11-10  1:31             ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-09 10:58 UTC (permalink / raw)
  To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

>
> On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > Hi Jaroslav,
> > > >
> > > > Hi Yu Zhao
> > > >
> > > > thanks for response, see answers inline:
> > > >
> > > > >
> > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > system (16numa domains).
> > > > >
> > > > > Kernel version please?
> > > >
> > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > (6.4.y and maybe even the 6.3.y).
> > >
> > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > for you if you run into other problems with v6.6.
> > >
> >
> > I will give it a try using 6.6.y. When it will work we can switch to
> > 6.6.y instead of backporting the stuff to 6.5.y.
> >
> > > > > > Symptoms of my issue are
> > > > > >
> > > > > > /A/ if mult-gen LRU is enabled
> > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > >
> > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > >
> > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > 18.26, 15.01
> > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > 0.4 si,  0.0 st
> > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > >     ...
> > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > 34969:04 kswapd3
> > > > > >     ...
> > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > some kind of locking)
> > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > >
> > > > > >
> > > > > > /B/ if mult-gen LRU is disabled
> > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > 17.77, 14.77
> > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > 0.4 si,  0.0 st
> > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > >     ...
> > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > 34966:46 [kswapd3]
> > > > > >     ...
> > > > > > 2/ swap space usage is low (4MB)
> > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > >
> > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > however the multi-gen LRU situation is 10times worse.
> > > > >
> > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > both cases, the reclaim activities were as expected.
> > > >
> > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > however the swap space usage is just 4MB (still going in and out). So
> > > > what can be the reason for that behaviour?
> > >
> > > The best analogy is that refuel (reclaim) happens before the tank
> > > becomes empty, and it happens even sooner when there is a long road
> > > ahead (high order allocations).
> > >
> > > > The workers/application is running in pre-allocated HugePages and the
> > > > rest is used for a small set of system services and drivers of
> > > > devices. It is static and not growing. The issue persists when I stop
> > > > the system services and free the memory.
> > >
> > > Yes, this helps.
> > >  Also could you attach /proc/buddyinfo from the moment
> > > you hit the problem?
> > >
> >
> > I can. The problem is continuous, it is 100% of time continuously
> > doing in/out and consuming 100% of CPU and locking IO.
> >
> > The output of /proc/buddyinfo is:
> >
> > # cat /proc/buddyinfo
> > Node 0, zone      DMA      7      2      2      1      1      2      1
> >      1      1      2      1
> > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> >     61     43     23      4
> > Node 0, zone   Normal     19    190    140    129    136     75     66
> >     41      9      1      5
> > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> >     56     42     36     55
> > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> >    194    238     47     74
> > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> >     44     14     42     10
>
> Again, thinking out loud: there is only one zone on node 3, i.e., the
> normal zone, and this excludes the problem commit
> 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> reclaim") fixed in v6.6.

I built vanila 6.6.1 and did the first fast test - spin up and destroy
VMs only - This test does not always trigger the kswapd3 continuous
swap in/out  usage but it uses it and it  looks like there is a
change:

 I can see kswapd non-continous (15s and more) usage with 6.5.y
 # ps ax | grep [k]swapd
    753 ?        S      0:00 [kswapd0]
    754 ?        S      0:00 [kswapd1]
    755 ?        S      0:00 [kswapd2]
    756 ?        S      0:15 [kswapd3]    <<<<<<<<<
    757 ?        S      0:00 [kswapd4]
    758 ?        S      0:00 [kswapd5]
    759 ?        S      0:00 [kswapd6]
    760 ?        S      0:00 [kswapd7]
    761 ?        S      0:00 [kswapd8]
    762 ?        S      0:00 [kswapd9]
    763 ?        S      0:00 [kswapd10]
    764 ?        S      0:00 [kswapd11]
    765 ?        S      0:00 [kswapd12]
    766 ?        S      0:00 [kswapd13]
    767 ?        S      0:00 [kswapd14]
    768 ?        S      0:00 [kswapd15]

and none kswapd usage with 6.6.1, that looks to be promising path

# ps ax | grep [k]swapd
    808 ?        S      0:00 [kswapd0]
    809 ?        S      0:00 [kswapd1]
    810 ?        S      0:00 [kswapd2]
    811 ?        S      0:00 [kswapd3]    <<<< nice
    812 ?        S      0:00 [kswapd4]
    813 ?        S      0:00 [kswapd5]
    814 ?        S      0:00 [kswapd6]
    815 ?        S      0:00 [kswapd7]
    816 ?        S      0:00 [kswapd8]
    817 ?        S      0:00 [kswapd9]
    818 ?        S      0:00 [kswapd10]
    819 ?        S      0:00 [kswapd11]
    820 ?        S      0:00 [kswapd12]
    821 ?        S      0:00 [kswapd13]
    822 ?        S      0:00 [kswapd14]
    823 ?        S      0:00 [kswapd15]

I will install the 6.6.1 on the server which is doing some work and
observe it later today..


>
> > Node 4, zone   Normal    282    705    623    274    184     90     63
> >     41     11      1     28
> > Node 5, zone   Normal    505    620   6180   3706   1724   1083    592
> >    410    417    168     70
> > Node 6, zone   Normal   1120    357   3314   3437   2264    872    606
> >    209    215    123    265
> > Node 7, zone   Normal    365   5499  12035   7486   3845   1743    635
> >    243    309    292     78
> > Node 8, zone   Normal    248    740   2280   1094   1225   2087    846
> >    308    192     65     55
> > Node 9, zone   Normal    356    763   1625    944    740   1920   1174
> >    696    217    235    111
> > Node 10, zone   Normal    727   1479   7002   6114   2487   1084
> > 407    269    157     78     16
> > Node 11, zone   Normal    189   3287   9141   5039   2560   1183
> > 1247    693    506    252      8
> > Node 12, zone   Normal    142    378   1317    466   1512   1568
> > 646    359    248    264    228
> > Node 13, zone   Normal    444   1977   3173   2625   2105   1493
> > 931    600    369    266    230
> > Node 14, zone   Normal    376    221    120    360   2721   2378
> > 1521    826    442    204     59
> > Node 15, zone   Normal   1210    966    922   2046   4128   2904
> > 1518    744    352    102     58
> >
> >
> > > > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > > > pattern?
> > > > >
> > > > > The easiest way is to disable NUMA domain so that there would be only
> > > > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > > > has more memory and therefore they are less likely to become empty.
> > > > >
> > > > > > There is a free RAM in each numa node for the few MB used in
> > > > > > swap:
> > > > > >     NUMA stats:
> > > > > >     NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > > >     MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > > > 65486 65486 65486 65486 65486 65486 65424
> > > > > >     MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > > > 2623 2833 2530 2269
> > > > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > > > multi-gen LRU.
> > > > >
> > > > > My questions:
> > > > > 1. Were there any OOM kills with either case?
> > > >
> > > > There is no OOM. The memory usage is not growing nor the swap space
> > > > usage, it is still a few MB there.
> > > >
> > > > > 2. Was THP enabled?
> > > >
> > > > Both situations with enabled and with disabled THP.
> > >
> > > My suspicion is that you packed the node 3 too perfectly :) And that
> > > might have triggered a known but currently a low priority problem in
> > > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> > > for me in case v6.6 by itself still has the problem?
> > >
> >
> > I would not focus just to node3, we had issues on different servers
> > with node0 and node2 both in parallel, but mostly it is the node3.
> >
> > How our setup looks like:
> > * each node has 64GB of RAM,
> > * 61GB from it is in 1GB Huge Pages,
> > * rest 3GB is used by host system
> >
> > There are running kvm VMs vCPUs pinned to the NUMA domains and using
> > the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
> > cpus), the qemu-kvm threads are pinned to the same numa domain as the
> > vCPUs. System services are not pinned, I'm not sure why the node3 is
> > used at most as the vms are balanced and the host's system services
> > can move between domains.
> >
> > > > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > > > produce more THPs.
> > > > >
> > > > > If disabling the NUMA domain isn't an option, I'd recommend:
> > > >
> > > > Disabling numa is not an option. However we are now testing a setup
> > > > with -1GB in HugePages per each numa.
> > > >
> > > > > 1. Try the latest kernel (6.6.1) if you haven't.
> > > >
> > > > Not yet, the 6.6.1 was released today.
> > > >
> > > > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> > > >
> > > > I try disabling THP without any effect.
> > >
> > > Gochat. Please try the patch with MGLRU and let me know. Thanks!
> > >
> > > (Also CC Charan @ Qualcomm who initially reported the problem that
> > > ended up with the attached patch.)
> >
> > I can try it. Will let you know.
>
> Great, thanks!


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-09 10:58           ` Jaroslav Pulchart
@ 2023-11-10  1:31             ` Yu Zhao
       [not found]               ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-10  1:31 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > Hi Jaroslav,
> > > > >
> > > > > Hi Yu Zhao
> > > > >
> > > > > thanks for response, see answers inline:
> > > > >
> > > > > >
> > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > system (16numa domains).
> > > > > >
> > > > > > Kernel version please?
> > > > >
> > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > (6.4.y and maybe even the 6.3.y).
> > > >
> > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > for you if you run into other problems with v6.6.
> > > >
> > >
> > > I will give it a try using 6.6.y. When it will work we can switch to
> > > 6.6.y instead of backporting the stuff to 6.5.y.
> > >
> > > > > > > Symptoms of my issue are
> > > > > > >
> > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > >
> > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > >
> > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > 18.26, 15.01
> > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > 0.4 si,  0.0 st
> > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > >     ...
> > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > 34969:04 kswapd3
> > > > > > >     ...
> > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > some kind of locking)
> > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > >
> > > > > > >
> > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > 17.77, 14.77
> > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > 0.4 si,  0.0 st
> > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > >     ...
> > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > 34966:46 [kswapd3]
> > > > > > >     ...
> > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > >
> > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > >
> > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > both cases, the reclaim activities were as expected.
> > > > >
> > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > what can be the reason for that behaviour?
> > > >
> > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > becomes empty, and it happens even sooner when there is a long road
> > > > ahead (high order allocations).
> > > >
> > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > rest is used for a small set of system services and drivers of
> > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > the system services and free the memory.
> > > >
> > > > Yes, this helps.
> > > >  Also could you attach /proc/buddyinfo from the moment
> > > > you hit the problem?
> > > >
> > >
> > > I can. The problem is continuous, it is 100% of time continuously
> > > doing in/out and consuming 100% of CPU and locking IO.
> > >
> > > The output of /proc/buddyinfo is:
> > >
> > > # cat /proc/buddyinfo
> > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > >      1      1      2      1
> > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > >     61     43     23      4
> > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > >     41      9      1      5
> > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > >     56     42     36     55
> > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > >    194    238     47     74
> > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > >     44     14     42     10
> >
> > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > normal zone, and this excludes the problem commit
> > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > reclaim") fixed in v6.6.
>
> I built vanila 6.6.1 and did the first fast test - spin up and destroy
> VMs only - This test does not always trigger the kswapd3 continuous
> swap in/out  usage but it uses it and it  looks like there is a
> change:
>
>  I can see kswapd non-continous (15s and more) usage with 6.5.y
>  # ps ax | grep [k]swapd
>     753 ?        S      0:00 [kswapd0]
>     754 ?        S      0:00 [kswapd1]
>     755 ?        S      0:00 [kswapd2]
>     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
>     757 ?        S      0:00 [kswapd4]
>     758 ?        S      0:00 [kswapd5]
>     759 ?        S      0:00 [kswapd6]
>     760 ?        S      0:00 [kswapd7]
>     761 ?        S      0:00 [kswapd8]
>     762 ?        S      0:00 [kswapd9]
>     763 ?        S      0:00 [kswapd10]
>     764 ?        S      0:00 [kswapd11]
>     765 ?        S      0:00 [kswapd12]
>     766 ?        S      0:00 [kswapd13]
>     767 ?        S      0:00 [kswapd14]
>     768 ?        S      0:00 [kswapd15]
>
> and none kswapd usage with 6.6.1, that looks to be promising path
>
> # ps ax | grep [k]swapd
>     808 ?        S      0:00 [kswapd0]
>     809 ?        S      0:00 [kswapd1]
>     810 ?        S      0:00 [kswapd2]
>     811 ?        S      0:00 [kswapd3]    <<<< nice
>     812 ?        S      0:00 [kswapd4]
>     813 ?        S      0:00 [kswapd5]
>     814 ?        S      0:00 [kswapd6]
>     815 ?        S      0:00 [kswapd7]
>     816 ?        S      0:00 [kswapd8]
>     817 ?        S      0:00 [kswapd9]
>     818 ?        S      0:00 [kswapd10]
>     819 ?        S      0:00 [kswapd11]
>     820 ?        S      0:00 [kswapd12]
>     821 ?        S      0:00 [kswapd13]
>     822 ?        S      0:00 [kswapd14]
>     823 ?        S      0:00 [kswapd15]
>
> I will install the 6.6.1 on the server which is doing some work and
> observe it later today.

Thanks. Fingers crossed.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
       [not found]               ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
@ 2023-11-13 20:09                 ` Yu Zhao
  2023-11-14  7:29                   ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-13 20:09 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

[-- Attachment #1: Type: text/plain, Size: 9618 bytes --]

On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > Hi Yu Zhao
> > > > > > >
> > > > > > > thanks for response, see answers inline:
> > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > system (16numa domains).
> > > > > > > >
> > > > > > > > Kernel version please?
> > > > > > >
> > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > >
> > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > for you if you run into other problems with v6.6.
> > > > > >
> > > > >
> > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > >
> > > > > > > > > Symptoms of my issue are
> > > > > > > > >
> > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > >
> > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > >
> > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > 18.26, 15.01
> > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > >     ...
> > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > 34969:04 kswapd3
> > > > > > > > >     ...
> > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > some kind of locking)
> > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > 17.77, 14.77
> > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > >     ...
> > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > >     ...
> > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > >
> > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > >
> > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > >
> > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > what can be the reason for that behaviour?
> > > > > >
> > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > ahead (high order allocations).
> > > > > >
> > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > the system services and free the memory.
> > > > > >
> > > > > > Yes, this helps.
> > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > you hit the problem?
> > > > > >
> > > > >
> > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > >
> > > > > The output of /proc/buddyinfo is:
> > > > >
> > > > > # cat /proc/buddyinfo
> > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > >      1      1      2      1
> > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > >     61     43     23      4
> > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > >     41      9      1      5
> > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > >     56     42     36     55
> > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > >    194    238     47     74
> > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > >     44     14     42     10
> > > >
> > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > normal zone, and this excludes the problem commit
> > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > reclaim") fixed in v6.6.
> > >
> > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > VMs only - This test does not always trigger the kswapd3 continuous
> > > swap in/out  usage but it uses it and it  looks like there is a
> > > change:
> > >
> > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > >  # ps ax | grep [k]swapd
> > >     753 ?        S      0:00 [kswapd0]
> > >     754 ?        S      0:00 [kswapd1]
> > >     755 ?        S      0:00 [kswapd2]
> > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > >     757 ?        S      0:00 [kswapd4]
> > >     758 ?        S      0:00 [kswapd5]
> > >     759 ?        S      0:00 [kswapd6]
> > >     760 ?        S      0:00 [kswapd7]
> > >     761 ?        S      0:00 [kswapd8]
> > >     762 ?        S      0:00 [kswapd9]
> > >     763 ?        S      0:00 [kswapd10]
> > >     764 ?        S      0:00 [kswapd11]
> > >     765 ?        S      0:00 [kswapd12]
> > >     766 ?        S      0:00 [kswapd13]
> > >     767 ?        S      0:00 [kswapd14]
> > >     768 ?        S      0:00 [kswapd15]
> > >
> > > and none kswapd usage with 6.6.1, that looks to be promising path
> > >
> > > # ps ax | grep [k]swapd
> > >     808 ?        S      0:00 [kswapd0]
> > >     809 ?        S      0:00 [kswapd1]
> > >     810 ?        S      0:00 [kswapd2]
> > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > >     812 ?        S      0:00 [kswapd4]
> > >     813 ?        S      0:00 [kswapd5]
> > >     814 ?        S      0:00 [kswapd6]
> > >     815 ?        S      0:00 [kswapd7]
> > >     816 ?        S      0:00 [kswapd8]
> > >     817 ?        S      0:00 [kswapd9]
> > >     818 ?        S      0:00 [kswapd10]
> > >     819 ?        S      0:00 [kswapd11]
> > >     820 ?        S      0:00 [kswapd12]
> > >     821 ?        S      0:00 [kswapd13]
> > >     822 ?        S      0:00 [kswapd14]
> > >     823 ?        S      0:00 [kswapd15]
> > >
> > > I will install the 6.6.1 on the server which is doing some work and
> > > observe it later today.
> >
> > Thanks. Fingers crossed.
>
> The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> The node 3 has 163MiB free of memory and I see
> just a few in/out swap usage sometimes (which is expected) and minimal
> kswapd3 process usage for almost 4days.

Thanks for the update!

Just to confirm:
1. MGLRU was enabled, and
2. The v6.6 deployed did NOT have the patch I attached earlier.
Are both correct?

If so, I'd very appreciate it if you could try the attached patch on
top of v6.5 and see if it helps. My suspicion is that the problem is
compaction related, i.e., kswapd was woken up by high order
allocations but didn't properly stop. But what causes the behavior
difference on v6.5 between MGLRU and the active/inactive LRU still
puzzles me --the problem might be somehow masked rather than fixed on
v6.6.

For any other problems that you suspect might be related to MGLRU,
please let me know and I'd be happy to look into them as well.

[-- Attachment #2: mglru-v6.5.patch --]
[-- Type: application/x-patch, Size: 2989 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-13 20:09                 ` Yu Zhao
@ 2023-11-14  7:29                   ` Jaroslav Pulchart
  2023-11-14  7:47                     ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-14  7:29 UTC (permalink / raw)
  To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

>
> On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Jaroslav,
> > > > > > > >
> > > > > > > > Hi Yu Zhao
> > > > > > > >
> > > > > > > > thanks for response, see answers inline:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > system (16numa domains).
> > > > > > > > >
> > > > > > > > > Kernel version please?
> > > > > > > >
> > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > >
> > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > for you if you run into other problems with v6.6.
> > > > > > >
> > > > > >
> > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > >
> > > > > > > > > > Symptoms of my issue are
> > > > > > > > > >
> > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > >
> > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > >
> > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > > 18.26, 15.01
> > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > > >     ...
> > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > >     ...
> > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > some kind of locking)
> > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > > 17.77, 14.77
> > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > > >     ...
> > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > >     ...
> > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > >
> > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > >
> > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > >
> > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > what can be the reason for that behaviour?
> > > > > > >
> > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > ahead (high order allocations).
> > > > > > >
> > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > the system services and free the memory.
> > > > > > >
> > > > > > > Yes, this helps.
> > > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > > you hit the problem?
> > > > > > >
> > > > > >
> > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > >
> > > > > > The output of /proc/buddyinfo is:
> > > > > >
> > > > > > # cat /proc/buddyinfo
> > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > > >      1      1      2      1
> > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > > >     61     43     23      4
> > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > > >     41      9      1      5
> > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > > >     56     42     36     55
> > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > > >    194    238     47     74
> > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > > >     44     14     42     10
> > > > >
> > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > normal zone, and this excludes the problem commit
> > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > reclaim") fixed in v6.6.
> > > >
> > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > swap in/out  usage but it uses it and it  looks like there is a
> > > > change:
> > > >
> > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > >  # ps ax | grep [k]swapd
> > > >     753 ?        S      0:00 [kswapd0]
> > > >     754 ?        S      0:00 [kswapd1]
> > > >     755 ?        S      0:00 [kswapd2]
> > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > > >     757 ?        S      0:00 [kswapd4]
> > > >     758 ?        S      0:00 [kswapd5]
> > > >     759 ?        S      0:00 [kswapd6]
> > > >     760 ?        S      0:00 [kswapd7]
> > > >     761 ?        S      0:00 [kswapd8]
> > > >     762 ?        S      0:00 [kswapd9]
> > > >     763 ?        S      0:00 [kswapd10]
> > > >     764 ?        S      0:00 [kswapd11]
> > > >     765 ?        S      0:00 [kswapd12]
> > > >     766 ?        S      0:00 [kswapd13]
> > > >     767 ?        S      0:00 [kswapd14]
> > > >     768 ?        S      0:00 [kswapd15]
> > > >
> > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > >
> > > > # ps ax | grep [k]swapd
> > > >     808 ?        S      0:00 [kswapd0]
> > > >     809 ?        S      0:00 [kswapd1]
> > > >     810 ?        S      0:00 [kswapd2]
> > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > > >     812 ?        S      0:00 [kswapd4]
> > > >     813 ?        S      0:00 [kswapd5]
> > > >     814 ?        S      0:00 [kswapd6]
> > > >     815 ?        S      0:00 [kswapd7]
> > > >     816 ?        S      0:00 [kswapd8]
> > > >     817 ?        S      0:00 [kswapd9]
> > > >     818 ?        S      0:00 [kswapd10]
> > > >     819 ?        S      0:00 [kswapd11]
> > > >     820 ?        S      0:00 [kswapd12]
> > > >     821 ?        S      0:00 [kswapd13]
> > > >     822 ?        S      0:00 [kswapd14]
> > > >     823 ?        S      0:00 [kswapd15]
> > > >
> > > > I will install the 6.6.1 on the server which is doing some work and
> > > > observe it later today.
> > >
> > > Thanks. Fingers crossed.
> >
> > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > The node 3 has 163MiB free of memory and I see
> > just a few in/out swap usage sometimes (which is expected) and minimal
> > kswapd3 process usage for almost 4days.
>
> Thanks for the update!
>
> Just to confirm:
> 1. MGLRU was enabled, and

Yes, MGLRU is enabled

> 2. The v6.6 deployed did NOT have the patch I attached earlier.

Vanila 6.6, attached patch NOT applied.

> Are both correct?
>
> If so, I'd very appreciate it if you could try the attached patch on
> top of v6.5 and see if it helps. My suspicion is that the problem is
> compaction related, i.e., kswapd was woken up by high order
> allocations but didn't properly stop. But what causes the behavior

Sure, I can try it. Will inform you about progress.

> difference on v6.5 between MGLRU and the active/inactive LRU still
> puzzles me --the problem might be somehow masked rather than fixed on
> v6.6.

I'm not sure how I can help with the issue. Any suggestions on what to
change/try?

>
> For any other problems that you suspect might be related to MGLRU,
> please let me know and I'd be happy to look into them as well.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-14  7:29                   ` Jaroslav Pulchart
@ 2023-11-14  7:47                     ` Yu Zhao
  2023-11-20  8:41                       ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-14  7:47 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Jaroslav,
> > > > > > > > >
> > > > > > > > > Hi Yu Zhao
> > > > > > > > >
> > > > > > > > > thanks for response, see answers inline:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello,
> > > > > > > > > > >
> > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > system (16numa domains).
> > > > > > > > > >
> > > > > > > > > > Kernel version please?
> > > > > > > > >
> > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > >
> > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > >
> > > > > > >
> > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > >
> > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > >
> > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > >
> > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > >
> > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > > > >     ...
> > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > >     ...
> > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > some kind of locking)
> > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > > > >     ...
> > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > >     ...
> > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > >
> > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > >
> > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > >
> > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > what can be the reason for that behaviour?
> > > > > > > >
> > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > ahead (high order allocations).
> > > > > > > >
> > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > the system services and free the memory.
> > > > > > > >
> > > > > > > > Yes, this helps.
> > > > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > > > you hit the problem?
> > > > > > > >
> > > > > > >
> > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > >
> > > > > > > The output of /proc/buddyinfo is:
> > > > > > >
> > > > > > > # cat /proc/buddyinfo
> > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > > > >      1      1      2      1
> > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > > > >     61     43     23      4
> > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > > > >     41      9      1      5
> > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > > > >     56     42     36     55
> > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > > > >    194    238     47     74
> > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > > > >     44     14     42     10
> > > > > >
> > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > normal zone, and this excludes the problem commit
> > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > reclaim") fixed in v6.6.
> > > > >
> > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > swap in/out  usage but it uses it and it  looks like there is a
> > > > > change:
> > > > >
> > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > >  # ps ax | grep [k]swapd
> > > > >     753 ?        S      0:00 [kswapd0]
> > > > >     754 ?        S      0:00 [kswapd1]
> > > > >     755 ?        S      0:00 [kswapd2]
> > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > > > >     757 ?        S      0:00 [kswapd4]
> > > > >     758 ?        S      0:00 [kswapd5]
> > > > >     759 ?        S      0:00 [kswapd6]
> > > > >     760 ?        S      0:00 [kswapd7]
> > > > >     761 ?        S      0:00 [kswapd8]
> > > > >     762 ?        S      0:00 [kswapd9]
> > > > >     763 ?        S      0:00 [kswapd10]
> > > > >     764 ?        S      0:00 [kswapd11]
> > > > >     765 ?        S      0:00 [kswapd12]
> > > > >     766 ?        S      0:00 [kswapd13]
> > > > >     767 ?        S      0:00 [kswapd14]
> > > > >     768 ?        S      0:00 [kswapd15]
> > > > >
> > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > >
> > > > > # ps ax | grep [k]swapd
> > > > >     808 ?        S      0:00 [kswapd0]
> > > > >     809 ?        S      0:00 [kswapd1]
> > > > >     810 ?        S      0:00 [kswapd2]
> > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > > > >     812 ?        S      0:00 [kswapd4]
> > > > >     813 ?        S      0:00 [kswapd5]
> > > > >     814 ?        S      0:00 [kswapd6]
> > > > >     815 ?        S      0:00 [kswapd7]
> > > > >     816 ?        S      0:00 [kswapd8]
> > > > >     817 ?        S      0:00 [kswapd9]
> > > > >     818 ?        S      0:00 [kswapd10]
> > > > >     819 ?        S      0:00 [kswapd11]
> > > > >     820 ?        S      0:00 [kswapd12]
> > > > >     821 ?        S      0:00 [kswapd13]
> > > > >     822 ?        S      0:00 [kswapd14]
> > > > >     823 ?        S      0:00 [kswapd15]
> > > > >
> > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > observe it later today.
> > > >
> > > > Thanks. Fingers crossed.
> > >
> > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > The node 3 has 163MiB free of memory and I see
> > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > kswapd3 process usage for almost 4days.
> >
> > Thanks for the update!
> >
> > Just to confirm:
> > 1. MGLRU was enabled, and
>
> Yes, MGLRU is enabled
>
> > 2. The v6.6 deployed did NOT have the patch I attached earlier.
>
> Vanila 6.6, attached patch NOT applied.
>
> > Are both correct?
> >
> > If so, I'd very appreciate it if you could try the attached patch on
> > top of v6.5 and see if it helps. My suspicion is that the problem is
> > compaction related, i.e., kswapd was woken up by high order
> > allocations but didn't properly stop. But what causes the behavior
>
> Sure, I can try it. Will inform you about progress.

Thanks!

> > difference on v6.5 between MGLRU and the active/inactive LRU still
> > puzzles me --the problem might be somehow masked rather than fixed on
> > v6.6.
>
> I'm not sure how I can help with the issue. Any suggestions on what to
> change/try?

Trying the attached patch is good enough for now :)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-14  7:47                     ` Yu Zhao
@ 2023-11-20  8:41                       ` Jaroslav Pulchart
  2023-11-22  6:13                         ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-20  8:41 UTC (permalink / raw)
  To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla

[-- Attachment #1: Type: text/plain, Size: 12374 bytes --]

> On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > >
> > > > > > > > > > Hi Yu Zhao
> > > > > > > > > >
> > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hello,
> > > > > > > > > > > >
> > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > >
> > > > > > > > > > > Kernel version please?
> > > > > > > > > >
> > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > >
> > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > >
> > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > >
> > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > >
> > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > >
> > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > > > > >     ...
> > > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > >     ...
> > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > > > > >     ...
> > > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > >     ...
> > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > >
> > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > >
> > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > >
> > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > >
> > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > ahead (high order allocations).
> > > > > > > > >
> > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > the system services and free the memory.
> > > > > > > > >
> > > > > > > > > Yes, this helps.
> > > > > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > you hit the problem?
> > > > > > > > >
> > > > > > > >
> > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > >
> > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > >
> > > > > > > > # cat /proc/buddyinfo
> > > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > > > > >      1      1      2      1
> > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > > > > >     61     43     23      4
> > > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > > > > >     41      9      1      5
> > > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > > > > >     56     42     36     55
> > > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > > > > >    194    238     47     74
> > > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > > > > >     44     14     42     10
> > > > > > >
> > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > normal zone, and this excludes the problem commit
> > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > reclaim") fixed in v6.6.
> > > > > >
> > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > swap in/out  usage but it uses it and it  looks like there is a
> > > > > > change:
> > > > > >
> > > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > >  # ps ax | grep [k]swapd
> > > > > >     753 ?        S      0:00 [kswapd0]
> > > > > >     754 ?        S      0:00 [kswapd1]
> > > > > >     755 ?        S      0:00 [kswapd2]
> > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > > > > >     757 ?        S      0:00 [kswapd4]
> > > > > >     758 ?        S      0:00 [kswapd5]
> > > > > >     759 ?        S      0:00 [kswapd6]
> > > > > >     760 ?        S      0:00 [kswapd7]
> > > > > >     761 ?        S      0:00 [kswapd8]
> > > > > >     762 ?        S      0:00 [kswapd9]
> > > > > >     763 ?        S      0:00 [kswapd10]
> > > > > >     764 ?        S      0:00 [kswapd11]
> > > > > >     765 ?        S      0:00 [kswapd12]
> > > > > >     766 ?        S      0:00 [kswapd13]
> > > > > >     767 ?        S      0:00 [kswapd14]
> > > > > >     768 ?        S      0:00 [kswapd15]
> > > > > >
> > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > >
> > > > > > # ps ax | grep [k]swapd
> > > > > >     808 ?        S      0:00 [kswapd0]
> > > > > >     809 ?        S      0:00 [kswapd1]
> > > > > >     810 ?        S      0:00 [kswapd2]
> > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > > > > >     812 ?        S      0:00 [kswapd4]
> > > > > >     813 ?        S      0:00 [kswapd5]
> > > > > >     814 ?        S      0:00 [kswapd6]
> > > > > >     815 ?        S      0:00 [kswapd7]
> > > > > >     816 ?        S      0:00 [kswapd8]
> > > > > >     817 ?        S      0:00 [kswapd9]
> > > > > >     818 ?        S      0:00 [kswapd10]
> > > > > >     819 ?        S      0:00 [kswapd11]
> > > > > >     820 ?        S      0:00 [kswapd12]
> > > > > >     821 ?        S      0:00 [kswapd13]
> > > > > >     822 ?        S      0:00 [kswapd14]
> > > > > >     823 ?        S      0:00 [kswapd15]
> > > > > >
> > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > observe it later today.
> > > > >
> > > > > Thanks. Fingers crossed.
> > > >
> > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > The node 3 has 163MiB free of memory and I see
> > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > kswapd3 process usage for almost 4days.
> > >
> > > Thanks for the update!
> > >
> > > Just to confirm:
> > > 1. MGLRU was enabled, and
> >
> > Yes, MGLRU is enabled
> >
> > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> >
> > Vanila 6.6, attached patch NOT applied.
> >
> > > Are both correct?
> > >
> > > If so, I'd very appreciate it if you could try the attached patch on
> > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > compaction related, i.e., kswapd was woken up by high order
> > > allocations but didn't properly stop. But what causes the behavior
> >
> > Sure, I can try it. Will inform you about progress.
>
> Thanks!
>
> > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > puzzles me --the problem might be somehow masked rather than fixed on
> > > v6.6.
> >
> > I'm not sure how I can help with the issue. Any suggestions on what to
> > change/try?
>
> Trying the attached patch is good enough for now :)

So far I'm running the "6.5.y + patch" for 4 days without triggering
the infinite swap in//out usage.

I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
then it is in majority the kswapd3 - like the vanila 6.5.y which is
not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
# ps ax | grep [k]swapd
    750 ?        S      0:00 [kswapd0]
    751 ?        S      0:00 [kswapd1]
    752 ?        S      0:00 [kswapd2]
    753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good
is that it is not continuous
    754 ?        S      0:00 [kswapd4]
    755 ?        S      0:00 [kswapd5]
    756 ?        S      0:00 [kswapd6]
    757 ?        S      0:00 [kswapd7]
    758 ?        S      0:00 [kswapd8]
    759 ?        S      0:00 [kswapd9]
    760 ?        S      0:00 [kswapd10]
    761 ?        S      0:00 [kswapd11]
    762 ?        S      0:00 [kswapd12]
    763 ?        S      0:00 [kswapd13]
    764 ?        S      0:00 [kswapd14]
    765 ?        S      0:00 [kswapd15]

Good stuff is that the system did not end in a continuous loop of swap
in/out usage (at least so far) which is great. See attached
swap_in_out_good_vs_bad.png. I will keep it running for the next 3
days.

[-- Attachment #2: swap_in_out_good_vs_bad.png --]
[-- Type: image/png, Size: 81234 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-20  8:41                       ` Jaroslav Pulchart
@ 2023-11-22  6:13                         ` Yu Zhao
  2023-11-22  7:12                           ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-22  6:13 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla,
	Kalesh Singh

On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > >
> > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > >
> > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hello,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > >
> > > > > > > > > > > > Kernel version please?
> > > > > > > > > > >
> > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > >
> > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > >
> > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > >
> > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > >
> > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > > >
> > > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > > > > > >     ...
> > > > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > >     ...
> > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > > > > > >     ...
> > > > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > >     ...
> > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > >
> > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > > >
> > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > > >
> > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > >
> > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > > ahead (high order allocations).
> > > > > > > > > >
> > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > > the system services and free the memory.
> > > > > > > > > >
> > > > > > > > > > Yes, this helps.
> > > > > > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > > you hit the problem?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > > >
> > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > >
> > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > > > > > >      1      1      2      1
> > > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > > > > > >     61     43     23      4
> > > > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > > > > > >     41      9      1      5
> > > > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > > > > > >     56     42     36     55
> > > > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > > > > > >    194    238     47     74
> > > > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > > > > > >     44     14     42     10
> > > > > > > >
> > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > > reclaim") fixed in v6.6.
> > > > > > >
> > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > > swap in/out  usage but it uses it and it  looks like there is a
> > > > > > > change:
> > > > > > >
> > > > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > > >  # ps ax | grep [k]swapd
> > > > > > >     753 ?        S      0:00 [kswapd0]
> > > > > > >     754 ?        S      0:00 [kswapd1]
> > > > > > >     755 ?        S      0:00 [kswapd2]
> > > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > > > > > >     757 ?        S      0:00 [kswapd4]
> > > > > > >     758 ?        S      0:00 [kswapd5]
> > > > > > >     759 ?        S      0:00 [kswapd6]
> > > > > > >     760 ?        S      0:00 [kswapd7]
> > > > > > >     761 ?        S      0:00 [kswapd8]
> > > > > > >     762 ?        S      0:00 [kswapd9]
> > > > > > >     763 ?        S      0:00 [kswapd10]
> > > > > > >     764 ?        S      0:00 [kswapd11]
> > > > > > >     765 ?        S      0:00 [kswapd12]
> > > > > > >     766 ?        S      0:00 [kswapd13]
> > > > > > >     767 ?        S      0:00 [kswapd14]
> > > > > > >     768 ?        S      0:00 [kswapd15]
> > > > > > >
> > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > > >
> > > > > > > # ps ax | grep [k]swapd
> > > > > > >     808 ?        S      0:00 [kswapd0]
> > > > > > >     809 ?        S      0:00 [kswapd1]
> > > > > > >     810 ?        S      0:00 [kswapd2]
> > > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > > > > > >     812 ?        S      0:00 [kswapd4]
> > > > > > >     813 ?        S      0:00 [kswapd5]
> > > > > > >     814 ?        S      0:00 [kswapd6]
> > > > > > >     815 ?        S      0:00 [kswapd7]
> > > > > > >     816 ?        S      0:00 [kswapd8]
> > > > > > >     817 ?        S      0:00 [kswapd9]
> > > > > > >     818 ?        S      0:00 [kswapd10]
> > > > > > >     819 ?        S      0:00 [kswapd11]
> > > > > > >     820 ?        S      0:00 [kswapd12]
> > > > > > >     821 ?        S      0:00 [kswapd13]
> > > > > > >     822 ?        S      0:00 [kswapd14]
> > > > > > >     823 ?        S      0:00 [kswapd15]
> > > > > > >
> > > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > > observe it later today.
> > > > > >
> > > > > > Thanks. Fingers crossed.
> > > > >
> > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > > The node 3 has 163MiB free of memory and I see
> > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > > kswapd3 process usage for almost 4days.
> > > >
> > > > Thanks for the update!
> > > >
> > > > Just to confirm:
> > > > 1. MGLRU was enabled, and
> > >
> > > Yes, MGLRU is enabled
> > >
> > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > >
> > > Vanila 6.6, attached patch NOT applied.
> > >
> > > > Are both correct?
> > > >
> > > > If so, I'd very appreciate it if you could try the attached patch on
> > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > > compaction related, i.e., kswapd was woken up by high order
> > > > allocations but didn't properly stop. But what causes the behavior
> > >
> > > Sure, I can try it. Will inform you about progress.
> >
> > Thanks!
> >
> > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > > puzzles me --the problem might be somehow masked rather than fixed on
> > > > v6.6.
> > >
> > > I'm not sure how I can help with the issue. Any suggestions on what to
> > > change/try?
> >
> > Trying the attached patch is good enough for now :)
>
> So far I'm running the "6.5.y + patch" for 4 days without triggering
> the infinite swap in//out usage.
>
> I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> then it is in majority the kswapd3 - like the vanila 6.5.y which is
> not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> # ps ax | grep [k]swapd
>     750 ?        S      0:00 [kswapd0]
>     751 ?        S      0:00 [kswapd1]
>     752 ?        S      0:00 [kswapd2]
>     753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good
> is that it is not continuous
>     754 ?        S      0:00 [kswapd4]
>     755 ?        S      0:00 [kswapd5]
>     756 ?        S      0:00 [kswapd6]
>     757 ?        S      0:00 [kswapd7]
>     758 ?        S      0:00 [kswapd8]
>     759 ?        S      0:00 [kswapd9]
>     760 ?        S      0:00 [kswapd10]
>     761 ?        S      0:00 [kswapd11]
>     762 ?        S      0:00 [kswapd12]
>     763 ?        S      0:00 [kswapd13]
>     764 ?        S      0:00 [kswapd14]
>     765 ?        S      0:00 [kswapd15]
>
> Good stuff is that the system did not end in a continuous loop of swap
> in/out usage (at least so far) which is great. See attached
> swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> days.

Thanks again, Jaroslav!

Just a note here: I suspect the problem still exists on v6.6 but
somehow is masked, possibly by reduced memory usage from the kernel
itself and more free memory for userspace. So to be on the safe side,
I'll post the patch and credit you as the reporter and tester.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-22  6:13                         ` Yu Zhao
@ 2023-11-22  7:12                           ` Jaroslav Pulchart
  2023-11-22  7:30                             ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-22  7:12 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla,
	Kalesh Singh

[-- Attachment #1: Type: text/plain, Size: 13980 bytes --]

>
> On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > > >
> > > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kernel version please?
> > > > > > > > > > > >
> > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > > >
> > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > > >
> > > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > > >
> > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > > > >
> > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > > > >
> > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > > >
> > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > > > ahead (high order allocations).
> > > > > > > > > > >
> > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > > > the system services and free the memory.
> > > > > > > > > > >
> > > > > > > > > > > Yes, this helps.
> > > > > > > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > > > you hit the problem?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > > > >
> > > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > > >
> > > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > > > > > > >      1      1      2      1
> > > > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > > > > > > >     61     43     23      4
> > > > > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > > > > > > >     41      9      1      5
> > > > > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > > > > > > >     56     42     36     55
> > > > > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > > > > > > >    194    238     47     74
> > > > > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > > > > > > >     44     14     42     10
> > > > > > > > >
> > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > > > reclaim") fixed in v6.6.
> > > > > > > >
> > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > > > swap in/out  usage but it uses it and it  looks like there is a
> > > > > > > > change:
> > > > > > > >
> > > > > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > > > >  # ps ax | grep [k]swapd
> > > > > > > >     753 ?        S      0:00 [kswapd0]
> > > > > > > >     754 ?        S      0:00 [kswapd1]
> > > > > > > >     755 ?        S      0:00 [kswapd2]
> > > > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > > > > > > >     757 ?        S      0:00 [kswapd4]
> > > > > > > >     758 ?        S      0:00 [kswapd5]
> > > > > > > >     759 ?        S      0:00 [kswapd6]
> > > > > > > >     760 ?        S      0:00 [kswapd7]
> > > > > > > >     761 ?        S      0:00 [kswapd8]
> > > > > > > >     762 ?        S      0:00 [kswapd9]
> > > > > > > >     763 ?        S      0:00 [kswapd10]
> > > > > > > >     764 ?        S      0:00 [kswapd11]
> > > > > > > >     765 ?        S      0:00 [kswapd12]
> > > > > > > >     766 ?        S      0:00 [kswapd13]
> > > > > > > >     767 ?        S      0:00 [kswapd14]
> > > > > > > >     768 ?        S      0:00 [kswapd15]
> > > > > > > >
> > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > > > >
> > > > > > > > # ps ax | grep [k]swapd
> > > > > > > >     808 ?        S      0:00 [kswapd0]
> > > > > > > >     809 ?        S      0:00 [kswapd1]
> > > > > > > >     810 ?        S      0:00 [kswapd2]
> > > > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > > > > > > >     812 ?        S      0:00 [kswapd4]
> > > > > > > >     813 ?        S      0:00 [kswapd5]
> > > > > > > >     814 ?        S      0:00 [kswapd6]
> > > > > > > >     815 ?        S      0:00 [kswapd7]
> > > > > > > >     816 ?        S      0:00 [kswapd8]
> > > > > > > >     817 ?        S      0:00 [kswapd9]
> > > > > > > >     818 ?        S      0:00 [kswapd10]
> > > > > > > >     819 ?        S      0:00 [kswapd11]
> > > > > > > >     820 ?        S      0:00 [kswapd12]
> > > > > > > >     821 ?        S      0:00 [kswapd13]
> > > > > > > >     822 ?        S      0:00 [kswapd14]
> > > > > > > >     823 ?        S      0:00 [kswapd15]
> > > > > > > >
> > > > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > > > observe it later today.
> > > > > > >
> > > > > > > Thanks. Fingers crossed.
> > > > > >
> > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > > > The node 3 has 163MiB free of memory and I see
> > > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > > > kswapd3 process usage for almost 4days.
> > > > >
> > > > > Thanks for the update!
> > > > >
> > > > > Just to confirm:
> > > > > 1. MGLRU was enabled, and
> > > >
> > > > Yes, MGLRU is enabled
> > > >
> > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > > >
> > > > Vanila 6.6, attached patch NOT applied.
> > > >
> > > > > Are both correct?
> > > > >
> > > > > If so, I'd very appreciate it if you could try the attached patch on
> > > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > > > compaction related, i.e., kswapd was woken up by high order
> > > > > allocations but didn't properly stop. But what causes the behavior
> > > >
> > > > Sure, I can try it. Will inform you about progress.
> > >
> > > Thanks!
> > >
> > > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > > > puzzles me --the problem might be somehow masked rather than fixed on
> > > > > v6.6.
> > > >
> > > > I'm not sure how I can help with the issue. Any suggestions on what to
> > > > change/try?
> > >
> > > Trying the attached patch is good enough for now :)
> >
> > So far I'm running the "6.5.y + patch" for 4 days without triggering
> > the infinite swap in//out usage.
> >
> > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> > then it is in majority the kswapd3 - like the vanila 6.5.y which is
> > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> > # ps ax | grep [k]swapd
> >     750 ?        S      0:00 [kswapd0]
> >     751 ?        S      0:00 [kswapd1]
> >     752 ?        S      0:00 [kswapd2]
> >     753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good
> > is that it is not continuous
> >     754 ?        S      0:00 [kswapd4]
> >     755 ?        S      0:00 [kswapd5]
> >     756 ?        S      0:00 [kswapd6]
> >     757 ?        S      0:00 [kswapd7]
> >     758 ?        S      0:00 [kswapd8]
> >     759 ?        S      0:00 [kswapd9]
> >     760 ?        S      0:00 [kswapd10]
> >     761 ?        S      0:00 [kswapd11]
> >     762 ?        S      0:00 [kswapd12]
> >     763 ?        S      0:00 [kswapd13]
> >     764 ?        S      0:00 [kswapd14]
> >     765 ?        S      0:00 [kswapd15]
> >
> > Good stuff is that the system did not end in a continuous loop of swap
> > in/out usage (at least so far) which is great. See attached
> > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> > days.
>
> Thanks again, Jaroslav!
>
> Just a note here: I suspect the problem still exists on v6.6 but
> somehow is masked, possibly by reduced memory usage from the kernel
> itself and more free memory for userspace. So to be on the safe side,
> I'll post the patch and credit you as the reporter and tester.

Morning, let's wait. I reviewed the graph and the swap in/out started
to be happening from 1:50 AM CET. Slower than before (util of cpu
0.3%) but it is doing in/out see attached png.

[-- Attachment #2: in_out_again.png --]
[-- Type: image/png, Size: 23506 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-22  7:12                           ` Jaroslav Pulchart
@ 2023-11-22  7:30                             ` Jaroslav Pulchart
  2023-11-22 14:18                               ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-22  7:30 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla,
	Kalesh Singh

>
> >
> > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > > > >
> > > > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Kernel version please?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > > > >
> > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > > > >
> > > > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> > > > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> > > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> > > > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> > > > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> > > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> > > > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > > > >
> > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > > > > ahead (high order allocations).
> > > > > > > > > > > >
> > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > > > > the system services and free the memory.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, this helps.
> > > > > > > > > > > >  Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > > > > you hit the problem?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > > > > >
> > > > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > > > >
> > > > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> > > > > > > > > > >      1      1      2      1
> > > > > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> > > > > > > > > > >     61     43     23      4
> > > > > > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> > > > > > > > > > >     41      9      1      5
> > > > > > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> > > > > > > > > > >     56     42     36     55
> > > > > > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> > > > > > > > > > >    194    238     47     74
> > > > > > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> > > > > > > > > > >     44     14     42     10
> > > > > > > > > >
> > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > > > > reclaim") fixed in v6.6.
> > > > > > > > >
> > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > > > > swap in/out  usage but it uses it and it  looks like there is a
> > > > > > > > > change:
> > > > > > > > >
> > > > > > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > > > > >  # ps ax | grep [k]swapd
> > > > > > > > >     753 ?        S      0:00 [kswapd0]
> > > > > > > > >     754 ?        S      0:00 [kswapd1]
> > > > > > > > >     755 ?        S      0:00 [kswapd2]
> > > > > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > > > > > > > >     757 ?        S      0:00 [kswapd4]
> > > > > > > > >     758 ?        S      0:00 [kswapd5]
> > > > > > > > >     759 ?        S      0:00 [kswapd6]
> > > > > > > > >     760 ?        S      0:00 [kswapd7]
> > > > > > > > >     761 ?        S      0:00 [kswapd8]
> > > > > > > > >     762 ?        S      0:00 [kswapd9]
> > > > > > > > >     763 ?        S      0:00 [kswapd10]
> > > > > > > > >     764 ?        S      0:00 [kswapd11]
> > > > > > > > >     765 ?        S      0:00 [kswapd12]
> > > > > > > > >     766 ?        S      0:00 [kswapd13]
> > > > > > > > >     767 ?        S      0:00 [kswapd14]
> > > > > > > > >     768 ?        S      0:00 [kswapd15]
> > > > > > > > >
> > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > > > > >
> > > > > > > > > # ps ax | grep [k]swapd
> > > > > > > > >     808 ?        S      0:00 [kswapd0]
> > > > > > > > >     809 ?        S      0:00 [kswapd1]
> > > > > > > > >     810 ?        S      0:00 [kswapd2]
> > > > > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > > > > > > > >     812 ?        S      0:00 [kswapd4]
> > > > > > > > >     813 ?        S      0:00 [kswapd5]
> > > > > > > > >     814 ?        S      0:00 [kswapd6]
> > > > > > > > >     815 ?        S      0:00 [kswapd7]
> > > > > > > > >     816 ?        S      0:00 [kswapd8]
> > > > > > > > >     817 ?        S      0:00 [kswapd9]
> > > > > > > > >     818 ?        S      0:00 [kswapd10]
> > > > > > > > >     819 ?        S      0:00 [kswapd11]
> > > > > > > > >     820 ?        S      0:00 [kswapd12]
> > > > > > > > >     821 ?        S      0:00 [kswapd13]
> > > > > > > > >     822 ?        S      0:00 [kswapd14]
> > > > > > > > >     823 ?        S      0:00 [kswapd15]
> > > > > > > > >
> > > > > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > > > > observe it later today.
> > > > > > > >
> > > > > > > > Thanks. Fingers crossed.
> > > > > > >
> > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > > > > The node 3 has 163MiB free of memory and I see
> > > > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > > > > kswapd3 process usage for almost 4days.
> > > > > >
> > > > > > Thanks for the update!
> > > > > >
> > > > > > Just to confirm:
> > > > > > 1. MGLRU was enabled, and
> > > > >
> > > > > Yes, MGLRU is enabled
> > > > >
> > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > > > >
> > > > > Vanila 6.6, attached patch NOT applied.
> > > > >
> > > > > > Are both correct?
> > > > > >
> > > > > > If so, I'd very appreciate it if you could try the attached patch on
> > > > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > > > > compaction related, i.e., kswapd was woken up by high order
> > > > > > allocations but didn't properly stop. But what causes the behavior
> > > > >
> > > > > Sure, I can try it. Will inform you about progress.
> > > >
> > > > Thanks!
> > > >
> > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > > > > puzzles me --the problem might be somehow masked rather than fixed on
> > > > > > v6.6.
> > > > >
> > > > > I'm not sure how I can help with the issue. Any suggestions on what to
> > > > > change/try?
> > > >
> > > > Trying the attached patch is good enough for now :)
> > >
> > > So far I'm running the "6.5.y + patch" for 4 days without triggering
> > > the infinite swap in//out usage.
> > >
> > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> > > then it is in majority the kswapd3 - like the vanila 6.5.y which is
> > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> > > # ps ax | grep [k]swapd
> > >     750 ?        S      0:00 [kswapd0]
> > >     751 ?        S      0:00 [kswapd1]
> > >     752 ?        S      0:00 [kswapd2]
> > >     753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good
> > > is that it is not continuous
> > >     754 ?        S      0:00 [kswapd4]
> > >     755 ?        S      0:00 [kswapd5]
> > >     756 ?        S      0:00 [kswapd6]
> > >     757 ?        S      0:00 [kswapd7]
> > >     758 ?        S      0:00 [kswapd8]
> > >     759 ?        S      0:00 [kswapd9]
> > >     760 ?        S      0:00 [kswapd10]
> > >     761 ?        S      0:00 [kswapd11]
> > >     762 ?        S      0:00 [kswapd12]
> > >     763 ?        S      0:00 [kswapd13]
> > >     764 ?        S      0:00 [kswapd14]
> > >     765 ?        S      0:00 [kswapd15]
> > >
> > > Good stuff is that the system did not end in a continuous loop of swap
> > > in/out usage (at least so far) which is great. See attached
> > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> > > days.
> >
> > Thanks again, Jaroslav!
> >
> > Just a note here: I suspect the problem still exists on v6.6 but
> > somehow is masked, possibly by reduced memory usage from the kernel
> > itself and more free memory for userspace. So to be on the safe side,
> > I'll post the patch and credit you as the reporter and tester.
>
> Morning, let's wait. I reviewed the graph and the swap in/out started
> to be happening from 1:50 AM CET. Slower than before (util of cpu
> 0.3%) but it is doing in/out see attached png.

I investigated it more, there was an operation issue and the system
disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround
for this problem) by
   echo N > /sys/kernel/mm/lru_gen/enabled
when an alert was triggered by an unexpected setup of the server.
Could it be that the patch is not functional if lru_gen/enabled is
0x0000?

I need to reboot the system and do the whole week's test again.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-22  7:30                             ` Jaroslav Pulchart
@ 2023-11-22 14:18                               ` Yu Zhao
  2023-11-29 13:54                                 ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-22 14:18 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: Charan Teja Kalla, Daniel Secik, Igor Raits, Kalesh Singh, akpm,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 15792 bytes --]

On Wed, Nov 22, 2023 at 12:31 AM Jaroslav Pulchart <
jaroslav.pulchart@gooddata.com> wrote:

> >
> > >
> > > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav
> Pulchart
> > > > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I would like to report to you an unpleasant
> behavior of multi-gen LRU
> > > > > > > > > > > > > > > > with strange swap in/out usage on my Dell
> 7525 two socket AMD 74F3
> > > > > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Kernel version please?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in
> investigation from 23th May
> > > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > > > > >
> > > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can
> backport them to v6.5
> > > > > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I will give it a try using 6.6.y. When it will work
> we can switch to
> > > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > > > > >
> > > > > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the
> fourth node was under memory pressure.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2
> users,  load average: 23.34,
> > > > > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224
> sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1
> id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free,
> 1021308.+used,    767.6 buff/cache
> > > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7
> free,      4.2 used.  25956.7 avail Mem
> > > > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > > >         765 root      20   0       0      0
>     0 R  98.3   0.0
> > > > > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from
> 8GB as swap in zram (was
> > > > > > > > > > > > > > > > observed with swap disk as well and cause IO
> latency issues due to
> > > > > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical
> ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2
> users,  load average: 23.05,
> > > > > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225
> sleeping,   0 stopped,   0 zombie
> > > > > > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8
> id,  0.0 wa,  0.4 hi,
> > > > > > > > > > > > > > > > 0.4 si,  0.0 st
> > > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free,
> 1021313.+used,    767.3 buff/cache
> > > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0
> free,      3.0 used.  25952.4 avail Mem
> > > > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > > >        765 root      20   0       0      0
>     0 S   3.6   0.0
> > > > > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > > > > >     ...
> > > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical
> ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Both situations are wrong as they are using
> swap in/out extensively,
> > > > > > > > > > > > > > > > however the multi-gen LRU situation is
> 10times worse.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From the stats below, node 3 had the lowest
> free memory. So I think in
> > > > > > > > > > > > > > > both cases, the reclaim activities were as
> expected.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I do not see a reason for the memory pressure
> and reclaims. This node
> > > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB
> free) that is true,
> > > > > > > > > > > > > > however the swap space usage is just 4MB (still
> going in and out). So
> > > > > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > > > > >
> > > > > > > > > > > > > The best analogy is that refuel (reclaim) happens
> before the tank
> > > > > > > > > > > > > becomes empty, and it happens even sooner when
> there is a long road
> > > > > > > > > > > > > ahead (high order allocations).
> > > > > > > > > > > > >
> > > > > > > > > > > > > > The workers/application is running in
> pre-allocated HugePages and the
> > > > > > > > > > > > > > rest is used for a small set of system services
> and drivers of
> > > > > > > > > > > > > > devices. It is static and not growing. The issue
> persists when I stop
> > > > > > > > > > > > > > the system services and free the memory.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, this helps.
> > > > > > > > > > > > >  Also could you attach /proc/buddyinfo from the
> moment
> > > > > > > > > > > > > you hit the problem?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I can. The problem is continuous, it is 100% of time
> continuously
> > > > > > > > > > > > doing in/out and consuming 100% of CPU and locking
> IO.
> > > > > > > > > > > >
> > > > > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > > > > >
> > > > > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > > > > Node 0, zone      DMA      7      2      2      1
>   1      2      1
> > > > > > > > > > > >      1      1      2      1
> > > > > > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846
> 439    190     93
> > > > > > > > > > > >     61     43     23      4
> > > > > > > > > > > > Node 0, zone   Normal     19    190    140    129
> 136     75     66
> > > > > > > > > > > >     41      9      1      5
> > > > > > > > > > > > Node 1, zone   Normal    194   1210   2080   1800
> 715    255    111
> > > > > > > > > > > >     56     42     36     55
> > > > > > > > > > > > Node 2, zone   Normal    204    768   3766   3394
>  1742    468    185
> > > > > > > > > > > >    194    238     47     74
> > > > > > > > > > > > Node 3, zone   Normal   1622   2137   1058    846
> 388    208     97
> > > > > > > > > > > >     44     14     42     10
> > > > > > > > > > >
> > > > > > > > > > > Again, thinking out loud: there is only one zone on
> node 3, i.e., the
> > > > > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen
> LRU: fix per-zone
> > > > > > > > > > > reclaim") fixed in v6.6.
> > > > > > > > > >
> > > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin
> up and destroy
> > > > > > > > > > VMs only - This test does not always trigger the kswapd3
> continuous
> > > > > > > > > > swap in/out  usage but it uses it and it  looks like
> there is a
> > > > > > > > > > change:
> > > > > > > > > >
> > > > > > > > > >  I can see kswapd non-continous (15s and more) usage
> with 6.5.y
> > > > > > > > > >  # ps ax | grep [k]swapd
> > > > > > > > > >     753 ?        S      0:00 [kswapd0]
> > > > > > > > > >     754 ?        S      0:00 [kswapd1]
> > > > > > > > > >     755 ?        S      0:00 [kswapd2]
> > > > > > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> > > > > > > > > >     757 ?        S      0:00 [kswapd4]
> > > > > > > > > >     758 ?        S      0:00 [kswapd5]
> > > > > > > > > >     759 ?        S      0:00 [kswapd6]
> > > > > > > > > >     760 ?        S      0:00 [kswapd7]
> > > > > > > > > >     761 ?        S      0:00 [kswapd8]
> > > > > > > > > >     762 ?        S      0:00 [kswapd9]
> > > > > > > > > >     763 ?        S      0:00 [kswapd10]
> > > > > > > > > >     764 ?        S      0:00 [kswapd11]
> > > > > > > > > >     765 ?        S      0:00 [kswapd12]
> > > > > > > > > >     766 ?        S      0:00 [kswapd13]
> > > > > > > > > >     767 ?        S      0:00 [kswapd14]
> > > > > > > > > >     768 ?        S      0:00 [kswapd15]
> > > > > > > > > >
> > > > > > > > > > and none kswapd usage with 6.6.1, that looks to be
> promising path
> > > > > > > > > >
> > > > > > > > > > # ps ax | grep [k]swapd
> > > > > > > > > >     808 ?        S      0:00 [kswapd0]
> > > > > > > > > >     809 ?        S      0:00 [kswapd1]
> > > > > > > > > >     810 ?        S      0:00 [kswapd2]
> > > > > > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> > > > > > > > > >     812 ?        S      0:00 [kswapd4]
> > > > > > > > > >     813 ?        S      0:00 [kswapd5]
> > > > > > > > > >     814 ?        S      0:00 [kswapd6]
> > > > > > > > > >     815 ?        S      0:00 [kswapd7]
> > > > > > > > > >     816 ?        S      0:00 [kswapd8]
> > > > > > > > > >     817 ?        S      0:00 [kswapd9]
> > > > > > > > > >     818 ?        S      0:00 [kswapd10]
> > > > > > > > > >     819 ?        S      0:00 [kswapd11]
> > > > > > > > > >     820 ?        S      0:00 [kswapd12]
> > > > > > > > > >     821 ?        S      0:00 [kswapd13]
> > > > > > > > > >     822 ?        S      0:00 [kswapd14]
> > > > > > > > > >     823 ?        S      0:00 [kswapd15]
> > > > > > > > > >
> > > > > > > > > > I will install the 6.6.1 on the server which is doing
> some work and
> > > > > > > > > > observe it later today.
> > > > > > > > >
> > > > > > > > > Thanks. Fingers crossed.
> > > > > > > >
> > > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So
> far so good.
> > > > > > > > The node 3 has 163MiB free of memory and I see
> > > > > > > > just a few in/out swap usage sometimes (which is expected)
> and minimal
> > > > > > > > kswapd3 process usage for almost 4days.
> > > > > > >
> > > > > > > Thanks for the update!
> > > > > > >
> > > > > > > Just to confirm:
> > > > > > > 1. MGLRU was enabled, and
> > > > > >
> > > > > > Yes, MGLRU is enabled
> > > > > >
> > > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > > > > >
> > > > > > Vanila 6.6, attached patch NOT applied.
> > > > > >
> > > > > > > Are both correct?
> > > > > > >
> > > > > > > If so, I'd very appreciate it if you could try the attached
> patch on
> > > > > > > top of v6.5 and see if it helps. My suspicion is that the
> problem is
> > > > > > > compaction related, i.e., kswapd was woken up by high order
> > > > > > > allocations but didn't properly stop. But what causes the
> behavior
> > > > > >
> > > > > > Sure, I can try it. Will inform you about progress.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > > > difference on v6.5 between MGLRU and the active/inactive LRU
> still
> > > > > > > puzzles me --the problem might be somehow masked rather than
> fixed on
> > > > > > > v6.6.
> > > > > >
> > > > > > I'm not sure how I can help with the issue. Any suggestions on
> what to
> > > > > > change/try?
> > > > >
> > > > > Trying the attached patch is good enough for now :)
> > > >
> > > > So far I'm running the "6.5.y + patch" for 4 days without triggering
> > > > the infinite swap in//out usage.
> > > >
> > > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> > > > then it is in majority the kswapd3 - like the vanila 6.5.y which is
> > > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> > > > # ps ax | grep [k]swapd
> > > >     750 ?        S      0:00 [kswapd0]
> > > >     751 ?        S      0:00 [kswapd1]
> > > >     752 ?        S      0:00 [kswapd2]
> > > >     753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good
> > > > is that it is not continuous
> > > >     754 ?        S      0:00 [kswapd4]
> > > >     755 ?        S      0:00 [kswapd5]
> > > >     756 ?        S      0:00 [kswapd6]
> > > >     757 ?        S      0:00 [kswapd7]
> > > >     758 ?        S      0:00 [kswapd8]
> > > >     759 ?        S      0:00 [kswapd9]
> > > >     760 ?        S      0:00 [kswapd10]
> > > >     761 ?        S      0:00 [kswapd11]
> > > >     762 ?        S      0:00 [kswapd12]
> > > >     763 ?        S      0:00 [kswapd13]
> > > >     764 ?        S      0:00 [kswapd14]
> > > >     765 ?        S      0:00 [kswapd15]
> > > >
> > > > Good stuff is that the system did not end in a continuous loop of
> swap
> > > > in/out usage (at least so far) which is great. See attached
> > > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> > > > days.
> > >
> > > Thanks again, Jaroslav!
> > >
> > > Just a note here: I suspect the problem still exists on v6.6 but
> > > somehow is masked, possibly by reduced memory usage from the kernel
> > > itself and more free memory for userspace. So to be on the safe side,
> > > I'll post the patch and credit you as the reporter and tester.
> >
> > Morning, let's wait. I reviewed the graph and the swap in/out started
> > to be happening from 1:50 AM CET. Slower than before (util of cpu
> > 0.3%) but it is doing in/out see attached png.
>
> I investigated it more, there was an operation issue and the system
> disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround
> for this problem) by
>    echo N > /sys/kernel/mm/lru_gen/enabled
> when an alert was triggered by an unexpected setup of the server.
> Could it be that the patch is not functional if lru_gen/enabled is
> 0x0000?


That’s correct.

I need to reboot the system and do the whole week's test again.


Thanks a lot!

>

[-- Attachment #2: Type: text/html, Size: 26004 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-22 14:18                               ` Yu Zhao
@ 2023-11-29 13:54                                 ` Jaroslav Pulchart
  2023-12-01 23:52                                   ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-29 13:54 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Charan Teja Kalla, Daniel Secik, Igor Raits, Kalesh Singh, akpm,
	linux-mm

> On Wed, Nov 22, 2023 at 12:31 AM Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> wrote:
>>
>> >
>> > >
>> > > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
>> > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > >
>> > > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
>> > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > >
>> > > > > > >
>> > > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
>> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
>> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
>> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
>> > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hi Jaroslav,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hi Yu Zhao
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > thanks for response, see answers inline:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
>> > > > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Hello,
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
>> > > > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
>> > > > > > > > > > > > > > > > system (16numa domains).
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Kernel version please?
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
>> > > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
>> > > > > > > > > > > > > for you if you run into other problems with v6.6.
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
>> > > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
>> > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Symptoms of my issue are
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
>> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
>> > > > > > > > > > > > > > > > 18.26, 15.01
>> > > > > > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
>> > > > > > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
>> > > > > > > > > > > > > > > > 0.4 si,  0.0 st
>> > > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
>> > > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
>> > > > > > > > > > > > > > > >     ...
>> > > > > > > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
>> > > > > > > > > > > > > > > > 34969:04 kswapd3
>> > > > > > > > > > > > > > > >     ...
>> > > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
>> > > > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
>> > > > > > > > > > > > > > > > some kind of locking)
>> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
>> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
>> > > > > > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
>> > > > > > > > > > > > > > > > 17.77, 14.77
>> > > > > > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
>> > > > > > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
>> > > > > > > > > > > > > > > > 0.4 si,  0.0 st
>> > > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
>> > > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
>> > > > > > > > > > > > > > > >     ...
>> > > > > > > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
>> > > > > > > > > > > > > > > > 34966:46 [kswapd3]
>> > > > > > > > > > > > > > > >     ...
>> > > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
>> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
>> > > > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
>> > > > > > > > > > > > > > > both cases, the reclaim activities were as expected.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
>> > > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
>> > > > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
>> > > > > > > > > > > > > > what can be the reason for that behaviour?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
>> > > > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
>> > > > > > > > > > > > > ahead (high order allocations).
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
>> > > > > > > > > > > > > > rest is used for a small set of system services and drivers of
>> > > > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
>> > > > > > > > > > > > > > the system services and free the memory.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Yes, this helps.
>> > > > > > > > > > > > >  Also could you attach /proc/buddyinfo from the moment
>> > > > > > > > > > > > > you hit the problem?
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
>> > > > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
>> > > > > > > > > > > >
>> > > > > > > > > > > > The output of /proc/buddyinfo is:
>> > > > > > > > > > > >
>> > > > > > > > > > > > # cat /proc/buddyinfo
>> > > > > > > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
>> > > > > > > > > > > >      1      1      2      1
>> > > > > > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
>> > > > > > > > > > > >     61     43     23      4
>> > > > > > > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
>> > > > > > > > > > > >     41      9      1      5
>> > > > > > > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
>> > > > > > > > > > > >     56     42     36     55
>> > > > > > > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
>> > > > > > > > > > > >    194    238     47     74
>> > > > > > > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
>> > > > > > > > > > > >     44     14     42     10
>> > > > > > > > > > >
>> > > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
>> > > > > > > > > > > normal zone, and this excludes the problem commit
>> > > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
>> > > > > > > > > > > reclaim") fixed in v6.6.
>> > > > > > > > > >
>> > > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
>> > > > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
>> > > > > > > > > > swap in/out  usage but it uses it and it  looks like there is a
>> > > > > > > > > > change:
>> > > > > > > > > >
>> > > > > > > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
>> > > > > > > > > >  # ps ax | grep [k]swapd
>> > > > > > > > > >     753 ?        S      0:00 [kswapd0]
>> > > > > > > > > >     754 ?        S      0:00 [kswapd1]
>> > > > > > > > > >     755 ?        S      0:00 [kswapd2]
>> > > > > > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
>> > > > > > > > > >     757 ?        S      0:00 [kswapd4]
>> > > > > > > > > >     758 ?        S      0:00 [kswapd5]
>> > > > > > > > > >     759 ?        S      0:00 [kswapd6]
>> > > > > > > > > >     760 ?        S      0:00 [kswapd7]
>> > > > > > > > > >     761 ?        S      0:00 [kswapd8]
>> > > > > > > > > >     762 ?        S      0:00 [kswapd9]
>> > > > > > > > > >     763 ?        S      0:00 [kswapd10]
>> > > > > > > > > >     764 ?        S      0:00 [kswapd11]
>> > > > > > > > > >     765 ?        S      0:00 [kswapd12]
>> > > > > > > > > >     766 ?        S      0:00 [kswapd13]
>> > > > > > > > > >     767 ?        S      0:00 [kswapd14]
>> > > > > > > > > >     768 ?        S      0:00 [kswapd15]
>> > > > > > > > > >
>> > > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
>> > > > > > > > > >
>> > > > > > > > > > # ps ax | grep [k]swapd
>> > > > > > > > > >     808 ?        S      0:00 [kswapd0]
>> > > > > > > > > >     809 ?        S      0:00 [kswapd1]
>> > > > > > > > > >     810 ?        S      0:00 [kswapd2]
>> > > > > > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
>> > > > > > > > > >     812 ?        S      0:00 [kswapd4]
>> > > > > > > > > >     813 ?        S      0:00 [kswapd5]
>> > > > > > > > > >     814 ?        S      0:00 [kswapd6]
>> > > > > > > > > >     815 ?        S      0:00 [kswapd7]
>> > > > > > > > > >     816 ?        S      0:00 [kswapd8]
>> > > > > > > > > >     817 ?        S      0:00 [kswapd9]
>> > > > > > > > > >     818 ?        S      0:00 [kswapd10]
>> > > > > > > > > >     819 ?        S      0:00 [kswapd11]
>> > > > > > > > > >     820 ?        S      0:00 [kswapd12]
>> > > > > > > > > >     821 ?        S      0:00 [kswapd13]
>> > > > > > > > > >     822 ?        S      0:00 [kswapd14]
>> > > > > > > > > >     823 ?        S      0:00 [kswapd15]
>> > > > > > > > > >
>> > > > > > > > > > I will install the 6.6.1 on the server which is doing some work and
>> > > > > > > > > > observe it later today.
>> > > > > > > > >
>> > > > > > > > > Thanks. Fingers crossed.
>> > > > > > > >
>> > > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
>> > > > > > > > The node 3 has 163MiB free of memory and I see
>> > > > > > > > just a few in/out swap usage sometimes (which is expected) and minimal
>> > > > > > > > kswapd3 process usage for almost 4days.
>> > > > > > >
>> > > > > > > Thanks for the update!
>> > > > > > >
>> > > > > > > Just to confirm:
>> > > > > > > 1. MGLRU was enabled, and
>> > > > > >
>> > > > > > Yes, MGLRU is enabled
>> > > > > >
>> > > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
>> > > > > >
>> > > > > > Vanila 6.6, attached patch NOT applied.
>> > > > > >
>> > > > > > > Are both correct?
>> > > > > > >
>> > > > > > > If so, I'd very appreciate it if you could try the attached patch on
>> > > > > > > top of v6.5 and see if it helps. My suspicion is that the problem is
>> > > > > > > compaction related, i.e., kswapd was woken up by high order
>> > > > > > > allocations but didn't properly stop. But what causes the behavior
>> > > > > >
>> > > > > > Sure, I can try it. Will inform you about progress.
>> > > > >
>> > > > > Thanks!
>> > > > >
>> > > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still
>> > > > > > > puzzles me --the problem might be somehow masked rather than fixed on
>> > > > > > > v6.6.
>> > > > > >
>> > > > > > I'm not sure how I can help with the issue. Any suggestions on what to
>> > > > > > change/try?
>> > > > >
>> > > > > Trying the attached patch is good enough for now :)
>> > > >
>> > > > So far I'm running the "6.5.y + patch" for 4 days without triggering
>> > > > the infinite swap in//out usage.
>> > > >
>> > > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
>> > > > then it is in majority the kswapd3 - like the vanila 6.5.y which is
>> > > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
>> > > > # ps ax | grep [k]swapd
>> > > >     750 ?        S      0:00 [kswapd0]
>> > > >     751 ?        S      0:00 [kswapd1]
>> > > >     752 ?        S      0:00 [kswapd2]
>> > > >     753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good
>> > > > is that it is not continuous
>> > > >     754 ?        S      0:00 [kswapd4]
>> > > >     755 ?        S      0:00 [kswapd5]
>> > > >     756 ?        S      0:00 [kswapd6]
>> > > >     757 ?        S      0:00 [kswapd7]
>> > > >     758 ?        S      0:00 [kswapd8]
>> > > >     759 ?        S      0:00 [kswapd9]
>> > > >     760 ?        S      0:00 [kswapd10]
>> > > >     761 ?        S      0:00 [kswapd11]
>> > > >     762 ?        S      0:00 [kswapd12]
>> > > >     763 ?        S      0:00 [kswapd13]
>> > > >     764 ?        S      0:00 [kswapd14]
>> > > >     765 ?        S      0:00 [kswapd15]
>> > > >
>> > > > Good stuff is that the system did not end in a continuous loop of swap
>> > > > in/out usage (at least so far) which is great. See attached
>> > > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
>> > > > days.
>> > >
>> > > Thanks again, Jaroslav!
>> > >
>> > > Just a note here: I suspect the problem still exists on v6.6 but
>> > > somehow is masked, possibly by reduced memory usage from the kernel
>> > > itself and more free memory for userspace. So to be on the safe side,
>> > > I'll post the patch and credit you as the reporter and tester.
>> >
>> > Morning, let's wait. I reviewed the graph and the swap in/out started
>> > to be happening from 1:50 AM CET. Slower than before (util of cpu
>> > 0.3%) but it is doing in/out see attached png.
>>
>> I investigated it more, there was an operation issue and the system
>> disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround
>> for this problem) by
>>    echo N > /sys/kernel/mm/lru_gen/enabled
>> when an alert was triggered by an unexpected setup of the server.
>> Could it be that the patch is not functional if lru_gen/enabled is
>> 0x0000?
>
>
> That’s correct.
>
>> I need to reboot the system and do the whole week's test again.
>
>
> Thanks a lot!

The server with 6.5.y + lru patch is stable, no continuous swap in/out
is observed in the last 7days!

I assume the fix is correct. Can you share with me the final patch for
6.6.y, I will use in our kernel builds till it is in the upstream.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-11-29 13:54                                 ` Jaroslav Pulchart
@ 2023-12-01 23:52                                   ` Yu Zhao
  2023-12-07  8:46                                     ` Charan Teja Kalla
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-12-01 23:52 UTC (permalink / raw)
  To: Jaroslav Pulchart, Charan Teja Kalla
  Cc: Daniel Secik, Igor Raits, Kalesh Singh, akpm, linux-mm

On Wed, Nov 29, 2023 at 6:54 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> > On Wed, Nov 22, 2023 at 12:31 AM Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> wrote:
> >>
> >> >
> >> > >
> >> > > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
> >> > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > >
> >> > > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> >> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > >
> >> > > > > > >
> >> > > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> >> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> >> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> >> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> >> > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi Jaroslav,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Yu Zhao
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > thanks for response, see answers inline:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> >> > > > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Hello,
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> >> > > > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> >> > > > > > > > > > > > > > > > system (16numa domains).
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Kernel version please?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> >> > > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> >> > > > > > > > > > > > > for you if you run into other problems with v6.6.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> >> > > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Symptoms of my issue are
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> >> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >     top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
> >> > > > > > > > > > > > > > > > 18.26, 15.01
> >> > > > > > > > > > > > > > > >     Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
> >> > > > > > > > > > > > > > > >     %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
> >> > > > > > > > > > > > > > > > 0.4 si,  0.0 st
> >> > > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
> >> > > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
> >> > > > > > > > > > > > > > > >     ...
> >> > > > > > > > > > > > > > > >         765 root      20   0       0      0      0 R  98.3   0.0
> >> > > > > > > > > > > > > > > > 34969:04 kswapd3
> >> > > > > > > > > > > > > > > >     ...
> >> > > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> >> > > > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> >> > > > > > > > > > > > > > > > some kind of locking)
> >> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> >> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> >> > > > > > > > > > > > > > > >     top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
> >> > > > > > > > > > > > > > > > 17.77, 14.77
> >> > > > > > > > > > > > > > > >     Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
> >> > > > > > > > > > > > > > > >     %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
> >> > > > > > > > > > > > > > > > 0.4 si,  0.0 st
> >> > > > > > > > > > > > > > > >     MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
> >> > > > > > > > > > > > > > > >     MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
> >> > > > > > > > > > > > > > > >     ...
> >> > > > > > > > > > > > > > > >        765 root      20   0       0      0      0 S   3.6   0.0
> >> > > > > > > > > > > > > > > > 34966:46 [kswapd3]
> >> > > > > > > > > > > > > > > >     ...
> >> > > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> >> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> >> > > > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> >> > > > > > > > > > > > > > > both cases, the reclaim activities were as expected.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> >> > > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> >> > > > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> >> > > > > > > > > > > > > > what can be the reason for that behaviour?
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> >> > > > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> >> > > > > > > > > > > > > ahead (high order allocations).
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> >> > > > > > > > > > > > > > rest is used for a small set of system services and drivers of
> >> > > > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> >> > > > > > > > > > > > > > the system services and free the memory.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Yes, this helps.
> >> > > > > > > > > > > > >  Also could you attach /proc/buddyinfo from the moment
> >> > > > > > > > > > > > > you hit the problem?
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> >> > > > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > The output of /proc/buddyinfo is:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > # cat /proc/buddyinfo
> >> > > > > > > > > > > > Node 0, zone      DMA      7      2      2      1      1      2      1
> >> > > > > > > > > > > >      1      1      2      1
> >> > > > > > > > > > > > Node 0, zone    DMA32   4567   3395   1357    846    439    190     93
> >> > > > > > > > > > > >     61     43     23      4
> >> > > > > > > > > > > > Node 0, zone   Normal     19    190    140    129    136     75     66
> >> > > > > > > > > > > >     41      9      1      5
> >> > > > > > > > > > > > Node 1, zone   Normal    194   1210   2080   1800    715    255    111
> >> > > > > > > > > > > >     56     42     36     55
> >> > > > > > > > > > > > Node 2, zone   Normal    204    768   3766   3394   1742    468    185
> >> > > > > > > > > > > >    194    238     47     74
> >> > > > > > > > > > > > Node 3, zone   Normal   1622   2137   1058    846    388    208     97
> >> > > > > > > > > > > >     44     14     42     10
> >> > > > > > > > > > >
> >> > > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> >> > > > > > > > > > > normal zone, and this excludes the problem commit
> >> > > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> >> > > > > > > > > > > reclaim") fixed in v6.6.
> >> > > > > > > > > >
> >> > > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> >> > > > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> >> > > > > > > > > > swap in/out  usage but it uses it and it  looks like there is a
> >> > > > > > > > > > change:
> >> > > > > > > > > >
> >> > > > > > > > > >  I can see kswapd non-continous (15s and more) usage with 6.5.y
> >> > > > > > > > > >  # ps ax | grep [k]swapd
> >> > > > > > > > > >     753 ?        S      0:00 [kswapd0]
> >> > > > > > > > > >     754 ?        S      0:00 [kswapd1]
> >> > > > > > > > > >     755 ?        S      0:00 [kswapd2]
> >> > > > > > > > > >     756 ?        S      0:15 [kswapd3]    <<<<<<<<<
> >> > > > > > > > > >     757 ?        S      0:00 [kswapd4]
> >> > > > > > > > > >     758 ?        S      0:00 [kswapd5]
> >> > > > > > > > > >     759 ?        S      0:00 [kswapd6]
> >> > > > > > > > > >     760 ?        S      0:00 [kswapd7]
> >> > > > > > > > > >     761 ?        S      0:00 [kswapd8]
> >> > > > > > > > > >     762 ?        S      0:00 [kswapd9]
> >> > > > > > > > > >     763 ?        S      0:00 [kswapd10]
> >> > > > > > > > > >     764 ?        S      0:00 [kswapd11]
> >> > > > > > > > > >     765 ?        S      0:00 [kswapd12]
> >> > > > > > > > > >     766 ?        S      0:00 [kswapd13]
> >> > > > > > > > > >     767 ?        S      0:00 [kswapd14]
> >> > > > > > > > > >     768 ?        S      0:00 [kswapd15]
> >> > > > > > > > > >
> >> > > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> >> > > > > > > > > >
> >> > > > > > > > > > # ps ax | grep [k]swapd
> >> > > > > > > > > >     808 ?        S      0:00 [kswapd0]
> >> > > > > > > > > >     809 ?        S      0:00 [kswapd1]
> >> > > > > > > > > >     810 ?        S      0:00 [kswapd2]
> >> > > > > > > > > >     811 ?        S      0:00 [kswapd3]    <<<< nice
> >> > > > > > > > > >     812 ?        S      0:00 [kswapd4]
> >> > > > > > > > > >     813 ?        S      0:00 [kswapd5]
> >> > > > > > > > > >     814 ?        S      0:00 [kswapd6]
> >> > > > > > > > > >     815 ?        S      0:00 [kswapd7]
> >> > > > > > > > > >     816 ?        S      0:00 [kswapd8]
> >> > > > > > > > > >     817 ?        S      0:00 [kswapd9]
> >> > > > > > > > > >     818 ?        S      0:00 [kswapd10]
> >> > > > > > > > > >     819 ?        S      0:00 [kswapd11]
> >> > > > > > > > > >     820 ?        S      0:00 [kswapd12]
> >> > > > > > > > > >     821 ?        S      0:00 [kswapd13]
> >> > > > > > > > > >     822 ?        S      0:00 [kswapd14]
> >> > > > > > > > > >     823 ?        S      0:00 [kswapd15]
> >> > > > > > > > > >
> >> > > > > > > > > > I will install the 6.6.1 on the server which is doing some work and
> >> > > > > > > > > > observe it later today.
> >> > > > > > > > >
> >> > > > > > > > > Thanks. Fingers crossed.
> >> > > > > > > >
> >> > > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> >> > > > > > > > The node 3 has 163MiB free of memory and I see
> >> > > > > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> >> > > > > > > > kswapd3 process usage for almost 4days.
> >> > > > > > >
> >> > > > > > > Thanks for the update!
> >> > > > > > >
> >> > > > > > > Just to confirm:
> >> > > > > > > 1. MGLRU was enabled, and
> >> > > > > >
> >> > > > > > Yes, MGLRU is enabled
> >> > > > > >
> >> > > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> >> > > > > >
> >> > > > > > Vanila 6.6, attached patch NOT applied.
> >> > > > > >
> >> > > > > > > Are both correct?
> >> > > > > > >
> >> > > > > > > If so, I'd very appreciate it if you could try the attached patch on
> >> > > > > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> >> > > > > > > compaction related, i.e., kswapd was woken up by high order
> >> > > > > > > allocations but didn't properly stop. But what causes the behavior
> >> > > > > >
> >> > > > > > Sure, I can try it. Will inform you about progress.
> >> > > > >
> >> > > > > Thanks!
> >> > > > >
> >> > > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> >> > > > > > > puzzles me --the problem might be somehow masked rather than fixed on
> >> > > > > > > v6.6.
> >> > > > > >
> >> > > > > > I'm not sure how I can help with the issue. Any suggestions on what to
> >> > > > > > change/try?
> >> > > > >
> >> > > > > Trying the attached patch is good enough for now :)
> >> > > >
> >> > > > So far I'm running the "6.5.y + patch" for 4 days without triggering
> >> > > > the infinite swap in//out usage.
> >> > > >
> >> > > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> >> > > > then it is in majority the kswapd3 - like the vanila 6.5.y which is
> >> > > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> >> > > > # ps ax | grep [k]swapd
> >> > > >     750 ?        S      0:00 [kswapd0]
> >> > > >     751 ?        S      0:00 [kswapd1]
> >> > > >     752 ?        S      0:00 [kswapd2]
> >> > > >     753 ?        S      0:02 [kswapd3]    <<<< it uses kswapd3, good
> >> > > > is that it is not continuous
> >> > > >     754 ?        S      0:00 [kswapd4]
> >> > > >     755 ?        S      0:00 [kswapd5]
> >> > > >     756 ?        S      0:00 [kswapd6]
> >> > > >     757 ?        S      0:00 [kswapd7]
> >> > > >     758 ?        S      0:00 [kswapd8]
> >> > > >     759 ?        S      0:00 [kswapd9]
> >> > > >     760 ?        S      0:00 [kswapd10]
> >> > > >     761 ?        S      0:00 [kswapd11]
> >> > > >     762 ?        S      0:00 [kswapd12]
> >> > > >     763 ?        S      0:00 [kswapd13]
> >> > > >     764 ?        S      0:00 [kswapd14]
> >> > > >     765 ?        S      0:00 [kswapd15]
> >> > > >
> >> > > > Good stuff is that the system did not end in a continuous loop of swap
> >> > > > in/out usage (at least so far) which is great. See attached
> >> > > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> >> > > > days.
> >> > >
> >> > > Thanks again, Jaroslav!
> >> > >
> >> > > Just a note here: I suspect the problem still exists on v6.6 but
> >> > > somehow is masked, possibly by reduced memory usage from the kernel
> >> > > itself and more free memory for userspace. So to be on the safe side,
> >> > > I'll post the patch and credit you as the reporter and tester.
> >> >
> >> > Morning, let's wait. I reviewed the graph and the swap in/out started
> >> > to be happening from 1:50 AM CET. Slower than before (util of cpu
> >> > 0.3%) but it is doing in/out see attached png.
> >>
> >> I investigated it more, there was an operation issue and the system
> >> disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround
> >> for this problem) by
> >>    echo N > /sys/kernel/mm/lru_gen/enabled
> >> when an alert was triggered by an unexpected setup of the server.
> >> Could it be that the patch is not functional if lru_gen/enabled is
> >> 0x0000?
> >
> >
> > That’s correct.
> >
> >> I need to reboot the system and do the whole week's test again.
> >
> >
> > Thanks a lot!
>
> The server with 6.5.y + lru patch is stable, no continuous swap in/out
> is observed in the last 7days!
>
> I assume the fix is correct. Can you share with me the final patch for
> 6.6.y, I will use in our kernel builds till it is in the upstream.

Will do. Thank you.

Charan, does the fix previously attached seem acceptable to you? Any
additional feedback? Thanks.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-12-01 23:52                                   ` Yu Zhao
@ 2023-12-07  8:46                                     ` Charan Teja Kalla
  2023-12-07 18:23                                       ` Yu Zhao
  2023-12-08  8:03                                       ` Jaroslav Pulchart
  0 siblings, 2 replies; 30+ messages in thread
From: Charan Teja Kalla @ 2023-12-07  8:46 UTC (permalink / raw)
  To: Yu Zhao, Jaroslav Pulchart
  Cc: Daniel Secik, Igor Raits, Kalesh Singh, akpm, linux-mm

Hi yu,

On 12/2/2023 5:22 AM, Yu Zhao wrote:
> Charan, does the fix previously attached seem acceptable to you? Any
> additional feedback? Thanks.

First, thanks for taking this patch to upstream.

A comment in code snippet is checking just 'high wmark' pages might
succeed here but can fail in the immediate kswapd sleep, see
prepare_kswapd_sleep(). This can show up into the increased
KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
@Jaroslav: Have you observed something like above?

So, in downstream, we have something like for zone_watermark_ok():
unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;

Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
what all I can say for this patch.

+	mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
+	       WMARK_PROMO : WMARK_HIGH;
+	for (i = 0; i <= sc->reclaim_idx; i++) {
+		struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
+		unsigned long size = wmark_pages(zone, mark);
+
+		if (managed_zone(zone) &&
+		    !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
+			return false;
+	}


Thanks,
Charan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-12-07  8:46                                     ` Charan Teja Kalla
@ 2023-12-07 18:23                                       ` Yu Zhao
  2023-12-08  8:03                                       ` Jaroslav Pulchart
  1 sibling, 0 replies; 30+ messages in thread
From: Yu Zhao @ 2023-12-07 18:23 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: Jaroslav Pulchart, Daniel Secik, Igor Raits, Kalesh Singh, akpm,
	linux-mm

On Thu, Dec 7, 2023 at 1:47 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Hi yu,
>
> On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > Charan, does the fix previously attached seem acceptable to you? Any
> > additional feedback? Thanks.
>
> First, thanks for taking this patch to upstream.
>
> A comment in code snippet is checking just 'high wmark' pages might
> succeed here but can fail in the immediate kswapd sleep, see
> prepare_kswapd_sleep(). This can show up into the increased
> KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> @Jaroslav: Have you observed something like above?
>
> So, in downstream, we have something like for zone_watermark_ok():
> unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
>
> Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> what all I can say for this patch.

Yeah, we can add MIN_LRU_BATCH on top of the high watermark.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-12-07  8:46                                     ` Charan Teja Kalla
  2023-12-07 18:23                                       ` Yu Zhao
@ 2023-12-08  8:03                                       ` Jaroslav Pulchart
  2024-01-03 21:30                                         ` Jaroslav Pulchart
  1 sibling, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-12-08  8:03 UTC (permalink / raw)
  To: Charan Teja Kalla
  Cc: Yu Zhao, Daniel Secik, Igor Raits, Kalesh Singh, akpm, linux-mm

>
> Hi yu,
>
> On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > Charan, does the fix previously attached seem acceptable to you? Any
> > additional feedback? Thanks.
>
> First, thanks for taking this patch to upstream.
>
> A comment in code snippet is checking just 'high wmark' pages might
> succeed here but can fail in the immediate kswapd sleep, see
> prepare_kswapd_sleep(). This can show up into the increased
> KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> @Jaroslav: Have you observed something like above?

I do not see any unnecessary kswapd run time, on the contrary it is
fixing the kswapd continuous run issue.

>
> So, in downstream, we have something like for zone_watermark_ok():
> unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
>
> Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> what all I can say for this patch.
>
> +       mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> +              WMARK_PROMO : WMARK_HIGH;
> +       for (i = 0; i <= sc->reclaim_idx; i++) {
> +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> +               unsigned long size = wmark_pages(zone, mark);
> +
> +               if (managed_zone(zone) &&
> +                   !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> +                       return false;
> +       }
>
>
> Thanks,
> Charan



-- 
Jaroslav Pulchart
Sr. Principal SW Engineer
GoodData


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2023-12-08  8:03                                       ` Jaroslav Pulchart
@ 2024-01-03 21:30                                         ` Jaroslav Pulchart
  2024-01-04  3:03                                           ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-03 21:30 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Daniel Secik, Charan Teja Kalla, Igor Raits, Kalesh Singh, akpm,
	linux-mm

>
> >
> > Hi yu,
> >
> > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > Charan, does the fix previously attached seem acceptable to you? Any
> > > additional feedback? Thanks.
> >
> > First, thanks for taking this patch to upstream.
> >
> > A comment in code snippet is checking just 'high wmark' pages might
> > succeed here but can fail in the immediate kswapd sleep, see
> > prepare_kswapd_sleep(). This can show up into the increased
> > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > @Jaroslav: Have you observed something like above?
>
> I do not see any unnecessary kswapd run time, on the contrary it is
> fixing the kswapd continuous run issue.
>
> >
> > So, in downstream, we have something like for zone_watermark_ok():
> > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> >
> > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > what all I can say for this patch.
> >
> > +       mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > +              WMARK_PROMO : WMARK_HIGH;
> > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > +               unsigned long size = wmark_pages(zone, mark);
> > +
> > +               if (managed_zone(zone) &&
> > +                   !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > +                       return false;
> > +       }
> >
> >
> > Thanks,
> > Charan
>
>
>
> --
> Jaroslav Pulchart
> Sr. Principal SW Engineer
> GoodData


Hello,

today we try to update servers to 6.6.9 which contains the mglru fixes
(from 6.6.8) and the server behaves much much worse.

I got multiple kswapd* load to ~100% imediatelly.
    555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
kswapd1
    554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
kswapd0
    556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
kswapd2
are the changes in upstream different compared to the initial patch
which I tested?

Best regards,
Jaroslav Pulchart


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2024-01-03 21:30                                         ` Jaroslav Pulchart
@ 2024-01-04  3:03                                           ` Yu Zhao
  2024-01-04  9:46                                             ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2024-01-04  3:03 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: Daniel Secik, Charan Teja Kalla, Igor Raits, Kalesh Singh, akpm,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 2812 bytes --]

On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > >
> > > Hi yu,
> > >
> > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > additional feedback? Thanks.
> > >
> > > First, thanks for taking this patch to upstream.
> > >
> > > A comment in code snippet is checking just 'high wmark' pages might
> > > succeed here but can fail in the immediate kswapd sleep, see
> > > prepare_kswapd_sleep(). This can show up into the increased
> > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > @Jaroslav: Have you observed something like above?
> >
> > I do not see any unnecessary kswapd run time, on the contrary it is
> > fixing the kswapd continuous run issue.
> >
> > >
> > > So, in downstream, we have something like for zone_watermark_ok():
> > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > >
> > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > what all I can say for this patch.
> > >
> > > +       mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > +              WMARK_PROMO : WMARK_HIGH;
> > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > +               unsigned long size = wmark_pages(zone, mark);
> > > +
> > > +               if (managed_zone(zone) &&
> > > +                   !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > +                       return false;
> > > +       }
> > >
> > >
> > > Thanks,
> > > Charan
> >
> >
> >
> > --
> > Jaroslav Pulchart
> > Sr. Principal SW Engineer
> > GoodData
>
>
> Hello,
>
> today we try to update servers to 6.6.9 which contains the mglru fixes
> (from 6.6.8) and the server behaves much much worse.
>
> I got multiple kswapd* load to ~100% imediatelly.
>     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> kswapd1
>     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> kswapd0
>     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> kswapd2
> are the changes in upstream different compared to the initial patch
> which I tested?
>
> Best regards,
> Jaroslav Pulchart

Hi Jaroslav,

My apologies for all the trouble!

Yes, there is a slight difference between the fix you verified and
what went into 6.6.9. The fix in 6.6.9 is disabled under a special
condition which I thought wouldn't affect you.

Could you try the attached fix again on top of 6.6.9? It removed that
special condition.

Thanks!

[-- Attachment #2: mglru-fix-6.6.9.patch --]
[-- Type: application/octet-stream, Size: 975 bytes --]

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcc264d3c92f..ae3f73fc933c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5358,8 +5358,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order)))
 		return true;
 
-	/* check the order to exclude compaction-induced reclaim */
-	if (!current_is_kswapd() || sc->order)
+	if (!current_is_kswapd())
 		return false;
 
 	mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
@@ -5367,7 +5366,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 
 	for (i = 0; i <= sc->reclaim_idx; i++) {
 		struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
-		unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH;
+		unsigned long size = wmark_pages(zone, mark) + min_wmark_pages(zone);
 
 		if (managed_zone(zone) && !zone_watermark_ok(zone, 0, size, sc->reclaim_idx, 0))
 			return false;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2024-01-04  3:03                                           ` Yu Zhao
@ 2024-01-04  9:46                                             ` Jaroslav Pulchart
  2024-01-04 14:34                                               ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-04  9:46 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Daniel Secik, Charan Teja Kalla, Igor Raits, Kalesh Singh, akpm,
	linux-mm

>
> On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > >
> > > > Hi yu,
> > > >
> > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > > additional feedback? Thanks.
> > > >
> > > > First, thanks for taking this patch to upstream.
> > > >
> > > > A comment in code snippet is checking just 'high wmark' pages might
> > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > > @Jaroslav: Have you observed something like above?
> > >
> > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > fixing the kswapd continuous run issue.
> > >
> > > >
> > > > So, in downstream, we have something like for zone_watermark_ok():
> > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > > >
> > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > > what all I can say for this patch.
> > > >
> > > > +       mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > +
> > > > +               if (managed_zone(zone) &&
> > > > +                   !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > > +                       return false;
> > > > +       }
> > > >
> > > >
> > > > Thanks,
> > > > Charan
> > >
> > >
> > >
> > > --
> > > Jaroslav Pulchart
> > > Sr. Principal SW Engineer
> > > GoodData
> >
> >
> > Hello,
> >
> > today we try to update servers to 6.6.9 which contains the mglru fixes
> > (from 6.6.8) and the server behaves much much worse.
> >
> > I got multiple kswapd* load to ~100% imediatelly.
> >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > kswapd1
> >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > kswapd0
> >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > kswapd2
> > are the changes in upstream different compared to the initial patch
> > which I tested?
> >
> > Best regards,
> > Jaroslav Pulchart
>
> Hi Jaroslav,
>
> My apologies for all the trouble!
>
> Yes, there is a slight difference between the fix you verified and
> what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> condition which I thought wouldn't affect you.
>
> Could you try the attached fix again on top of 6.6.9? It removed that
> special condition.
>
> Thanks!

Thanks for prompt response. I did a test with the patch and it didn't
help. The situation is super strange.

I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
of all numa nodes of the first cpu socket if using 6.6.9 and it is the
worst situation, but the kswapd load is visible from 6.6.8.

Setup of this server:
* 4 chiplets per each sockets, there are 2 sockets
* 32 GB of RAM for each chiplet, 28GB are in hugepages
  Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
memory pressure however it is even worse now in contrary.

kernel 6.6.7: I do not see kswapd usage when application started == OK
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
MemFree: 2766 2715 63 2366 3495 2990 3462 252

kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
MemFree: 2744 2788 65 581 3304 3215 3266 2226

kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
MemFree: 75 60 60 60 3169 2784 3203 2944


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2024-01-04  9:46                                             ` Jaroslav Pulchart
@ 2024-01-04 14:34                                               ` Jaroslav Pulchart
  2024-01-04 23:51                                                 ` Igor Raits
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-04 14:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Daniel Secik, Charan Teja Kalla, Igor Raits, Kalesh Singh, akpm,
	linux-mm

>
> >
> > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > >
> > > > > Hi yu,
> > > > >
> > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > > > additional feedback? Thanks.
> > > > >
> > > > > First, thanks for taking this patch to upstream.
> > > > >
> > > > > A comment in code snippet is checking just 'high wmark' pages might
> > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > > > @Jaroslav: Have you observed something like above?
> > > >
> > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > fixing the kswapd continuous run issue.
> > > >
> > > > >
> > > > > So, in downstream, we have something like for zone_watermark_ok():
> > > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > > > >
> > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > > > what all I can say for this patch.
> > > > >
> > > > > +       mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > +
> > > > > +               if (managed_zone(zone) &&
> > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > > > +                       return false;
> > > > > +       }
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Charan
> > > >
> > > >
> > > >
> > > > --
> > > > Jaroslav Pulchart
> > > > Sr. Principal SW Engineer
> > > > GoodData
> > >
> > >
> > > Hello,
> > >
> > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > (from 6.6.8) and the server behaves much much worse.
> > >
> > > I got multiple kswapd* load to ~100% imediatelly.
> > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > kswapd1
> > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > kswapd0
> > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > kswapd2
> > > are the changes in upstream different compared to the initial patch
> > > which I tested?
> > >
> > > Best regards,
> > > Jaroslav Pulchart
> >
> > Hi Jaroslav,
> >
> > My apologies for all the trouble!
> >
> > Yes, there is a slight difference between the fix you verified and
> > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > condition which I thought wouldn't affect you.
> >
> > Could you try the attached fix again on top of 6.6.9? It removed that
> > special condition.
> >
> > Thanks!
>
> Thanks for prompt response. I did a test with the patch and it didn't
> help. The situation is super strange.
>
> I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> worst situation, but the kswapd load is visible from 6.6.8.
>
> Setup of this server:
> * 4 chiplets per each sockets, there are 2 sockets
> * 32 GB of RAM for each chiplet, 28GB are in hugepages
>   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> memory pressure however it is even worse now in contrary.
>
> kernel 6.6.7: I do not see kswapd usage when application started == OK
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 28 28 28 28 28 28 28 28
> MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> MemFree: 2766 2715 63 2366 3495 2990 3462 252
>
> kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 28 28 28 28 28 28 28 28
> MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> MemFree: 2744 2788 65 581 3304 3215 3266 2226
>
> kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 28 28 28 28 28 28 28 28
> MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> MemFree: 75 60 60 60 3169 2784 3203 2944

I run few more combinations, and here are results / findings:

  6.6.7-1  (vanila)                            == OK, no issue

  6.6.8-1  (vanila)                            == single kswapd 100% !
  6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
  6.6.8-1  (revert four mglru patches)         == OK, no issue

  6.6.9-1  (vanila)                            == four kswapd 100% !!!!
  6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
  6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!

Summary:
* mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
kernel 6.6.8,
* there is (new?) problem in case of 6.6.9 kernel, which looks not to
be related to mglru patches at all


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2024-01-04 14:34                                               ` Jaroslav Pulchart
@ 2024-01-04 23:51                                                 ` Igor Raits
  2024-01-05 17:35                                                   ` Ertman, David M
  0 siblings, 1 reply; 30+ messages in thread
From: Igor Raits @ 2024-01-04 23:51 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: Yu Zhao, Daniel Secik, Charan Teja Kalla, Kalesh Singh, akpm,
	linux-mm, Dave Ertman

Hello everyone,

On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > >
> > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > >
> > > > > > Hi yu,
> > > > > >
> > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > > > > additional feedback? Thanks.
> > > > > >
> > > > > > First, thanks for taking this patch to upstream.
> > > > > >
> > > > > > A comment in code snippet is checking just 'high wmark' pages might
> > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > > > > @Jaroslav: Have you observed something like above?
> > > > >
> > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > fixing the kswapd continuous run issue.
> > > > >
> > > > > >
> > > > > > So, in downstream, we have something like for zone_watermark_ok():
> > > > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > > > > >
> > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > > > > what all I can say for this patch.
> > > > > >
> > > > > > +       mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > > +
> > > > > > +               if (managed_zone(zone) &&
> > > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > > > > +                       return false;
> > > > > > +       }
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Charan
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jaroslav Pulchart
> > > > > Sr. Principal SW Engineer
> > > > > GoodData
> > > >
> > > >
> > > > Hello,
> > > >
> > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > (from 6.6.8) and the server behaves much much worse.
> > > >
> > > > I got multiple kswapd* load to ~100% imediatelly.
> > > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > > kswapd1
> > > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > > kswapd0
> > > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > > kswapd2
> > > > are the changes in upstream different compared to the initial patch
> > > > which I tested?
> > > >
> > > > Best regards,
> > > > Jaroslav Pulchart
> > >
> > > Hi Jaroslav,
> > >
> > > My apologies for all the trouble!
> > >
> > > Yes, there is a slight difference between the fix you verified and
> > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > condition which I thought wouldn't affect you.
> > >
> > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > special condition.
> > >
> > > Thanks!
> >
> > Thanks for prompt response. I did a test with the patch and it didn't
> > help. The situation is super strange.
> >
> > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > worst situation, but the kswapd load is visible from 6.6.8.
> >
> > Setup of this server:
> > * 4 chiplets per each sockets, there are 2 sockets
> > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> >   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > memory pressure however it is even worse now in contrary.
> >
> > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> >
> > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> >
> > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > MemFree: 75 60 60 60 3169 2784 3203 2944
>
> I run few more combinations, and here are results / findings:
>
>   6.6.7-1  (vanila)                            == OK, no issue
>
>   6.6.8-1  (vanila)                            == single kswapd 100% !
>   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
>   6.6.8-1  (revert four mglru patches)         == OK, no issue
>
>   6.6.9-1  (vanila)                            == four kswapd 100% !!!!
>   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
>   6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!
>
> Summary:
> * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> kernel 6.6.8,
> * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> be related to mglru patches at all

I was able to bisect this change and it looks like there is something
going wrong with the ice driver…

Usually after booting our server we see something like this. Most of
the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
that have a really low amount of free memory and we don't know why but
it looks like that in the end causes the constant swap in/out issue.
With the final bit of the patch you've sent earlier in this thread it
is almost invisible.

NUMA nodes:     0       1       2       3       4       5       6       7
HPTotalGiB:     28      28      28      28      28      28      28      28
HPFreeGiB:      28      28      28      28      28      28      28      28
MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
MemFree:        2191    2828    92      292     3344    2916    3594    3222


However, after the following patch we see that more NUMA nodes have
such a low amount of memory and  that is causing constant reclaiming
of memory because it looks like something inside of the kernel ate all
the memory. This is right after the start of the system as well.

NUMA nodes:     0       1       2       3       4       5       6       7
HPTotalGiB:     28      28      28      28      28      28      28      28
HPFreeGiB:      28      28      28      28      28      28      28      28
MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
MemFree:        46      59      51      33      3078    3535    2708    3511

The difference is 18G vs 12G of free memory sum'd across all NUMA
nodes right after boot of the system. If you have some hints on how to
debug what is actually occupying all that memory, maybe in both cases
- would be happy to debug more!

Dave, would you have any idea why that patch could cause such a boost
in memory utilization?

commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
Author: Dave Ertman <david.m.ertman@intel.com>
Date:   Mon Dec 11 13:19:28 2023 -0800

    ice: alter feature support check for SRIOV and LAG

    [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]

    Previously, the ice driver had support for using a handler for bonding
    netdev events to ensure that conflicting features were not allowed to be
    activated at the same time.  While this was still in place, additional
    support was added to specifically support SRIOV and LAG together.  These
    both utilized the netdev event handler, but the SRIOV and LAG feature was
    behind a capabilities feature check to make sure the current NVM has
    support.

    The exclusion part of the event handler should be removed since there are
    users who have custom made solutions that depend on the non-exclusion of
    features.

    Wrap the creation/registration and cleanup of the event handler and
    associated structs in the probe flow with a feature check so that the
    only systems that support the full implementation of LAG features will
    initialize support.  This will leave other systems unhindered with
    functionality as it existed before any LAG code was added.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2024-01-04 23:51                                                 ` Igor Raits
@ 2024-01-05 17:35                                                   ` Ertman, David M
  2024-01-08 17:53                                                     ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Ertman, David M @ 2024-01-05 17:35 UTC (permalink / raw)
  To: Igor Raits, Jaroslav Pulchart
  Cc: Yu Zhao, Daniel Secik, Charan Teja Kalla, Kalesh Singh, akpm, linux-mm

> -----Original Message-----
> From: Igor Raits <igor@gooddata.com>
> Sent: Thursday, January 4, 2024 3:51 PM
> To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> <daniel.secik@gooddata.com>; Charan Teja Kalla
> <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> <david.m.ertman@intel.com>
> Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> with multi-gen LRU
> 
> Hello everyone,
> 
> On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > >
> > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Hi yu,
> > > > > > >
> > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > Charan, does the fix previously attached seem acceptable to
> you? Any
> > > > > > > > additional feedback? Thanks.
> > > > > > >
> > > > > > > First, thanks for taking this patch to upstream.
> > > > > > >
> > > > > > > A comment in code snippet is checking just 'high wmark' pages
> might
> > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> kswapd run time.
> > > > > > > @Jaroslav: Have you observed something like above?
> > > > > >
> > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > fixing the kswapd continuous run issue.
> > > > > >
> > > > > > >
> > > > > > > So, in downstream, we have something like for
> zone_watermark_ok():
> > > > > > > unsigned long size = wmark_pages(zone, mark) +
> MIN_LRU_BATCH << 2;
> > > > > > >
> > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> may be we
> > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> reasoning, is
> > > > > > > what all I can say for this patch.
> > > > > > >
> > > > > > > +       mark = sysctl_numa_balancing_mode &
> NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> i;
> > > > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > > > +
> > > > > > > +               if (managed_zone(zone) &&
> > > > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc-
> >reclaim_idx, 0))
> > > > > > > +                       return false;
> > > > > > > +       }
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Charan
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jaroslav Pulchart
> > > > > > Sr. Principal SW Engineer
> > > > > > GoodData
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > (from 6.6.8) and the server behaves much much worse.
> > > > >
> > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > > > kswapd1
> > > > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > > > kswapd0
> > > > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > > > kswapd2
> > > > > are the changes in upstream different compared to the initial patch
> > > > > which I tested?
> > > > >
> > > > > Best regards,
> > > > > Jaroslav Pulchart
> > > >
> > > > Hi Jaroslav,
> > > >
> > > > My apologies for all the trouble!
> > > >
> > > > Yes, there is a slight difference between the fix you verified and
> > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > condition which I thought wouldn't affect you.
> > > >
> > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > special condition.
> > > >
> > > > Thanks!
> > >
> > > Thanks for prompt response. I did a test with the patch and it didn't
> > > help. The situation is super strange.
> > >
> > > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > worst situation, but the kswapd load is visible from 6.6.8.
> > >
> > > Setup of this server:
> > > * 4 chiplets per each sockets, there are 2 sockets
> > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > >   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > memory pressure however it is even worse now in contrary.
> > >
> > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > >
> > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > >
> > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > MemFree: 75 60 60 60 3169 2784 3203 2944
> >
> > I run few more combinations, and here are results / findings:
> >
> >   6.6.7-1  (vanila)                            == OK, no issue
> >
> >   6.6.8-1  (vanila)                            == single kswapd 100% !
> >   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> >   6.6.8-1  (revert four mglru patches)         == OK, no issue
> >
> >   6.6.9-1  (vanila)                            == four kswapd 100% !!!!
> >   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> >   6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!
> >
> > Summary:
> > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > kernel 6.6.8,
> > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > be related to mglru patches at all
> 
> I was able to bisect this change and it looks like there is something
> going wrong with the ice driver…
> 
> Usually after booting our server we see something like this. Most of
> the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> that have a really low amount of free memory and we don't know why but
> it looks like that in the end causes the constant swap in/out issue.
> With the final bit of the patch you've sent earlier in this thread it
> is almost invisible.
> 
> NUMA nodes:     0       1       2       3       4       5       6       7
> HPTotalGiB:     28      28      28      28      28      28      28      28
> HPFreeGiB:      28      28      28      28      28      28      28      28
> MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> MemFree:        2191    2828    92      292     3344    2916    3594    3222
> 
> 
> However, after the following patch we see that more NUMA nodes have
> such a low amount of memory and  that is causing constant reclaiming
> of memory because it looks like something inside of the kernel ate all
> the memory. This is right after the start of the system as well.
> 
> NUMA nodes:     0       1       2       3       4       5       6       7
> HPTotalGiB:     28      28      28      28      28      28      28      28
> HPFreeGiB:      28      28      28      28      28      28      28      28
> MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> MemFree:        46      59      51      33      3078    3535    2708    3511
> 
> The difference is 18G vs 12G of free memory sum'd across all NUMA
> nodes right after boot of the system. If you have some hints on how to
> debug what is actually occupying all that memory, maybe in both cases
> - would be happy to debug more!
> 
> Dave, would you have any idea why that patch could cause such a boost
> in memory utilization?
> 
> commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> Author: Dave Ertman <david.m.ertman@intel.com>
> Date:   Mon Dec 11 13:19:28 2023 -0800
> 
>     ice: alter feature support check for SRIOV and LAG
> 
>     [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> 
>     Previously, the ice driver had support for using a handler for bonding
>     netdev events to ensure that conflicting features were not allowed to be
>     activated at the same time.  While this was still in place, additional
>     support was added to specifically support SRIOV and LAG together.  These
>     both utilized the netdev event handler, but the SRIOV and LAG feature
> was
>     behind a capabilities feature check to make sure the current NVM has
>     support.
> 
>     The exclusion part of the event handler should be removed since there are
>     users who have custom made solutions that depend on the non-exclusion
> of
>     features.
> 
>     Wrap the creation/registration and cleanup of the event handler and
>     associated structs in the probe flow with a feature check so that the
>     only systems that support the full implementation of LAG features will
>     initialize support.  This will leave other systems unhindered with
>     functionality as it existed before any LAG code was added.

Igor,

I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
for the pf->lag struct.

DaveE

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2024-01-05 17:35                                                   ` Ertman, David M
@ 2024-01-08 17:53                                                     ` Jaroslav Pulchart
  2024-01-16  4:58                                                       ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-08 17:53 UTC (permalink / raw)
  To: Ertman, David M, Yu Zhao
  Cc: Igor Raits, Daniel Secik, Charan Teja Kalla, Kalesh Singh, akpm,
	linux-mm

>
> > -----Original Message-----
> > From: Igor Raits <igor@gooddata.com>
> > Sent: Thursday, January 4, 2024 3:51 PM
> > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > <david.m.ertman@intel.com>
> > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > with multi-gen LRU
> >
> > Hello everyone,
> >
> > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > >
> > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Hi yu,
> > > > > > > >
> > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > you? Any
> > > > > > > > > additional feedback? Thanks.
> > > > > > > >
> > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > >
> > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > might
> > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > kswapd run time.
> > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > >
> > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > fixing the kswapd continuous run issue.
> > > > > > >
> > > > > > > >
> > > > > > > > So, in downstream, we have something like for
> > zone_watermark_ok():
> > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > MIN_LRU_BATCH << 2;
> > > > > > > >
> > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > may be we
> > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > reasoning, is
> > > > > > > > what all I can say for this patch.
> > > > > > > >
> > > > > > > > +       mark = sysctl_numa_balancing_mode &
> > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > i;
> > > > > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > > > > +
> > > > > > > > +               if (managed_zone(zone) &&
> > > > > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc-
> > >reclaim_idx, 0))
> > > > > > > > +                       return false;
> > > > > > > > +       }
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Charan
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jaroslav Pulchart
> > > > > > > Sr. Principal SW Engineer
> > > > > > > GoodData
> > > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > >
> > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > > > > kswapd1
> > > > > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > > > > kswapd0
> > > > > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > > > > kswapd2
> > > > > > are the changes in upstream different compared to the initial patch
> > > > > > which I tested?
> > > > > >
> > > > > > Best regards,
> > > > > > Jaroslav Pulchart
> > > > >
> > > > > Hi Jaroslav,
> > > > >
> > > > > My apologies for all the trouble!
> > > > >
> > > > > Yes, there is a slight difference between the fix you verified and
> > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > condition which I thought wouldn't affect you.
> > > > >
> > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > special condition.
> > > > >
> > > > > Thanks!
> > > >
> > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > help. The situation is super strange.
> > > >
> > > > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > >
> > > > Setup of this server:
> > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > >   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > memory pressure however it is even worse now in contrary.
> > > >
> > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > >
> > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > >
> > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > >
> > > I run few more combinations, and here are results / findings:
> > >
> > >   6.6.7-1  (vanila)                            == OK, no issue
> > >
> > >   6.6.8-1  (vanila)                            == single kswapd 100% !
> > >   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > >   6.6.8-1  (revert four mglru patches)         == OK, no issue
> > >
> > >   6.6.9-1  (vanila)                            == four kswapd 100% !!!!
> > >   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > >   6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!
> > >
> > > Summary:
> > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > kernel 6.6.8,
> > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > be related to mglru patches at all
> >
> > I was able to bisect this change and it looks like there is something
> > going wrong with the ice driver…
> >
> > Usually after booting our server we see something like this. Most of
> > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > that have a really low amount of free memory and we don't know why but
> > it looks like that in the end causes the constant swap in/out issue.
> > With the final bit of the patch you've sent earlier in this thread it
> > is almost invisible.
> >
> > NUMA nodes:     0       1       2       3       4       5       6       7
> > HPTotalGiB:     28      28      28      28      28      28      28      28
> > HPFreeGiB:      28      28      28      28      28      28      28      28
> > MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> > MemFree:        2191    2828    92      292     3344    2916    3594    3222
> >
> >
> > However, after the following patch we see that more NUMA nodes have
> > such a low amount of memory and  that is causing constant reclaiming
> > of memory because it looks like something inside of the kernel ate all
> > the memory. This is right after the start of the system as well.
> >
> > NUMA nodes:     0       1       2       3       4       5       6       7
> > HPTotalGiB:     28      28      28      28      28      28      28      28
> > HPFreeGiB:      28      28      28      28      28      28      28      28
> > MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> > MemFree:        46      59      51      33      3078    3535    2708    3511
> >
> > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > nodes right after boot of the system. If you have some hints on how to
> > debug what is actually occupying all that memory, maybe in both cases
> > - would be happy to debug more!
> >
> > Dave, would you have any idea why that patch could cause such a boost
> > in memory utilization?
> >
> > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > Author: Dave Ertman <david.m.ertman@intel.com>
> > Date:   Mon Dec 11 13:19:28 2023 -0800
> >
> >     ice: alter feature support check for SRIOV and LAG
> >
> >     [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> >
> >     Previously, the ice driver had support for using a handler for bonding
> >     netdev events to ensure that conflicting features were not allowed to be
> >     activated at the same time.  While this was still in place, additional
> >     support was added to specifically support SRIOV and LAG together.  These
> >     both utilized the netdev event handler, but the SRIOV and LAG feature
> > was
> >     behind a capabilities feature check to make sure the current NVM has
> >     support.
> >
> >     The exclusion part of the event handler should be removed since there are
> >     users who have custom made solutions that depend on the non-exclusion
> > of
> >     features.
> >
> >     Wrap the creation/registration and cleanup of the event handler and
> >     associated structs in the probe flow with a feature check so that the
> >     only systems that support the full implementation of LAG features will
> >     initialize support.  This will leave other systems unhindered with
> >     functionality as it existed before any LAG code was added.
>
> Igor,
>
> I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> for the pf->lag struct.
>
> DaveE

Hello,

I believe we can track it as two different issues. So I reported the
ICE driver commit as a email with subject "[REGRESSION] Intel ICE
Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
and cause continous kswapd* usage and continuous swapping" to
    Jesse Brandeburg <jesse.brandeburg@intel.com>
    Tony Nguyen <anthony.l.nguyen@intel.com>
    intel-wired-lan@lists.osuosl.org
    Dave Ertman <david.m.ertman@intel.com>

Lets track the mglru here in this email thread. Yu, the kernel build
with your mglru-fix-6.6.9.patch seem to be OK at least running it for
3days without kswapd usage (excluding the ice driver commit).

Best!
-- 
Jaroslav Pulchart


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2024-01-08 17:53                                                     ` Jaroslav Pulchart
@ 2024-01-16  4:58                                                       ` Yu Zhao
  2024-01-16 17:34                                                         ` Jaroslav Pulchart
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2024-01-16  4:58 UTC (permalink / raw)
  To: Jaroslav Pulchart
  Cc: Ertman, David M, Igor Raits, Daniel Secik, Charan Teja Kalla,
	Kalesh Singh, akpm, linux-mm

On Mon, Jan 8, 2024 at 10:54 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > > -----Original Message-----
> > > From: Igor Raits <igor@gooddata.com>
> > > Sent: Thursday, January 4, 2024 3:51 PM
> > > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > > <david.m.ertman@intel.com>
> > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > > with multi-gen LRU
> > >
> > > Hello everyone,
> > >
> > > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > >
> > > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi yu,
> > > > > > > > >
> > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > > you? Any
> > > > > > > > > > additional feedback? Thanks.
> > > > > > > > >
> > > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > > >
> > > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > > might
> > > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > > kswapd run time.
> > > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > > >
> > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > > fixing the kswapd continuous run issue.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > So, in downstream, we have something like for
> > > zone_watermark_ok():
> > > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > > MIN_LRU_BATCH << 2;
> > > > > > > > >
> > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > > may be we
> > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > > reasoning, is
> > > > > > > > > what all I can say for this patch.
> > > > > > > > >
> > > > > > > > > +       mark = sysctl_numa_balancing_mode &
> > > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > > i;
> > > > > > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > > > > > +
> > > > > > > > > +               if (managed_zone(zone) &&
> > > > > > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc-
> > > >reclaim_idx, 0))
> > > > > > > > > +                       return false;
> > > > > > > > > +       }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Charan
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Jaroslav Pulchart
> > > > > > > > Sr. Principal SW Engineer
> > > > > > > > GoodData
> > > > > > >
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > > >
> > > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > > > > > kswapd1
> > > > > > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > > > > > kswapd0
> > > > > > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > > > > > kswapd2
> > > > > > > are the changes in upstream different compared to the initial patch
> > > > > > > which I tested?
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Jaroslav Pulchart
> > > > > >
> > > > > > Hi Jaroslav,
> > > > > >
> > > > > > My apologies for all the trouble!
> > > > > >
> > > > > > Yes, there is a slight difference between the fix you verified and
> > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > > condition which I thought wouldn't affect you.
> > > > > >
> > > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > > special condition.
> > > > > >
> > > > > > Thanks!
> > > > >
> > > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > > help. The situation is super strange.
> > > > >
> > > > > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > > >
> > > > > Setup of this server:
> > > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > >   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > > memory pressure however it is even worse now in contrary.
> > > > >
> > > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > > >
> > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > > >
> > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > > >
> > > > I run few more combinations, and here are results / findings:
> > > >
> > > >   6.6.7-1  (vanila)                            == OK, no issue
> > > >
> > > >   6.6.8-1  (vanila)                            == single kswapd 100% !
> > > >   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > > >   6.6.8-1  (revert four mglru patches)         == OK, no issue
> > > >
> > > >   6.6.9-1  (vanila)                            == four kswapd 100% !!!!
> > > >   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > > >   6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!
> > > >
> > > > Summary:
> > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > > kernel 6.6.8,
> > > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > > be related to mglru patches at all
> > >
> > > I was able to bisect this change and it looks like there is something
> > > going wrong with the ice driver…
> > >
> > > Usually after booting our server we see something like this. Most of
> > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > > that have a really low amount of free memory and we don't know why but
> > > it looks like that in the end causes the constant swap in/out issue.
> > > With the final bit of the patch you've sent earlier in this thread it
> > > is almost invisible.
> > >
> > > NUMA nodes:     0       1       2       3       4       5       6       7
> > > HPTotalGiB:     28      28      28      28      28      28      28      28
> > > HPFreeGiB:      28      28      28      28      28      28      28      28
> > > MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> > > MemFree:        2191    2828    92      292     3344    2916    3594    3222
> > >
> > >
> > > However, after the following patch we see that more NUMA nodes have
> > > such a low amount of memory and  that is causing constant reclaiming
> > > of memory because it looks like something inside of the kernel ate all
> > > the memory. This is right after the start of the system as well.
> > >
> > > NUMA nodes:     0       1       2       3       4       5       6       7
> > > HPTotalGiB:     28      28      28      28      28      28      28      28
> > > HPFreeGiB:      28      28      28      28      28      28      28      28
> > > MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> > > MemFree:        46      59      51      33      3078    3535    2708    3511
> > >
> > > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > > nodes right after boot of the system. If you have some hints on how to
> > > debug what is actually occupying all that memory, maybe in both cases
> > > - would be happy to debug more!
> > >
> > > Dave, would you have any idea why that patch could cause such a boost
> > > in memory utilization?
> > >
> > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > > Author: Dave Ertman <david.m.ertman@intel.com>
> > > Date:   Mon Dec 11 13:19:28 2023 -0800
> > >
> > >     ice: alter feature support check for SRIOV and LAG
> > >
> > >     [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> > >
> > >     Previously, the ice driver had support for using a handler for bonding
> > >     netdev events to ensure that conflicting features were not allowed to be
> > >     activated at the same time.  While this was still in place, additional
> > >     support was added to specifically support SRIOV and LAG together.  These
> > >     both utilized the netdev event handler, but the SRIOV and LAG feature
> > > was
> > >     behind a capabilities feature check to make sure the current NVM has
> > >     support.
> > >
> > >     The exclusion part of the event handler should be removed since there are
> > >     users who have custom made solutions that depend on the non-exclusion
> > > of
> > >     features.
> > >
> > >     Wrap the creation/registration and cleanup of the event handler and
> > >     associated structs in the probe flow with a feature check so that the
> > >     only systems that support the full implementation of LAG features will
> > >     initialize support.  This will leave other systems unhindered with
> > >     functionality as it existed before any LAG code was added.
> >
> > Igor,
> >
> > I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> > If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> > for the pf->lag struct.
> >
> > DaveE
>
> Hello,
>
> I believe we can track it as two different issues. So I reported the
> ICE driver commit as a email with subject "[REGRESSION] Intel ICE
> Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
> and cause continous kswapd* usage and continuous swapping" to
>     Jesse Brandeburg <jesse.brandeburg@intel.com>
>     Tony Nguyen <anthony.l.nguyen@intel.com>
>     intel-wired-lan@lists.osuosl.org
>     Dave Ertman <david.m.ertman@intel.com>
>
> Lets track the mglru here in this email thread. Yu, the kernel build
> with your mglru-fix-6.6.9.patch seem to be OK at least running it for
> 3days without kswapd usage (excluding the ice driver commit).

Hi Jaroslav,

Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a
difference? IOW, were you able to reproduce the problem consistently
without it?

Thanks!


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
  2024-01-16  4:58                                                       ` Yu Zhao
@ 2024-01-16 17:34                                                         ` Jaroslav Pulchart
  0 siblings, 0 replies; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-16 17:34 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Ertman, David M, Igor Raits, Daniel Secik, Charan Teja Kalla,
	Kalesh Singh, akpm, linux-mm

>
> On Mon, Jan 8, 2024 at 10:54 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > > -----Original Message-----
> > > > From: Igor Raits <igor@gooddata.com>
> > > > Sent: Thursday, January 4, 2024 3:51 PM
> > > > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > > > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > > > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > > > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > > > <david.m.ertman@intel.com>
> > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > > > with multi-gen LRU
> > > >
> > > > Hello everyone,
> > > >
> > > > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi yu,
> > > > > > > > > >
> > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > > > you? Any
> > > > > > > > > > > additional feedback? Thanks.
> > > > > > > > > >
> > > > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > > > >
> > > > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > > > might
> > > > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > > > kswapd run time.
> > > > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > > > >
> > > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > > > fixing the kswapd continuous run issue.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So, in downstream, we have something like for
> > > > zone_watermark_ok():
> > > > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > > > MIN_LRU_BATCH << 2;
> > > > > > > > > >
> > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > > > may be we
> > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > > > reasoning, is
> > > > > > > > > > what all I can say for this patch.
> > > > > > > > > >
> > > > > > > > > > +       mark = sysctl_numa_balancing_mode &
> > > > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > > > i;
> > > > > > > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > > > > > > +
> > > > > > > > > > +               if (managed_zone(zone) &&
> > > > > > > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc-
> > > > >reclaim_idx, 0))
> > > > > > > > > > +                       return false;
> > > > > > > > > > +       }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Charan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jaroslav Pulchart
> > > > > > > > > Sr. Principal SW Engineer
> > > > > > > > > GoodData
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > > > >
> > > > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > > > > > > kswapd1
> > > > > > > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > > > > > > kswapd0
> > > > > > > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > > > > > > kswapd2
> > > > > > > > are the changes in upstream different compared to the initial patch
> > > > > > > > which I tested?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jaroslav Pulchart
> > > > > > >
> > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > My apologies for all the trouble!
> > > > > > >
> > > > > > > Yes, there is a slight difference between the fix you verified and
> > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > > > condition which I thought wouldn't affect you.
> > > > > > >
> > > > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > > > special condition.
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > > > help. The situation is super strange.
> > > > > >
> > > > > > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > > > >
> > > > > > Setup of this server:
> > > > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > > >   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > > > memory pressure however it is even worse now in contrary.
> > > > > >
> > > > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > > > >
> > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > > > >
> > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > > > >
> > > > > I run few more combinations, and here are results / findings:
> > > > >
> > > > >   6.6.7-1  (vanila)                            == OK, no issue
> > > > >
> > > > >   6.6.8-1  (vanila)                            == single kswapd 100% !
> > > > >   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > > > >   6.6.8-1  (revert four mglru patches)         == OK, no issue
> > > > >
> > > > >   6.6.9-1  (vanila)                            == four kswapd 100% !!!!
> > > > >   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > > > >   6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!
> > > > >
> > > > > Summary:
> > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > > > kernel 6.6.8,
> > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > > > be related to mglru patches at all
> > > >
> > > > I was able to bisect this change and it looks like there is something
> > > > going wrong with the ice driver…
> > > >
> > > > Usually after booting our server we see something like this. Most of
> > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > > > that have a really low amount of free memory and we don't know why but
> > > > it looks like that in the end causes the constant swap in/out issue.
> > > > With the final bit of the patch you've sent earlier in this thread it
> > > > is almost invisible.
> > > >
> > > > NUMA nodes:     0       1       2       3       4       5       6       7
> > > > HPTotalGiB:     28      28      28      28      28      28      28      28
> > > > HPFreeGiB:      28      28      28      28      28      28      28      28
> > > > MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> > > > MemFree:        2191    2828    92      292     3344    2916    3594    3222
> > > >
> > > >
> > > > However, after the following patch we see that more NUMA nodes have
> > > > such a low amount of memory and  that is causing constant reclaiming
> > > > of memory because it looks like something inside of the kernel ate all
> > > > the memory. This is right after the start of the system as well.
> > > >
> > > > NUMA nodes:     0       1       2       3       4       5       6       7
> > > > HPTotalGiB:     28      28      28      28      28      28      28      28
> > > > HPFreeGiB:      28      28      28      28      28      28      28      28
> > > > MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> > > > MemFree:        46      59      51      33      3078    3535    2708    3511
> > > >
> > > > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > > > nodes right after boot of the system. If you have some hints on how to
> > > > debug what is actually occupying all that memory, maybe in both cases
> > > > - would be happy to debug more!
> > > >
> > > > Dave, would you have any idea why that patch could cause such a boost
> > > > in memory utilization?
> > > >
> > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > > > Author: Dave Ertman <david.m.ertman@intel.com>
> > > > Date:   Mon Dec 11 13:19:28 2023 -0800
> > > >
> > > >     ice: alter feature support check for SRIOV and LAG
> > > >
> > > >     [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> > > >
> > > >     Previously, the ice driver had support for using a handler for bonding
> > > >     netdev events to ensure that conflicting features were not allowed to be
> > > >     activated at the same time.  While this was still in place, additional
> > > >     support was added to specifically support SRIOV and LAG together.  These
> > > >     both utilized the netdev event handler, but the SRIOV and LAG feature
> > > > was
> > > >     behind a capabilities feature check to make sure the current NVM has
> > > >     support.
> > > >
> > > >     The exclusion part of the event handler should be removed since there are
> > > >     users who have custom made solutions that depend on the non-exclusion
> > > > of
> > > >     features.
> > > >
> > > >     Wrap the creation/registration and cleanup of the event handler and
> > > >     associated structs in the probe flow with a feature check so that the
> > > >     only systems that support the full implementation of LAG features will
> > > >     initialize support.  This will leave other systems unhindered with
> > > >     functionality as it existed before any LAG code was added.
> > >
> > > Igor,
> > >
> > > I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> > > If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> > > for the pf->lag struct.
> > >
> > > DaveE
> >
> > Hello,
> >
> > I believe we can track it as two different issues. So I reported the
> > ICE driver commit as a email with subject "[REGRESSION] Intel ICE
> > Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
> > and cause continous kswapd* usage and continuous swapping" to
> >     Jesse Brandeburg <jesse.brandeburg@intel.com>
> >     Tony Nguyen <anthony.l.nguyen@intel.com>
> >     intel-wired-lan@lists.osuosl.org
> >     Dave Ertman <david.m.ertman@intel.com>
> >
> > Lets track the mglru here in this email thread. Yu, the kernel build
> > with your mglru-fix-6.6.9.patch seem to be OK at least running it for
> > 3days without kswapd usage (excluding the ice driver commit).
>
> Hi Jaroslav,
>
> Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a
> difference? IOW, were you able to reproduce the problem consistently
> without it?
>
> Thanks!


Hi Yu,

the mglru-fix-6.6.9.patch is needed for all >= 6.6.8 till 6.7. I
tested the new 6.7 (without mglru-fix) and this kernel is fine as I
cannot trigger the problem there.


út 16. 1. 2024 v 5:59 odesílatel Yu Zhao <yuzhao@google.com> napsal:
>
> On Mon, Jan 8, 2024 at 10:54 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > > -----Original Message-----
> > > > From: Igor Raits <igor@gooddata.com>
> > > > Sent: Thursday, January 4, 2024 3:51 PM
> > > > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > > > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > > > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > > > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > > > <david.m.ertman@intel.com>
> > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > > > with multi-gen LRU
> > > >
> > > > Hello everyone,
> > > >
> > > > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi yu,
> > > > > > > > > >
> > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > > > you? Any
> > > > > > > > > > > additional feedback? Thanks.
> > > > > > > > > >
> > > > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > > > >
> > > > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > > > might
> > > > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > > > kswapd run time.
> > > > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > > > >
> > > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > > > fixing the kswapd continuous run issue.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So, in downstream, we have something like for
> > > > zone_watermark_ok():
> > > > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > > > MIN_LRU_BATCH << 2;
> > > > > > > > > >
> > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > > > may be we
> > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > > > reasoning, is
> > > > > > > > > > what all I can say for this patch.
> > > > > > > > > >
> > > > > > > > > > +       mark = sysctl_numa_balancing_mode &
> > > > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > > > +              WMARK_PROMO : WMARK_HIGH;
> > > > > > > > > > +       for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > > > +               struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > > > i;
> > > > > > > > > > +               unsigned long size = wmark_pages(zone, mark);
> > > > > > > > > > +
> > > > > > > > > > +               if (managed_zone(zone) &&
> > > > > > > > > > +                   !zone_watermark_ok(zone, sc->order, size, sc-
> > > > >reclaim_idx, 0))
> > > > > > > > > > +                       return false;
> > > > > > > > > > +       }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Charan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jaroslav Pulchart
> > > > > > > > > Sr. Principal SW Engineer
> > > > > > > > > GoodData
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > > > >
> > > > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > > >     555 root      20   0       0      0      0 R  99.7   0.0   4:32.86
> > > > > > > > kswapd1
> > > > > > > >     554 root      20   0       0      0      0 R  99.3   0.0   3:57.76
> > > > > > > > kswapd0
> > > > > > > >     556 root      20   0       0      0      0 R  97.7   0.0   3:42.27
> > > > > > > > kswapd2
> > > > > > > > are the changes in upstream different compared to the initial patch
> > > > > > > > which I tested?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jaroslav Pulchart
> > > > > > >
> > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > My apologies for all the trouble!
> > > > > > >
> > > > > > > Yes, there is a slight difference between the fix you verified and
> > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > > > condition which I thought wouldn't affect you.
> > > > > > >
> > > > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > > > special condition.
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > > > help. The situation is super strange.
> > > > > >
> > > > > > I tried kernels 6.6.7, 6.6.8 and  6.6.9. I see high memory utilization
> > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > > > >
> > > > > > Setup of this server:
> > > > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > > >   Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > > > memory pressure however it is even worse now in contrary.
> > > > > >
> > > > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > > > >
> > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > > > >
> > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > > > >
> > > > > I run few more combinations, and here are results / findings:
> > > > >
> > > > >   6.6.7-1  (vanila)                            == OK, no issue
> > > > >
> > > > >   6.6.8-1  (vanila)                            == single kswapd 100% !
> > > > >   6.6.8-1  (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > > > >   6.6.8-1  (revert four mglru patches)         == OK, no issue
> > > > >
> > > > >   6.6.9-1  (vanila)                            == four kswapd 100% !!!!
> > > > >   6.6.9-2  (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > > > >   6.6.9-3  (revert four mglru patches)         == four kswapd 100% !!!!
> > > > >
> > > > > Summary:
> > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > > > kernel 6.6.8,
> > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > > > be related to mglru patches at all
> > > >
> > > > I was able to bisect this change and it looks like there is something
> > > > going wrong with the ice driver…
> > > >
> > > > Usually after booting our server we see something like this. Most of
> > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > > > that have a really low amount of free memory and we don't know why but
> > > > it looks like that in the end causes the constant swap in/out issue.
> > > > With the final bit of the patch you've sent earlier in this thread it
> > > > is almost invisible.
> > > >
> > > > NUMA nodes:     0       1       2       3       4       5       6       7
> > > > HPTotalGiB:     28      28      28      28      28      28      28      28
> > > > HPFreeGiB:      28      28      28      28      28      28      28      28
> > > > MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> > > > MemFree:        2191    2828    92      292     3344    2916    3594    3222
> > > >
> > > >
> > > > However, after the following patch we see that more NUMA nodes have
> > > > such a low amount of memory and  that is causing constant reclaiming
> > > > of memory because it looks like something inside of the kernel ate all
> > > > the memory. This is right after the start of the system as well.
> > > >
> > > > NUMA nodes:     0       1       2       3       4       5       6       7
> > > > HPTotalGiB:     28      28      28      28      28      28      28      28
> > > > HPFreeGiB:      28      28      28      28      28      28      28      28
> > > > MemTotal:       32264   32701   32659   32686   32701   32701   32701   32696
> > > > MemFree:        46      59      51      33      3078    3535    2708    3511
> > > >
> > > > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > > > nodes right after boot of the system. If you have some hints on how to
> > > > debug what is actually occupying all that memory, maybe in both cases
> > > > - would be happy to debug more!
> > > >
> > > > Dave, would you have any idea why that patch could cause such a boost
> > > > in memory utilization?
> > > >
> > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > > > Author: Dave Ertman <david.m.ertman@intel.com>
> > > > Date:   Mon Dec 11 13:19:28 2023 -0800
> > > >
> > > >     ice: alter feature support check for SRIOV and LAG
> > > >
> > > >     [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> > > >
> > > >     Previously, the ice driver had support for using a handler for bonding
> > > >     netdev events to ensure that conflicting features were not allowed to be
> > > >     activated at the same time.  While this was still in place, additional
> > > >     support was added to specifically support SRIOV and LAG together.  These
> > > >     both utilized the netdev event handler, but the SRIOV and LAG feature
> > > > was
> > > >     behind a capabilities feature check to make sure the current NVM has
> > > >     support.
> > > >
> > > >     The exclusion part of the event handler should be removed since there are
> > > >     users who have custom made solutions that depend on the non-exclusion
> > > > of
> > > >     features.
> > > >
> > > >     Wrap the creation/registration and cleanup of the event handler and
> > > >     associated structs in the probe flow with a feature check so that the
> > > >     only systems that support the full implementation of LAG features will
> > > >     initialize support.  This will leave other systems unhindered with
> > > >     functionality as it existed before any LAG code was added.
> > >
> > > Igor,
> > >
> > > I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> > > If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> > > for the pf->lag struct.
> > >
> > > DaveE
> >
> > Hello,
> >
> > I believe we can track it as two different issues. So I reported the
> > ICE driver commit as a email with subject "[REGRESSION] Intel ICE
> > Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
> > and cause continous kswapd* usage and continuous swapping" to
> >     Jesse Brandeburg <jesse.brandeburg@intel.com>
> >     Tony Nguyen <anthony.l.nguyen@intel.com>
> >     intel-wired-lan@lists.osuosl.org
> >     Dave Ertman <david.m.ertman@intel.com>
> >
> > Lets track the mglru here in this email thread. Yu, the kernel build
> > with your mglru-fix-6.6.9.patch seem to be OK at least running it for
> > 3days without kswapd usage (excluding the ice driver commit).
>
> Hi Jaroslav,
>
> Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a
> difference? IOW, were you able to reproduce the problem consistently
> without it?
>
> Thanks!



-- 
Jaroslav Pulchart
Sr. Principal SW Engineer
GoodData


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-01-16 17:35 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-08 14:35 high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU Jaroslav Pulchart
2023-11-08 18:47 ` Yu Zhao
2023-11-08 20:04   ` Jaroslav Pulchart
2023-11-08 22:09     ` Yu Zhao
2023-11-09  6:39       ` Jaroslav Pulchart
2023-11-09  6:48         ` Yu Zhao
2023-11-09 10:58           ` Jaroslav Pulchart
2023-11-10  1:31             ` Yu Zhao
     [not found]               ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
2023-11-13 20:09                 ` Yu Zhao
2023-11-14  7:29                   ` Jaroslav Pulchart
2023-11-14  7:47                     ` Yu Zhao
2023-11-20  8:41                       ` Jaroslav Pulchart
2023-11-22  6:13                         ` Yu Zhao
2023-11-22  7:12                           ` Jaroslav Pulchart
2023-11-22  7:30                             ` Jaroslav Pulchart
2023-11-22 14:18                               ` Yu Zhao
2023-11-29 13:54                                 ` Jaroslav Pulchart
2023-12-01 23:52                                   ` Yu Zhao
2023-12-07  8:46                                     ` Charan Teja Kalla
2023-12-07 18:23                                       ` Yu Zhao
2023-12-08  8:03                                       ` Jaroslav Pulchart
2024-01-03 21:30                                         ` Jaroslav Pulchart
2024-01-04  3:03                                           ` Yu Zhao
2024-01-04  9:46                                             ` Jaroslav Pulchart
2024-01-04 14:34                                               ` Jaroslav Pulchart
2024-01-04 23:51                                                 ` Igor Raits
2024-01-05 17:35                                                   ` Ertman, David M
2024-01-08 17:53                                                     ` Jaroslav Pulchart
2024-01-16  4:58                                                       ` Yu Zhao
2024-01-16 17:34                                                         ` Jaroslav Pulchart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.