high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU

* high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
@ 2023-11-08 14:35 Jaroslav Pulchart
  2023-11-08 18:47 ` Yu Zhao
  0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-08 14:35 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm

Hello,

I would like to report to you an unpleasant behavior of multi-gen LRU
with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
system (16numa domains).

Symptoms of my issue are

/A/ if mult-gen LRU is enabled
1/ [kswapd3] is consuming 100% CPU

    top - 15:03:11 up 34 days,  1:51,  2 users,  load average: 23.34,
18.26, 15.01
    Tasks: 1226 total,   2 running, 1224 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 12.5 us,  4.7 sy,  0.0 ni, 82.1 id,  0.0 wa,  0.4 hi,
0.4 si,  0.0 st
    MiB Mem : 1047265.+total,  28382.7 free, 1021308.+used,    767.6 buff/cache
    MiB Swap:   8192.0 total,   8187.7 free,      4.2 used.  25956.7 avail Mem
    ...
        765 root      20   0       0      0      0 R  98.3   0.0
34969:04 kswapd3
    ...
2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
observed with swap disk as well and cause IO latency issues due to
some kind of locking)
3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out

/B/ if mult-gen LRU is disabled
1/ [kswapd3] is consuming 3%-10% CPU
    top - 15:02:49 up 34 days,  1:51,  2 users,  load average: 23.05,
17.77, 14.77
    Tasks: 1226 total,   1 running, 1225 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 14.7 us,  2.8 sy,  0.0 ni, 81.8 id,  0.0 wa,  0.4 hi,
0.4 si,  0.0 st
    MiB Mem : 1047265.+total,  28378.5 free, 1021313.+used,    767.3 buff/cache
    MiB Swap:   8192.0 total,   8189.0 free,      3.0 used.  25952.4 avail Mem
    ...
       765 root      20   0       0      0      0 S   3.6   0.0
34966:46 [kswapd3]
    ...
2/ swap space usage is low (4MB)
3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out

Both situations are wrong as they are using swap in/out extensively,
however the multi-gen LRU situation is 10times worse.

The perf record of case /A/
-  100.00%     0.00%  kswapd3  [kernel.kallsyms]  [k] kswapd

                                        ▒
   - kswapd

                                        ▒
      - 99.88% balance_pgdat

                                        ▒
         - 99.84% shrink_node

                                        ▒
            - 99.78% shrink_many

                                        ▒
               - 61.66% shrink_one

                                        ▒
                  - 55.32% try_to_shrink_lruvec

                                        ▒
                     - 49.80% try_to_inc_max_seq.constprop.0

                                        ▒
                        - 49.53% walk_mm

                                        ▒
                           - 49.46% walk_page_range

                                        ▒
                              - 49.32% __walk_page_range

                                        ▒
                                 - walk_pgd_range

                                        ▒
                                    - walk_p4d_range

                                        ▒
                                    - walk_pud_range

                                        ▒
                                       - 49.02% walk_pmd_range

                                        ▒
                                          - 45.94% get_next_vma

                                        ▒
                                             - 30.08% mas_find

                                        ▒
                                                - 29.33% mas_walk

                                        ▒
                                                     26.83%
mtree_range_walk
                                                   ▒
                                               2.86% should_skip_vma

                                        ▒
                                               0.58% mas_next_slot

                                        ▒
                                            1.25%
walk_pmd_range_locked.isra.0
                                                             ▒
                     - 5.46% evict_folios

                                        ▒
                        - 3.41% shrink_folio_list

                                        ▒
                           - 1.15% pageout

                                        ▒
                              - swap_writepage

                                        ▒
                                 - 1.12% swap_writepage_bdev_sync

                                        ▒
                                    - 1.01% submit_bio_wait

                                        ▒
                                       - 1.00% __submit_bio_noacct

                                        ▒
                                          - __submit_bio

                                        ▒
                                             - zram_bio_write

                                        ▒
                                                - 0.96%
zram_write_page
                                                       ▒
                                                   - 0.82%
lzorle_compress
                                                    ▒
                                                      -
lzogeneric1x_1_compress
                                                       ▒
                                                           0.73%
lzo1x_1_do_compress
                                              ▒
                             0.68% __remove_mapping

                                        ▒
                        - 1.02% isolate_folios

                                        ▒
                           - scan_folios

                                        ▒
                                0.65% isolate_folio.isra.0

                                        ▒
                          0.55% move_folios_to_lru

                                        ▒
                  - 5.43% lruvec_is_sizable

                                        ▒
                     - 0.93% get_swappiness

                                        ▒
                          mem_cgroup_get_nr_swap_pages

                                        ▒
               - 32.07% lru_gen_rotate_memcg

                                        ▒
                  - 3.23% _raw_spin_lock_irqsave

                                        ▒
                       2.32% native_queued_spin_lock_slowpath

                                        ▒
                    1.91% get_random_u8

                                        ▒
               - 0.94% _raw_spin_unlock_irqrestore

                                        ▒
                  - asm_sysvec_apic_timer_interrupt

                                        ▒
                     - sysvec_apic_timer_interrupt

                                        ▒
                        - 0.69% __sysvec_apic_timer_interrupt

                                        ▒
                           - hrtimer_interrupt

                                        ▒
                              - 0.65% __hrtimer_run_queues

                                        ▒
                                 - 0.63% tick_sched_timer

                                        ▒
                                    - 0.62% tick_sched_handle

                                        ▒
                                       - update_process_times

                                        ▒
                                            0.51% scheduler_tick

                                        ▒

The perf record of case /B/
-  100.00%     0.00%  kswapd3  [kernel.kallsyms]  [k] kswapd

              ▒
   - kswapd

              ▒
      - 99.66% balance_pgdat

              ▒
         - 90.96% shrink_node

              ▒
            - 75.69% shrink_node_memcgs

              ▒
               - 25.73% shrink_lruvec

              ▒
                  - 18.74% get_scan_count

              ▒
                       2.76% mem_cgroup_get_nr_swap_pages

              ▒
                  - 2.50% blk_finish_plug

              ▒
                     - __blk_flush_plug

              ▒
                          blk_mq_flush_plug_list

              ▒
                    1.02% shrink_inactive_list

              ▒
                    1.01% inactive_is_low

              ▒
               - 17.33% shrink_slab_memcg

              ▒
                  - 4.02% do_shrink_slab

              ▒
                     - 1.57% nfs4_xattr_entry_count

              ▒
                        - list_lru_count_one

              ▒
                             0.56% __rcu_read_unlock

              ▒
                     - 0.79% super_cache_count

              ▒
                          list_lru_count_one

              ▒
                     - 0.68% nfs4_xattr_cache_count

              ▒
                        - list_lru_count_one

              ▒
                             xa_load

              ▒
                    3.12% _find_next_bit

              ▒
                    1.87% __radix_tree_lookup

              ▒
                    0.67% up_read

              ▒
                    0.67% down_read_trylock

              ▒
               - 16.34% mem_cgroup_iter

              ▒
                    0.57% __rcu_read_lock

              ▒
                    0.54% __rcu_read_unlock

              ▒
               - 9.36% shrink_slab

              ▒
                  - do_shrink_slab

              ▒
                     - 2.37% super_cache_count

              ▒
                          1.04% list_lru_count_one

              ▒
                       2.14% count_shadow_nodes

              ▒
                       1.71% kfree_rcu_shrink_count

              ▒
                 1.24% vmpressure

              ▒
            - 15.27% prepare_scan_count

              ▒
               - 15.04% do_flush_stats

              ▒
                  - 14.93% cgroup_rstat_flush

              ▒
                     - cgroup_rstat_flush_locked

              ▒
                          13.20% mem_cgroup_css_rstat_flush

              ▒
                          0.78% __blkcg_rstat_flush.isra.0

              ▒
         - 5.87% shrink_active_list

              ▒
              2.16% __count_memcg_events

              ▒
              1.64% _raw_spin_lock_irq

              ▒
              0.94% isolate_lru_folios

              ▒
           2.24% mem_cgroup_iter

              ▒

Could I ask for any suggestions on how to avoid the kswapd utilization
pattern? There is a free RAM in each numa node for the few MB used in
swap:
    NUMA stats:
    NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
65486 65486 65486 65486 65486 65486 65424
    MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
2623 2833 2530 2269
the in/out usage does not make sense for me nor the CPU utilization by
multi-gen LRU.

Many thanks and best regards,
-- 
Jaroslav Pulchart

^ permalink raw reply	[flat|nested] 30+ messages in thread