* high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
@ 2023-11-08 14:35 Jaroslav Pulchart
2023-11-08 18:47 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-08 14:35 UTC (permalink / raw)
To: linux-mm; +Cc: akpm
Hello,
I would like to report to you an unpleasant behavior of multi-gen LRU
with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
system (16numa domains).
Symptoms of my issue are
/A/ if mult-gen LRU is enabled
1/ [kswapd3] is consuming 100% CPU
top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
18.26, 15.01
Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
0.4 si, 0.0 st
MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
...
765 root 20 0 0 0 0 R 98.3 0.0
34969:04 kswapd3
...
2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
observed with swap disk as well and cause IO latency issues due to
some kind of locking)
3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
/B/ if mult-gen LRU is disabled
1/ [kswapd3] is consuming 3%-10% CPU
top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
17.77, 14.77
Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
%Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
0.4 si, 0.0 st
MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
...
765 root 20 0 0 0 0 S 3.6 0.0
34966:46 [kswapd3]
...
2/ swap space usage is low (4MB)
3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
Both situations are wrong as they are using swap in/out extensively,
however the multi-gen LRU situation is 10times worse.
The perf record of case /A/
- 100.00% 0.00% kswapd3 [kernel.kallsyms] [k] kswapd
▒
- kswapd
▒
- 99.88% balance_pgdat
▒
- 99.84% shrink_node
▒
- 99.78% shrink_many
▒
- 61.66% shrink_one
▒
- 55.32% try_to_shrink_lruvec
▒
- 49.80% try_to_inc_max_seq.constprop.0
▒
- 49.53% walk_mm
▒
- 49.46% walk_page_range
▒
- 49.32% __walk_page_range
▒
- walk_pgd_range
▒
- walk_p4d_range
▒
- walk_pud_range
▒
- 49.02% walk_pmd_range
▒
- 45.94% get_next_vma
▒
- 30.08% mas_find
▒
- 29.33% mas_walk
▒
26.83%
mtree_range_walk
▒
2.86% should_skip_vma
▒
0.58% mas_next_slot
▒
1.25%
walk_pmd_range_locked.isra.0
▒
- 5.46% evict_folios
▒
- 3.41% shrink_folio_list
▒
- 1.15% pageout
▒
- swap_writepage
▒
- 1.12% swap_writepage_bdev_sync
▒
- 1.01% submit_bio_wait
▒
- 1.00% __submit_bio_noacct
▒
- __submit_bio
▒
- zram_bio_write
▒
- 0.96%
zram_write_page
▒
- 0.82%
lzorle_compress
▒
-
lzogeneric1x_1_compress
▒
0.73%
lzo1x_1_do_compress
▒
0.68% __remove_mapping
▒
- 1.02% isolate_folios
▒
- scan_folios
▒
0.65% isolate_folio.isra.0
▒
0.55% move_folios_to_lru
▒
- 5.43% lruvec_is_sizable
▒
- 0.93% get_swappiness
▒
mem_cgroup_get_nr_swap_pages
▒
- 32.07% lru_gen_rotate_memcg
▒
- 3.23% _raw_spin_lock_irqsave
▒
2.32% native_queued_spin_lock_slowpath
▒
1.91% get_random_u8
▒
- 0.94% _raw_spin_unlock_irqrestore
▒
- asm_sysvec_apic_timer_interrupt
▒
- sysvec_apic_timer_interrupt
▒
- 0.69% __sysvec_apic_timer_interrupt
▒
- hrtimer_interrupt
▒
- 0.65% __hrtimer_run_queues
▒
- 0.63% tick_sched_timer
▒
- 0.62% tick_sched_handle
▒
- update_process_times
▒
0.51% scheduler_tick
▒
The perf record of case /B/
- 100.00% 0.00% kswapd3 [kernel.kallsyms] [k] kswapd
▒
- kswapd
▒
- 99.66% balance_pgdat
▒
- 90.96% shrink_node
▒
- 75.69% shrink_node_memcgs
▒
- 25.73% shrink_lruvec
▒
- 18.74% get_scan_count
▒
2.76% mem_cgroup_get_nr_swap_pages
▒
- 2.50% blk_finish_plug
▒
- __blk_flush_plug
▒
blk_mq_flush_plug_list
▒
1.02% shrink_inactive_list
▒
1.01% inactive_is_low
▒
- 17.33% shrink_slab_memcg
▒
- 4.02% do_shrink_slab
▒
- 1.57% nfs4_xattr_entry_count
▒
- list_lru_count_one
▒
0.56% __rcu_read_unlock
▒
- 0.79% super_cache_count
▒
list_lru_count_one
▒
- 0.68% nfs4_xattr_cache_count
▒
- list_lru_count_one
▒
xa_load
▒
3.12% _find_next_bit
▒
1.87% __radix_tree_lookup
▒
0.67% up_read
▒
0.67% down_read_trylock
▒
- 16.34% mem_cgroup_iter
▒
0.57% __rcu_read_lock
▒
0.54% __rcu_read_unlock
▒
- 9.36% shrink_slab
▒
- do_shrink_slab
▒
- 2.37% super_cache_count
▒
1.04% list_lru_count_one
▒
2.14% count_shadow_nodes
▒
1.71% kfree_rcu_shrink_count
▒
1.24% vmpressure
▒
- 15.27% prepare_scan_count
▒
- 15.04% do_flush_stats
▒
- 14.93% cgroup_rstat_flush
▒
- cgroup_rstat_flush_locked
▒
13.20% mem_cgroup_css_rstat_flush
▒
0.78% __blkcg_rstat_flush.isra.0
▒
- 5.87% shrink_active_list
▒
2.16% __count_memcg_events
▒
1.64% _raw_spin_lock_irq
▒
0.94% isolate_lru_folios
▒
2.24% mem_cgroup_iter
▒
Could I ask for any suggestions on how to avoid the kswapd utilization
pattern? There is a free RAM in each numa node for the few MB used in
swap:
NUMA stats:
NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
65486 65486 65486 65486 65486 65486 65424
MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
2623 2833 2530 2269
the in/out usage does not make sense for me nor the CPU utilization by
multi-gen LRU.
Many thanks and best regards,
--
Jaroslav Pulchart
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-08 14:35 high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU Jaroslav Pulchart
@ 2023-11-08 18:47 ` Yu Zhao
2023-11-08 20:04 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-08 18:47 UTC (permalink / raw)
To: Jaroslav Pulchart; +Cc: linux-mm, akpm
Hi Jaroslav,
On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> Hello,
>
> I would like to report to you an unpleasant behavior of multi-gen LRU
> with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> system (16numa domains).
Kernel version please?
> Symptoms of my issue are
>
> /A/ if mult-gen LRU is enabled
> 1/ [kswapd3] is consuming 100% CPU
Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> 18.26, 15.01
> Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> 0.4 si, 0.0 st
> MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> ...
> 765 root 20 0 0 0 0 R 98.3 0.0
> 34969:04 kswapd3
> ...
> 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> observed with swap disk as well and cause IO latency issues due to
> some kind of locking)
> 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
>
>
> /B/ if mult-gen LRU is disabled
> 1/ [kswapd3] is consuming 3%-10% CPU
> top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> 17.77, 14.77
> Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> 0.4 si, 0.0 st
> MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> ...
> 765 root 20 0 0 0 0 S 3.6 0.0
> 34966:46 [kswapd3]
> ...
> 2/ swap space usage is low (4MB)
> 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
>
> Both situations are wrong as they are using swap in/out extensively,
> however the multi-gen LRU situation is 10times worse.
From the stats below, node 3 had the lowest free memory. So I think in
both cases, the reclaim activities were as expected.
> Could I ask for any suggestions on how to avoid the kswapd utilization
> pattern?
The easiest way is to disable NUMA domain so that there would be only
two nodes with 8x more memory. IOW, you have fewer pools but each pool
has more memory and therefore they are less likely to become empty.
> There is a free RAM in each numa node for the few MB used in
> swap:
> NUMA stats:
> NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> 65486 65486 65486 65486 65486 65486 65424
> MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> 2623 2833 2530 2269
> the in/out usage does not make sense for me nor the CPU utilization by
> multi-gen LRU.
My questions:
1. Were there any OOM kills with either case?
2. Was THP enabled?
MGLRU might have spent the extra CPU cycles just to void OOM kills or
produce more THPs.
If disabling the NUMA domain isn't an option, I'd recommend:
1. Try the latest kernel (6.6.1) if you haven't.
2. Disable THP if it was enabled, to verify whether it has an impact.
Thanks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-08 18:47 ` Yu Zhao
@ 2023-11-08 20:04 ` Jaroslav Pulchart
2023-11-08 22:09 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-08 20:04 UTC (permalink / raw)
To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik
>
> Hi Jaroslav,
Hi Yu Zhao
thanks for response, see answers inline:
>
> On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > Hello,
> >
> > I would like to report to you an unpleasant behavior of multi-gen LRU
> > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > system (16numa domains).
>
> Kernel version please?
6.5.y, but we saw it sooner as it is in investigation from 23th May
(6.4.y and maybe even the 6.3.y).
>
> > Symptoms of my issue are
> >
> > /A/ if mult-gen LRU is enabled
> > 1/ [kswapd3] is consuming 100% CPU
>
> Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
>
> > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > 18.26, 15.01
> > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > 0.4 si, 0.0 st
> > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > ...
> > 765 root 20 0 0 0 0 R 98.3 0.0
> > 34969:04 kswapd3
> > ...
> > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > observed with swap disk as well and cause IO latency issues due to
> > some kind of locking)
> > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> >
> >
> > /B/ if mult-gen LRU is disabled
> > 1/ [kswapd3] is consuming 3%-10% CPU
> > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > 17.77, 14.77
> > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > 0.4 si, 0.0 st
> > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > ...
> > 765 root 20 0 0 0 0 S 3.6 0.0
> > 34966:46 [kswapd3]
> > ...
> > 2/ swap space usage is low (4MB)
> > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> >
> > Both situations are wrong as they are using swap in/out extensively,
> > however the multi-gen LRU situation is 10times worse.
>
> From the stats below, node 3 had the lowest free memory. So I think in
> both cases, the reclaim activities were as expected.
I do not see a reason for the memory pressure and reclaims. This node
has the lowest free memory of all nodes (~302MB free) that is true,
however the swap space usage is just 4MB (still going in and out). So
what can be the reason for that behaviour?
The workers/application is running in pre-allocated HugePages and the
rest is used for a small set of system services and drivers of
devices. It is static and not growing. The issue persists when I stop
the system services and free the memory.
>
> > Could I ask for any suggestions on how to avoid the kswapd utilization
> > pattern?
>
> The easiest way is to disable NUMA domain so that there would be only
> two nodes with 8x more memory. IOW, you have fewer pools but each pool
> has more memory and therefore they are less likely to become empty.
>
> > There is a free RAM in each numa node for the few MB used in
> > swap:
> > NUMA stats:
> > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > 65486 65486 65486 65486 65486 65486 65424
> > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > 2623 2833 2530 2269
> > the in/out usage does not make sense for me nor the CPU utilization by
> > multi-gen LRU.
>
> My questions:
> 1. Were there any OOM kills with either case?
There is no OOM. The memory usage is not growing nor the swap space
usage, it is still a few MB there.
> 2. Was THP enabled?
Both situations with enabled and with disabled THP.
> MGLRU might have spent the extra CPU cycles just to void OOM kills or
> produce more THPs.
>
> If disabling the NUMA domain isn't an option, I'd recommend:
Disabling numa is not an option. However we are now testing a setup
with -1GB in HugePages per each numa.
> 1. Try the latest kernel (6.6.1) if you haven't.
Not yet, the 6.6.1 was released today.
> 2. Disable THP if it was enabled, to verify whether it has an impact.
I try disabling THP without any effect.
>
> Thanks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-08 20:04 ` Jaroslav Pulchart
@ 2023-11-08 22:09 ` Yu Zhao
2023-11-09 6:39 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-08 22:09 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
[-- Attachment #1: Type: text/plain, Size: 5724 bytes --]
On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > Hi Jaroslav,
>
> Hi Yu Zhao
>
> thanks for response, see answers inline:
>
> >
> > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > Hello,
> > >
> > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > system (16numa domains).
> >
> > Kernel version please?
>
> 6.5.y, but we saw it sooner as it is in investigation from 23th May
> (6.4.y and maybe even the 6.3.y).
v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
for you if you run into other problems with v6.6.
> > > Symptoms of my issue are
> > >
> > > /A/ if mult-gen LRU is enabled
> > > 1/ [kswapd3] is consuming 100% CPU
> >
> > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> >
> > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > 18.26, 15.01
> > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > 0.4 si, 0.0 st
> > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > ...
> > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > 34969:04 kswapd3
> > > ...
> > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > observed with swap disk as well and cause IO latency issues due to
> > > some kind of locking)
> > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > >
> > >
> > > /B/ if mult-gen LRU is disabled
> > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > 17.77, 14.77
> > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > 0.4 si, 0.0 st
> > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > ...
> > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > 34966:46 [kswapd3]
> > > ...
> > > 2/ swap space usage is low (4MB)
> > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > >
> > > Both situations are wrong as they are using swap in/out extensively,
> > > however the multi-gen LRU situation is 10times worse.
> >
> > From the stats below, node 3 had the lowest free memory. So I think in
> > both cases, the reclaim activities were as expected.
>
> I do not see a reason for the memory pressure and reclaims. This node
> has the lowest free memory of all nodes (~302MB free) that is true,
> however the swap space usage is just 4MB (still going in and out). So
> what can be the reason for that behaviour?
The best analogy is that refuel (reclaim) happens before the tank
becomes empty, and it happens even sooner when there is a long road
ahead (high order allocations).
> The workers/application is running in pre-allocated HugePages and the
> rest is used for a small set of system services and drivers of
> devices. It is static and not growing. The issue persists when I stop
> the system services and free the memory.
Yes, this helps. Also could you attach /proc/buddyinfo from the moment
you hit the problem?
> > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > pattern?
> >
> > The easiest way is to disable NUMA domain so that there would be only
> > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > has more memory and therefore they are less likely to become empty.
> >
> > > There is a free RAM in each numa node for the few MB used in
> > > swap:
> > > NUMA stats:
> > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > 65486 65486 65486 65486 65486 65486 65424
> > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > 2623 2833 2530 2269
> > > the in/out usage does not make sense for me nor the CPU utilization by
> > > multi-gen LRU.
> >
> > My questions:
> > 1. Were there any OOM kills with either case?
>
> There is no OOM. The memory usage is not growing nor the swap space
> usage, it is still a few MB there.
>
> > 2. Was THP enabled?
>
> Both situations with enabled and with disabled THP.
My suspicion is that you packed the node 3 too perfectly :) And that
might have triggered a known but currently a low priority problem in
MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
for me in case v6.6 by itself still has the problem?
> > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > produce more THPs.
> >
> > If disabling the NUMA domain isn't an option, I'd recommend:
>
> Disabling numa is not an option. However we are now testing a setup
> with -1GB in HugePages per each numa.
>
> > 1. Try the latest kernel (6.6.1) if you haven't.
>
> Not yet, the 6.6.1 was released today.
>
> > 2. Disable THP if it was enabled, to verify whether it has an impact.
>
> I try disabling THP without any effect.
Gochat. Please try the patch with MGLRU and let me know. Thanks!
(Also CC Charan @ Qualcomm who initially reported the problem that
ended up with the attached patch.)
[-- Attachment #2: 0001-mm-mglru-curb-kswapd-overshooting-high-wmarks.patch --]
[-- Type: application/octet-stream, Size: 3209 bytes --]
From a188169d26b2d40fe0a91393761cf2292984545c Mon Sep 17 00:00:00 2001
From: Yu Zhao <yuzhao@google.com>
Date: Wed, 8 Nov 2023 14:56:58 -0700
Subject: [PATCH] mm/mglru: curb kswapd overshooting high wmarks
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
mm/vmscan.c | 40 +++++++++++++++++++++++++++++++++-------
1 file changed, 33 insertions(+), 7 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394b112e..dc0bd2cc27e0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5341,20 +5341,47 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool
return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
}
-static unsigned long get_nr_to_reclaim(struct scan_control *sc)
+static unsigned long get_nr_to_reclaim(struct lruvec *lruvec, struct scan_control *sc)
{
+ int i;
+ unsigned long nr_to_reclaim;
+
/* don't abort memcg reclaim to ensure fairness */
if (!root_reclaim(sc))
return -1;
- return max(sc->nr_to_reclaim, compact_gap(sc->order));
+ nr_to_reclaim = max(sc->nr_to_reclaim, compact_gap(sc->order));
+ if (sc->nr_reclaimed >= nr_to_reclaim)
+ return 0;
+
+ /* don't abort direct reclaim to avoid premature OOM */
+ if (!current_is_kswapd())
+ return nr_to_reclaim;
+
+ /* abort only if all eligible zones are balanced */
+ for (i = 0; i <= sc->reclaim_idx; i++) {
+ unsigned long wmark;
+ struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
+
+ if (!managed_zone(zone))
+ continue;
+
+ if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
+ wmark = wmark_pages(zone, WMARK_PROMO);
+ else
+ wmark = high_wmark_pages(zone);
+
+ if (!zone_watermark_ok_safe(zone, sc->order, wmark, sc->reclaim_idx))
+ return nr_to_reclaim;
+ }
+
+ return i > sc->reclaim_idx ? 0 : nr_to_reclaim;
}
static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
long nr_to_scan;
unsigned long scanned = 0;
- unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
int swappiness = get_swappiness(lruvec, sc);
/* clean file folios are more likely to exist */
@@ -5376,7 +5403,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
if (scanned >= nr_to_scan)
break;
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (sc->nr_reclaimed >= get_nr_to_reclaim(lruvec, sc))
break;
cond_resched();
@@ -5437,7 +5464,6 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
struct lru_gen_folio *lrugen;
struct mem_cgroup *memcg;
const struct hlist_nulls_node *pos;
- unsigned long nr_to_reclaim = get_nr_to_reclaim(sc);
bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
restart:
@@ -5470,7 +5496,7 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
rcu_read_lock();
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (sc->nr_reclaimed >= get_nr_to_reclaim(lruvec, sc))
break;
}
@@ -5481,7 +5507,7 @@ static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
mem_cgroup_put(memcg);
- if (sc->nr_reclaimed >= nr_to_reclaim)
+ if (!is_a_nulls(pos))
return;
/* restart if raced with lru_gen_rotate_memcg() */
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-08 22:09 ` Yu Zhao
@ 2023-11-09 6:39 ` Jaroslav Pulchart
2023-11-09 6:48 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-09 6:39 UTC (permalink / raw)
To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
>
> On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > Hi Jaroslav,
> >
> > Hi Yu Zhao
> >
> > thanks for response, see answers inline:
> >
> > >
> > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > system (16numa domains).
> > >
> > > Kernel version please?
> >
> > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > (6.4.y and maybe even the 6.3.y).
>
> v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> for you if you run into other problems with v6.6.
>
I will give it a try using 6.6.y. When it will work we can switch to
6.6.y instead of backporting the stuff to 6.5.y.
> > > > Symptoms of my issue are
> > > >
> > > > /A/ if mult-gen LRU is enabled
> > > > 1/ [kswapd3] is consuming 100% CPU
> > >
> > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > >
> > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > 18.26, 15.01
> > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > 0.4 si, 0.0 st
> > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > ...
> > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > 34969:04 kswapd3
> > > > ...
> > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > observed with swap disk as well and cause IO latency issues due to
> > > > some kind of locking)
> > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > >
> > > >
> > > > /B/ if mult-gen LRU is disabled
> > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > 17.77, 14.77
> > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > 0.4 si, 0.0 st
> > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > ...
> > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > 34966:46 [kswapd3]
> > > > ...
> > > > 2/ swap space usage is low (4MB)
> > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > >
> > > > Both situations are wrong as they are using swap in/out extensively,
> > > > however the multi-gen LRU situation is 10times worse.
> > >
> > > From the stats below, node 3 had the lowest free memory. So I think in
> > > both cases, the reclaim activities were as expected.
> >
> > I do not see a reason for the memory pressure and reclaims. This node
> > has the lowest free memory of all nodes (~302MB free) that is true,
> > however the swap space usage is just 4MB (still going in and out). So
> > what can be the reason for that behaviour?
>
> The best analogy is that refuel (reclaim) happens before the tank
> becomes empty, and it happens even sooner when there is a long road
> ahead (high order allocations).
>
> > The workers/application is running in pre-allocated HugePages and the
> > rest is used for a small set of system services and drivers of
> > devices. It is static and not growing. The issue persists when I stop
> > the system services and free the memory.
>
> Yes, this helps.
> Also could you attach /proc/buddyinfo from the moment
> you hit the problem?
>
I can. The problem is continuous, it is 100% of time continuously
doing in/out and consuming 100% of CPU and locking IO.
The output of /proc/buddyinfo is:
# cat /proc/buddyinfo
Node 0, zone DMA 7 2 2 1 1 2 1
1 1 2 1
Node 0, zone DMA32 4567 3395 1357 846 439 190 93
61 43 23 4
Node 0, zone Normal 19 190 140 129 136 75 66
41 9 1 5
Node 1, zone Normal 194 1210 2080 1800 715 255 111
56 42 36 55
Node 2, zone Normal 204 768 3766 3394 1742 468 185
194 238 47 74
Node 3, zone Normal 1622 2137 1058 846 388 208 97
44 14 42 10
Node 4, zone Normal 282 705 623 274 184 90 63
41 11 1 28
Node 5, zone Normal 505 620 6180 3706 1724 1083 592
410 417 168 70
Node 6, zone Normal 1120 357 3314 3437 2264 872 606
209 215 123 265
Node 7, zone Normal 365 5499 12035 7486 3845 1743 635
243 309 292 78
Node 8, zone Normal 248 740 2280 1094 1225 2087 846
308 192 65 55
Node 9, zone Normal 356 763 1625 944 740 1920 1174
696 217 235 111
Node 10, zone Normal 727 1479 7002 6114 2487 1084
407 269 157 78 16
Node 11, zone Normal 189 3287 9141 5039 2560 1183
1247 693 506 252 8
Node 12, zone Normal 142 378 1317 466 1512 1568
646 359 248 264 228
Node 13, zone Normal 444 1977 3173 2625 2105 1493
931 600 369 266 230
Node 14, zone Normal 376 221 120 360 2721 2378
1521 826 442 204 59
Node 15, zone Normal 1210 966 922 2046 4128 2904
1518 744 352 102 58
> > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > pattern?
> > >
> > > The easiest way is to disable NUMA domain so that there would be only
> > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > has more memory and therefore they are less likely to become empty.
> > >
> > > > There is a free RAM in each numa node for the few MB used in
> > > > swap:
> > > > NUMA stats:
> > > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > 65486 65486 65486 65486 65486 65486 65424
> > > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > 2623 2833 2530 2269
> > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > multi-gen LRU.
> > >
> > > My questions:
> > > 1. Were there any OOM kills with either case?
> >
> > There is no OOM. The memory usage is not growing nor the swap space
> > usage, it is still a few MB there.
> >
> > > 2. Was THP enabled?
> >
> > Both situations with enabled and with disabled THP.
>
> My suspicion is that you packed the node 3 too perfectly :) And that
> might have triggered a known but currently a low priority problem in
> MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> for me in case v6.6 by itself still has the problem?
>
I would not focus just to node3, we had issues on different servers
with node0 and node2 both in parallel, but mostly it is the node3.
How our setup looks like:
* each node has 64GB of RAM,
* 61GB from it is in 1GB Huge Pages,
* rest 3GB is used by host system
There are running kvm VMs vCPUs pinned to the NUMA domains and using
the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
cpus), the qemu-kvm threads are pinned to the same numa domain as the
vCPUs. System services are not pinned, I'm not sure why the node3 is
used at most as the vms are balanced and the host's system services
can move between domains.
> > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > produce more THPs.
> > >
> > > If disabling the NUMA domain isn't an option, I'd recommend:
> >
> > Disabling numa is not an option. However we are now testing a setup
> > with -1GB in HugePages per each numa.
> >
> > > 1. Try the latest kernel (6.6.1) if you haven't.
> >
> > Not yet, the 6.6.1 was released today.
> >
> > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> >
> > I try disabling THP without any effect.
>
> Gochat. Please try the patch with MGLRU and let me know. Thanks!
>
> (Also CC Charan @ Qualcomm who initially reported the problem that
> ended up with the attached patch.)
I can try it. Will let you know.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-09 6:39 ` Jaroslav Pulchart
@ 2023-11-09 6:48 ` Yu Zhao
2023-11-09 10:58 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-09 6:48 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > Hi Jaroslav,
> > >
> > > Hi Yu Zhao
> > >
> > > thanks for response, see answers inline:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > system (16numa domains).
> > > >
> > > > Kernel version please?
> > >
> > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > (6.4.y and maybe even the 6.3.y).
> >
> > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > for you if you run into other problems with v6.6.
> >
>
> I will give it a try using 6.6.y. When it will work we can switch to
> 6.6.y instead of backporting the stuff to 6.5.y.
>
> > > > > Symptoms of my issue are
> > > > >
> > > > > /A/ if mult-gen LRU is enabled
> > > > > 1/ [kswapd3] is consuming 100% CPU
> > > >
> > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > >
> > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > 18.26, 15.01
> > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > 0.4 si, 0.0 st
> > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > ...
> > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > 34969:04 kswapd3
> > > > > ...
> > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > some kind of locking)
> > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > >
> > > > >
> > > > > /B/ if mult-gen LRU is disabled
> > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > 17.77, 14.77
> > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > 0.4 si, 0.0 st
> > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > ...
> > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > 34966:46 [kswapd3]
> > > > > ...
> > > > > 2/ swap space usage is low (4MB)
> > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > >
> > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > however the multi-gen LRU situation is 10times worse.
> > > >
> > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > both cases, the reclaim activities were as expected.
> > >
> > > I do not see a reason for the memory pressure and reclaims. This node
> > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > however the swap space usage is just 4MB (still going in and out). So
> > > what can be the reason for that behaviour?
> >
> > The best analogy is that refuel (reclaim) happens before the tank
> > becomes empty, and it happens even sooner when there is a long road
> > ahead (high order allocations).
> >
> > > The workers/application is running in pre-allocated HugePages and the
> > > rest is used for a small set of system services and drivers of
> > > devices. It is static and not growing. The issue persists when I stop
> > > the system services and free the memory.
> >
> > Yes, this helps.
> > Also could you attach /proc/buddyinfo from the moment
> > you hit the problem?
> >
>
> I can. The problem is continuous, it is 100% of time continuously
> doing in/out and consuming 100% of CPU and locking IO.
>
> The output of /proc/buddyinfo is:
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 7 2 2 1 1 2 1
> 1 1 2 1
> Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> 61 43 23 4
> Node 0, zone Normal 19 190 140 129 136 75 66
> 41 9 1 5
> Node 1, zone Normal 194 1210 2080 1800 715 255 111
> 56 42 36 55
> Node 2, zone Normal 204 768 3766 3394 1742 468 185
> 194 238 47 74
> Node 3, zone Normal 1622 2137 1058 846 388 208 97
> 44 14 42 10
Again, thinking out loud: there is only one zone on node 3, i.e., the
normal zone, and this excludes the problem commit
669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
reclaim") fixed in v6.6.
> Node 4, zone Normal 282 705 623 274 184 90 63
> 41 11 1 28
> Node 5, zone Normal 505 620 6180 3706 1724 1083 592
> 410 417 168 70
> Node 6, zone Normal 1120 357 3314 3437 2264 872 606
> 209 215 123 265
> Node 7, zone Normal 365 5499 12035 7486 3845 1743 635
> 243 309 292 78
> Node 8, zone Normal 248 740 2280 1094 1225 2087 846
> 308 192 65 55
> Node 9, zone Normal 356 763 1625 944 740 1920 1174
> 696 217 235 111
> Node 10, zone Normal 727 1479 7002 6114 2487 1084
> 407 269 157 78 16
> Node 11, zone Normal 189 3287 9141 5039 2560 1183
> 1247 693 506 252 8
> Node 12, zone Normal 142 378 1317 466 1512 1568
> 646 359 248 264 228
> Node 13, zone Normal 444 1977 3173 2625 2105 1493
> 931 600 369 266 230
> Node 14, zone Normal 376 221 120 360 2721 2378
> 1521 826 442 204 59
> Node 15, zone Normal 1210 966 922 2046 4128 2904
> 1518 744 352 102 58
>
>
> > > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > > pattern?
> > > >
> > > > The easiest way is to disable NUMA domain so that there would be only
> > > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > > has more memory and therefore they are less likely to become empty.
> > > >
> > > > > There is a free RAM in each numa node for the few MB used in
> > > > > swap:
> > > > > NUMA stats:
> > > > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > > 65486 65486 65486 65486 65486 65486 65424
> > > > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > > 2623 2833 2530 2269
> > > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > > multi-gen LRU.
> > > >
> > > > My questions:
> > > > 1. Were there any OOM kills with either case?
> > >
> > > There is no OOM. The memory usage is not growing nor the swap space
> > > usage, it is still a few MB there.
> > >
> > > > 2. Was THP enabled?
> > >
> > > Both situations with enabled and with disabled THP.
> >
> > My suspicion is that you packed the node 3 too perfectly :) And that
> > might have triggered a known but currently a low priority problem in
> > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> > for me in case v6.6 by itself still has the problem?
> >
>
> I would not focus just to node3, we had issues on different servers
> with node0 and node2 both in parallel, but mostly it is the node3.
>
> How our setup looks like:
> * each node has 64GB of RAM,
> * 61GB from it is in 1GB Huge Pages,
> * rest 3GB is used by host system
>
> There are running kvm VMs vCPUs pinned to the NUMA domains and using
> the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
> cpus), the qemu-kvm threads are pinned to the same numa domain as the
> vCPUs. System services are not pinned, I'm not sure why the node3 is
> used at most as the vms are balanced and the host's system services
> can move between domains.
>
> > > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > > produce more THPs.
> > > >
> > > > If disabling the NUMA domain isn't an option, I'd recommend:
> > >
> > > Disabling numa is not an option. However we are now testing a setup
> > > with -1GB in HugePages per each numa.
> > >
> > > > 1. Try the latest kernel (6.6.1) if you haven't.
> > >
> > > Not yet, the 6.6.1 was released today.
> > >
> > > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> > >
> > > I try disabling THP without any effect.
> >
> > Gochat. Please try the patch with MGLRU and let me know. Thanks!
> >
> > (Also CC Charan @ Qualcomm who initially reported the problem that
> > ended up with the attached patch.)
>
> I can try it. Will let you know.
Great, thanks!
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-09 6:48 ` Yu Zhao
@ 2023-11-09 10:58 ` Jaroslav Pulchart
2023-11-10 1:31 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-09 10:58 UTC (permalink / raw)
To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
>
> On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > Hi Jaroslav,
> > > >
> > > > Hi Yu Zhao
> > > >
> > > > thanks for response, see answers inline:
> > > >
> > > > >
> > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > system (16numa domains).
> > > > >
> > > > > Kernel version please?
> > > >
> > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > (6.4.y and maybe even the 6.3.y).
> > >
> > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > for you if you run into other problems with v6.6.
> > >
> >
> > I will give it a try using 6.6.y. When it will work we can switch to
> > 6.6.y instead of backporting the stuff to 6.5.y.
> >
> > > > > > Symptoms of my issue are
> > > > > >
> > > > > > /A/ if mult-gen LRU is enabled
> > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > >
> > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > >
> > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > 18.26, 15.01
> > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > 0.4 si, 0.0 st
> > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > ...
> > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > 34969:04 kswapd3
> > > > > > ...
> > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > some kind of locking)
> > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > >
> > > > > >
> > > > > > /B/ if mult-gen LRU is disabled
> > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > 17.77, 14.77
> > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > 0.4 si, 0.0 st
> > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > ...
> > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > 34966:46 [kswapd3]
> > > > > > ...
> > > > > > 2/ swap space usage is low (4MB)
> > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > >
> > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > however the multi-gen LRU situation is 10times worse.
> > > > >
> > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > both cases, the reclaim activities were as expected.
> > > >
> > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > however the swap space usage is just 4MB (still going in and out). So
> > > > what can be the reason for that behaviour?
> > >
> > > The best analogy is that refuel (reclaim) happens before the tank
> > > becomes empty, and it happens even sooner when there is a long road
> > > ahead (high order allocations).
> > >
> > > > The workers/application is running in pre-allocated HugePages and the
> > > > rest is used for a small set of system services and drivers of
> > > > devices. It is static and not growing. The issue persists when I stop
> > > > the system services and free the memory.
> > >
> > > Yes, this helps.
> > > Also could you attach /proc/buddyinfo from the moment
> > > you hit the problem?
> > >
> >
> > I can. The problem is continuous, it is 100% of time continuously
> > doing in/out and consuming 100% of CPU and locking IO.
> >
> > The output of /proc/buddyinfo is:
> >
> > # cat /proc/buddyinfo
> > Node 0, zone DMA 7 2 2 1 1 2 1
> > 1 1 2 1
> > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > 61 43 23 4
> > Node 0, zone Normal 19 190 140 129 136 75 66
> > 41 9 1 5
> > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > 56 42 36 55
> > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > 194 238 47 74
> > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > 44 14 42 10
>
> Again, thinking out loud: there is only one zone on node 3, i.e., the
> normal zone, and this excludes the problem commit
> 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> reclaim") fixed in v6.6.
I built vanila 6.6.1 and did the first fast test - spin up and destroy
VMs only - This test does not always trigger the kswapd3 continuous
swap in/out usage but it uses it and it looks like there is a
change:
I can see kswapd non-continous (15s and more) usage with 6.5.y
# ps ax | grep [k]swapd
753 ? S 0:00 [kswapd0]
754 ? S 0:00 [kswapd1]
755 ? S 0:00 [kswapd2]
756 ? S 0:15 [kswapd3] <<<<<<<<<
757 ? S 0:00 [kswapd4]
758 ? S 0:00 [kswapd5]
759 ? S 0:00 [kswapd6]
760 ? S 0:00 [kswapd7]
761 ? S 0:00 [kswapd8]
762 ? S 0:00 [kswapd9]
763 ? S 0:00 [kswapd10]
764 ? S 0:00 [kswapd11]
765 ? S 0:00 [kswapd12]
766 ? S 0:00 [kswapd13]
767 ? S 0:00 [kswapd14]
768 ? S 0:00 [kswapd15]
and none kswapd usage with 6.6.1, that looks to be promising path
# ps ax | grep [k]swapd
808 ? S 0:00 [kswapd0]
809 ? S 0:00 [kswapd1]
810 ? S 0:00 [kswapd2]
811 ? S 0:00 [kswapd3] <<<< nice
812 ? S 0:00 [kswapd4]
813 ? S 0:00 [kswapd5]
814 ? S 0:00 [kswapd6]
815 ? S 0:00 [kswapd7]
816 ? S 0:00 [kswapd8]
817 ? S 0:00 [kswapd9]
818 ? S 0:00 [kswapd10]
819 ? S 0:00 [kswapd11]
820 ? S 0:00 [kswapd12]
821 ? S 0:00 [kswapd13]
822 ? S 0:00 [kswapd14]
823 ? S 0:00 [kswapd15]
I will install the 6.6.1 on the server which is doing some work and
observe it later today..
>
> > Node 4, zone Normal 282 705 623 274 184 90 63
> > 41 11 1 28
> > Node 5, zone Normal 505 620 6180 3706 1724 1083 592
> > 410 417 168 70
> > Node 6, zone Normal 1120 357 3314 3437 2264 872 606
> > 209 215 123 265
> > Node 7, zone Normal 365 5499 12035 7486 3845 1743 635
> > 243 309 292 78
> > Node 8, zone Normal 248 740 2280 1094 1225 2087 846
> > 308 192 65 55
> > Node 9, zone Normal 356 763 1625 944 740 1920 1174
> > 696 217 235 111
> > Node 10, zone Normal 727 1479 7002 6114 2487 1084
> > 407 269 157 78 16
> > Node 11, zone Normal 189 3287 9141 5039 2560 1183
> > 1247 693 506 252 8
> > Node 12, zone Normal 142 378 1317 466 1512 1568
> > 646 359 248 264 228
> > Node 13, zone Normal 444 1977 3173 2625 2105 1493
> > 931 600 369 266 230
> > Node 14, zone Normal 376 221 120 360 2721 2378
> > 1521 826 442 204 59
> > Node 15, zone Normal 1210 966 922 2046 4128 2904
> > 1518 744 352 102 58
> >
> >
> > > > > > Could I ask for any suggestions on how to avoid the kswapd utilization
> > > > > > pattern?
> > > > >
> > > > > The easiest way is to disable NUMA domain so that there would be only
> > > > > two nodes with 8x more memory. IOW, you have fewer pools but each pool
> > > > > has more memory and therefore they are less likely to become empty.
> > > > >
> > > > > > There is a free RAM in each numa node for the few MB used in
> > > > > > swap:
> > > > > > NUMA stats:
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> > > > > > MemTotal: 65048 65486 65486 65486 65486 65486 65486 65469 65486
> > > > > > 65486 65486 65486 65486 65486 65486 65424
> > > > > > MemFree: 468 601 1200 302 548 1879 2321 2478 1967 2239 1453 2417
> > > > > > 2623 2833 2530 2269
> > > > > > the in/out usage does not make sense for me nor the CPU utilization by
> > > > > > multi-gen LRU.
> > > > >
> > > > > My questions:
> > > > > 1. Were there any OOM kills with either case?
> > > >
> > > > There is no OOM. The memory usage is not growing nor the swap space
> > > > usage, it is still a few MB there.
> > > >
> > > > > 2. Was THP enabled?
> > > >
> > > > Both situations with enabled and with disabled THP.
> > >
> > > My suspicion is that you packed the node 3 too perfectly :) And that
> > > might have triggered a known but currently a low priority problem in
> > > MGLRU. I'm attaching a patch for v6.6 and hoping you could verify it
> > > for me in case v6.6 by itself still has the problem?
> > >
> >
> > I would not focus just to node3, we had issues on different servers
> > with node0 and node2 both in parallel, but mostly it is the node3.
> >
> > How our setup looks like:
> > * each node has 64GB of RAM,
> > * 61GB from it is in 1GB Huge Pages,
> > * rest 3GB is used by host system
> >
> > There are running kvm VMs vCPUs pinned to the NUMA domains and using
> > the Huge Pages (topology is exposed to VMs, no-overcommit, no-shared
> > cpus), the qemu-kvm threads are pinned to the same numa domain as the
> > vCPUs. System services are not pinned, I'm not sure why the node3 is
> > used at most as the vms are balanced and the host's system services
> > can move between domains.
> >
> > > > > MGLRU might have spent the extra CPU cycles just to void OOM kills or
> > > > > produce more THPs.
> > > > >
> > > > > If disabling the NUMA domain isn't an option, I'd recommend:
> > > >
> > > > Disabling numa is not an option. However we are now testing a setup
> > > > with -1GB in HugePages per each numa.
> > > >
> > > > > 1. Try the latest kernel (6.6.1) if you haven't.
> > > >
> > > > Not yet, the 6.6.1 was released today.
> > > >
> > > > > 2. Disable THP if it was enabled, to verify whether it has an impact.
> > > >
> > > > I try disabling THP without any effect.
> > >
> > > Gochat. Please try the patch with MGLRU and let me know. Thanks!
> > >
> > > (Also CC Charan @ Qualcomm who initially reported the problem that
> > > ended up with the attached patch.)
> >
> > I can try it. Will let you know.
>
> Great, thanks!
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-09 10:58 ` Jaroslav Pulchart
@ 2023-11-10 1:31 ` Yu Zhao
[not found] ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-10 1:31 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > Hi Jaroslav,
> > > > >
> > > > > Hi Yu Zhao
> > > > >
> > > > > thanks for response, see answers inline:
> > > > >
> > > > > >
> > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > system (16numa domains).
> > > > > >
> > > > > > Kernel version please?
> > > > >
> > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > (6.4.y and maybe even the 6.3.y).
> > > >
> > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > for you if you run into other problems with v6.6.
> > > >
> > >
> > > I will give it a try using 6.6.y. When it will work we can switch to
> > > 6.6.y instead of backporting the stuff to 6.5.y.
> > >
> > > > > > > Symptoms of my issue are
> > > > > > >
> > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > >
> > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > >
> > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > > 18.26, 15.01
> > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > > 0.4 si, 0.0 st
> > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > > ...
> > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > > 34969:04 kswapd3
> > > > > > > ...
> > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > some kind of locking)
> > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > >
> > > > > > >
> > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > > 17.77, 14.77
> > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > > 0.4 si, 0.0 st
> > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > > ...
> > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > > 34966:46 [kswapd3]
> > > > > > > ...
> > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > >
> > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > >
> > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > both cases, the reclaim activities were as expected.
> > > > >
> > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > what can be the reason for that behaviour?
> > > >
> > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > becomes empty, and it happens even sooner when there is a long road
> > > > ahead (high order allocations).
> > > >
> > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > rest is used for a small set of system services and drivers of
> > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > the system services and free the memory.
> > > >
> > > > Yes, this helps.
> > > > Also could you attach /proc/buddyinfo from the moment
> > > > you hit the problem?
> > > >
> > >
> > > I can. The problem is continuous, it is 100% of time continuously
> > > doing in/out and consuming 100% of CPU and locking IO.
> > >
> > > The output of /proc/buddyinfo is:
> > >
> > > # cat /proc/buddyinfo
> > > Node 0, zone DMA 7 2 2 1 1 2 1
> > > 1 1 2 1
> > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > > 61 43 23 4
> > > Node 0, zone Normal 19 190 140 129 136 75 66
> > > 41 9 1 5
> > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > > 56 42 36 55
> > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > > 194 238 47 74
> > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > > 44 14 42 10
> >
> > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > normal zone, and this excludes the problem commit
> > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > reclaim") fixed in v6.6.
>
> I built vanila 6.6.1 and did the first fast test - spin up and destroy
> VMs only - This test does not always trigger the kswapd3 continuous
> swap in/out usage but it uses it and it looks like there is a
> change:
>
> I can see kswapd non-continous (15s and more) usage with 6.5.y
> # ps ax | grep [k]swapd
> 753 ? S 0:00 [kswapd0]
> 754 ? S 0:00 [kswapd1]
> 755 ? S 0:00 [kswapd2]
> 756 ? S 0:15 [kswapd3] <<<<<<<<<
> 757 ? S 0:00 [kswapd4]
> 758 ? S 0:00 [kswapd5]
> 759 ? S 0:00 [kswapd6]
> 760 ? S 0:00 [kswapd7]
> 761 ? S 0:00 [kswapd8]
> 762 ? S 0:00 [kswapd9]
> 763 ? S 0:00 [kswapd10]
> 764 ? S 0:00 [kswapd11]
> 765 ? S 0:00 [kswapd12]
> 766 ? S 0:00 [kswapd13]
> 767 ? S 0:00 [kswapd14]
> 768 ? S 0:00 [kswapd15]
>
> and none kswapd usage with 6.6.1, that looks to be promising path
>
> # ps ax | grep [k]swapd
> 808 ? S 0:00 [kswapd0]
> 809 ? S 0:00 [kswapd1]
> 810 ? S 0:00 [kswapd2]
> 811 ? S 0:00 [kswapd3] <<<< nice
> 812 ? S 0:00 [kswapd4]
> 813 ? S 0:00 [kswapd5]
> 814 ? S 0:00 [kswapd6]
> 815 ? S 0:00 [kswapd7]
> 816 ? S 0:00 [kswapd8]
> 817 ? S 0:00 [kswapd9]
> 818 ? S 0:00 [kswapd10]
> 819 ? S 0:00 [kswapd11]
> 820 ? S 0:00 [kswapd12]
> 821 ? S 0:00 [kswapd13]
> 822 ? S 0:00 [kswapd14]
> 823 ? S 0:00 [kswapd15]
>
> I will install the 6.6.1 on the server which is doing some work and
> observe it later today.
Thanks. Fingers crossed.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
[not found] ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
@ 2023-11-13 20:09 ` Yu Zhao
2023-11-14 7:29 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-13 20:09 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
[-- Attachment #1: Type: text/plain, Size: 9618 bytes --]
On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > Hi Yu Zhao
> > > > > > >
> > > > > > > thanks for response, see answers inline:
> > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > system (16numa domains).
> > > > > > > >
> > > > > > > > Kernel version please?
> > > > > > >
> > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > >
> > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > for you if you run into other problems with v6.6.
> > > > > >
> > > > >
> > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > >
> > > > > > > > > Symptoms of my issue are
> > > > > > > > >
> > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > >
> > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > >
> > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > > > > 18.26, 15.01
> > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > > > > ...
> > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > > > > 34969:04 kswapd3
> > > > > > > > > ...
> > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > some kind of locking)
> > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > > > > 17.77, 14.77
> > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > > > > ...
> > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > ...
> > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > >
> > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > >
> > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > >
> > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > what can be the reason for that behaviour?
> > > > > >
> > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > ahead (high order allocations).
> > > > > >
> > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > the system services and free the memory.
> > > > > >
> > > > > > Yes, this helps.
> > > > > > Also could you attach /proc/buddyinfo from the moment
> > > > > > you hit the problem?
> > > > > >
> > > > >
> > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > >
> > > > > The output of /proc/buddyinfo is:
> > > > >
> > > > > # cat /proc/buddyinfo
> > > > > Node 0, zone DMA 7 2 2 1 1 2 1
> > > > > 1 1 2 1
> > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > > > > 61 43 23 4
> > > > > Node 0, zone Normal 19 190 140 129 136 75 66
> > > > > 41 9 1 5
> > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > > > > 56 42 36 55
> > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > > > > 194 238 47 74
> > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > > > > 44 14 42 10
> > > >
> > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > normal zone, and this excludes the problem commit
> > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > reclaim") fixed in v6.6.
> > >
> > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > VMs only - This test does not always trigger the kswapd3 continuous
> > > swap in/out usage but it uses it and it looks like there is a
> > > change:
> > >
> > > I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > # ps ax | grep [k]swapd
> > > 753 ? S 0:00 [kswapd0]
> > > 754 ? S 0:00 [kswapd1]
> > > 755 ? S 0:00 [kswapd2]
> > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> > > 757 ? S 0:00 [kswapd4]
> > > 758 ? S 0:00 [kswapd5]
> > > 759 ? S 0:00 [kswapd6]
> > > 760 ? S 0:00 [kswapd7]
> > > 761 ? S 0:00 [kswapd8]
> > > 762 ? S 0:00 [kswapd9]
> > > 763 ? S 0:00 [kswapd10]
> > > 764 ? S 0:00 [kswapd11]
> > > 765 ? S 0:00 [kswapd12]
> > > 766 ? S 0:00 [kswapd13]
> > > 767 ? S 0:00 [kswapd14]
> > > 768 ? S 0:00 [kswapd15]
> > >
> > > and none kswapd usage with 6.6.1, that looks to be promising path
> > >
> > > # ps ax | grep [k]swapd
> > > 808 ? S 0:00 [kswapd0]
> > > 809 ? S 0:00 [kswapd1]
> > > 810 ? S 0:00 [kswapd2]
> > > 811 ? S 0:00 [kswapd3] <<<< nice
> > > 812 ? S 0:00 [kswapd4]
> > > 813 ? S 0:00 [kswapd5]
> > > 814 ? S 0:00 [kswapd6]
> > > 815 ? S 0:00 [kswapd7]
> > > 816 ? S 0:00 [kswapd8]
> > > 817 ? S 0:00 [kswapd9]
> > > 818 ? S 0:00 [kswapd10]
> > > 819 ? S 0:00 [kswapd11]
> > > 820 ? S 0:00 [kswapd12]
> > > 821 ? S 0:00 [kswapd13]
> > > 822 ? S 0:00 [kswapd14]
> > > 823 ? S 0:00 [kswapd15]
> > >
> > > I will install the 6.6.1 on the server which is doing some work and
> > > observe it later today.
> >
> > Thanks. Fingers crossed.
>
> The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> The node 3 has 163MiB free of memory and I see
> just a few in/out swap usage sometimes (which is expected) and minimal
> kswapd3 process usage for almost 4days.
Thanks for the update!
Just to confirm:
1. MGLRU was enabled, and
2. The v6.6 deployed did NOT have the patch I attached earlier.
Are both correct?
If so, I'd very appreciate it if you could try the attached patch on
top of v6.5 and see if it helps. My suspicion is that the problem is
compaction related, i.e., kswapd was woken up by high order
allocations but didn't properly stop. But what causes the behavior
difference on v6.5 between MGLRU and the active/inactive LRU still
puzzles me --the problem might be somehow masked rather than fixed on
v6.6.
For any other problems that you suspect might be related to MGLRU,
please let me know and I'd be happy to look into them as well.
[-- Attachment #2: mglru-v6.5.patch --]
[-- Type: application/x-patch, Size: 2989 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-13 20:09 ` Yu Zhao
@ 2023-11-14 7:29 ` Jaroslav Pulchart
2023-11-14 7:47 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-14 7:29 UTC (permalink / raw)
To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
>
> On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Jaroslav,
> > > > > > > >
> > > > > > > > Hi Yu Zhao
> > > > > > > >
> > > > > > > > thanks for response, see answers inline:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > system (16numa domains).
> > > > > > > > >
> > > > > > > > > Kernel version please?
> > > > > > > >
> > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > >
> > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > for you if you run into other problems with v6.6.
> > > > > > >
> > > > > >
> > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > >
> > > > > > > > > > Symptoms of my issue are
> > > > > > > > > >
> > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > >
> > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > >
> > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > > > > > 18.26, 15.01
> > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > > > > > ...
> > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > ...
> > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > some kind of locking)
> > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > > > > > 17.77, 14.77
> > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > > > > > ...
> > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > ...
> > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > >
> > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > >
> > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > >
> > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > what can be the reason for that behaviour?
> > > > > > >
> > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > ahead (high order allocations).
> > > > > > >
> > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > the system services and free the memory.
> > > > > > >
> > > > > > > Yes, this helps.
> > > > > > > Also could you attach /proc/buddyinfo from the moment
> > > > > > > you hit the problem?
> > > > > > >
> > > > > >
> > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > >
> > > > > > The output of /proc/buddyinfo is:
> > > > > >
> > > > > > # cat /proc/buddyinfo
> > > > > > Node 0, zone DMA 7 2 2 1 1 2 1
> > > > > > 1 1 2 1
> > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > > > > > 61 43 23 4
> > > > > > Node 0, zone Normal 19 190 140 129 136 75 66
> > > > > > 41 9 1 5
> > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > > > > > 56 42 36 55
> > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > > > > > 194 238 47 74
> > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > > > > > 44 14 42 10
> > > > >
> > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > normal zone, and this excludes the problem commit
> > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > reclaim") fixed in v6.6.
> > > >
> > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > swap in/out usage but it uses it and it looks like there is a
> > > > change:
> > > >
> > > > I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > # ps ax | grep [k]swapd
> > > > 753 ? S 0:00 [kswapd0]
> > > > 754 ? S 0:00 [kswapd1]
> > > > 755 ? S 0:00 [kswapd2]
> > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> > > > 757 ? S 0:00 [kswapd4]
> > > > 758 ? S 0:00 [kswapd5]
> > > > 759 ? S 0:00 [kswapd6]
> > > > 760 ? S 0:00 [kswapd7]
> > > > 761 ? S 0:00 [kswapd8]
> > > > 762 ? S 0:00 [kswapd9]
> > > > 763 ? S 0:00 [kswapd10]
> > > > 764 ? S 0:00 [kswapd11]
> > > > 765 ? S 0:00 [kswapd12]
> > > > 766 ? S 0:00 [kswapd13]
> > > > 767 ? S 0:00 [kswapd14]
> > > > 768 ? S 0:00 [kswapd15]
> > > >
> > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > >
> > > > # ps ax | grep [k]swapd
> > > > 808 ? S 0:00 [kswapd0]
> > > > 809 ? S 0:00 [kswapd1]
> > > > 810 ? S 0:00 [kswapd2]
> > > > 811 ? S 0:00 [kswapd3] <<<< nice
> > > > 812 ? S 0:00 [kswapd4]
> > > > 813 ? S 0:00 [kswapd5]
> > > > 814 ? S 0:00 [kswapd6]
> > > > 815 ? S 0:00 [kswapd7]
> > > > 816 ? S 0:00 [kswapd8]
> > > > 817 ? S 0:00 [kswapd9]
> > > > 818 ? S 0:00 [kswapd10]
> > > > 819 ? S 0:00 [kswapd11]
> > > > 820 ? S 0:00 [kswapd12]
> > > > 821 ? S 0:00 [kswapd13]
> > > > 822 ? S 0:00 [kswapd14]
> > > > 823 ? S 0:00 [kswapd15]
> > > >
> > > > I will install the 6.6.1 on the server which is doing some work and
> > > > observe it later today.
> > >
> > > Thanks. Fingers crossed.
> >
> > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > The node 3 has 163MiB free of memory and I see
> > just a few in/out swap usage sometimes (which is expected) and minimal
> > kswapd3 process usage for almost 4days.
>
> Thanks for the update!
>
> Just to confirm:
> 1. MGLRU was enabled, and
Yes, MGLRU is enabled
> 2. The v6.6 deployed did NOT have the patch I attached earlier.
Vanila 6.6, attached patch NOT applied.
> Are both correct?
>
> If so, I'd very appreciate it if you could try the attached patch on
> top of v6.5 and see if it helps. My suspicion is that the problem is
> compaction related, i.e., kswapd was woken up by high order
> allocations but didn't properly stop. But what causes the behavior
Sure, I can try it. Will inform you about progress.
> difference on v6.5 between MGLRU and the active/inactive LRU still
> puzzles me --the problem might be somehow masked rather than fixed on
> v6.6.
I'm not sure how I can help with the issue. Any suggestions on what to
change/try?
>
> For any other problems that you suspect might be related to MGLRU,
> please let me know and I'd be happy to look into them as well.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-14 7:29 ` Jaroslav Pulchart
@ 2023-11-14 7:47 ` Yu Zhao
2023-11-20 8:41 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-14 7:47 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Jaroslav,
> > > > > > > > >
> > > > > > > > > Hi Yu Zhao
> > > > > > > > >
> > > > > > > > > thanks for response, see answers inline:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello,
> > > > > > > > > > >
> > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > system (16numa domains).
> > > > > > > > > >
> > > > > > > > > > Kernel version please?
> > > > > > > > >
> > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > >
> > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > >
> > > > > > >
> > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > >
> > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > >
> > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > >
> > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > >
> > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > > > > > > ...
> > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > ...
> > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > some kind of locking)
> > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > > > > > > ...
> > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > ...
> > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > >
> > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > >
> > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > >
> > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > what can be the reason for that behaviour?
> > > > > > > >
> > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > ahead (high order allocations).
> > > > > > > >
> > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > the system services and free the memory.
> > > > > > > >
> > > > > > > > Yes, this helps.
> > > > > > > > Also could you attach /proc/buddyinfo from the moment
> > > > > > > > you hit the problem?
> > > > > > > >
> > > > > > >
> > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > >
> > > > > > > The output of /proc/buddyinfo is:
> > > > > > >
> > > > > > > # cat /proc/buddyinfo
> > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1
> > > > > > > 1 1 2 1
> > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > > > > > > 61 43 23 4
> > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66
> > > > > > > 41 9 1 5
> > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > > > > > > 56 42 36 55
> > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > > > > > > 194 238 47 74
> > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > > > > > > 44 14 42 10
> > > > > >
> > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > normal zone, and this excludes the problem commit
> > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > reclaim") fixed in v6.6.
> > > > >
> > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > swap in/out usage but it uses it and it looks like there is a
> > > > > change:
> > > > >
> > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > # ps ax | grep [k]swapd
> > > > > 753 ? S 0:00 [kswapd0]
> > > > > 754 ? S 0:00 [kswapd1]
> > > > > 755 ? S 0:00 [kswapd2]
> > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> > > > > 757 ? S 0:00 [kswapd4]
> > > > > 758 ? S 0:00 [kswapd5]
> > > > > 759 ? S 0:00 [kswapd6]
> > > > > 760 ? S 0:00 [kswapd7]
> > > > > 761 ? S 0:00 [kswapd8]
> > > > > 762 ? S 0:00 [kswapd9]
> > > > > 763 ? S 0:00 [kswapd10]
> > > > > 764 ? S 0:00 [kswapd11]
> > > > > 765 ? S 0:00 [kswapd12]
> > > > > 766 ? S 0:00 [kswapd13]
> > > > > 767 ? S 0:00 [kswapd14]
> > > > > 768 ? S 0:00 [kswapd15]
> > > > >
> > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > >
> > > > > # ps ax | grep [k]swapd
> > > > > 808 ? S 0:00 [kswapd0]
> > > > > 809 ? S 0:00 [kswapd1]
> > > > > 810 ? S 0:00 [kswapd2]
> > > > > 811 ? S 0:00 [kswapd3] <<<< nice
> > > > > 812 ? S 0:00 [kswapd4]
> > > > > 813 ? S 0:00 [kswapd5]
> > > > > 814 ? S 0:00 [kswapd6]
> > > > > 815 ? S 0:00 [kswapd7]
> > > > > 816 ? S 0:00 [kswapd8]
> > > > > 817 ? S 0:00 [kswapd9]
> > > > > 818 ? S 0:00 [kswapd10]
> > > > > 819 ? S 0:00 [kswapd11]
> > > > > 820 ? S 0:00 [kswapd12]
> > > > > 821 ? S 0:00 [kswapd13]
> > > > > 822 ? S 0:00 [kswapd14]
> > > > > 823 ? S 0:00 [kswapd15]
> > > > >
> > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > observe it later today.
> > > >
> > > > Thanks. Fingers crossed.
> > >
> > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > The node 3 has 163MiB free of memory and I see
> > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > kswapd3 process usage for almost 4days.
> >
> > Thanks for the update!
> >
> > Just to confirm:
> > 1. MGLRU was enabled, and
>
> Yes, MGLRU is enabled
>
> > 2. The v6.6 deployed did NOT have the patch I attached earlier.
>
> Vanila 6.6, attached patch NOT applied.
>
> > Are both correct?
> >
> > If so, I'd very appreciate it if you could try the attached patch on
> > top of v6.5 and see if it helps. My suspicion is that the problem is
> > compaction related, i.e., kswapd was woken up by high order
> > allocations but didn't properly stop. But what causes the behavior
>
> Sure, I can try it. Will inform you about progress.
Thanks!
> > difference on v6.5 between MGLRU and the active/inactive LRU still
> > puzzles me --the problem might be somehow masked rather than fixed on
> > v6.6.
>
> I'm not sure how I can help with the issue. Any suggestions on what to
> change/try?
Trying the attached patch is good enough for now :)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-14 7:47 ` Yu Zhao
@ 2023-11-20 8:41 ` Jaroslav Pulchart
2023-11-22 6:13 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-20 8:41 UTC (permalink / raw)
To: Yu Zhao; +Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla
[-- Attachment #1: Type: text/plain, Size: 12374 bytes --]
> On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > >
> > > > > > > > > > Hi Yu Zhao
> > > > > > > > > >
> > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hello,
> > > > > > > > > > > >
> > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > >
> > > > > > > > > > > Kernel version please?
> > > > > > > > > >
> > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > >
> > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > >
> > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > >
> > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > >
> > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > >
> > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > > > > > > > ...
> > > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > ...
> > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > > > > > > > ...
> > > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > ...
> > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > >
> > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > >
> > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > >
> > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > >
> > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > ahead (high order allocations).
> > > > > > > > >
> > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > the system services and free the memory.
> > > > > > > > >
> > > > > > > > > Yes, this helps.
> > > > > > > > > Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > you hit the problem?
> > > > > > > > >
> > > > > > > >
> > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > >
> > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > >
> > > > > > > > # cat /proc/buddyinfo
> > > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1
> > > > > > > > 1 1 2 1
> > > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > > > > > > > 61 43 23 4
> > > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66
> > > > > > > > 41 9 1 5
> > > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > > > > > > > 56 42 36 55
> > > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > > > > > > > 194 238 47 74
> > > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > > > > > > > 44 14 42 10
> > > > > > >
> > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > normal zone, and this excludes the problem commit
> > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > reclaim") fixed in v6.6.
> > > > > >
> > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > swap in/out usage but it uses it and it looks like there is a
> > > > > > change:
> > > > > >
> > > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > > # ps ax | grep [k]swapd
> > > > > > 753 ? S 0:00 [kswapd0]
> > > > > > 754 ? S 0:00 [kswapd1]
> > > > > > 755 ? S 0:00 [kswapd2]
> > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> > > > > > 757 ? S 0:00 [kswapd4]
> > > > > > 758 ? S 0:00 [kswapd5]
> > > > > > 759 ? S 0:00 [kswapd6]
> > > > > > 760 ? S 0:00 [kswapd7]
> > > > > > 761 ? S 0:00 [kswapd8]
> > > > > > 762 ? S 0:00 [kswapd9]
> > > > > > 763 ? S 0:00 [kswapd10]
> > > > > > 764 ? S 0:00 [kswapd11]
> > > > > > 765 ? S 0:00 [kswapd12]
> > > > > > 766 ? S 0:00 [kswapd13]
> > > > > > 767 ? S 0:00 [kswapd14]
> > > > > > 768 ? S 0:00 [kswapd15]
> > > > > >
> > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > >
> > > > > > # ps ax | grep [k]swapd
> > > > > > 808 ? S 0:00 [kswapd0]
> > > > > > 809 ? S 0:00 [kswapd1]
> > > > > > 810 ? S 0:00 [kswapd2]
> > > > > > 811 ? S 0:00 [kswapd3] <<<< nice
> > > > > > 812 ? S 0:00 [kswapd4]
> > > > > > 813 ? S 0:00 [kswapd5]
> > > > > > 814 ? S 0:00 [kswapd6]
> > > > > > 815 ? S 0:00 [kswapd7]
> > > > > > 816 ? S 0:00 [kswapd8]
> > > > > > 817 ? S 0:00 [kswapd9]
> > > > > > 818 ? S 0:00 [kswapd10]
> > > > > > 819 ? S 0:00 [kswapd11]
> > > > > > 820 ? S 0:00 [kswapd12]
> > > > > > 821 ? S 0:00 [kswapd13]
> > > > > > 822 ? S 0:00 [kswapd14]
> > > > > > 823 ? S 0:00 [kswapd15]
> > > > > >
> > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > observe it later today.
> > > > >
> > > > > Thanks. Fingers crossed.
> > > >
> > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > The node 3 has 163MiB free of memory and I see
> > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > kswapd3 process usage for almost 4days.
> > >
> > > Thanks for the update!
> > >
> > > Just to confirm:
> > > 1. MGLRU was enabled, and
> >
> > Yes, MGLRU is enabled
> >
> > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> >
> > Vanila 6.6, attached patch NOT applied.
> >
> > > Are both correct?
> > >
> > > If so, I'd very appreciate it if you could try the attached patch on
> > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > compaction related, i.e., kswapd was woken up by high order
> > > allocations but didn't properly stop. But what causes the behavior
> >
> > Sure, I can try it. Will inform you about progress.
>
> Thanks!
>
> > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > puzzles me --the problem might be somehow masked rather than fixed on
> > > v6.6.
> >
> > I'm not sure how I can help with the issue. Any suggestions on what to
> > change/try?
>
> Trying the attached patch is good enough for now :)
So far I'm running the "6.5.y + patch" for 4 days without triggering
the infinite swap in//out usage.
I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
then it is in majority the kswapd3 - like the vanila 6.5.y which is
not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
# ps ax | grep [k]swapd
750 ? S 0:00 [kswapd0]
751 ? S 0:00 [kswapd1]
752 ? S 0:00 [kswapd2]
753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good
is that it is not continuous
754 ? S 0:00 [kswapd4]
755 ? S 0:00 [kswapd5]
756 ? S 0:00 [kswapd6]
757 ? S 0:00 [kswapd7]
758 ? S 0:00 [kswapd8]
759 ? S 0:00 [kswapd9]
760 ? S 0:00 [kswapd10]
761 ? S 0:00 [kswapd11]
762 ? S 0:00 [kswapd12]
763 ? S 0:00 [kswapd13]
764 ? S 0:00 [kswapd14]
765 ? S 0:00 [kswapd15]
Good stuff is that the system did not end in a continuous loop of swap
in/out usage (at least so far) which is great. See attached
swap_in_out_good_vs_bad.png. I will keep it running for the next 3
days.
[-- Attachment #2: swap_in_out_good_vs_bad.png --]
[-- Type: image/png, Size: 81234 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-20 8:41 ` Jaroslav Pulchart
@ 2023-11-22 6:13 ` Yu Zhao
2023-11-22 7:12 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-22 6:13 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla,
Kalesh Singh
On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > >
> > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > >
> > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hello,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > >
> > > > > > > > > > > > Kernel version please?
> > > > > > > > > > >
> > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > >
> > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > >
> > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > >
> > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > >
> > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > > >
> > > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > > > > > > > > ...
> > > > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > > ...
> > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > > > > > > > > ...
> > > > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > > ...
> > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > >
> > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > > >
> > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > > >
> > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > >
> > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > > ahead (high order allocations).
> > > > > > > > > >
> > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > > the system services and free the memory.
> > > > > > > > > >
> > > > > > > > > > Yes, this helps.
> > > > > > > > > > Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > > you hit the problem?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > > >
> > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > >
> > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1
> > > > > > > > > 1 1 2 1
> > > > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > > > > > > > > 61 43 23 4
> > > > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66
> > > > > > > > > 41 9 1 5
> > > > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > > > > > > > > 56 42 36 55
> > > > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > > > > > > > > 194 238 47 74
> > > > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > > > > > > > > 44 14 42 10
> > > > > > > >
> > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > > reclaim") fixed in v6.6.
> > > > > > >
> > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > > swap in/out usage but it uses it and it looks like there is a
> > > > > > > change:
> > > > > > >
> > > > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > > > # ps ax | grep [k]swapd
> > > > > > > 753 ? S 0:00 [kswapd0]
> > > > > > > 754 ? S 0:00 [kswapd1]
> > > > > > > 755 ? S 0:00 [kswapd2]
> > > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> > > > > > > 757 ? S 0:00 [kswapd4]
> > > > > > > 758 ? S 0:00 [kswapd5]
> > > > > > > 759 ? S 0:00 [kswapd6]
> > > > > > > 760 ? S 0:00 [kswapd7]
> > > > > > > 761 ? S 0:00 [kswapd8]
> > > > > > > 762 ? S 0:00 [kswapd9]
> > > > > > > 763 ? S 0:00 [kswapd10]
> > > > > > > 764 ? S 0:00 [kswapd11]
> > > > > > > 765 ? S 0:00 [kswapd12]
> > > > > > > 766 ? S 0:00 [kswapd13]
> > > > > > > 767 ? S 0:00 [kswapd14]
> > > > > > > 768 ? S 0:00 [kswapd15]
> > > > > > >
> > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > > >
> > > > > > > # ps ax | grep [k]swapd
> > > > > > > 808 ? S 0:00 [kswapd0]
> > > > > > > 809 ? S 0:00 [kswapd1]
> > > > > > > 810 ? S 0:00 [kswapd2]
> > > > > > > 811 ? S 0:00 [kswapd3] <<<< nice
> > > > > > > 812 ? S 0:00 [kswapd4]
> > > > > > > 813 ? S 0:00 [kswapd5]
> > > > > > > 814 ? S 0:00 [kswapd6]
> > > > > > > 815 ? S 0:00 [kswapd7]
> > > > > > > 816 ? S 0:00 [kswapd8]
> > > > > > > 817 ? S 0:00 [kswapd9]
> > > > > > > 818 ? S 0:00 [kswapd10]
> > > > > > > 819 ? S 0:00 [kswapd11]
> > > > > > > 820 ? S 0:00 [kswapd12]
> > > > > > > 821 ? S 0:00 [kswapd13]
> > > > > > > 822 ? S 0:00 [kswapd14]
> > > > > > > 823 ? S 0:00 [kswapd15]
> > > > > > >
> > > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > > observe it later today.
> > > > > >
> > > > > > Thanks. Fingers crossed.
> > > > >
> > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > > The node 3 has 163MiB free of memory and I see
> > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > > kswapd3 process usage for almost 4days.
> > > >
> > > > Thanks for the update!
> > > >
> > > > Just to confirm:
> > > > 1. MGLRU was enabled, and
> > >
> > > Yes, MGLRU is enabled
> > >
> > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > >
> > > Vanila 6.6, attached patch NOT applied.
> > >
> > > > Are both correct?
> > > >
> > > > If so, I'd very appreciate it if you could try the attached patch on
> > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > > compaction related, i.e., kswapd was woken up by high order
> > > > allocations but didn't properly stop. But what causes the behavior
> > >
> > > Sure, I can try it. Will inform you about progress.
> >
> > Thanks!
> >
> > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > > puzzles me --the problem might be somehow masked rather than fixed on
> > > > v6.6.
> > >
> > > I'm not sure how I can help with the issue. Any suggestions on what to
> > > change/try?
> >
> > Trying the attached patch is good enough for now :)
>
> So far I'm running the "6.5.y + patch" for 4 days without triggering
> the infinite swap in//out usage.
>
> I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> then it is in majority the kswapd3 - like the vanila 6.5.y which is
> not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> # ps ax | grep [k]swapd
> 750 ? S 0:00 [kswapd0]
> 751 ? S 0:00 [kswapd1]
> 752 ? S 0:00 [kswapd2]
> 753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good
> is that it is not continuous
> 754 ? S 0:00 [kswapd4]
> 755 ? S 0:00 [kswapd5]
> 756 ? S 0:00 [kswapd6]
> 757 ? S 0:00 [kswapd7]
> 758 ? S 0:00 [kswapd8]
> 759 ? S 0:00 [kswapd9]
> 760 ? S 0:00 [kswapd10]
> 761 ? S 0:00 [kswapd11]
> 762 ? S 0:00 [kswapd12]
> 763 ? S 0:00 [kswapd13]
> 764 ? S 0:00 [kswapd14]
> 765 ? S 0:00 [kswapd15]
>
> Good stuff is that the system did not end in a continuous loop of swap
> in/out usage (at least so far) which is great. See attached
> swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> days.
Thanks again, Jaroslav!
Just a note here: I suspect the problem still exists on v6.6 but
somehow is masked, possibly by reduced memory usage from the kernel
itself and more free memory for userspace. So to be on the safe side,
I'll post the patch and credit you as the reporter and tester.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-22 6:13 ` Yu Zhao
@ 2023-11-22 7:12 ` Jaroslav Pulchart
2023-11-22 7:30 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-22 7:12 UTC (permalink / raw)
To: Yu Zhao
Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla,
Kalesh Singh
[-- Attachment #1: Type: text/plain, Size: 13980 bytes --]
>
> On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > > >
> > > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kernel version please?
> > > > > > > > > > > >
> > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > > >
> > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > > >
> > > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > > >
> > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > > > >
> > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > > > >
> > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > > >
> > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > > > ahead (high order allocations).
> > > > > > > > > > >
> > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > > > the system services and free the memory.
> > > > > > > > > > >
> > > > > > > > > > > Yes, this helps.
> > > > > > > > > > > Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > > > you hit the problem?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > > > >
> > > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > > >
> > > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1
> > > > > > > > > > 1 1 2 1
> > > > > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > > > > > > > > > 61 43 23 4
> > > > > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66
> > > > > > > > > > 41 9 1 5
> > > > > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > > > > > > > > > 56 42 36 55
> > > > > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > > > > > > > > > 194 238 47 74
> > > > > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > > > > > > > > > 44 14 42 10
> > > > > > > > >
> > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > > > reclaim") fixed in v6.6.
> > > > > > > >
> > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > > > swap in/out usage but it uses it and it looks like there is a
> > > > > > > > change:
> > > > > > > >
> > > > > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > > > > # ps ax | grep [k]swapd
> > > > > > > > 753 ? S 0:00 [kswapd0]
> > > > > > > > 754 ? S 0:00 [kswapd1]
> > > > > > > > 755 ? S 0:00 [kswapd2]
> > > > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> > > > > > > > 757 ? S 0:00 [kswapd4]
> > > > > > > > 758 ? S 0:00 [kswapd5]
> > > > > > > > 759 ? S 0:00 [kswapd6]
> > > > > > > > 760 ? S 0:00 [kswapd7]
> > > > > > > > 761 ? S 0:00 [kswapd8]
> > > > > > > > 762 ? S 0:00 [kswapd9]
> > > > > > > > 763 ? S 0:00 [kswapd10]
> > > > > > > > 764 ? S 0:00 [kswapd11]
> > > > > > > > 765 ? S 0:00 [kswapd12]
> > > > > > > > 766 ? S 0:00 [kswapd13]
> > > > > > > > 767 ? S 0:00 [kswapd14]
> > > > > > > > 768 ? S 0:00 [kswapd15]
> > > > > > > >
> > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > > > >
> > > > > > > > # ps ax | grep [k]swapd
> > > > > > > > 808 ? S 0:00 [kswapd0]
> > > > > > > > 809 ? S 0:00 [kswapd1]
> > > > > > > > 810 ? S 0:00 [kswapd2]
> > > > > > > > 811 ? S 0:00 [kswapd3] <<<< nice
> > > > > > > > 812 ? S 0:00 [kswapd4]
> > > > > > > > 813 ? S 0:00 [kswapd5]
> > > > > > > > 814 ? S 0:00 [kswapd6]
> > > > > > > > 815 ? S 0:00 [kswapd7]
> > > > > > > > 816 ? S 0:00 [kswapd8]
> > > > > > > > 817 ? S 0:00 [kswapd9]
> > > > > > > > 818 ? S 0:00 [kswapd10]
> > > > > > > > 819 ? S 0:00 [kswapd11]
> > > > > > > > 820 ? S 0:00 [kswapd12]
> > > > > > > > 821 ? S 0:00 [kswapd13]
> > > > > > > > 822 ? S 0:00 [kswapd14]
> > > > > > > > 823 ? S 0:00 [kswapd15]
> > > > > > > >
> > > > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > > > observe it later today.
> > > > > > >
> > > > > > > Thanks. Fingers crossed.
> > > > > >
> > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > > > The node 3 has 163MiB free of memory and I see
> > > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > > > kswapd3 process usage for almost 4days.
> > > > >
> > > > > Thanks for the update!
> > > > >
> > > > > Just to confirm:
> > > > > 1. MGLRU was enabled, and
> > > >
> > > > Yes, MGLRU is enabled
> > > >
> > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > > >
> > > > Vanila 6.6, attached patch NOT applied.
> > > >
> > > > > Are both correct?
> > > > >
> > > > > If so, I'd very appreciate it if you could try the attached patch on
> > > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > > > compaction related, i.e., kswapd was woken up by high order
> > > > > allocations but didn't properly stop. But what causes the behavior
> > > >
> > > > Sure, I can try it. Will inform you about progress.
> > >
> > > Thanks!
> > >
> > > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > > > puzzles me --the problem might be somehow masked rather than fixed on
> > > > > v6.6.
> > > >
> > > > I'm not sure how I can help with the issue. Any suggestions on what to
> > > > change/try?
> > >
> > > Trying the attached patch is good enough for now :)
> >
> > So far I'm running the "6.5.y + patch" for 4 days without triggering
> > the infinite swap in//out usage.
> >
> > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> > then it is in majority the kswapd3 - like the vanila 6.5.y which is
> > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> > # ps ax | grep [k]swapd
> > 750 ? S 0:00 [kswapd0]
> > 751 ? S 0:00 [kswapd1]
> > 752 ? S 0:00 [kswapd2]
> > 753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good
> > is that it is not continuous
> > 754 ? S 0:00 [kswapd4]
> > 755 ? S 0:00 [kswapd5]
> > 756 ? S 0:00 [kswapd6]
> > 757 ? S 0:00 [kswapd7]
> > 758 ? S 0:00 [kswapd8]
> > 759 ? S 0:00 [kswapd9]
> > 760 ? S 0:00 [kswapd10]
> > 761 ? S 0:00 [kswapd11]
> > 762 ? S 0:00 [kswapd12]
> > 763 ? S 0:00 [kswapd13]
> > 764 ? S 0:00 [kswapd14]
> > 765 ? S 0:00 [kswapd15]
> >
> > Good stuff is that the system did not end in a continuous loop of swap
> > in/out usage (at least so far) which is great. See attached
> > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> > days.
>
> Thanks again, Jaroslav!
>
> Just a note here: I suspect the problem still exists on v6.6 but
> somehow is masked, possibly by reduced memory usage from the kernel
> itself and more free memory for userspace. So to be on the safe side,
> I'll post the patch and credit you as the reporter and tester.
Morning, let's wait. I reviewed the graph and the swap in/out started
to be happening from 1:50 AM CET. Slower than before (util of cpu
0.3%) but it is doing in/out see attached png.
[-- Attachment #2: in_out_again.png --]
[-- Type: image/png, Size: 23506 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-22 7:12 ` Jaroslav Pulchart
@ 2023-11-22 7:30 ` Jaroslav Pulchart
2023-11-22 14:18 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-22 7:30 UTC (permalink / raw)
To: Yu Zhao
Cc: linux-mm, akpm, Igor Raits, Daniel Secik, Charan Teja Kalla,
Kalesh Singh
>
> >
> > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > > > >
> > > > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> > > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> > > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> > > > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Kernel version please?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > > > >
> > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> > > > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > > > >
> > > > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> > > > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> > > > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> > > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> > > > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> > > > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> > > > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> > > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> > > > > > > > > > > > > > both cases, the reclaim activities were as expected.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> > > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> > > > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > > > >
> > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> > > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> > > > > > > > > > > > ahead (high order allocations).
> > > > > > > > > > > >
> > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> > > > > > > > > > > > > rest is used for a small set of system services and drivers of
> > > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> > > > > > > > > > > > > the system services and free the memory.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, this helps.
> > > > > > > > > > > > Also could you attach /proc/buddyinfo from the moment
> > > > > > > > > > > > you hit the problem?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> > > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> > > > > > > > > > >
> > > > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > > > >
> > > > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1
> > > > > > > > > > > 1 1 2 1
> > > > > > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> > > > > > > > > > > 61 43 23 4
> > > > > > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66
> > > > > > > > > > > 41 9 1 5
> > > > > > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> > > > > > > > > > > 56 42 36 55
> > > > > > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> > > > > > > > > > > 194 238 47 74
> > > > > > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> > > > > > > > > > > 44 14 42 10
> > > > > > > > > >
> > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> > > > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> > > > > > > > > > reclaim") fixed in v6.6.
> > > > > > > > >
> > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> > > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> > > > > > > > > swap in/out usage but it uses it and it looks like there is a
> > > > > > > > > change:
> > > > > > > > >
> > > > > > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y
> > > > > > > > > # ps ax | grep [k]swapd
> > > > > > > > > 753 ? S 0:00 [kswapd0]
> > > > > > > > > 754 ? S 0:00 [kswapd1]
> > > > > > > > > 755 ? S 0:00 [kswapd2]
> > > > > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> > > > > > > > > 757 ? S 0:00 [kswapd4]
> > > > > > > > > 758 ? S 0:00 [kswapd5]
> > > > > > > > > 759 ? S 0:00 [kswapd6]
> > > > > > > > > 760 ? S 0:00 [kswapd7]
> > > > > > > > > 761 ? S 0:00 [kswapd8]
> > > > > > > > > 762 ? S 0:00 [kswapd9]
> > > > > > > > > 763 ? S 0:00 [kswapd10]
> > > > > > > > > 764 ? S 0:00 [kswapd11]
> > > > > > > > > 765 ? S 0:00 [kswapd12]
> > > > > > > > > 766 ? S 0:00 [kswapd13]
> > > > > > > > > 767 ? S 0:00 [kswapd14]
> > > > > > > > > 768 ? S 0:00 [kswapd15]
> > > > > > > > >
> > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> > > > > > > > >
> > > > > > > > > # ps ax | grep [k]swapd
> > > > > > > > > 808 ? S 0:00 [kswapd0]
> > > > > > > > > 809 ? S 0:00 [kswapd1]
> > > > > > > > > 810 ? S 0:00 [kswapd2]
> > > > > > > > > 811 ? S 0:00 [kswapd3] <<<< nice
> > > > > > > > > 812 ? S 0:00 [kswapd4]
> > > > > > > > > 813 ? S 0:00 [kswapd5]
> > > > > > > > > 814 ? S 0:00 [kswapd6]
> > > > > > > > > 815 ? S 0:00 [kswapd7]
> > > > > > > > > 816 ? S 0:00 [kswapd8]
> > > > > > > > > 817 ? S 0:00 [kswapd9]
> > > > > > > > > 818 ? S 0:00 [kswapd10]
> > > > > > > > > 819 ? S 0:00 [kswapd11]
> > > > > > > > > 820 ? S 0:00 [kswapd12]
> > > > > > > > > 821 ? S 0:00 [kswapd13]
> > > > > > > > > 822 ? S 0:00 [kswapd14]
> > > > > > > > > 823 ? S 0:00 [kswapd15]
> > > > > > > > >
> > > > > > > > > I will install the 6.6.1 on the server which is doing some work and
> > > > > > > > > observe it later today.
> > > > > > > >
> > > > > > > > Thanks. Fingers crossed.
> > > > > > >
> > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> > > > > > > The node 3 has 163MiB free of memory and I see
> > > > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> > > > > > > kswapd3 process usage for almost 4days.
> > > > > >
> > > > > > Thanks for the update!
> > > > > >
> > > > > > Just to confirm:
> > > > > > 1. MGLRU was enabled, and
> > > > >
> > > > > Yes, MGLRU is enabled
> > > > >
> > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > > > >
> > > > > Vanila 6.6, attached patch NOT applied.
> > > > >
> > > > > > Are both correct?
> > > > > >
> > > > > > If so, I'd very appreciate it if you could try the attached patch on
> > > > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> > > > > > compaction related, i.e., kswapd was woken up by high order
> > > > > > allocations but didn't properly stop. But what causes the behavior
> > > > >
> > > > > Sure, I can try it. Will inform you about progress.
> > > >
> > > > Thanks!
> > > >
> > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> > > > > > puzzles me --the problem might be somehow masked rather than fixed on
> > > > > > v6.6.
> > > > >
> > > > > I'm not sure how I can help with the issue. Any suggestions on what to
> > > > > change/try?
> > > >
> > > > Trying the attached patch is good enough for now :)
> > >
> > > So far I'm running the "6.5.y + patch" for 4 days without triggering
> > > the infinite swap in//out usage.
> > >
> > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> > > then it is in majority the kswapd3 - like the vanila 6.5.y which is
> > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> > > # ps ax | grep [k]swapd
> > > 750 ? S 0:00 [kswapd0]
> > > 751 ? S 0:00 [kswapd1]
> > > 752 ? S 0:00 [kswapd2]
> > > 753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good
> > > is that it is not continuous
> > > 754 ? S 0:00 [kswapd4]
> > > 755 ? S 0:00 [kswapd5]
> > > 756 ? S 0:00 [kswapd6]
> > > 757 ? S 0:00 [kswapd7]
> > > 758 ? S 0:00 [kswapd8]
> > > 759 ? S 0:00 [kswapd9]
> > > 760 ? S 0:00 [kswapd10]
> > > 761 ? S 0:00 [kswapd11]
> > > 762 ? S 0:00 [kswapd12]
> > > 763 ? S 0:00 [kswapd13]
> > > 764 ? S 0:00 [kswapd14]
> > > 765 ? S 0:00 [kswapd15]
> > >
> > > Good stuff is that the system did not end in a continuous loop of swap
> > > in/out usage (at least so far) which is great. See attached
> > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> > > days.
> >
> > Thanks again, Jaroslav!
> >
> > Just a note here: I suspect the problem still exists on v6.6 but
> > somehow is masked, possibly by reduced memory usage from the kernel
> > itself and more free memory for userspace. So to be on the safe side,
> > I'll post the patch and credit you as the reporter and tester.
>
> Morning, let's wait. I reviewed the graph and the swap in/out started
> to be happening from 1:50 AM CET. Slower than before (util of cpu
> 0.3%) but it is doing in/out see attached png.
I investigated it more, there was an operation issue and the system
disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround
for this problem) by
echo N > /sys/kernel/mm/lru_gen/enabled
when an alert was triggered by an unexpected setup of the server.
Could it be that the patch is not functional if lru_gen/enabled is
0x0000?
I need to reboot the system and do the whole week's test again.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-22 7:30 ` Jaroslav Pulchart
@ 2023-11-22 14:18 ` Yu Zhao
2023-11-29 13:54 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-11-22 14:18 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Charan Teja Kalla, Daniel Secik, Igor Raits, Kalesh Singh, akpm,
linux-mm
[-- Attachment #1: Type: text/plain, Size: 15792 bytes --]
On Wed, Nov 22, 2023 at 12:31 AM Jaroslav Pulchart <
jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Jaroslav,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Yu Zhao
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > thanks for response, see answers inline:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav
> Pulchart
> > > > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I would like to report to you an unpleasant
> behavior of multi-gen LRU
> > > > > > > > > > > > > > > > with strange swap in/out usage on my Dell
> 7525 two socket AMD 74F3
> > > > > > > > > > > > > > > > system (16numa domains).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Kernel version please?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in
> investigation from 23th May
> > > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> > > > > > > > > > > > >
> > > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can
> backport them to v6.5
> > > > > > > > > > > > > for you if you run into other problems with v6.6.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I will give it a try using 6.6.y. When it will work
> we can switch to
> > > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> > > > > > > > > > > >
> > > > > > > > > > > > > > > > Symptoms of my issue are
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the
> fourth node was under memory pressure.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2
> users, load average: 23.34,
> > > > > > > > > > > > > > > > 18.26, 15.01
> > > > > > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224
> sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1
> id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free,
> 1021308.+used, 767.6 buff/cache
> > > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7
> free, 4.2 used. 25956.7 avail Mem
> > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > 765 root 20 0 0 0
> 0 R 98.3 0.0
> > > > > > > > > > > > > > > > 34969:04 kswapd3
> > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from
> 8GB as swap in zram (was
> > > > > > > > > > > > > > > > observed with swap disk as well and cause IO
> latency issues due to
> > > > > > > > > > > > > > > > some kind of locking)
> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical
> ~12MB/s in and ~12MB/s out
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> > > > > > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2
> users, load average: 23.05,
> > > > > > > > > > > > > > > > 17.77, 14.77
> > > > > > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225
> sleeping, 0 stopped, 0 zombie
> > > > > > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8
> id, 0.0 wa, 0.4 hi,
> > > > > > > > > > > > > > > > 0.4 si, 0.0 st
> > > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free,
> 1021313.+used, 767.3 buff/cache
> > > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0
> free, 3.0 used. 25952.4 avail Mem
> > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > 765 root 20 0 0 0
> 0 S 3.6 0.0
> > > > > > > > > > > > > > > > 34966:46 [kswapd3]
> > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical
> ~500kB/s in and ~500kB/s out
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Both situations are wrong as they are using
> swap in/out extensively,
> > > > > > > > > > > > > > > > however the multi-gen LRU situation is
> 10times worse.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From the stats below, node 3 had the lowest
> free memory. So I think in
> > > > > > > > > > > > > > > both cases, the reclaim activities were as
> expected.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I do not see a reason for the memory pressure
> and reclaims. This node
> > > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB
> free) that is true,
> > > > > > > > > > > > > > however the swap space usage is just 4MB (still
> going in and out). So
> > > > > > > > > > > > > > what can be the reason for that behaviour?
> > > > > > > > > > > > >
> > > > > > > > > > > > > The best analogy is that refuel (reclaim) happens
> before the tank
> > > > > > > > > > > > > becomes empty, and it happens even sooner when
> there is a long road
> > > > > > > > > > > > > ahead (high order allocations).
> > > > > > > > > > > > >
> > > > > > > > > > > > > > The workers/application is running in
> pre-allocated HugePages and the
> > > > > > > > > > > > > > rest is used for a small set of system services
> and drivers of
> > > > > > > > > > > > > > devices. It is static and not growing. The issue
> persists when I stop
> > > > > > > > > > > > > > the system services and free the memory.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, this helps.
> > > > > > > > > > > > > Also could you attach /proc/buddyinfo from the
> moment
> > > > > > > > > > > > > you hit the problem?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I can. The problem is continuous, it is 100% of time
> continuously
> > > > > > > > > > > > doing in/out and consuming 100% of CPU and locking
> IO.
> > > > > > > > > > > >
> > > > > > > > > > > > The output of /proc/buddyinfo is:
> > > > > > > > > > > >
> > > > > > > > > > > > # cat /proc/buddyinfo
> > > > > > > > > > > > Node 0, zone DMA 7 2 2 1
> 1 2 1
> > > > > > > > > > > > 1 1 2 1
> > > > > > > > > > > > Node 0, zone DMA32 4567 3395 1357 846
> 439 190 93
> > > > > > > > > > > > 61 43 23 4
> > > > > > > > > > > > Node 0, zone Normal 19 190 140 129
> 136 75 66
> > > > > > > > > > > > 41 9 1 5
> > > > > > > > > > > > Node 1, zone Normal 194 1210 2080 1800
> 715 255 111
> > > > > > > > > > > > 56 42 36 55
> > > > > > > > > > > > Node 2, zone Normal 204 768 3766 3394
> 1742 468 185
> > > > > > > > > > > > 194 238 47 74
> > > > > > > > > > > > Node 3, zone Normal 1622 2137 1058 846
> 388 208 97
> > > > > > > > > > > > 44 14 42 10
> > > > > > > > > > >
> > > > > > > > > > > Again, thinking out loud: there is only one zone on
> node 3, i.e., the
> > > > > > > > > > > normal zone, and this excludes the problem commit
> > > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen
> LRU: fix per-zone
> > > > > > > > > > > reclaim") fixed in v6.6.
> > > > > > > > > >
> > > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin
> up and destroy
> > > > > > > > > > VMs only - This test does not always trigger the kswapd3
> continuous
> > > > > > > > > > swap in/out usage but it uses it and it looks like
> there is a
> > > > > > > > > > change:
> > > > > > > > > >
> > > > > > > > > > I can see kswapd non-continous (15s and more) usage
> with 6.5.y
> > > > > > > > > > # ps ax | grep [k]swapd
> > > > > > > > > > 753 ? S 0:00 [kswapd0]
> > > > > > > > > > 754 ? S 0:00 [kswapd1]
> > > > > > > > > > 755 ? S 0:00 [kswapd2]
> > > > > > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> > > > > > > > > > 757 ? S 0:00 [kswapd4]
> > > > > > > > > > 758 ? S 0:00 [kswapd5]
> > > > > > > > > > 759 ? S 0:00 [kswapd6]
> > > > > > > > > > 760 ? S 0:00 [kswapd7]
> > > > > > > > > > 761 ? S 0:00 [kswapd8]
> > > > > > > > > > 762 ? S 0:00 [kswapd9]
> > > > > > > > > > 763 ? S 0:00 [kswapd10]
> > > > > > > > > > 764 ? S 0:00 [kswapd11]
> > > > > > > > > > 765 ? S 0:00 [kswapd12]
> > > > > > > > > > 766 ? S 0:00 [kswapd13]
> > > > > > > > > > 767 ? S 0:00 [kswapd14]
> > > > > > > > > > 768 ? S 0:00 [kswapd15]
> > > > > > > > > >
> > > > > > > > > > and none kswapd usage with 6.6.1, that looks to be
> promising path
> > > > > > > > > >
> > > > > > > > > > # ps ax | grep [k]swapd
> > > > > > > > > > 808 ? S 0:00 [kswapd0]
> > > > > > > > > > 809 ? S 0:00 [kswapd1]
> > > > > > > > > > 810 ? S 0:00 [kswapd2]
> > > > > > > > > > 811 ? S 0:00 [kswapd3] <<<< nice
> > > > > > > > > > 812 ? S 0:00 [kswapd4]
> > > > > > > > > > 813 ? S 0:00 [kswapd5]
> > > > > > > > > > 814 ? S 0:00 [kswapd6]
> > > > > > > > > > 815 ? S 0:00 [kswapd7]
> > > > > > > > > > 816 ? S 0:00 [kswapd8]
> > > > > > > > > > 817 ? S 0:00 [kswapd9]
> > > > > > > > > > 818 ? S 0:00 [kswapd10]
> > > > > > > > > > 819 ? S 0:00 [kswapd11]
> > > > > > > > > > 820 ? S 0:00 [kswapd12]
> > > > > > > > > > 821 ? S 0:00 [kswapd13]
> > > > > > > > > > 822 ? S 0:00 [kswapd14]
> > > > > > > > > > 823 ? S 0:00 [kswapd15]
> > > > > > > > > >
> > > > > > > > > > I will install the 6.6.1 on the server which is doing
> some work and
> > > > > > > > > > observe it later today.
> > > > > > > > >
> > > > > > > > > Thanks. Fingers crossed.
> > > > > > > >
> > > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So
> far so good.
> > > > > > > > The node 3 has 163MiB free of memory and I see
> > > > > > > > just a few in/out swap usage sometimes (which is expected)
> and minimal
> > > > > > > > kswapd3 process usage for almost 4days.
> > > > > > >
> > > > > > > Thanks for the update!
> > > > > > >
> > > > > > > Just to confirm:
> > > > > > > 1. MGLRU was enabled, and
> > > > > >
> > > > > > Yes, MGLRU is enabled
> > > > > >
> > > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> > > > > >
> > > > > > Vanila 6.6, attached patch NOT applied.
> > > > > >
> > > > > > > Are both correct?
> > > > > > >
> > > > > > > If so, I'd very appreciate it if you could try the attached
> patch on
> > > > > > > top of v6.5 and see if it helps. My suspicion is that the
> problem is
> > > > > > > compaction related, i.e., kswapd was woken up by high order
> > > > > > > allocations but didn't properly stop. But what causes the
> behavior
> > > > > >
> > > > > > Sure, I can try it. Will inform you about progress.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > > > difference on v6.5 between MGLRU and the active/inactive LRU
> still
> > > > > > > puzzles me --the problem might be somehow masked rather than
> fixed on
> > > > > > > v6.6.
> > > > > >
> > > > > > I'm not sure how I can help with the issue. Any suggestions on
> what to
> > > > > > change/try?
> > > > >
> > > > > Trying the attached patch is good enough for now :)
> > > >
> > > > So far I'm running the "6.5.y + patch" for 4 days without triggering
> > > > the infinite swap in//out usage.
> > > >
> > > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> > > > then it is in majority the kswapd3 - like the vanila 6.5.y which is
> > > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> > > > # ps ax | grep [k]swapd
> > > > 750 ? S 0:00 [kswapd0]
> > > > 751 ? S 0:00 [kswapd1]
> > > > 752 ? S 0:00 [kswapd2]
> > > > 753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good
> > > > is that it is not continuous
> > > > 754 ? S 0:00 [kswapd4]
> > > > 755 ? S 0:00 [kswapd5]
> > > > 756 ? S 0:00 [kswapd6]
> > > > 757 ? S 0:00 [kswapd7]
> > > > 758 ? S 0:00 [kswapd8]
> > > > 759 ? S 0:00 [kswapd9]
> > > > 760 ? S 0:00 [kswapd10]
> > > > 761 ? S 0:00 [kswapd11]
> > > > 762 ? S 0:00 [kswapd12]
> > > > 763 ? S 0:00 [kswapd13]
> > > > 764 ? S 0:00 [kswapd14]
> > > > 765 ? S 0:00 [kswapd15]
> > > >
> > > > Good stuff is that the system did not end in a continuous loop of
> swap
> > > > in/out usage (at least so far) which is great. See attached
> > > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> > > > days.
> > >
> > > Thanks again, Jaroslav!
> > >
> > > Just a note here: I suspect the problem still exists on v6.6 but
> > > somehow is masked, possibly by reduced memory usage from the kernel
> > > itself and more free memory for userspace. So to be on the safe side,
> > > I'll post the patch and credit you as the reporter and tester.
> >
> > Morning, let's wait. I reviewed the graph and the swap in/out started
> > to be happening from 1:50 AM CET. Slower than before (util of cpu
> > 0.3%) but it is doing in/out see attached png.
>
> I investigated it more, there was an operation issue and the system
> disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround
> for this problem) by
> echo N > /sys/kernel/mm/lru_gen/enabled
> when an alert was triggered by an unexpected setup of the server.
> Could it be that the patch is not functional if lru_gen/enabled is
> 0x0000?
That’s correct.
I need to reboot the system and do the whole week's test again.
Thanks a lot!
>
[-- Attachment #2: Type: text/html, Size: 26004 bytes --]
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-22 14:18 ` Yu Zhao
@ 2023-11-29 13:54 ` Jaroslav Pulchart
2023-12-01 23:52 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-11-29 13:54 UTC (permalink / raw)
To: Yu Zhao
Cc: Charan Teja Kalla, Daniel Secik, Igor Raits, Kalesh Singh, akpm,
linux-mm
> On Wed, Nov 22, 2023 at 12:31 AM Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> wrote:
>>
>> >
>> > >
>> > > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
>> > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > >
>> > > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
>> > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > >
>> > > > > > >
>> > > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
>> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
>> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
>> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
>> > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hi Jaroslav,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hi Yu Zhao
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > thanks for response, see answers inline:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
>> > > > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Hello,
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
>> > > > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
>> > > > > > > > > > > > > > > > system (16numa domains).
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Kernel version please?
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
>> > > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
>> > > > > > > > > > > > > for you if you run into other problems with v6.6.
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
>> > > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
>> > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Symptoms of my issue are
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
>> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
>> > > > > > > > > > > > > > > > 18.26, 15.01
>> > > > > > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
>> > > > > > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
>> > > > > > > > > > > > > > > > 0.4 si, 0.0 st
>> > > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
>> > > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
>> > > > > > > > > > > > > > > > ...
>> > > > > > > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
>> > > > > > > > > > > > > > > > 34969:04 kswapd3
>> > > > > > > > > > > > > > > > ...
>> > > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
>> > > > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
>> > > > > > > > > > > > > > > > some kind of locking)
>> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
>> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
>> > > > > > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
>> > > > > > > > > > > > > > > > 17.77, 14.77
>> > > > > > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
>> > > > > > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
>> > > > > > > > > > > > > > > > 0.4 si, 0.0 st
>> > > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
>> > > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
>> > > > > > > > > > > > > > > > ...
>> > > > > > > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
>> > > > > > > > > > > > > > > > 34966:46 [kswapd3]
>> > > > > > > > > > > > > > > > ...
>> > > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
>> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
>> > > > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
>> > > > > > > > > > > > > > > both cases, the reclaim activities were as expected.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
>> > > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
>> > > > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
>> > > > > > > > > > > > > > what can be the reason for that behaviour?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
>> > > > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
>> > > > > > > > > > > > > ahead (high order allocations).
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
>> > > > > > > > > > > > > > rest is used for a small set of system services and drivers of
>> > > > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
>> > > > > > > > > > > > > > the system services and free the memory.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Yes, this helps.
>> > > > > > > > > > > > > Also could you attach /proc/buddyinfo from the moment
>> > > > > > > > > > > > > you hit the problem?
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
>> > > > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
>> > > > > > > > > > > >
>> > > > > > > > > > > > The output of /proc/buddyinfo is:
>> > > > > > > > > > > >
>> > > > > > > > > > > > # cat /proc/buddyinfo
>> > > > > > > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1
>> > > > > > > > > > > > 1 1 2 1
>> > > > > > > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
>> > > > > > > > > > > > 61 43 23 4
>> > > > > > > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66
>> > > > > > > > > > > > 41 9 1 5
>> > > > > > > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
>> > > > > > > > > > > > 56 42 36 55
>> > > > > > > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
>> > > > > > > > > > > > 194 238 47 74
>> > > > > > > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
>> > > > > > > > > > > > 44 14 42 10
>> > > > > > > > > > >
>> > > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
>> > > > > > > > > > > normal zone, and this excludes the problem commit
>> > > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
>> > > > > > > > > > > reclaim") fixed in v6.6.
>> > > > > > > > > >
>> > > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
>> > > > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
>> > > > > > > > > > swap in/out usage but it uses it and it looks like there is a
>> > > > > > > > > > change:
>> > > > > > > > > >
>> > > > > > > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y
>> > > > > > > > > > # ps ax | grep [k]swapd
>> > > > > > > > > > 753 ? S 0:00 [kswapd0]
>> > > > > > > > > > 754 ? S 0:00 [kswapd1]
>> > > > > > > > > > 755 ? S 0:00 [kswapd2]
>> > > > > > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
>> > > > > > > > > > 757 ? S 0:00 [kswapd4]
>> > > > > > > > > > 758 ? S 0:00 [kswapd5]
>> > > > > > > > > > 759 ? S 0:00 [kswapd6]
>> > > > > > > > > > 760 ? S 0:00 [kswapd7]
>> > > > > > > > > > 761 ? S 0:00 [kswapd8]
>> > > > > > > > > > 762 ? S 0:00 [kswapd9]
>> > > > > > > > > > 763 ? S 0:00 [kswapd10]
>> > > > > > > > > > 764 ? S 0:00 [kswapd11]
>> > > > > > > > > > 765 ? S 0:00 [kswapd12]
>> > > > > > > > > > 766 ? S 0:00 [kswapd13]
>> > > > > > > > > > 767 ? S 0:00 [kswapd14]
>> > > > > > > > > > 768 ? S 0:00 [kswapd15]
>> > > > > > > > > >
>> > > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
>> > > > > > > > > >
>> > > > > > > > > > # ps ax | grep [k]swapd
>> > > > > > > > > > 808 ? S 0:00 [kswapd0]
>> > > > > > > > > > 809 ? S 0:00 [kswapd1]
>> > > > > > > > > > 810 ? S 0:00 [kswapd2]
>> > > > > > > > > > 811 ? S 0:00 [kswapd3] <<<< nice
>> > > > > > > > > > 812 ? S 0:00 [kswapd4]
>> > > > > > > > > > 813 ? S 0:00 [kswapd5]
>> > > > > > > > > > 814 ? S 0:00 [kswapd6]
>> > > > > > > > > > 815 ? S 0:00 [kswapd7]
>> > > > > > > > > > 816 ? S 0:00 [kswapd8]
>> > > > > > > > > > 817 ? S 0:00 [kswapd9]
>> > > > > > > > > > 818 ? S 0:00 [kswapd10]
>> > > > > > > > > > 819 ? S 0:00 [kswapd11]
>> > > > > > > > > > 820 ? S 0:00 [kswapd12]
>> > > > > > > > > > 821 ? S 0:00 [kswapd13]
>> > > > > > > > > > 822 ? S 0:00 [kswapd14]
>> > > > > > > > > > 823 ? S 0:00 [kswapd15]
>> > > > > > > > > >
>> > > > > > > > > > I will install the 6.6.1 on the server which is doing some work and
>> > > > > > > > > > observe it later today.
>> > > > > > > > >
>> > > > > > > > > Thanks. Fingers crossed.
>> > > > > > > >
>> > > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
>> > > > > > > > The node 3 has 163MiB free of memory and I see
>> > > > > > > > just a few in/out swap usage sometimes (which is expected) and minimal
>> > > > > > > > kswapd3 process usage for almost 4days.
>> > > > > > >
>> > > > > > > Thanks for the update!
>> > > > > > >
>> > > > > > > Just to confirm:
>> > > > > > > 1. MGLRU was enabled, and
>> > > > > >
>> > > > > > Yes, MGLRU is enabled
>> > > > > >
>> > > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
>> > > > > >
>> > > > > > Vanila 6.6, attached patch NOT applied.
>> > > > > >
>> > > > > > > Are both correct?
>> > > > > > >
>> > > > > > > If so, I'd very appreciate it if you could try the attached patch on
>> > > > > > > top of v6.5 and see if it helps. My suspicion is that the problem is
>> > > > > > > compaction related, i.e., kswapd was woken up by high order
>> > > > > > > allocations but didn't properly stop. But what causes the behavior
>> > > > > >
>> > > > > > Sure, I can try it. Will inform you about progress.
>> > > > >
>> > > > > Thanks!
>> > > > >
>> > > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still
>> > > > > > > puzzles me --the problem might be somehow masked rather than fixed on
>> > > > > > > v6.6.
>> > > > > >
>> > > > > > I'm not sure how I can help with the issue. Any suggestions on what to
>> > > > > > change/try?
>> > > > >
>> > > > > Trying the attached patch is good enough for now :)
>> > > >
>> > > > So far I'm running the "6.5.y + patch" for 4 days without triggering
>> > > > the infinite swap in//out usage.
>> > > >
>> > > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
>> > > > then it is in majority the kswapd3 - like the vanila 6.5.y which is
>> > > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
>> > > > # ps ax | grep [k]swapd
>> > > > 750 ? S 0:00 [kswapd0]
>> > > > 751 ? S 0:00 [kswapd1]
>> > > > 752 ? S 0:00 [kswapd2]
>> > > > 753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good
>> > > > is that it is not continuous
>> > > > 754 ? S 0:00 [kswapd4]
>> > > > 755 ? S 0:00 [kswapd5]
>> > > > 756 ? S 0:00 [kswapd6]
>> > > > 757 ? S 0:00 [kswapd7]
>> > > > 758 ? S 0:00 [kswapd8]
>> > > > 759 ? S 0:00 [kswapd9]
>> > > > 760 ? S 0:00 [kswapd10]
>> > > > 761 ? S 0:00 [kswapd11]
>> > > > 762 ? S 0:00 [kswapd12]
>> > > > 763 ? S 0:00 [kswapd13]
>> > > > 764 ? S 0:00 [kswapd14]
>> > > > 765 ? S 0:00 [kswapd15]
>> > > >
>> > > > Good stuff is that the system did not end in a continuous loop of swap
>> > > > in/out usage (at least so far) which is great. See attached
>> > > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
>> > > > days.
>> > >
>> > > Thanks again, Jaroslav!
>> > >
>> > > Just a note here: I suspect the problem still exists on v6.6 but
>> > > somehow is masked, possibly by reduced memory usage from the kernel
>> > > itself and more free memory for userspace. So to be on the safe side,
>> > > I'll post the patch and credit you as the reporter and tester.
>> >
>> > Morning, let's wait. I reviewed the graph and the swap in/out started
>> > to be happening from 1:50 AM CET. Slower than before (util of cpu
>> > 0.3%) but it is doing in/out see attached png.
>>
>> I investigated it more, there was an operation issue and the system
>> disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround
>> for this problem) by
>> echo N > /sys/kernel/mm/lru_gen/enabled
>> when an alert was triggered by an unexpected setup of the server.
>> Could it be that the patch is not functional if lru_gen/enabled is
>> 0x0000?
>
>
> That’s correct.
>
>> I need to reboot the system and do the whole week's test again.
>
>
> Thanks a lot!
The server with 6.5.y + lru patch is stable, no continuous swap in/out
is observed in the last 7days!
I assume the fix is correct. Can you share with me the final patch for
6.6.y, I will use in our kernel builds till it is in the upstream.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-11-29 13:54 ` Jaroslav Pulchart
@ 2023-12-01 23:52 ` Yu Zhao
2023-12-07 8:46 ` Charan Teja Kalla
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2023-12-01 23:52 UTC (permalink / raw)
To: Jaroslav Pulchart, Charan Teja Kalla
Cc: Daniel Secik, Igor Raits, Kalesh Singh, akpm, linux-mm
On Wed, Nov 29, 2023 at 6:54 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> > On Wed, Nov 22, 2023 at 12:31 AM Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> wrote:
> >>
> >> >
> >> > >
> >> > > On Mon, Nov 20, 2023 at 1:42 AM Jaroslav Pulchart
> >> > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > >
> >> > > > > On Tue, Nov 14, 2023 at 12:30 AM Jaroslav Pulchart
> >> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > >
> >> > > > > > >
> >> > > > > > > On Mon, Nov 13, 2023 at 1:36 AM Jaroslav Pulchart
> >> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Thu, Nov 9, 2023 at 3:58 AM Jaroslav Pulchart
> >> > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Wed, Nov 8, 2023 at 10:39 PM Jaroslav Pulchart
> >> > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Nov 8, 2023 at 12:04 PM Jaroslav Pulchart
> >> > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi Jaroslav,
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > Hi Yu Zhao
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > thanks for response, see answers inline:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Wed, Nov 8, 2023 at 6:35 AM Jaroslav Pulchart
> >> > > > > > > > > > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Hello,
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > I would like to report to you an unpleasant behavior of multi-gen LRU
> >> > > > > > > > > > > > > > > > with strange swap in/out usage on my Dell 7525 two socket AMD 74F3
> >> > > > > > > > > > > > > > > > system (16numa domains).
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Kernel version please?
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > 6.5.y, but we saw it sooner as it is in investigation from 23th May
> >> > > > > > > > > > > > > > (6.4.y and maybe even the 6.3.y).
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > v6.6 has a few critical fixes for MGLRU, I can backport them to v6.5
> >> > > > > > > > > > > > > for you if you run into other problems with v6.6.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I will give it a try using 6.6.y. When it will work we can switch to
> >> > > > > > > > > > > > 6.6.y instead of backporting the stuff to 6.5.y.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Symptoms of my issue are
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > /A/ if mult-gen LRU is enabled
> >> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 100% CPU
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Just thinking out loud: kswapd3 means the fourth node was under memory pressure.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > top - 15:03:11 up 34 days, 1:51, 2 users, load average: 23.34,
> >> > > > > > > > > > > > > > > > 18.26, 15.01
> >> > > > > > > > > > > > > > > > Tasks: 1226 total, 2 running, 1224 sleeping, 0 stopped, 0 zombie
> >> > > > > > > > > > > > > > > > %Cpu(s): 12.5 us, 4.7 sy, 0.0 ni, 82.1 id, 0.0 wa, 0.4 hi,
> >> > > > > > > > > > > > > > > > 0.4 si, 0.0 st
> >> > > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28382.7 free, 1021308.+used, 767.6 buff/cache
> >> > > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8187.7 free, 4.2 used. 25956.7 avail Mem
> >> > > > > > > > > > > > > > > > ...
> >> > > > > > > > > > > > > > > > 765 root 20 0 0 0 0 R 98.3 0.0
> >> > > > > > > > > > > > > > > > 34969:04 kswapd3
> >> > > > > > > > > > > > > > > > ...
> >> > > > > > > > > > > > > > > > 2/ swap space usage is low about ~4MB from 8GB as swap in zram (was
> >> > > > > > > > > > > > > > > > observed with swap disk as well and cause IO latency issues due to
> >> > > > > > > > > > > > > > > > some kind of locking)
> >> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~12MB/s in and ~12MB/s out
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > /B/ if mult-gen LRU is disabled
> >> > > > > > > > > > > > > > > > 1/ [kswapd3] is consuming 3%-10% CPU
> >> > > > > > > > > > > > > > > > top - 15:02:49 up 34 days, 1:51, 2 users, load average: 23.05,
> >> > > > > > > > > > > > > > > > 17.77, 14.77
> >> > > > > > > > > > > > > > > > Tasks: 1226 total, 1 running, 1225 sleeping, 0 stopped, 0 zombie
> >> > > > > > > > > > > > > > > > %Cpu(s): 14.7 us, 2.8 sy, 0.0 ni, 81.8 id, 0.0 wa, 0.4 hi,
> >> > > > > > > > > > > > > > > > 0.4 si, 0.0 st
> >> > > > > > > > > > > > > > > > MiB Mem : 1047265.+total, 28378.5 free, 1021313.+used, 767.3 buff/cache
> >> > > > > > > > > > > > > > > > MiB Swap: 8192.0 total, 8189.0 free, 3.0 used. 25952.4 avail Mem
> >> > > > > > > > > > > > > > > > ...
> >> > > > > > > > > > > > > > > > 765 root 20 0 0 0 0 S 3.6 0.0
> >> > > > > > > > > > > > > > > > 34966:46 [kswapd3]
> >> > > > > > > > > > > > > > > > ...
> >> > > > > > > > > > > > > > > > 2/ swap space usage is low (4MB)
> >> > > > > > > > > > > > > > > > 3/ swap In/Out is huge and symmetrical ~500kB/s in and ~500kB/s out
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Both situations are wrong as they are using swap in/out extensively,
> >> > > > > > > > > > > > > > > > however the multi-gen LRU situation is 10times worse.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > From the stats below, node 3 had the lowest free memory. So I think in
> >> > > > > > > > > > > > > > > both cases, the reclaim activities were as expected.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I do not see a reason for the memory pressure and reclaims. This node
> >> > > > > > > > > > > > > > has the lowest free memory of all nodes (~302MB free) that is true,
> >> > > > > > > > > > > > > > however the swap space usage is just 4MB (still going in and out). So
> >> > > > > > > > > > > > > > what can be the reason for that behaviour?
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > The best analogy is that refuel (reclaim) happens before the tank
> >> > > > > > > > > > > > > becomes empty, and it happens even sooner when there is a long road
> >> > > > > > > > > > > > > ahead (high order allocations).
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > The workers/application is running in pre-allocated HugePages and the
> >> > > > > > > > > > > > > > rest is used for a small set of system services and drivers of
> >> > > > > > > > > > > > > > devices. It is static and not growing. The issue persists when I stop
> >> > > > > > > > > > > > > > the system services and free the memory.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Yes, this helps.
> >> > > > > > > > > > > > > Also could you attach /proc/buddyinfo from the moment
> >> > > > > > > > > > > > > you hit the problem?
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I can. The problem is continuous, it is 100% of time continuously
> >> > > > > > > > > > > > doing in/out and consuming 100% of CPU and locking IO.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > The output of /proc/buddyinfo is:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > # cat /proc/buddyinfo
> >> > > > > > > > > > > > Node 0, zone DMA 7 2 2 1 1 2 1
> >> > > > > > > > > > > > 1 1 2 1
> >> > > > > > > > > > > > Node 0, zone DMA32 4567 3395 1357 846 439 190 93
> >> > > > > > > > > > > > 61 43 23 4
> >> > > > > > > > > > > > Node 0, zone Normal 19 190 140 129 136 75 66
> >> > > > > > > > > > > > 41 9 1 5
> >> > > > > > > > > > > > Node 1, zone Normal 194 1210 2080 1800 715 255 111
> >> > > > > > > > > > > > 56 42 36 55
> >> > > > > > > > > > > > Node 2, zone Normal 204 768 3766 3394 1742 468 185
> >> > > > > > > > > > > > 194 238 47 74
> >> > > > > > > > > > > > Node 3, zone Normal 1622 2137 1058 846 388 208 97
> >> > > > > > > > > > > > 44 14 42 10
> >> > > > > > > > > > >
> >> > > > > > > > > > > Again, thinking out loud: there is only one zone on node 3, i.e., the
> >> > > > > > > > > > > normal zone, and this excludes the problem commit
> >> > > > > > > > > > > 669281ee7ef731fb5204df9d948669bf32a5e68d ("Multi-gen LRU: fix per-zone
> >> > > > > > > > > > > reclaim") fixed in v6.6.
> >> > > > > > > > > >
> >> > > > > > > > > > I built vanila 6.6.1 and did the first fast test - spin up and destroy
> >> > > > > > > > > > VMs only - This test does not always trigger the kswapd3 continuous
> >> > > > > > > > > > swap in/out usage but it uses it and it looks like there is a
> >> > > > > > > > > > change:
> >> > > > > > > > > >
> >> > > > > > > > > > I can see kswapd non-continous (15s and more) usage with 6.5.y
> >> > > > > > > > > > # ps ax | grep [k]swapd
> >> > > > > > > > > > 753 ? S 0:00 [kswapd0]
> >> > > > > > > > > > 754 ? S 0:00 [kswapd1]
> >> > > > > > > > > > 755 ? S 0:00 [kswapd2]
> >> > > > > > > > > > 756 ? S 0:15 [kswapd3] <<<<<<<<<
> >> > > > > > > > > > 757 ? S 0:00 [kswapd4]
> >> > > > > > > > > > 758 ? S 0:00 [kswapd5]
> >> > > > > > > > > > 759 ? S 0:00 [kswapd6]
> >> > > > > > > > > > 760 ? S 0:00 [kswapd7]
> >> > > > > > > > > > 761 ? S 0:00 [kswapd8]
> >> > > > > > > > > > 762 ? S 0:00 [kswapd9]
> >> > > > > > > > > > 763 ? S 0:00 [kswapd10]
> >> > > > > > > > > > 764 ? S 0:00 [kswapd11]
> >> > > > > > > > > > 765 ? S 0:00 [kswapd12]
> >> > > > > > > > > > 766 ? S 0:00 [kswapd13]
> >> > > > > > > > > > 767 ? S 0:00 [kswapd14]
> >> > > > > > > > > > 768 ? S 0:00 [kswapd15]
> >> > > > > > > > > >
> >> > > > > > > > > > and none kswapd usage with 6.6.1, that looks to be promising path
> >> > > > > > > > > >
> >> > > > > > > > > > # ps ax | grep [k]swapd
> >> > > > > > > > > > 808 ? S 0:00 [kswapd0]
> >> > > > > > > > > > 809 ? S 0:00 [kswapd1]
> >> > > > > > > > > > 810 ? S 0:00 [kswapd2]
> >> > > > > > > > > > 811 ? S 0:00 [kswapd3] <<<< nice
> >> > > > > > > > > > 812 ? S 0:00 [kswapd4]
> >> > > > > > > > > > 813 ? S 0:00 [kswapd5]
> >> > > > > > > > > > 814 ? S 0:00 [kswapd6]
> >> > > > > > > > > > 815 ? S 0:00 [kswapd7]
> >> > > > > > > > > > 816 ? S 0:00 [kswapd8]
> >> > > > > > > > > > 817 ? S 0:00 [kswapd9]
> >> > > > > > > > > > 818 ? S 0:00 [kswapd10]
> >> > > > > > > > > > 819 ? S 0:00 [kswapd11]
> >> > > > > > > > > > 820 ? S 0:00 [kswapd12]
> >> > > > > > > > > > 821 ? S 0:00 [kswapd13]
> >> > > > > > > > > > 822 ? S 0:00 [kswapd14]
> >> > > > > > > > > > 823 ? S 0:00 [kswapd15]
> >> > > > > > > > > >
> >> > > > > > > > > > I will install the 6.6.1 on the server which is doing some work and
> >> > > > > > > > > > observe it later today.
> >> > > > > > > > >
> >> > > > > > > > > Thanks. Fingers crossed.
> >> > > > > > > >
> >> > > > > > > > The 6.6.y was deployed and used from 9th Nov 3PM CEST. So far so good.
> >> > > > > > > > The node 3 has 163MiB free of memory and I see
> >> > > > > > > > just a few in/out swap usage sometimes (which is expected) and minimal
> >> > > > > > > > kswapd3 process usage for almost 4days.
> >> > > > > > >
> >> > > > > > > Thanks for the update!
> >> > > > > > >
> >> > > > > > > Just to confirm:
> >> > > > > > > 1. MGLRU was enabled, and
> >> > > > > >
> >> > > > > > Yes, MGLRU is enabled
> >> > > > > >
> >> > > > > > > 2. The v6.6 deployed did NOT have the patch I attached earlier.
> >> > > > > >
> >> > > > > > Vanila 6.6, attached patch NOT applied.
> >> > > > > >
> >> > > > > > > Are both correct?
> >> > > > > > >
> >> > > > > > > If so, I'd very appreciate it if you could try the attached patch on
> >> > > > > > > top of v6.5 and see if it helps. My suspicion is that the problem is
> >> > > > > > > compaction related, i.e., kswapd was woken up by high order
> >> > > > > > > allocations but didn't properly stop. But what causes the behavior
> >> > > > > >
> >> > > > > > Sure, I can try it. Will inform you about progress.
> >> > > > >
> >> > > > > Thanks!
> >> > > > >
> >> > > > > > > difference on v6.5 between MGLRU and the active/inactive LRU still
> >> > > > > > > puzzles me --the problem might be somehow masked rather than fixed on
> >> > > > > > > v6.6.
> >> > > > > >
> >> > > > > > I'm not sure how I can help with the issue. Any suggestions on what to
> >> > > > > > change/try?
> >> > > > >
> >> > > > > Trying the attached patch is good enough for now :)
> >> > > >
> >> > > > So far I'm running the "6.5.y + patch" for 4 days without triggering
> >> > > > the infinite swap in//out usage.
> >> > > >
> >> > > > I'm observing a similar pattern in kswapd usage - "if it uses kswapd,
> >> > > > then it is in majority the kswapd3 - like the vanila 6.5.y which is
> >> > > > not observed with 6.6.y, (The Node's 3 free mem is 159 MB)
> >> > > > # ps ax | grep [k]swapd
> >> > > > 750 ? S 0:00 [kswapd0]
> >> > > > 751 ? S 0:00 [kswapd1]
> >> > > > 752 ? S 0:00 [kswapd2]
> >> > > > 753 ? S 0:02 [kswapd3] <<<< it uses kswapd3, good
> >> > > > is that it is not continuous
> >> > > > 754 ? S 0:00 [kswapd4]
> >> > > > 755 ? S 0:00 [kswapd5]
> >> > > > 756 ? S 0:00 [kswapd6]
> >> > > > 757 ? S 0:00 [kswapd7]
> >> > > > 758 ? S 0:00 [kswapd8]
> >> > > > 759 ? S 0:00 [kswapd9]
> >> > > > 760 ? S 0:00 [kswapd10]
> >> > > > 761 ? S 0:00 [kswapd11]
> >> > > > 762 ? S 0:00 [kswapd12]
> >> > > > 763 ? S 0:00 [kswapd13]
> >> > > > 764 ? S 0:00 [kswapd14]
> >> > > > 765 ? S 0:00 [kswapd15]
> >> > > >
> >> > > > Good stuff is that the system did not end in a continuous loop of swap
> >> > > > in/out usage (at least so far) which is great. See attached
> >> > > > swap_in_out_good_vs_bad.png. I will keep it running for the next 3
> >> > > > days.
> >> > >
> >> > > Thanks again, Jaroslav!
> >> > >
> >> > > Just a note here: I suspect the problem still exists on v6.6 but
> >> > > somehow is masked, possibly by reduced memory usage from the kernel
> >> > > itself and more free memory for userspace. So to be on the safe side,
> >> > > I'll post the patch and credit you as the reporter and tester.
> >> >
> >> > Morning, let's wait. I reviewed the graph and the swap in/out started
> >> > to be happening from 1:50 AM CET. Slower than before (util of cpu
> >> > 0.3%) but it is doing in/out see attached png.
> >>
> >> I investigated it more, there was an operation issue and the system
> >> disabled multi-gen lru yesterday ~10 AM CET (our temporary workaround
> >> for this problem) by
> >> echo N > /sys/kernel/mm/lru_gen/enabled
> >> when an alert was triggered by an unexpected setup of the server.
> >> Could it be that the patch is not functional if lru_gen/enabled is
> >> 0x0000?
> >
> >
> > That’s correct.
> >
> >> I need to reboot the system and do the whole week's test again.
> >
> >
> > Thanks a lot!
>
> The server with 6.5.y + lru patch is stable, no continuous swap in/out
> is observed in the last 7days!
>
> I assume the fix is correct. Can you share with me the final patch for
> 6.6.y, I will use in our kernel builds till it is in the upstream.
Will do. Thank you.
Charan, does the fix previously attached seem acceptable to you? Any
additional feedback? Thanks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-12-01 23:52 ` Yu Zhao
@ 2023-12-07 8:46 ` Charan Teja Kalla
2023-12-07 18:23 ` Yu Zhao
2023-12-08 8:03 ` Jaroslav Pulchart
0 siblings, 2 replies; 30+ messages in thread
From: Charan Teja Kalla @ 2023-12-07 8:46 UTC (permalink / raw)
To: Yu Zhao, Jaroslav Pulchart
Cc: Daniel Secik, Igor Raits, Kalesh Singh, akpm, linux-mm
Hi yu,
On 12/2/2023 5:22 AM, Yu Zhao wrote:
> Charan, does the fix previously attached seem acceptable to you? Any
> additional feedback? Thanks.
First, thanks for taking this patch to upstream.
A comment in code snippet is checking just 'high wmark' pages might
succeed here but can fail in the immediate kswapd sleep, see
prepare_kswapd_sleep(). This can show up into the increased
KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
@Jaroslav: Have you observed something like above?
So, in downstream, we have something like for zone_watermark_ok():
unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
what all I can say for this patch.
+ mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
+ WMARK_PROMO : WMARK_HIGH;
+ for (i = 0; i <= sc->reclaim_idx; i++) {
+ struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
+ unsigned long size = wmark_pages(zone, mark);
+
+ if (managed_zone(zone) &&
+ !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
+ return false;
+ }
Thanks,
Charan
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-12-07 8:46 ` Charan Teja Kalla
@ 2023-12-07 18:23 ` Yu Zhao
2023-12-08 8:03 ` Jaroslav Pulchart
1 sibling, 0 replies; 30+ messages in thread
From: Yu Zhao @ 2023-12-07 18:23 UTC (permalink / raw)
To: Charan Teja Kalla
Cc: Jaroslav Pulchart, Daniel Secik, Igor Raits, Kalesh Singh, akpm,
linux-mm
On Thu, Dec 7, 2023 at 1:47 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Hi yu,
>
> On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > Charan, does the fix previously attached seem acceptable to you? Any
> > additional feedback? Thanks.
>
> First, thanks for taking this patch to upstream.
>
> A comment in code snippet is checking just 'high wmark' pages might
> succeed here but can fail in the immediate kswapd sleep, see
> prepare_kswapd_sleep(). This can show up into the increased
> KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> @Jaroslav: Have you observed something like above?
>
> So, in downstream, we have something like for zone_watermark_ok():
> unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
>
> Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> what all I can say for this patch.
Yeah, we can add MIN_LRU_BATCH on top of the high watermark.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-12-07 8:46 ` Charan Teja Kalla
2023-12-07 18:23 ` Yu Zhao
@ 2023-12-08 8:03 ` Jaroslav Pulchart
2024-01-03 21:30 ` Jaroslav Pulchart
1 sibling, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2023-12-08 8:03 UTC (permalink / raw)
To: Charan Teja Kalla
Cc: Yu Zhao, Daniel Secik, Igor Raits, Kalesh Singh, akpm, linux-mm
>
> Hi yu,
>
> On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > Charan, does the fix previously attached seem acceptable to you? Any
> > additional feedback? Thanks.
>
> First, thanks for taking this patch to upstream.
>
> A comment in code snippet is checking just 'high wmark' pages might
> succeed here but can fail in the immediate kswapd sleep, see
> prepare_kswapd_sleep(). This can show up into the increased
> KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> @Jaroslav: Have you observed something like above?
I do not see any unnecessary kswapd run time, on the contrary it is
fixing the kswapd continuous run issue.
>
> So, in downstream, we have something like for zone_watermark_ok():
> unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
>
> Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> what all I can say for this patch.
>
> + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> + WMARK_PROMO : WMARK_HIGH;
> + for (i = 0; i <= sc->reclaim_idx; i++) {
> + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> + unsigned long size = wmark_pages(zone, mark);
> +
> + if (managed_zone(zone) &&
> + !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> + return false;
> + }
>
>
> Thanks,
> Charan
--
Jaroslav Pulchart
Sr. Principal SW Engineer
GoodData
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2023-12-08 8:03 ` Jaroslav Pulchart
@ 2024-01-03 21:30 ` Jaroslav Pulchart
2024-01-04 3:03 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-03 21:30 UTC (permalink / raw)
To: Yu Zhao
Cc: Daniel Secik, Charan Teja Kalla, Igor Raits, Kalesh Singh, akpm,
linux-mm
>
> >
> > Hi yu,
> >
> > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > Charan, does the fix previously attached seem acceptable to you? Any
> > > additional feedback? Thanks.
> >
> > First, thanks for taking this patch to upstream.
> >
> > A comment in code snippet is checking just 'high wmark' pages might
> > succeed here but can fail in the immediate kswapd sleep, see
> > prepare_kswapd_sleep(). This can show up into the increased
> > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > @Jaroslav: Have you observed something like above?
>
> I do not see any unnecessary kswapd run time, on the contrary it is
> fixing the kswapd continuous run issue.
>
> >
> > So, in downstream, we have something like for zone_watermark_ok():
> > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> >
> > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > what all I can say for this patch.
> >
> > + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > + WMARK_PROMO : WMARK_HIGH;
> > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > + unsigned long size = wmark_pages(zone, mark);
> > +
> > + if (managed_zone(zone) &&
> > + !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > + return false;
> > + }
> >
> >
> > Thanks,
> > Charan
>
>
>
> --
> Jaroslav Pulchart
> Sr. Principal SW Engineer
> GoodData
Hello,
today we try to update servers to 6.6.9 which contains the mglru fixes
(from 6.6.8) and the server behaves much much worse.
I got multiple kswapd* load to ~100% imediatelly.
555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
kswapd1
554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
kswapd0
556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
kswapd2
are the changes in upstream different compared to the initial patch
which I tested?
Best regards,
Jaroslav Pulchart
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2024-01-03 21:30 ` Jaroslav Pulchart
@ 2024-01-04 3:03 ` Yu Zhao
2024-01-04 9:46 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2024-01-04 3:03 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Daniel Secik, Charan Teja Kalla, Igor Raits, Kalesh Singh, akpm,
linux-mm
[-- Attachment #1: Type: text/plain, Size: 2812 bytes --]
On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > >
> > > Hi yu,
> > >
> > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > additional feedback? Thanks.
> > >
> > > First, thanks for taking this patch to upstream.
> > >
> > > A comment in code snippet is checking just 'high wmark' pages might
> > > succeed here but can fail in the immediate kswapd sleep, see
> > > prepare_kswapd_sleep(). This can show up into the increased
> > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > @Jaroslav: Have you observed something like above?
> >
> > I do not see any unnecessary kswapd run time, on the contrary it is
> > fixing the kswapd continuous run issue.
> >
> > >
> > > So, in downstream, we have something like for zone_watermark_ok():
> > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > >
> > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > what all I can say for this patch.
> > >
> > > + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > + WMARK_PROMO : WMARK_HIGH;
> > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > + unsigned long size = wmark_pages(zone, mark);
> > > +
> > > + if (managed_zone(zone) &&
> > > + !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > + return false;
> > > + }
> > >
> > >
> > > Thanks,
> > > Charan
> >
> >
> >
> > --
> > Jaroslav Pulchart
> > Sr. Principal SW Engineer
> > GoodData
>
>
> Hello,
>
> today we try to update servers to 6.6.9 which contains the mglru fixes
> (from 6.6.8) and the server behaves much much worse.
>
> I got multiple kswapd* load to ~100% imediatelly.
> 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> kswapd1
> 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> kswapd0
> 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> kswapd2
> are the changes in upstream different compared to the initial patch
> which I tested?
>
> Best regards,
> Jaroslav Pulchart
Hi Jaroslav,
My apologies for all the trouble!
Yes, there is a slight difference between the fix you verified and
what went into 6.6.9. The fix in 6.6.9 is disabled under a special
condition which I thought wouldn't affect you.
Could you try the attached fix again on top of 6.6.9? It removed that
special condition.
Thanks!
[-- Attachment #2: mglru-fix-6.6.9.patch --]
[-- Type: application/octet-stream, Size: 975 bytes --]
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dcc264d3c92f..ae3f73fc933c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5358,8 +5358,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
if (sc->nr_reclaimed >= max(sc->nr_to_reclaim, compact_gap(sc->order)))
return true;
- /* check the order to exclude compaction-induced reclaim */
- if (!current_is_kswapd() || sc->order)
+ if (!current_is_kswapd())
return false;
mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
@@ -5367,7 +5366,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
for (i = 0; i <= sc->reclaim_idx; i++) {
struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
- unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH;
+ unsigned long size = wmark_pages(zone, mark) + min_wmark_pages(zone);
if (managed_zone(zone) && !zone_watermark_ok(zone, 0, size, sc->reclaim_idx, 0))
return false;
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2024-01-04 3:03 ` Yu Zhao
@ 2024-01-04 9:46 ` Jaroslav Pulchart
2024-01-04 14:34 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-04 9:46 UTC (permalink / raw)
To: Yu Zhao
Cc: Daniel Secik, Charan Teja Kalla, Igor Raits, Kalesh Singh, akpm,
linux-mm
>
> On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > >
> > > > Hi yu,
> > > >
> > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > > additional feedback? Thanks.
> > > >
> > > > First, thanks for taking this patch to upstream.
> > > >
> > > > A comment in code snippet is checking just 'high wmark' pages might
> > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > > @Jaroslav: Have you observed something like above?
> > >
> > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > fixing the kswapd continuous run issue.
> > >
> > > >
> > > > So, in downstream, we have something like for zone_watermark_ok():
> > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > > >
> > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > > what all I can say for this patch.
> > > >
> > > > + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > > + WMARK_PROMO : WMARK_HIGH;
> > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > > + unsigned long size = wmark_pages(zone, mark);
> > > > +
> > > > + if (managed_zone(zone) &&
> > > > + !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > > + return false;
> > > > + }
> > > >
> > > >
> > > > Thanks,
> > > > Charan
> > >
> > >
> > >
> > > --
> > > Jaroslav Pulchart
> > > Sr. Principal SW Engineer
> > > GoodData
> >
> >
> > Hello,
> >
> > today we try to update servers to 6.6.9 which contains the mglru fixes
> > (from 6.6.8) and the server behaves much much worse.
> >
> > I got multiple kswapd* load to ~100% imediatelly.
> > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > kswapd1
> > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > kswapd0
> > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > kswapd2
> > are the changes in upstream different compared to the initial patch
> > which I tested?
> >
> > Best regards,
> > Jaroslav Pulchart
>
> Hi Jaroslav,
>
> My apologies for all the trouble!
>
> Yes, there is a slight difference between the fix you verified and
> what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> condition which I thought wouldn't affect you.
>
> Could you try the attached fix again on top of 6.6.9? It removed that
> special condition.
>
> Thanks!
Thanks for prompt response. I did a test with the patch and it didn't
help. The situation is super strange.
I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
of all numa nodes of the first cpu socket if using 6.6.9 and it is the
worst situation, but the kswapd load is visible from 6.6.8.
Setup of this server:
* 4 chiplets per each sockets, there are 2 sockets
* 32 GB of RAM for each chiplet, 28GB are in hugepages
Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
memory pressure however it is even worse now in contrary.
kernel 6.6.7: I do not see kswapd usage when application started == OK
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
MemFree: 2766 2715 63 2366 3495 2990 3462 252
kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
MemFree: 2744 2788 65 581 3304 3215 3266 2226
kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
MemFree: 75 60 60 60 3169 2784 3203 2944
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2024-01-04 9:46 ` Jaroslav Pulchart
@ 2024-01-04 14:34 ` Jaroslav Pulchart
2024-01-04 23:51 ` Igor Raits
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-04 14:34 UTC (permalink / raw)
To: Yu Zhao
Cc: Daniel Secik, Charan Teja Kalla, Igor Raits, Kalesh Singh, akpm,
linux-mm
>
> >
> > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > >
> > > > > Hi yu,
> > > > >
> > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > > > additional feedback? Thanks.
> > > > >
> > > > > First, thanks for taking this patch to upstream.
> > > > >
> > > > > A comment in code snippet is checking just 'high wmark' pages might
> > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > > > @Jaroslav: Have you observed something like above?
> > > >
> > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > fixing the kswapd continuous run issue.
> > > >
> > > > >
> > > > > So, in downstream, we have something like for zone_watermark_ok():
> > > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > > > >
> > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > > > what all I can say for this patch.
> > > > >
> > > > > + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > > > + WMARK_PROMO : WMARK_HIGH;
> > > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > > > + unsigned long size = wmark_pages(zone, mark);
> > > > > +
> > > > > + if (managed_zone(zone) &&
> > > > > + !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > > > + return false;
> > > > > + }
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Charan
> > > >
> > > >
> > > >
> > > > --
> > > > Jaroslav Pulchart
> > > > Sr. Principal SW Engineer
> > > > GoodData
> > >
> > >
> > > Hello,
> > >
> > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > (from 6.6.8) and the server behaves much much worse.
> > >
> > > I got multiple kswapd* load to ~100% imediatelly.
> > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > > kswapd1
> > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > > kswapd0
> > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > > kswapd2
> > > are the changes in upstream different compared to the initial patch
> > > which I tested?
> > >
> > > Best regards,
> > > Jaroslav Pulchart
> >
> > Hi Jaroslav,
> >
> > My apologies for all the trouble!
> >
> > Yes, there is a slight difference between the fix you verified and
> > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > condition which I thought wouldn't affect you.
> >
> > Could you try the attached fix again on top of 6.6.9? It removed that
> > special condition.
> >
> > Thanks!
>
> Thanks for prompt response. I did a test with the patch and it didn't
> help. The situation is super strange.
>
> I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
> of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> worst situation, but the kswapd load is visible from 6.6.8.
>
> Setup of this server:
> * 4 chiplets per each sockets, there are 2 sockets
> * 32 GB of RAM for each chiplet, 28GB are in hugepages
> Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> memory pressure however it is even worse now in contrary.
>
> kernel 6.6.7: I do not see kswapd usage when application started == OK
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 28 28 28 28 28 28 28 28
> MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> MemFree: 2766 2715 63 2366 3495 2990 3462 252
>
> kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 28 28 28 28 28 28 28 28
> MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> MemFree: 2744 2788 65 581 3304 3215 3266 2226
>
> kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 28 28 28 28 28 28 28 28
> MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> MemFree: 75 60 60 60 3169 2784 3203 2944
I run few more combinations, and here are results / findings:
6.6.7-1 (vanila) == OK, no issue
6.6.8-1 (vanila) == single kswapd 100% !
6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
6.6.8-1 (revert four mglru patches) == OK, no issue
6.6.9-1 (vanila) == four kswapd 100% !!!!
6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!!
Summary:
* mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
kernel 6.6.8,
* there is (new?) problem in case of 6.6.9 kernel, which looks not to
be related to mglru patches at all
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2024-01-04 14:34 ` Jaroslav Pulchart
@ 2024-01-04 23:51 ` Igor Raits
2024-01-05 17:35 ` Ertman, David M
0 siblings, 1 reply; 30+ messages in thread
From: Igor Raits @ 2024-01-04 23:51 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Yu Zhao, Daniel Secik, Charan Teja Kalla, Kalesh Singh, akpm,
linux-mm, Dave Ertman
Hello everyone,
On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > >
> > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > >
> > > > > > Hi yu,
> > > > > >
> > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > Charan, does the fix previously attached seem acceptable to you? Any
> > > > > > > additional feedback? Thanks.
> > > > > >
> > > > > > First, thanks for taking this patch to upstream.
> > > > > >
> > > > > > A comment in code snippet is checking just 'high wmark' pages might
> > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary kswapd run time.
> > > > > > @Jaroslav: Have you observed something like above?
> > > > >
> > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > fixing the kswapd continuous run issue.
> > > > >
> > > > > >
> > > > > > So, in downstream, we have something like for zone_watermark_ok():
> > > > > > unsigned long size = wmark_pages(zone, mark) + MIN_LRU_BATCH << 2;
> > > > > >
> > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value, may be we
> > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned reasoning, is
> > > > > > what all I can say for this patch.
> > > > > >
> > > > > > + mark = sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > + WMARK_PROMO : WMARK_HIGH;
> > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones + i;
> > > > > > + unsigned long size = wmark_pages(zone, mark);
> > > > > > +
> > > > > > + if (managed_zone(zone) &&
> > > > > > + !zone_watermark_ok(zone, sc->order, size, sc->reclaim_idx, 0))
> > > > > > + return false;
> > > > > > + }
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Charan
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jaroslav Pulchart
> > > > > Sr. Principal SW Engineer
> > > > > GoodData
> > > >
> > > >
> > > > Hello,
> > > >
> > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > (from 6.6.8) and the server behaves much much worse.
> > > >
> > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > > > kswapd1
> > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > > > kswapd0
> > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > > > kswapd2
> > > > are the changes in upstream different compared to the initial patch
> > > > which I tested?
> > > >
> > > > Best regards,
> > > > Jaroslav Pulchart
> > >
> > > Hi Jaroslav,
> > >
> > > My apologies for all the trouble!
> > >
> > > Yes, there is a slight difference between the fix you verified and
> > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > condition which I thought wouldn't affect you.
> > >
> > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > special condition.
> > >
> > > Thanks!
> >
> > Thanks for prompt response. I did a test with the patch and it didn't
> > help. The situation is super strange.
> >
> > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
> > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > worst situation, but the kswapd load is visible from 6.6.8.
> >
> > Setup of this server:
> > * 4 chiplets per each sockets, there are 2 sockets
> > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > memory pressure however it is even worse now in contrary.
> >
> > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> >
> > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> >
> > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > MemFree: 75 60 60 60 3169 2784 3203 2944
>
> I run few more combinations, and here are results / findings:
>
> 6.6.7-1 (vanila) == OK, no issue
>
> 6.6.8-1 (vanila) == single kswapd 100% !
> 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> 6.6.8-1 (revert four mglru patches) == OK, no issue
>
> 6.6.9-1 (vanila) == four kswapd 100% !!!!
> 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!!
>
> Summary:
> * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> kernel 6.6.8,
> * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> be related to mglru patches at all
I was able to bisect this change and it looks like there is something
going wrong with the ice driver…
Usually after booting our server we see something like this. Most of
the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
that have a really low amount of free memory and we don't know why but
it looks like that in the end causes the constant swap in/out issue.
With the final bit of the patch you've sent earlier in this thread it
is almost invisible.
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
MemFree: 2191 2828 92 292 3344 2916 3594 3222
However, after the following patch we see that more NUMA nodes have
such a low amount of memory and that is causing constant reclaiming
of memory because it looks like something inside of the kernel ate all
the memory. This is right after the start of the system as well.
NUMA nodes: 0 1 2 3 4 5 6 7
HPTotalGiB: 28 28 28 28 28 28 28 28
HPFreeGiB: 28 28 28 28 28 28 28 28
MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
MemFree: 46 59 51 33 3078 3535 2708 3511
The difference is 18G vs 12G of free memory sum'd across all NUMA
nodes right after boot of the system. If you have some hints on how to
debug what is actually occupying all that memory, maybe in both cases
- would be happy to debug more!
Dave, would you have any idea why that patch could cause such a boost
in memory utilization?
commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
Author: Dave Ertman <david.m.ertman@intel.com>
Date: Mon Dec 11 13:19:28 2023 -0800
ice: alter feature support check for SRIOV and LAG
[ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
Previously, the ice driver had support for using a handler for bonding
netdev events to ensure that conflicting features were not allowed to be
activated at the same time. While this was still in place, additional
support was added to specifically support SRIOV and LAG together. These
both utilized the netdev event handler, but the SRIOV and LAG feature was
behind a capabilities feature check to make sure the current NVM has
support.
The exclusion part of the event handler should be removed since there are
users who have custom made solutions that depend on the non-exclusion of
features.
Wrap the creation/registration and cleanup of the event handler and
associated structs in the probe flow with a feature check so that the
only systems that support the full implementation of LAG features will
initialize support. This will leave other systems unhindered with
functionality as it existed before any LAG code was added.
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2024-01-04 23:51 ` Igor Raits
@ 2024-01-05 17:35 ` Ertman, David M
2024-01-08 17:53 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Ertman, David M @ 2024-01-05 17:35 UTC (permalink / raw)
To: Igor Raits, Jaroslav Pulchart
Cc: Yu Zhao, Daniel Secik, Charan Teja Kalla, Kalesh Singh, akpm, linux-mm
> -----Original Message-----
> From: Igor Raits <igor@gooddata.com>
> Sent: Thursday, January 4, 2024 3:51 PM
> To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> <daniel.secik@gooddata.com>; Charan Teja Kalla
> <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> <david.m.ertman@intel.com>
> Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> with multi-gen LRU
>
> Hello everyone,
>
> On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > >
> > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Hi yu,
> > > > > > >
> > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > Charan, does the fix previously attached seem acceptable to
> you? Any
> > > > > > > > additional feedback? Thanks.
> > > > > > >
> > > > > > > First, thanks for taking this patch to upstream.
> > > > > > >
> > > > > > > A comment in code snippet is checking just 'high wmark' pages
> might
> > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> kswapd run time.
> > > > > > > @Jaroslav: Have you observed something like above?
> > > > > >
> > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > fixing the kswapd continuous run issue.
> > > > > >
> > > > > > >
> > > > > > > So, in downstream, we have something like for
> zone_watermark_ok():
> > > > > > > unsigned long size = wmark_pages(zone, mark) +
> MIN_LRU_BATCH << 2;
> > > > > > >
> > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> may be we
> > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> reasoning, is
> > > > > > > what all I can say for this patch.
> > > > > > >
> > > > > > > + mark = sysctl_numa_balancing_mode &
> NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > + WMARK_PROMO : WMARK_HIGH;
> > > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> i;
> > > > > > > + unsigned long size = wmark_pages(zone, mark);
> > > > > > > +
> > > > > > > + if (managed_zone(zone) &&
> > > > > > > + !zone_watermark_ok(zone, sc->order, size, sc-
> >reclaim_idx, 0))
> > > > > > > + return false;
> > > > > > > + }
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Charan
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jaroslav Pulchart
> > > > > > Sr. Principal SW Engineer
> > > > > > GoodData
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > (from 6.6.8) and the server behaves much much worse.
> > > > >
> > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > > > > kswapd1
> > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > > > > kswapd0
> > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > > > > kswapd2
> > > > > are the changes in upstream different compared to the initial patch
> > > > > which I tested?
> > > > >
> > > > > Best regards,
> > > > > Jaroslav Pulchart
> > > >
> > > > Hi Jaroslav,
> > > >
> > > > My apologies for all the trouble!
> > > >
> > > > Yes, there is a slight difference between the fix you verified and
> > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > condition which I thought wouldn't affect you.
> > > >
> > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > special condition.
> > > >
> > > > Thanks!
> > >
> > > Thanks for prompt response. I did a test with the patch and it didn't
> > > help. The situation is super strange.
> > >
> > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
> > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > worst situation, but the kswapd load is visible from 6.6.8.
> > >
> > > Setup of this server:
> > > * 4 chiplets per each sockets, there are 2 sockets
> > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > memory pressure however it is even worse now in contrary.
> > >
> > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > >
> > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > >
> > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > MemFree: 75 60 60 60 3169 2784 3203 2944
> >
> > I run few more combinations, and here are results / findings:
> >
> > 6.6.7-1 (vanila) == OK, no issue
> >
> > 6.6.8-1 (vanila) == single kswapd 100% !
> > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > 6.6.8-1 (revert four mglru patches) == OK, no issue
> >
> > 6.6.9-1 (vanila) == four kswapd 100% !!!!
> > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!!
> >
> > Summary:
> > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > kernel 6.6.8,
> > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > be related to mglru patches at all
>
> I was able to bisect this change and it looks like there is something
> going wrong with the ice driver…
>
> Usually after booting our server we see something like this. Most of
> the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> that have a really low amount of free memory and we don't know why but
> it looks like that in the end causes the constant swap in/out issue.
> With the final bit of the patch you've sent earlier in this thread it
> is almost invisible.
>
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 28 28 28 28 28 28 28 28
> MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> MemFree: 2191 2828 92 292 3344 2916 3594 3222
>
>
> However, after the following patch we see that more NUMA nodes have
> such a low amount of memory and that is causing constant reclaiming
> of memory because it looks like something inside of the kernel ate all
> the memory. This is right after the start of the system as well.
>
> NUMA nodes: 0 1 2 3 4 5 6 7
> HPTotalGiB: 28 28 28 28 28 28 28 28
> HPFreeGiB: 28 28 28 28 28 28 28 28
> MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> MemFree: 46 59 51 33 3078 3535 2708 3511
>
> The difference is 18G vs 12G of free memory sum'd across all NUMA
> nodes right after boot of the system. If you have some hints on how to
> debug what is actually occupying all that memory, maybe in both cases
> - would be happy to debug more!
>
> Dave, would you have any idea why that patch could cause such a boost
> in memory utilization?
>
> commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> Author: Dave Ertman <david.m.ertman@intel.com>
> Date: Mon Dec 11 13:19:28 2023 -0800
>
> ice: alter feature support check for SRIOV and LAG
>
> [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
>
> Previously, the ice driver had support for using a handler for bonding
> netdev events to ensure that conflicting features were not allowed to be
> activated at the same time. While this was still in place, additional
> support was added to specifically support SRIOV and LAG together. These
> both utilized the netdev event handler, but the SRIOV and LAG feature
> was
> behind a capabilities feature check to make sure the current NVM has
> support.
>
> The exclusion part of the event handler should be removed since there are
> users who have custom made solutions that depend on the non-exclusion
> of
> features.
>
> Wrap the creation/registration and cleanup of the event handler and
> associated structs in the probe flow with a feature check so that the
> only systems that support the full implementation of LAG features will
> initialize support. This will leave other systems unhindered with
> functionality as it existed before any LAG code was added.
Igor,
I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
for the pf->lag struct.
DaveE
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2024-01-05 17:35 ` Ertman, David M
@ 2024-01-08 17:53 ` Jaroslav Pulchart
2024-01-16 4:58 ` Yu Zhao
0 siblings, 1 reply; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-08 17:53 UTC (permalink / raw)
To: Ertman, David M, Yu Zhao
Cc: Igor Raits, Daniel Secik, Charan Teja Kalla, Kalesh Singh, akpm,
linux-mm
>
> > -----Original Message-----
> > From: Igor Raits <igor@gooddata.com>
> > Sent: Thursday, January 4, 2024 3:51 PM
> > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > <david.m.ertman@intel.com>
> > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > with multi-gen LRU
> >
> > Hello everyone,
> >
> > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > <jaroslav.pulchart@gooddata.com> wrote:
> > >
> > > >
> > > > >
> > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Hi yu,
> > > > > > > >
> > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > you? Any
> > > > > > > > > additional feedback? Thanks.
> > > > > > > >
> > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > >
> > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > might
> > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > kswapd run time.
> > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > >
> > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > fixing the kswapd continuous run issue.
> > > > > > >
> > > > > > > >
> > > > > > > > So, in downstream, we have something like for
> > zone_watermark_ok():
> > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > MIN_LRU_BATCH << 2;
> > > > > > > >
> > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > may be we
> > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > reasoning, is
> > > > > > > > what all I can say for this patch.
> > > > > > > >
> > > > > > > > + mark = sysctl_numa_balancing_mode &
> > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > + WMARK_PROMO : WMARK_HIGH;
> > > > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > i;
> > > > > > > > + unsigned long size = wmark_pages(zone, mark);
> > > > > > > > +
> > > > > > > > + if (managed_zone(zone) &&
> > > > > > > > + !zone_watermark_ok(zone, sc->order, size, sc-
> > >reclaim_idx, 0))
> > > > > > > > + return false;
> > > > > > > > + }
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Charan
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jaroslav Pulchart
> > > > > > > Sr. Principal SW Engineer
> > > > > > > GoodData
> > > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > >
> > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > > > > > kswapd1
> > > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > > > > > kswapd0
> > > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > > > > > kswapd2
> > > > > > are the changes in upstream different compared to the initial patch
> > > > > > which I tested?
> > > > > >
> > > > > > Best regards,
> > > > > > Jaroslav Pulchart
> > > > >
> > > > > Hi Jaroslav,
> > > > >
> > > > > My apologies for all the trouble!
> > > > >
> > > > > Yes, there is a slight difference between the fix you verified and
> > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > condition which I thought wouldn't affect you.
> > > > >
> > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > special condition.
> > > > >
> > > > > Thanks!
> > > >
> > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > help. The situation is super strange.
> > > >
> > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
> > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > >
> > > > Setup of this server:
> > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > memory pressure however it is even worse now in contrary.
> > > >
> > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > >
> > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > >
> > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > >
> > > I run few more combinations, and here are results / findings:
> > >
> > > 6.6.7-1 (vanila) == OK, no issue
> > >
> > > 6.6.8-1 (vanila) == single kswapd 100% !
> > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > > 6.6.8-1 (revert four mglru patches) == OK, no issue
> > >
> > > 6.6.9-1 (vanila) == four kswapd 100% !!!!
> > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > > 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!!
> > >
> > > Summary:
> > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > kernel 6.6.8,
> > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > be related to mglru patches at all
> >
> > I was able to bisect this change and it looks like there is something
> > going wrong with the ice driver…
> >
> > Usually after booting our server we see something like this. Most of
> > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > that have a really low amount of free memory and we don't know why but
> > it looks like that in the end causes the constant swap in/out issue.
> > With the final bit of the patch you've sent earlier in this thread it
> > is almost invisible.
> >
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > MemFree: 2191 2828 92 292 3344 2916 3594 3222
> >
> >
> > However, after the following patch we see that more NUMA nodes have
> > such a low amount of memory and that is causing constant reclaiming
> > of memory because it looks like something inside of the kernel ate all
> > the memory. This is right after the start of the system as well.
> >
> > NUMA nodes: 0 1 2 3 4 5 6 7
> > HPTotalGiB: 28 28 28 28 28 28 28 28
> > HPFreeGiB: 28 28 28 28 28 28 28 28
> > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > MemFree: 46 59 51 33 3078 3535 2708 3511
> >
> > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > nodes right after boot of the system. If you have some hints on how to
> > debug what is actually occupying all that memory, maybe in both cases
> > - would be happy to debug more!
> >
> > Dave, would you have any idea why that patch could cause such a boost
> > in memory utilization?
> >
> > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > Author: Dave Ertman <david.m.ertman@intel.com>
> > Date: Mon Dec 11 13:19:28 2023 -0800
> >
> > ice: alter feature support check for SRIOV and LAG
> >
> > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> >
> > Previously, the ice driver had support for using a handler for bonding
> > netdev events to ensure that conflicting features were not allowed to be
> > activated at the same time. While this was still in place, additional
> > support was added to specifically support SRIOV and LAG together. These
> > both utilized the netdev event handler, but the SRIOV and LAG feature
> > was
> > behind a capabilities feature check to make sure the current NVM has
> > support.
> >
> > The exclusion part of the event handler should be removed since there are
> > users who have custom made solutions that depend on the non-exclusion
> > of
> > features.
> >
> > Wrap the creation/registration and cleanup of the event handler and
> > associated structs in the probe flow with a feature check so that the
> > only systems that support the full implementation of LAG features will
> > initialize support. This will leave other systems unhindered with
> > functionality as it existed before any LAG code was added.
>
> Igor,
>
> I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> for the pf->lag struct.
>
> DaveE
Hello,
I believe we can track it as two different issues. So I reported the
ICE driver commit as a email with subject "[REGRESSION] Intel ICE
Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
and cause continous kswapd* usage and continuous swapping" to
Jesse Brandeburg <jesse.brandeburg@intel.com>
Tony Nguyen <anthony.l.nguyen@intel.com>
intel-wired-lan@lists.osuosl.org
Dave Ertman <david.m.ertman@intel.com>
Lets track the mglru here in this email thread. Yu, the kernel build
with your mglru-fix-6.6.9.patch seem to be OK at least running it for
3days without kswapd usage (excluding the ice driver commit).
Best!
--
Jaroslav Pulchart
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2024-01-08 17:53 ` Jaroslav Pulchart
@ 2024-01-16 4:58 ` Yu Zhao
2024-01-16 17:34 ` Jaroslav Pulchart
0 siblings, 1 reply; 30+ messages in thread
From: Yu Zhao @ 2024-01-16 4:58 UTC (permalink / raw)
To: Jaroslav Pulchart
Cc: Ertman, David M, Igor Raits, Daniel Secik, Charan Teja Kalla,
Kalesh Singh, akpm, linux-mm
On Mon, Jan 8, 2024 at 10:54 AM Jaroslav Pulchart
<jaroslav.pulchart@gooddata.com> wrote:
>
> >
> > > -----Original Message-----
> > > From: Igor Raits <igor@gooddata.com>
> > > Sent: Thursday, January 4, 2024 3:51 PM
> > > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > > <david.m.ertman@intel.com>
> > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > > with multi-gen LRU
> > >
> > > Hello everyone,
> > >
> > > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > > <jaroslav.pulchart@gooddata.com> wrote:
> > > >
> > > > >
> > > > > >
> > > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi yu,
> > > > > > > > >
> > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > > you? Any
> > > > > > > > > > additional feedback? Thanks.
> > > > > > > > >
> > > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > > >
> > > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > > might
> > > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > > kswapd run time.
> > > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > > >
> > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > > fixing the kswapd continuous run issue.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > So, in downstream, we have something like for
> > > zone_watermark_ok():
> > > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > > MIN_LRU_BATCH << 2;
> > > > > > > > >
> > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > > may be we
> > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > > reasoning, is
> > > > > > > > > what all I can say for this patch.
> > > > > > > > >
> > > > > > > > > + mark = sysctl_numa_balancing_mode &
> > > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > > + WMARK_PROMO : WMARK_HIGH;
> > > > > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > > i;
> > > > > > > > > + unsigned long size = wmark_pages(zone, mark);
> > > > > > > > > +
> > > > > > > > > + if (managed_zone(zone) &&
> > > > > > > > > + !zone_watermark_ok(zone, sc->order, size, sc-
> > > >reclaim_idx, 0))
> > > > > > > > > + return false;
> > > > > > > > > + }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Charan
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Jaroslav Pulchart
> > > > > > > > Sr. Principal SW Engineer
> > > > > > > > GoodData
> > > > > > >
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > > >
> > > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > > > > > > kswapd1
> > > > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > > > > > > kswapd0
> > > > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > > > > > > kswapd2
> > > > > > > are the changes in upstream different compared to the initial patch
> > > > > > > which I tested?
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Jaroslav Pulchart
> > > > > >
> > > > > > Hi Jaroslav,
> > > > > >
> > > > > > My apologies for all the trouble!
> > > > > >
> > > > > > Yes, there is a slight difference between the fix you verified and
> > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > > condition which I thought wouldn't affect you.
> > > > > >
> > > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > > special condition.
> > > > > >
> > > > > > Thanks!
> > > > >
> > > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > > help. The situation is super strange.
> > > > >
> > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
> > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > > >
> > > > > Setup of this server:
> > > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > > memory pressure however it is even worse now in contrary.
> > > > >
> > > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > > >
> > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > > >
> > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > > >
> > > > I run few more combinations, and here are results / findings:
> > > >
> > > > 6.6.7-1 (vanila) == OK, no issue
> > > >
> > > > 6.6.8-1 (vanila) == single kswapd 100% !
> > > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > > > 6.6.8-1 (revert four mglru patches) == OK, no issue
> > > >
> > > > 6.6.9-1 (vanila) == four kswapd 100% !!!!
> > > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > > > 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!!
> > > >
> > > > Summary:
> > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > > kernel 6.6.8,
> > > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > > be related to mglru patches at all
> > >
> > > I was able to bisect this change and it looks like there is something
> > > going wrong with the ice driver…
> > >
> > > Usually after booting our server we see something like this. Most of
> > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > > that have a really low amount of free memory and we don't know why but
> > > it looks like that in the end causes the constant swap in/out issue.
> > > With the final bit of the patch you've sent earlier in this thread it
> > > is almost invisible.
> > >
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > > MemFree: 2191 2828 92 292 3344 2916 3594 3222
> > >
> > >
> > > However, after the following patch we see that more NUMA nodes have
> > > such a low amount of memory and that is causing constant reclaiming
> > > of memory because it looks like something inside of the kernel ate all
> > > the memory. This is right after the start of the system as well.
> > >
> > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > > MemFree: 46 59 51 33 3078 3535 2708 3511
> > >
> > > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > > nodes right after boot of the system. If you have some hints on how to
> > > debug what is actually occupying all that memory, maybe in both cases
> > > - would be happy to debug more!
> > >
> > > Dave, would you have any idea why that patch could cause such a boost
> > > in memory utilization?
> > >
> > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > > Author: Dave Ertman <david.m.ertman@intel.com>
> > > Date: Mon Dec 11 13:19:28 2023 -0800
> > >
> > > ice: alter feature support check for SRIOV and LAG
> > >
> > > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> > >
> > > Previously, the ice driver had support for using a handler for bonding
> > > netdev events to ensure that conflicting features were not allowed to be
> > > activated at the same time. While this was still in place, additional
> > > support was added to specifically support SRIOV and LAG together. These
> > > both utilized the netdev event handler, but the SRIOV and LAG feature
> > > was
> > > behind a capabilities feature check to make sure the current NVM has
> > > support.
> > >
> > > The exclusion part of the event handler should be removed since there are
> > > users who have custom made solutions that depend on the non-exclusion
> > > of
> > > features.
> > >
> > > Wrap the creation/registration and cleanup of the event handler and
> > > associated structs in the probe flow with a feature check so that the
> > > only systems that support the full implementation of LAG features will
> > > initialize support. This will leave other systems unhindered with
> > > functionality as it existed before any LAG code was added.
> >
> > Igor,
> >
> > I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> > If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> > for the pf->lag struct.
> >
> > DaveE
>
> Hello,
>
> I believe we can track it as two different issues. So I reported the
> ICE driver commit as a email with subject "[REGRESSION] Intel ICE
> Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
> and cause continous kswapd* usage and continuous swapping" to
> Jesse Brandeburg <jesse.brandeburg@intel.com>
> Tony Nguyen <anthony.l.nguyen@intel.com>
> intel-wired-lan@lists.osuosl.org
> Dave Ertman <david.m.ertman@intel.com>
>
> Lets track the mglru here in this email thread. Yu, the kernel build
> with your mglru-fix-6.6.9.patch seem to be OK at least running it for
> 3days without kswapd usage (excluding the ice driver commit).
Hi Jaroslav,
Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a
difference? IOW, were you able to reproduce the problem consistently
without it?
Thanks!
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU
2024-01-16 4:58 ` Yu Zhao
@ 2024-01-16 17:34 ` Jaroslav Pulchart
0 siblings, 0 replies; 30+ messages in thread
From: Jaroslav Pulchart @ 2024-01-16 17:34 UTC (permalink / raw)
To: Yu Zhao
Cc: Ertman, David M, Igor Raits, Daniel Secik, Charan Teja Kalla,
Kalesh Singh, akpm, linux-mm
>
> On Mon, Jan 8, 2024 at 10:54 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > > -----Original Message-----
> > > > From: Igor Raits <igor@gooddata.com>
> > > > Sent: Thursday, January 4, 2024 3:51 PM
> > > > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > > > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > > > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > > > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > > > <david.m.ertman@intel.com>
> > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > > > with multi-gen LRU
> > > >
> > > > Hello everyone,
> > > >
> > > > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi yu,
> > > > > > > > > >
> > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > > > you? Any
> > > > > > > > > > > additional feedback? Thanks.
> > > > > > > > > >
> > > > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > > > >
> > > > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > > > might
> > > > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > > > kswapd run time.
> > > > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > > > >
> > > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > > > fixing the kswapd continuous run issue.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So, in downstream, we have something like for
> > > > zone_watermark_ok():
> > > > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > > > MIN_LRU_BATCH << 2;
> > > > > > > > > >
> > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > > > may be we
> > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > > > reasoning, is
> > > > > > > > > > what all I can say for this patch.
> > > > > > > > > >
> > > > > > > > > > + mark = sysctl_numa_balancing_mode &
> > > > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > > > + WMARK_PROMO : WMARK_HIGH;
> > > > > > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > > > i;
> > > > > > > > > > + unsigned long size = wmark_pages(zone, mark);
> > > > > > > > > > +
> > > > > > > > > > + if (managed_zone(zone) &&
> > > > > > > > > > + !zone_watermark_ok(zone, sc->order, size, sc-
> > > > >reclaim_idx, 0))
> > > > > > > > > > + return false;
> > > > > > > > > > + }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Charan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jaroslav Pulchart
> > > > > > > > > Sr. Principal SW Engineer
> > > > > > > > > GoodData
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > > > >
> > > > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > > > > > > > kswapd1
> > > > > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > > > > > > > kswapd0
> > > > > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > > > > > > > kswapd2
> > > > > > > > are the changes in upstream different compared to the initial patch
> > > > > > > > which I tested?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jaroslav Pulchart
> > > > > > >
> > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > My apologies for all the trouble!
> > > > > > >
> > > > > > > Yes, there is a slight difference between the fix you verified and
> > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > > > condition which I thought wouldn't affect you.
> > > > > > >
> > > > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > > > special condition.
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > > > help. The situation is super strange.
> > > > > >
> > > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
> > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > > > >
> > > > > > Setup of this server:
> > > > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > > > memory pressure however it is even worse now in contrary.
> > > > > >
> > > > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > > > >
> > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > > > >
> > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > > > >
> > > > > I run few more combinations, and here are results / findings:
> > > > >
> > > > > 6.6.7-1 (vanila) == OK, no issue
> > > > >
> > > > > 6.6.8-1 (vanila) == single kswapd 100% !
> > > > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > > > > 6.6.8-1 (revert four mglru patches) == OK, no issue
> > > > >
> > > > > 6.6.9-1 (vanila) == four kswapd 100% !!!!
> > > > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > > > > 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!!
> > > > >
> > > > > Summary:
> > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > > > kernel 6.6.8,
> > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > > > be related to mglru patches at all
> > > >
> > > > I was able to bisect this change and it looks like there is something
> > > > going wrong with the ice driver…
> > > >
> > > > Usually after booting our server we see something like this. Most of
> > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > > > that have a really low amount of free memory and we don't know why but
> > > > it looks like that in the end causes the constant swap in/out issue.
> > > > With the final bit of the patch you've sent earlier in this thread it
> > > > is almost invisible.
> > > >
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > > > MemFree: 2191 2828 92 292 3344 2916 3594 3222
> > > >
> > > >
> > > > However, after the following patch we see that more NUMA nodes have
> > > > such a low amount of memory and that is causing constant reclaiming
> > > > of memory because it looks like something inside of the kernel ate all
> > > > the memory. This is right after the start of the system as well.
> > > >
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > > > MemFree: 46 59 51 33 3078 3535 2708 3511
> > > >
> > > > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > > > nodes right after boot of the system. If you have some hints on how to
> > > > debug what is actually occupying all that memory, maybe in both cases
> > > > - would be happy to debug more!
> > > >
> > > > Dave, would you have any idea why that patch could cause such a boost
> > > > in memory utilization?
> > > >
> > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > > > Author: Dave Ertman <david.m.ertman@intel.com>
> > > > Date: Mon Dec 11 13:19:28 2023 -0800
> > > >
> > > > ice: alter feature support check for SRIOV and LAG
> > > >
> > > > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> > > >
> > > > Previously, the ice driver had support for using a handler for bonding
> > > > netdev events to ensure that conflicting features were not allowed to be
> > > > activated at the same time. While this was still in place, additional
> > > > support was added to specifically support SRIOV and LAG together. These
> > > > both utilized the netdev event handler, but the SRIOV and LAG feature
> > > > was
> > > > behind a capabilities feature check to make sure the current NVM has
> > > > support.
> > > >
> > > > The exclusion part of the event handler should be removed since there are
> > > > users who have custom made solutions that depend on the non-exclusion
> > > > of
> > > > features.
> > > >
> > > > Wrap the creation/registration and cleanup of the event handler and
> > > > associated structs in the probe flow with a feature check so that the
> > > > only systems that support the full implementation of LAG features will
> > > > initialize support. This will leave other systems unhindered with
> > > > functionality as it existed before any LAG code was added.
> > >
> > > Igor,
> > >
> > > I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> > > If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> > > for the pf->lag struct.
> > >
> > > DaveE
> >
> > Hello,
> >
> > I believe we can track it as two different issues. So I reported the
> > ICE driver commit as a email with subject "[REGRESSION] Intel ICE
> > Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
> > and cause continous kswapd* usage and continuous swapping" to
> > Jesse Brandeburg <jesse.brandeburg@intel.com>
> > Tony Nguyen <anthony.l.nguyen@intel.com>
> > intel-wired-lan@lists.osuosl.org
> > Dave Ertman <david.m.ertman@intel.com>
> >
> > Lets track the mglru here in this email thread. Yu, the kernel build
> > with your mglru-fix-6.6.9.patch seem to be OK at least running it for
> > 3days without kswapd usage (excluding the ice driver commit).
>
> Hi Jaroslav,
>
> Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a
> difference? IOW, were you able to reproduce the problem consistently
> without it?
>
> Thanks!
Hi Yu,
the mglru-fix-6.6.9.patch is needed for all >= 6.6.8 till 6.7. I
tested the new 6.7 (without mglru-fix) and this kernel is fine as I
cannot trigger the problem there.
út 16. 1. 2024 v 5:59 odesílatel Yu Zhao <yuzhao@google.com> napsal:
>
> On Mon, Jan 8, 2024 at 10:54 AM Jaroslav Pulchart
> <jaroslav.pulchart@gooddata.com> wrote:
> >
> > >
> > > > -----Original Message-----
> > > > From: Igor Raits <igor@gooddata.com>
> > > > Sent: Thursday, January 4, 2024 3:51 PM
> > > > To: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
> > > > Cc: Yu Zhao <yuzhao@google.com>; Daniel Secik
> > > > <daniel.secik@gooddata.com>; Charan Teja Kalla
> > > > <quic_charante@quicinc.com>; Kalesh Singh <kaleshsingh@google.com>;
> > > > akpm@linux-foundation.org; linux-mm@kvack.org; Ertman, David M
> > > > <david.m.ertman@intel.com>
> > > > Subject: Re: high kswapd CPU usage with symmetrical swap in/out pattern
> > > > with multi-gen LRU
> > > >
> > > > Hello everyone,
> > > >
> > > > On Thu, Jan 4, 2024 at 3:34 PM Jaroslav Pulchart
> > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 3, 2024 at 2:30 PM Jaroslav Pulchart
> > > > > > > <jaroslav.pulchart@gooddata.com> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi yu,
> > > > > > > > > >
> > > > > > > > > > On 12/2/2023 5:22 AM, Yu Zhao wrote:
> > > > > > > > > > > Charan, does the fix previously attached seem acceptable to
> > > > you? Any
> > > > > > > > > > > additional feedback? Thanks.
> > > > > > > > > >
> > > > > > > > > > First, thanks for taking this patch to upstream.
> > > > > > > > > >
> > > > > > > > > > A comment in code snippet is checking just 'high wmark' pages
> > > > might
> > > > > > > > > > succeed here but can fail in the immediate kswapd sleep, see
> > > > > > > > > > prepare_kswapd_sleep(). This can show up into the increased
> > > > > > > > > > KSWAPD_HIGH_WMARK_HIT_QUICKLY, thus unnecessary
> > > > kswapd run time.
> > > > > > > > > > @Jaroslav: Have you observed something like above?
> > > > > > > > >
> > > > > > > > > I do not see any unnecessary kswapd run time, on the contrary it is
> > > > > > > > > fixing the kswapd continuous run issue.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So, in downstream, we have something like for
> > > > zone_watermark_ok():
> > > > > > > > > > unsigned long size = wmark_pages(zone, mark) +
> > > > MIN_LRU_BATCH << 2;
> > > > > > > > > >
> > > > > > > > > > Hard to convince of this 'MIN_LRU_BATCH << 2' empirical value,
> > > > may be we
> > > > > > > > > > should atleast use the 'MIN_LRU_BATCH' with the mentioned
> > > > reasoning, is
> > > > > > > > > > what all I can say for this patch.
> > > > > > > > > >
> > > > > > > > > > + mark = sysctl_numa_balancing_mode &
> > > > NUMA_BALANCING_MEMORY_TIERING ?
> > > > > > > > > > + WMARK_PROMO : WMARK_HIGH;
> > > > > > > > > > + for (i = 0; i <= sc->reclaim_idx; i++) {
> > > > > > > > > > + struct zone *zone = lruvec_pgdat(lruvec)->node_zones +
> > > > i;
> > > > > > > > > > + unsigned long size = wmark_pages(zone, mark);
> > > > > > > > > > +
> > > > > > > > > > + if (managed_zone(zone) &&
> > > > > > > > > > + !zone_watermark_ok(zone, sc->order, size, sc-
> > > > >reclaim_idx, 0))
> > > > > > > > > > + return false;
> > > > > > > > > > + }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Charan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jaroslav Pulchart
> > > > > > > > > Sr. Principal SW Engineer
> > > > > > > > > GoodData
> > > > > > > >
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > today we try to update servers to 6.6.9 which contains the mglru fixes
> > > > > > > > (from 6.6.8) and the server behaves much much worse.
> > > > > > > >
> > > > > > > > I got multiple kswapd* load to ~100% imediatelly.
> > > > > > > > 555 root 20 0 0 0 0 R 99.7 0.0 4:32.86
> > > > > > > > kswapd1
> > > > > > > > 554 root 20 0 0 0 0 R 99.3 0.0 3:57.76
> > > > > > > > kswapd0
> > > > > > > > 556 root 20 0 0 0 0 R 97.7 0.0 3:42.27
> > > > > > > > kswapd2
> > > > > > > > are the changes in upstream different compared to the initial patch
> > > > > > > > which I tested?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jaroslav Pulchart
> > > > > > >
> > > > > > > Hi Jaroslav,
> > > > > > >
> > > > > > > My apologies for all the trouble!
> > > > > > >
> > > > > > > Yes, there is a slight difference between the fix you verified and
> > > > > > > what went into 6.6.9. The fix in 6.6.9 is disabled under a special
> > > > > > > condition which I thought wouldn't affect you.
> > > > > > >
> > > > > > > Could you try the attached fix again on top of 6.6.9? It removed that
> > > > > > > special condition.
> > > > > > >
> > > > > > > Thanks!
> > > > > >
> > > > > > Thanks for prompt response. I did a test with the patch and it didn't
> > > > > > help. The situation is super strange.
> > > > > >
> > > > > > I tried kernels 6.6.7, 6.6.8 and 6.6.9. I see high memory utilization
> > > > > > of all numa nodes of the first cpu socket if using 6.6.9 and it is the
> > > > > > worst situation, but the kswapd load is visible from 6.6.8.
> > > > > >
> > > > > > Setup of this server:
> > > > > > * 4 chiplets per each sockets, there are 2 sockets
> > > > > > * 32 GB of RAM for each chiplet, 28GB are in hugepages
> > > > > > Note: previously I have 29GB in Hugepages, I free up 1GB to avoid
> > > > > > memory pressure however it is even worse now in contrary.
> > > > > >
> > > > > > kernel 6.6.7: I do not see kswapd usage when application started == OK
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32659 32701 32696
> > > > > > MemFree: 2766 2715 63 2366 3495 2990 3462 252
> > > > > >
> > > > > > kernel 6.6.8: I see kswapd on nodes 2 and 3 when application started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32701 32701 32659 32696
> > > > > > MemFree: 2744 2788 65 581 3304 3215 3266 2226
> > > > > >
> > > > > > kernel 6.6.9: I see kswapd on nodes 0, 1, 2 and 3 when application started
> > > > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > > > MemTotal: 32264 32701 32701 32686 32659 32701 32701 32696
> > > > > > MemFree: 75 60 60 60 3169 2784 3203 2944
> > > > >
> > > > > I run few more combinations, and here are results / findings:
> > > > >
> > > > > 6.6.7-1 (vanila) == OK, no issue
> > > > >
> > > > > 6.6.8-1 (vanila) == single kswapd 100% !
> > > > > 6.6.8-1 (vanila plus mglru-fix-6.6.9.patch) == OK, no issue
> > > > > 6.6.8-1 (revert four mglru patches) == OK, no issue
> > > > >
> > > > > 6.6.9-1 (vanila) == four kswapd 100% !!!!
> > > > > 6.6.9-2 (vanila plus mglru-fix-6.6.9.patch) == four kswapd 100% !!!!
> > > > > 6.6.9-3 (revert four mglru patches) == four kswapd 100% !!!!
> > > > >
> > > > > Summary:
> > > > > * mglru-fix-6.6.9.patch or reverting mglru patches helps in case of
> > > > > kernel 6.6.8,
> > > > > * there is (new?) problem in case of 6.6.9 kernel, which looks not to
> > > > > be related to mglru patches at all
> > > >
> > > > I was able to bisect this change and it looks like there is something
> > > > going wrong with the ice driver…
> > > >
> > > > Usually after booting our server we see something like this. Most of
> > > > the nodes have ~2-3G of free memory. There are always 1-2 NUMA nodes
> > > > that have a really low amount of free memory and we don't know why but
> > > > it looks like that in the end causes the constant swap in/out issue.
> > > > With the final bit of the patch you've sent earlier in this thread it
> > > > is almost invisible.
> > > >
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > > > MemFree: 2191 2828 92 292 3344 2916 3594 3222
> > > >
> > > >
> > > > However, after the following patch we see that more NUMA nodes have
> > > > such a low amount of memory and that is causing constant reclaiming
> > > > of memory because it looks like something inside of the kernel ate all
> > > > the memory. This is right after the start of the system as well.
> > > >
> > > > NUMA nodes: 0 1 2 3 4 5 6 7
> > > > HPTotalGiB: 28 28 28 28 28 28 28 28
> > > > HPFreeGiB: 28 28 28 28 28 28 28 28
> > > > MemTotal: 32264 32701 32659 32686 32701 32701 32701 32696
> > > > MemFree: 46 59 51 33 3078 3535 2708 3511
> > > >
> > > > The difference is 18G vs 12G of free memory sum'd across all NUMA
> > > > nodes right after boot of the system. If you have some hints on how to
> > > > debug what is actually occupying all that memory, maybe in both cases
> > > > - would be happy to debug more!
> > > >
> > > > Dave, would you have any idea why that patch could cause such a boost
> > > > in memory utilization?
> > > >
> > > > commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f
> > > > Author: Dave Ertman <david.m.ertman@intel.com>
> > > > Date: Mon Dec 11 13:19:28 2023 -0800
> > > >
> > > > ice: alter feature support check for SRIOV and LAG
> > > >
> > > > [ Upstream commit 4d50fcdc2476eef94c14c6761073af5667bb43b6 ]
> > > >
> > > > Previously, the ice driver had support for using a handler for bonding
> > > > netdev events to ensure that conflicting features were not allowed to be
> > > > activated at the same time. While this was still in place, additional
> > > > support was added to specifically support SRIOV and LAG together. These
> > > > both utilized the netdev event handler, but the SRIOV and LAG feature
> > > > was
> > > > behind a capabilities feature check to make sure the current NVM has
> > > > support.
> > > >
> > > > The exclusion part of the event handler should be removed since there are
> > > > users who have custom made solutions that depend on the non-exclusion
> > > > of
> > > > features.
> > > >
> > > > Wrap the creation/registration and cleanup of the event handler and
> > > > associated structs in the probe flow with a feature check so that the
> > > > only systems that support the full implementation of LAG features will
> > > > initialize support. This will leave other systems unhindered with
> > > > functionality as it existed before any LAG code was added.
> > >
> > > Igor,
> > >
> > > I have no idea why that two line commit would do anything to increase memory usage by the ice driver.
> > > If anything, I would expect it to lower memory usage as it has the potential to stop the allocation of memory
> > > for the pf->lag struct.
> > >
> > > DaveE
> >
> > Hello,
> >
> > I believe we can track it as two different issues. So I reported the
> > ICE driver commit as a email with subject "[REGRESSION] Intel ICE
> > Ethernet driver in linux >= 6.6.9 triggers extra memory consumption
> > and cause continous kswapd* usage and continuous swapping" to
> > Jesse Brandeburg <jesse.brandeburg@intel.com>
> > Tony Nguyen <anthony.l.nguyen@intel.com>
> > intel-wired-lan@lists.osuosl.org
> > Dave Ertman <david.m.ertman@intel.com>
> >
> > Lets track the mglru here in this email thread. Yu, the kernel build
> > with your mglru-fix-6.6.9.patch seem to be OK at least running it for
> > 3days without kswapd usage (excluding the ice driver commit).
>
> Hi Jaroslav,
>
> Do we now have a clear conclusion that mglru-fix-6.6.9.patch made a
> difference? IOW, were you able to reproduce the problem consistently
> without it?
>
> Thanks!
--
Jaroslav Pulchart
Sr. Principal SW Engineer
GoodData
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2024-01-16 17:35 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-08 14:35 high kswapd CPU usage with symmetrical swap in/out pattern with multi-gen LRU Jaroslav Pulchart
2023-11-08 18:47 ` Yu Zhao
2023-11-08 20:04 ` Jaroslav Pulchart
2023-11-08 22:09 ` Yu Zhao
2023-11-09 6:39 ` Jaroslav Pulchart
2023-11-09 6:48 ` Yu Zhao
2023-11-09 10:58 ` Jaroslav Pulchart
2023-11-10 1:31 ` Yu Zhao
[not found] ` <CAK8fFZ5xUe=JMOxUWgQ-0aqWMXuZYF2EtPOoZQqr89sjrL+zTw@mail.gmail.com>
2023-11-13 20:09 ` Yu Zhao
2023-11-14 7:29 ` Jaroslav Pulchart
2023-11-14 7:47 ` Yu Zhao
2023-11-20 8:41 ` Jaroslav Pulchart
2023-11-22 6:13 ` Yu Zhao
2023-11-22 7:12 ` Jaroslav Pulchart
2023-11-22 7:30 ` Jaroslav Pulchart
2023-11-22 14:18 ` Yu Zhao
2023-11-29 13:54 ` Jaroslav Pulchart
2023-12-01 23:52 ` Yu Zhao
2023-12-07 8:46 ` Charan Teja Kalla
2023-12-07 18:23 ` Yu Zhao
2023-12-08 8:03 ` Jaroslav Pulchart
2024-01-03 21:30 ` Jaroslav Pulchart
2024-01-04 3:03 ` Yu Zhao
2024-01-04 9:46 ` Jaroslav Pulchart
2024-01-04 14:34 ` Jaroslav Pulchart
2024-01-04 23:51 ` Igor Raits
2024-01-05 17:35 ` Ertman, David M
2024-01-08 17:53 ` Jaroslav Pulchart
2024-01-16 4:58 ` Yu Zhao
2024-01-16 17:34 ` Jaroslav Pulchart
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.