* Kswapd in 3.2.0-rc5 is a CPU hog
@ 2011-12-21 3:10 Nikolay S.
2011-12-21 9:52 ` Michal Hocko
0 siblings, 1 reply; 32+ messages in thread
From: Nikolay S. @ 2011-12-21 3:10 UTC (permalink / raw)
To: linux-kernel
Hello,
I'm using 3.2-rc5 on a machine, which atm does almost nothing except
file system operations and network i/o (i.e. file server). And there is
a problem with kswapd.
I'm playing with dd:
dd if=/some/big/file of=/dev/null bs=8M
I.e. I'm filling page cache.
So when the machine is just rebooted, kswapd during this operation is
almost idle, just 5-8 percent according to top.
After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
for kswapd:
PID USER S %CPU %MEM TIME+ SWAP COMMAND
420 root R 70 0.0 22:09.60 0 kswapd0
17717 nowhere D 27 0.2 0:01.81 10m dd
In fact, kswapd cpu usage on this operation steadily increases over
time.
Also read performance degrades over time. After reboot:
dd if=/some/big/file of=/dev/null bs=8M
1019+1 records in
1019+1 records out
8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
After ~5 days uptime:
dd if=/some/big/file of=/dev/null bs=8M
1019+1 records in
1019+1 records out
8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
Whereas raw disk sequential read performance stays the same:
dd if=/some/big/file of=/dev/null bs=8M iflag=direct
1019+1 records in
1019+1 records out
8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
Also after dropping caches, situation somehow improves, but not to the
state of freshly restarted system:
PID USER S %CPU %MEM TIME+ SWAP COMMAND
420 root S 39 0.0 23:31.17 0 kswapd0
19829 nowhere D 24 0.2 0:02.72 7764 dd
perf shows:
31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
16.28% kswapd0 [kernel.kallsyms] [k] prune_super
6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
4.03% kswapd0 [kernel.kallsyms] [k] up_read
2.31% kswapd0 [kernel.kallsyms] [k] put_super
1.81% kswapd0 [kernel.kallsyms] [k] drop_super
0.99% kswapd0 [kernel.kallsyms] [k] __put_super
0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
0.19% kswapd0 [r8169] [k] rtl8169_interrupt
0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
P.S.: The message above was written couple of days ago. Now I'm at 10
days uptime, and this is the result as of today
PID USER S %CPU %MEM TIME+ SWAP COMMAND
420 root R 93 0.0 110:48.48 0 kswapd0
30085 nowhere D 42 0.2 0:04.36 10m dd
PPS: Please CC me.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-21 3:10 Kswapd in 3.2.0-rc5 is a CPU hog Nikolay S.
@ 2011-12-21 9:52 ` Michal Hocko
2011-12-21 10:15 ` nowhere
2011-12-21 22:55 ` Dave Chinner
0 siblings, 2 replies; 32+ messages in thread
From: Michal Hocko @ 2011-12-21 9:52 UTC (permalink / raw)
To: Nikolay S.; +Cc: linux-kernel, linux-mm
[Let's CC linux-mm]
On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> Hello,
>
> I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> file system operations and network i/o (i.e. file server). And there is
> a problem with kswapd.
What kind of filesystem do you use?
>
> I'm playing with dd:
> dd if=/some/big/file of=/dev/null bs=8M
>
> I.e. I'm filling page cache.
>
> So when the machine is just rebooted, kswapd during this operation is
> almost idle, just 5-8 percent according to top.
>
> After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> for kswapd:
>
> PID USER S %CPU %MEM TIME+ SWAP COMMAND
> 420 root R 70 0.0 22:09.60 0 kswapd0
> 17717 nowhere D 27 0.2 0:01.81 10m dd
>
> In fact, kswapd cpu usage on this operation steadily increases over
> time.
>
> Also read performance degrades over time. After reboot:
> dd if=/some/big/file of=/dev/null bs=8M
> 1019+1 records in
> 1019+1 records out
> 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
>
> After ~5 days uptime:
> dd if=/some/big/file of=/dev/null bs=8M
> 1019+1 records in
> 1019+1 records out
> 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
>
> Whereas raw disk sequential read performance stays the same:
> dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> 1019+1 records in
> 1019+1 records out
> 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
>
> Also after dropping caches, situation somehow improves, but not to the
> state of freshly restarted system:
> PID USER S %CPU %MEM TIME+ SWAP COMMAND
> 420 root S 39 0.0 23:31.17 0 kswapd0
> 19829 nowhere D 24 0.2 0:02.72 7764 dd
>
> perf shows:
>
> 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
spin lock contention.
Could you also take few snapshots of /proc/420/stack to see what kswapd
is doing.
>
> P.S.: The message above was written couple of days ago. Now I'm at 10
> days uptime, and this is the result as of today
> PID USER S %CPU %MEM TIME+ SWAP COMMAND
> 420 root R 93 0.0 110:48.48 0 kswapd0
> 30085 nowhere D 42 0.2 0:04.36 10m dd
>
> PPS: Please CC me.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-21 9:52 ` Michal Hocko
@ 2011-12-21 10:15 ` nowhere
2011-12-21 10:24 ` Michal Hocko
2011-12-21 22:55 ` Dave Chinner
1 sibling, 1 reply; 32+ messages in thread
From: nowhere @ 2011-12-21 10:15 UTC (permalink / raw)
To: Michal Hocko; +Cc: linux-kernel, linux-mm
В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> [Let's CC linux-mm]
>
> On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > Hello,
> >
> > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > file system operations and network i/o (i.e. file server). And there is
> > a problem with kswapd.
>
> What kind of filesystem do you use?
Well, that is XFS.
I have a large volume with ~200000 files, and a periodic job, which
checks all file's timestamps once per 30 minutes and makes actions if
timestamp has changed.
>
> >
> > I'm playing with dd:
> > dd if=/some/big/file of=/dev/null bs=8M
> >
> > I.e. I'm filling page cache.
> >
> > So when the machine is just rebooted, kswapd during this operation is
> > almost idle, just 5-8 percent according to top.
> >
> > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > for kswapd:
> >
> > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > 420 root R 70 0.0 22:09.60 0 kswapd0
> > 17717 nowhere D 27 0.2 0:01.81 10m dd
> >
> > In fact, kswapd cpu usage on this operation steadily increases over
> > time.
> >
> > Also read performance degrades over time. After reboot:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> >
> > After ~5 days uptime:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> >
> > Whereas raw disk sequential read performance stays the same:
> > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> >
> > Also after dropping caches, situation somehow improves, but not to the
> > state of freshly restarted system:
> > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > 420 root S 39 0.0 23:31.17 0 kswapd0
> > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> >
> > perf shows:
> >
> > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
>
> Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> spin lock contention.
This is slabinfo, sorted by num objects:
xfs_inode 192941 193205 960 17 4 : tunables 0 0 0 : slabdata 11365 11365 0
dentry 118818 118818 192 21 1 : tunables 0 0 0 : slabdata 5658 5658 0
kmalloc-256 107920 107920 256 16 1 : tunables 0 0 0 : slabdata 6745 6745 0
kmalloc-64 59912 102656 64 64 1 : tunables 0 0 0 : slabdata 1604 1604 0
radix_tree_node 30618 33474 568 14 2 : tunables 0 0 0 : slabdata 2391 2391 0
kmalloc-96 27092 41202 96 42 1 : tunables 0 0 0 : slabdata 981 981 0
buffer_head 24892 63843 104 39 1 : tunables 0 0 0 : slabdata 1637 1637 0
kmalloc-192 23332 34503 192 21 1 : tunables 0 0 0 : slabdata 1643 1643 0
sysfs_dir_cache 17444 17444 144 28 1 : tunables 0 0 0 : slabdata 623 623 0
arp_cache 12863 14796 320 12 1 : tunables 0 0 0 : slabdata 1233 1233 0
kmalloc-512 11051 11440 512 16 2 : tunables 0 0 0 : slabdata 715 715 0
kmalloc-128 10611 13152 128 32 1 : tunables 0 0 0 : slabdata 411 411 0
ext4_inode_cache 9660 18018 880 18 4 : tunables 0 0 0 : slabdata 1001 1001 0
kmalloc-8 8704 8704 8 512 1 : tunables 0 0 0 : slabdata 17 17 0
ext4_io_page 6912 6912 16 256 1 : tunables 0 0 0 : slabdata 27 27 0
anon_vma_chain 6701 10880 48 85 1 : tunables 0 0 0 : slabdata 128 128 0
Acpi-Namespace 6611 8058 40 102 1 : tunables 0 0 0 : slabdata 79 79 0
fsnotify_event_holder 6290 6970 24 170 1 : tunables 0 0 0 : slabdata 41 41 0
kmalloc-1024 5813 5888 1024 16 4 : tunables 0 0 0 : slabdata 368 368 0
vm_area_struct 5664 5664 168 24 1 : tunables 0 0 0 : slabdata 236 236 0
reiser_inode_cache 3992 5198 704 23 4 : tunables 0 0 0 : slabdata 226 226 0
Acpi-ParseExt 3808 3808 72 56 1 : tunables 0 0 0 : slabdata 68 68 0
kmalloc-2048 3587 3888 2048 16 8 : tunables 0 0 0 : slabdata 243 243 0
proc_inode_cache 3498 3510 624 13 2 : tunables 0 0 0 : slabdata 270 270 0
anon_vma 3380 3640 72 56 1 : tunables 0 0 0 : slabdata 65 65 0
kmalloc-16 3072 3072 16 256 1 : tunables 0 0 0 : slabdata 12 12 0
inode_cache 3024 3024 560 14 2 : tunables 0 0 0 : slabdata 216 216 0
ext4_allocation_context 3000 3000 136 30 1 : tunables 0 0 0 : slabdata 100 100 0
nf_conntrack_ffffffff81776d40 2910 3549 312 13 1 : tunables 0 0 0 : slabdata 273 273 0
kmalloc-4096 2792 3136 4096 8 8 : tunables 0 0 0 : slabdata 392 392 0
ext4_free_data 2701 2701 56 73 1 : tunables 0 0 0 : slabdata 37 37 0
pid_namespace 2130 2130 2112 15 8 : tunables 0 0 0 : slabdata 142 142 0
mqueue_inode_cache 2124 2124 896 18 4 : tunables 0 0 0 : slabdata 118 118 0
jbd2_revoke_record 1664 1664 32 128 1 : tunables 0 0 0 : slabdata 13 13 0
kmalloc-32 1434 3072 32 128 1 : tunables 0 0 0 : slabdata 24 24 0
shmem_inode_cache 1400 1521 624 13 2 : tunables 0 0 0 : slabdata 117 117 0
xfs_ili 1139 1260 216 18 1 : tunables 0 0 0 : slabdata 70 70 0
nfsd4_stateids 1131 1496 120 34 1 : tunables 0 0 0 : slabdata 44 44 0
idr_layer_cache 900 900 544 15 2 : tunables 0 0 0 : slabdata 60 60 0
jbd2_journal_head 742 1008 112 36 1 : tunables 0 0 0 : slabdata 28 28 0
fsnotify_event 648 648 112 36 1 : tunables 0 0 0 : slabdata 18 18 0
sock_inode_cache 577 624 640 12 2 : tunables 0 0 0 : slabdata 52 52 0
tw_sock_TCP 504 504 192 21 1 : tunables 0 0 0 : slabdata 24 24 0
TCP 356 414 1728 18 8 : tunables 0 0 0 : slabdata 23 23 0
RAW 342 342 832 19 4 : tunables 0 0 0 : slabdata 18 18 0
jbd2_journal_handle 340 340 24 170 1 : tunables 0 0 0 : slabdata 2 2 0
blkdev_requests 322 322 344 23 2 : tunables 0 0 0 : slabdata 14 14 0
task_struct 293 357 1504 21 8 : tunables 0 0 0 : slabdata 17 17 0
UDP 285 285 832 19 4 : tunables 0 0 0 : slabdata 15 15 0
files_cache 276 276 704 23 4 : tunables 0 0 0 : slabdata 12 12 0
nfsd4_openowners 220 320 392 20 2 : tunables 0 0 0 : slabdata 16 16 0
mm_struct 216 216 896 18 4 : tunables 0 0 0 : slabdata 12 12 0
sighand_cache 199 225 2112 15 8 : tunables 0 0 0 : slabdata 15 15 0
nfsd4_delegations 198 198 368 22 2 : tunables 0 0 0 : slabdata 9 9 0
kmem_cache_node 192 192 64 64 1 : tunables 0 0 0 : slabdata 3 3 0
xfs_buf_item 162 162 224 18 1 : tunables 0 0 0 : slabdata 9 9 0
ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
ext4_io_end 140 154 1128 14 4 : tunables 0 0 0 : slabdata 11 11 0
dnotify_mark 120 120 136 30 1 : tunables 0 0 0 : slabdata 4 4 0
TCPv6 104 119 1856 17 8 : tunables 0 0 0 : slabdata 7 7 0
cfq_queue 102 102 232 17 1 : tunables 0 0 0 : slabdata 6 6 0
Acpi-State 102 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
sigqueue 100 100 160 25 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_efd_item 80 80 400 20 2 : tunables 0 0 0 : slabdata 4 4 0
tw_sock_TCPv6 64 144 256 16 1 : tunables 0 0 0 : slabdata 9 9 0
bdev_cache 57 57 832 19 4 : tunables 0 0 0 : slabdata 3 3 0
blkdev_queue 54 54 1744 18 8 : tunables 0 0 0 : slabdata 3 3 0
net_namespace 52 52 2432 13 8 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-8192 52 52 8192 4 8 : tunables 0 0 0 : slabdata 13 13 0
kmem_cache 42 42 192 21 1 : tunables 0 0 0 : slabdata 2 2 0
xfs_log_ticket 40 40 200 20 1 : tunables 0 0 0 : slabdata 2 2 0
xfs_btree_cur 38 38 208 19 1 : tunables 0 0 0 : slabdata 2 2 0
rpc_inode_cache 38 38 832 19 4 : tunables 0 0 0 : slabdata 2 2 0
nf_conntrack_expect 34 34 240 17 1 : tunables 0 0 0 : slabdata 2 2 0
xfs_da_state 32 32 488 16 2 : tunables 0 0 0 : slabdata 2 2 0
UDPv6 32 32 1024 16 4 : tunables 0 0 0 : slabdata 2 2 0
xfs_trans 28 28 280 14 1 : tunables 0 0 0 : slabdata 2 2 0
taskstats 24 24 328 12 1 : tunables 0 0 0 : slabdata 2 2 0
dio 24 24 640 12 2 : tunables 0 0 0 : slabdata 2 2 0
posix_timers_cache 23 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
hugetlbfs_inode_cache 14 14 560 14 2 : tunables 0 0 0 : slabdata 1 1 0
xfrm_dst_cache 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
user_namespace 0 0 1072 15 4 : tunables 0 0 0 : slabdata 0 0 0
UDPLITEv6 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
flow_cache 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
ext2_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
dquot 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 16 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 13 1 : tunables 0 0 0 : slabdata 0 0 0
> Could you also take few snapshots of /proc/420/stack to see what kswapd
> is doing.
Uhm, there is no such entry in proc. Guess I need to enable some kernel
option and recompile?
This will reset uptime and a problem for another 5-10 days..
>
> >
> > P.S.: The message above was written couple of days ago. Now I'm at 10
> > days uptime, and this is the result as of today
> > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > 420 root R 93 0.0 110:48.48 0 kswapd0
> > 30085 nowhere D 42 0.2 0:04.36 10m dd
> >
> > PPS: Please CC me.
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-21 10:15 ` nowhere
@ 2011-12-21 10:24 ` Michal Hocko
2011-12-21 10:52 ` nowhere
2011-12-21 14:06 ` Alex Elder
0 siblings, 2 replies; 32+ messages in thread
From: Michal Hocko @ 2011-12-21 10:24 UTC (permalink / raw)
To: nowhere
Cc: linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner, xfs,
Ben Myers, Alex Elder
[Let's cc also some fs and xfs people]
On Wed 21-12-11 14:15:21, nowhere wrote:
> В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> > [Let's CC linux-mm]
> >
> > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > Hello,
> > >
> > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > file system operations and network i/o (i.e. file server). And there is
> > > a problem with kswapd.
> >
> > What kind of filesystem do you use?
>
> Well, that is XFS.
> I have a large volume with ~200000 files, and a periodic job, which
> checks all file's timestamps once per 30 minutes and makes actions if
> timestamp has changed.
Is it the first time you are seeing this? I am not familiar with xfs at
all but the number of files sounds like dcache shrinker might be really
busy...
> > > I'm playing with dd:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > >
> > > I.e. I'm filling page cache.
> > >
> > > So when the machine is just rebooted, kswapd during this operation is
> > > almost idle, just 5-8 percent according to top.
> > >
> > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > for kswapd:
> > >
> > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > >
> > > In fact, kswapd cpu usage on this operation steadily increases over
> > > time.
> > >
> > > Also read performance degrades over time. After reboot:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > >
> > > After ~5 days uptime:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > >
> > > Whereas raw disk sequential read performance stays the same:
> > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > >
> > > Also after dropping caches, situation somehow improves, but not to the
> > > state of freshly restarted system:
> > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > >
> > > perf shows:
> > >
> > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> >
> > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > spin lock contention.
>
> This is slabinfo, sorted by num objects:
>
> xfs_inode 192941 193205 960 17 4 : tunables 0 0 0 : slabdata 11365 11365 0
> dentry 118818 118818 192 21 1 : tunables 0 0 0 : slabdata 5658 5658 0
> kmalloc-256 107920 107920 256 16 1 : tunables 0 0 0 : slabdata 6745 6745 0
> kmalloc-64 59912 102656 64 64 1 : tunables 0 0 0 : slabdata 1604 1604 0
> radix_tree_node 30618 33474 568 14 2 : tunables 0 0 0 : slabdata 2391 2391 0
> kmalloc-96 27092 41202 96 42 1 : tunables 0 0 0 : slabdata 981 981 0
> buffer_head 24892 63843 104 39 1 : tunables 0 0 0 : slabdata 1637 1637 0
> kmalloc-192 23332 34503 192 21 1 : tunables 0 0 0 : slabdata 1643 1643 0
> sysfs_dir_cache 17444 17444 144 28 1 : tunables 0 0 0 : slabdata 623 623 0
> arp_cache 12863 14796 320 12 1 : tunables 0 0 0 : slabdata 1233 1233 0
> kmalloc-512 11051 11440 512 16 2 : tunables 0 0 0 : slabdata 715 715 0
> kmalloc-128 10611 13152 128 32 1 : tunables 0 0 0 : slabdata 411 411 0
> ext4_inode_cache 9660 18018 880 18 4 : tunables 0 0 0 : slabdata 1001 1001 0
> kmalloc-8 8704 8704 8 512 1 : tunables 0 0 0 : slabdata 17 17 0
> ext4_io_page 6912 6912 16 256 1 : tunables 0 0 0 : slabdata 27 27 0
> anon_vma_chain 6701 10880 48 85 1 : tunables 0 0 0 : slabdata 128 128 0
> Acpi-Namespace 6611 8058 40 102 1 : tunables 0 0 0 : slabdata 79 79 0
> fsnotify_event_holder 6290 6970 24 170 1 : tunables 0 0 0 : slabdata 41 41 0
> kmalloc-1024 5813 5888 1024 16 4 : tunables 0 0 0 : slabdata 368 368 0
> vm_area_struct 5664 5664 168 24 1 : tunables 0 0 0 : slabdata 236 236 0
> reiser_inode_cache 3992 5198 704 23 4 : tunables 0 0 0 : slabdata 226 226 0
> Acpi-ParseExt 3808 3808 72 56 1 : tunables 0 0 0 : slabdata 68 68 0
> kmalloc-2048 3587 3888 2048 16 8 : tunables 0 0 0 : slabdata 243 243 0
> proc_inode_cache 3498 3510 624 13 2 : tunables 0 0 0 : slabdata 270 270 0
> anon_vma 3380 3640 72 56 1 : tunables 0 0 0 : slabdata 65 65 0
> kmalloc-16 3072 3072 16 256 1 : tunables 0 0 0 : slabdata 12 12 0
> inode_cache 3024 3024 560 14 2 : tunables 0 0 0 : slabdata 216 216 0
> ext4_allocation_context 3000 3000 136 30 1 : tunables 0 0 0 : slabdata 100 100 0
> nf_conntrack_ffffffff81776d40 2910 3549 312 13 1 : tunables 0 0 0 : slabdata 273 273 0
> kmalloc-4096 2792 3136 4096 8 8 : tunables 0 0 0 : slabdata 392 392 0
> ext4_free_data 2701 2701 56 73 1 : tunables 0 0 0 : slabdata 37 37 0
> pid_namespace 2130 2130 2112 15 8 : tunables 0 0 0 : slabdata 142 142 0
> mqueue_inode_cache 2124 2124 896 18 4 : tunables 0 0 0 : slabdata 118 118 0
> jbd2_revoke_record 1664 1664 32 128 1 : tunables 0 0 0 : slabdata 13 13 0
> kmalloc-32 1434 3072 32 128 1 : tunables 0 0 0 : slabdata 24 24 0
> shmem_inode_cache 1400 1521 624 13 2 : tunables 0 0 0 : slabdata 117 117 0
> xfs_ili 1139 1260 216 18 1 : tunables 0 0 0 : slabdata 70 70 0
> nfsd4_stateids 1131 1496 120 34 1 : tunables 0 0 0 : slabdata 44 44 0
> idr_layer_cache 900 900 544 15 2 : tunables 0 0 0 : slabdata 60 60 0
> jbd2_journal_head 742 1008 112 36 1 : tunables 0 0 0 : slabdata 28 28 0
> fsnotify_event 648 648 112 36 1 : tunables 0 0 0 : slabdata 18 18 0
> sock_inode_cache 577 624 640 12 2 : tunables 0 0 0 : slabdata 52 52 0
> tw_sock_TCP 504 504 192 21 1 : tunables 0 0 0 : slabdata 24 24 0
> TCP 356 414 1728 18 8 : tunables 0 0 0 : slabdata 23 23 0
> RAW 342 342 832 19 4 : tunables 0 0 0 : slabdata 18 18 0
> jbd2_journal_handle 340 340 24 170 1 : tunables 0 0 0 : slabdata 2 2 0
> blkdev_requests 322 322 344 23 2 : tunables 0 0 0 : slabdata 14 14 0
> task_struct 293 357 1504 21 8 : tunables 0 0 0 : slabdata 17 17 0
> UDP 285 285 832 19 4 : tunables 0 0 0 : slabdata 15 15 0
> files_cache 276 276 704 23 4 : tunables 0 0 0 : slabdata 12 12 0
> nfsd4_openowners 220 320 392 20 2 : tunables 0 0 0 : slabdata 16 16 0
> mm_struct 216 216 896 18 4 : tunables 0 0 0 : slabdata 12 12 0
> sighand_cache 199 225 2112 15 8 : tunables 0 0 0 : slabdata 15 15 0
> nfsd4_delegations 198 198 368 22 2 : tunables 0 0 0 : slabdata 9 9 0
> kmem_cache_node 192 192 64 64 1 : tunables 0 0 0 : slabdata 3 3 0
> xfs_buf_item 162 162 224 18 1 : tunables 0 0 0 : slabdata 9 9 0
> ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
> ext4_io_end 140 154 1128 14 4 : tunables 0 0 0 : slabdata 11 11 0
> dnotify_mark 120 120 136 30 1 : tunables 0 0 0 : slabdata 4 4 0
> TCPv6 104 119 1856 17 8 : tunables 0 0 0 : slabdata 7 7 0
> cfq_queue 102 102 232 17 1 : tunables 0 0 0 : slabdata 6 6 0
> Acpi-State 102 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
> sigqueue 100 100 160 25 1 : tunables 0 0 0 : slabdata 4 4 0
> xfs_efd_item 80 80 400 20 2 : tunables 0 0 0 : slabdata 4 4 0
> tw_sock_TCPv6 64 144 256 16 1 : tunables 0 0 0 : slabdata 9 9 0
> bdev_cache 57 57 832 19 4 : tunables 0 0 0 : slabdata 3 3 0
> blkdev_queue 54 54 1744 18 8 : tunables 0 0 0 : slabdata 3 3 0
> net_namespace 52 52 2432 13 8 : tunables 0 0 0 : slabdata 4 4 0
> kmalloc-8192 52 52 8192 4 8 : tunables 0 0 0 : slabdata 13 13 0
> kmem_cache 42 42 192 21 1 : tunables 0 0 0 : slabdata 2 2 0
> xfs_log_ticket 40 40 200 20 1 : tunables 0 0 0 : slabdata 2 2 0
> xfs_btree_cur 38 38 208 19 1 : tunables 0 0 0 : slabdata 2 2 0
> rpc_inode_cache 38 38 832 19 4 : tunables 0 0 0 : slabdata 2 2 0
> nf_conntrack_expect 34 34 240 17 1 : tunables 0 0 0 : slabdata 2 2 0
> xfs_da_state 32 32 488 16 2 : tunables 0 0 0 : slabdata 2 2 0
> UDPv6 32 32 1024 16 4 : tunables 0 0 0 : slabdata 2 2 0
> xfs_trans 28 28 280 14 1 : tunables 0 0 0 : slabdata 2 2 0
> taskstats 24 24 328 12 1 : tunables 0 0 0 : slabdata 2 2 0
> dio 24 24 640 12 2 : tunables 0 0 0 : slabdata 2 2 0
> posix_timers_cache 23 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
> hugetlbfs_inode_cache 14 14 560 14 2 : tunables 0 0 0 : slabdata 1 1 0
> xfrm_dst_cache 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
> user_namespace 0 0 1072 15 4 : tunables 0 0 0 : slabdata 0 0 0
> UDPLITEv6 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> flow_cache 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> ext2_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
> dquot 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-512 0 0 512 16 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-256 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-192 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-1024 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> bsg_cmd 0 0 312 13 1 : tunables 0 0 0 : slabdata 0 0 0
>
> > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > is doing.
>
> Uhm, there is no such entry in proc. Guess I need to enable some kernel
> option and recompile?
Yes, you need CONFIG_STACKTRACE. But you can get a similar information
by sysrq+t
> This will reset uptime and a problem for another 5-10 days..
Then don't do that ;)
> > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > days uptime, and this is the result as of today
> > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > 420 root R 93 0.0 110:48.48 0 kswapd0
> > > 30085 nowhere D 42 0.2 0:04.36 10m dd
> > >
> > > PPS: Please CC me.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-21 10:24 ` Michal Hocko
@ 2011-12-21 10:52 ` nowhere
2011-12-21 14:06 ` Alex Elder
1 sibling, 0 replies; 32+ messages in thread
From: nowhere @ 2011-12-21 10:52 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner, xfs,
Ben Myers, Alex Elder
В Ср., 21/12/2011 в 11:24 +0100, Michal Hocko пишет:
> [Let's cc also some fs and xfs people]
>
> On Wed 21-12-11 14:15:21, nowhere wrote:
> > В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> > > [Let's CC linux-mm]
> > >
> > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > Hello,
> > > >
> > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > file system operations and network i/o (i.e. file server). And there is
> > > > a problem with kswapd.
> > >
> > > What kind of filesystem do you use?
> >
> > Well, that is XFS.
> > I have a large volume with ~200000 files, and a periodic job, which
> > checks all file's timestamps once per 30 minutes and makes actions if
> > timestamp has changed.
>
> Is it the first time you are seeing this? I am not familiar with xfs at
> all but the number of files sounds like dcache shrinker might be really
> busy...
This behavoir - yes. In 3.0 kswapd sometimes hangs in D-state, shifting
load average above 1 - that's how I noticed it.
> > > > I'm playing with dd:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > >
> > > > I.e. I'm filling page cache.
> > > >
> > > > So when the machine is just rebooted, kswapd during this operation is
> > > > almost idle, just 5-8 percent according to top.
> > > >
> > > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > > for kswapd:
> > > >
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > > >
> > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > time.
> > > >
> > > > Also read performance degrades over time. After reboot:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > >
> > > > After ~5 days uptime:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > >
> > > > Whereas raw disk sequential read performance stays the same:
> > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > >
> > > > Also after dropping caches, situation somehow improves, but not to the
> > > > state of freshly restarted system:
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > > >
> > > > perf shows:
> > > >
> > > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> > >
> > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > spin lock contention.
> >
> > This is slabinfo, sorted by num objects:
> >
> > xfs_inode 192941 193205 960 17 4 : tunables 0 0 0 : slabdata 11365 11365 0
> > dentry 118818 118818 192 21 1 : tunables 0 0 0 : slabdata 5658 5658 0
> > kmalloc-256 107920 107920 256 16 1 : tunables 0 0 0 : slabdata 6745 6745 0
> > kmalloc-64 59912 102656 64 64 1 : tunables 0 0 0 : slabdata 1604 1604 0
> > radix_tree_node 30618 33474 568 14 2 : tunables 0 0 0 : slabdata 2391 2391 0
> > kmalloc-96 27092 41202 96 42 1 : tunables 0 0 0 : slabdata 981 981 0
> > buffer_head 24892 63843 104 39 1 : tunables 0 0 0 : slabdata 1637 1637 0
> > kmalloc-192 23332 34503 192 21 1 : tunables 0 0 0 : slabdata 1643 1643 0
> > sysfs_dir_cache 17444 17444 144 28 1 : tunables 0 0 0 : slabdata 623 623 0
> > arp_cache 12863 14796 320 12 1 : tunables 0 0 0 : slabdata 1233 1233 0
> > kmalloc-512 11051 11440 512 16 2 : tunables 0 0 0 : slabdata 715 715 0
> > kmalloc-128 10611 13152 128 32 1 : tunables 0 0 0 : slabdata 411 411 0
> > ext4_inode_cache 9660 18018 880 18 4 : tunables 0 0 0 : slabdata 1001 1001 0
> > kmalloc-8 8704 8704 8 512 1 : tunables 0 0 0 : slabdata 17 17 0
> > ext4_io_page 6912 6912 16 256 1 : tunables 0 0 0 : slabdata 27 27 0
> > anon_vma_chain 6701 10880 48 85 1 : tunables 0 0 0 : slabdata 128 128 0
> > Acpi-Namespace 6611 8058 40 102 1 : tunables 0 0 0 : slabdata 79 79 0
> > fsnotify_event_holder 6290 6970 24 170 1 : tunables 0 0 0 : slabdata 41 41 0
> > kmalloc-1024 5813 5888 1024 16 4 : tunables 0 0 0 : slabdata 368 368 0
> > vm_area_struct 5664 5664 168 24 1 : tunables 0 0 0 : slabdata 236 236 0
> > reiser_inode_cache 3992 5198 704 23 4 : tunables 0 0 0 : slabdata 226 226 0
> > Acpi-ParseExt 3808 3808 72 56 1 : tunables 0 0 0 : slabdata 68 68 0
> > kmalloc-2048 3587 3888 2048 16 8 : tunables 0 0 0 : slabdata 243 243 0
> > proc_inode_cache 3498 3510 624 13 2 : tunables 0 0 0 : slabdata 270 270 0
> > anon_vma 3380 3640 72 56 1 : tunables 0 0 0 : slabdata 65 65 0
> > kmalloc-16 3072 3072 16 256 1 : tunables 0 0 0 : slabdata 12 12 0
> > inode_cache 3024 3024 560 14 2 : tunables 0 0 0 : slabdata 216 216 0
> > ext4_allocation_context 3000 3000 136 30 1 : tunables 0 0 0 : slabdata 100 100 0
> > nf_conntrack_ffffffff81776d40 2910 3549 312 13 1 : tunables 0 0 0 : slabdata 273 273 0
> > kmalloc-4096 2792 3136 4096 8 8 : tunables 0 0 0 : slabdata 392 392 0
> > ext4_free_data 2701 2701 56 73 1 : tunables 0 0 0 : slabdata 37 37 0
> > pid_namespace 2130 2130 2112 15 8 : tunables 0 0 0 : slabdata 142 142 0
> > mqueue_inode_cache 2124 2124 896 18 4 : tunables 0 0 0 : slabdata 118 118 0
> > jbd2_revoke_record 1664 1664 32 128 1 : tunables 0 0 0 : slabdata 13 13 0
> > kmalloc-32 1434 3072 32 128 1 : tunables 0 0 0 : slabdata 24 24 0
> > shmem_inode_cache 1400 1521 624 13 2 : tunables 0 0 0 : slabdata 117 117 0
> > xfs_ili 1139 1260 216 18 1 : tunables 0 0 0 : slabdata 70 70 0
> > nfsd4_stateids 1131 1496 120 34 1 : tunables 0 0 0 : slabdata 44 44 0
> > idr_layer_cache 900 900 544 15 2 : tunables 0 0 0 : slabdata 60 60 0
> > jbd2_journal_head 742 1008 112 36 1 : tunables 0 0 0 : slabdata 28 28 0
> > fsnotify_event 648 648 112 36 1 : tunables 0 0 0 : slabdata 18 18 0
> > sock_inode_cache 577 624 640 12 2 : tunables 0 0 0 : slabdata 52 52 0
> > tw_sock_TCP 504 504 192 21 1 : tunables 0 0 0 : slabdata 24 24 0
> > TCP 356 414 1728 18 8 : tunables 0 0 0 : slabdata 23 23 0
> > RAW 342 342 832 19 4 : tunables 0 0 0 : slabdata 18 18 0
> > jbd2_journal_handle 340 340 24 170 1 : tunables 0 0 0 : slabdata 2 2 0
> > blkdev_requests 322 322 344 23 2 : tunables 0 0 0 : slabdata 14 14 0
> > task_struct 293 357 1504 21 8 : tunables 0 0 0 : slabdata 17 17 0
> > UDP 285 285 832 19 4 : tunables 0 0 0 : slabdata 15 15 0
> > files_cache 276 276 704 23 4 : tunables 0 0 0 : slabdata 12 12 0
> > nfsd4_openowners 220 320 392 20 2 : tunables 0 0 0 : slabdata 16 16 0
> > mm_struct 216 216 896 18 4 : tunables 0 0 0 : slabdata 12 12 0
> > sighand_cache 199 225 2112 15 8 : tunables 0 0 0 : slabdata 15 15 0
> > nfsd4_delegations 198 198 368 22 2 : tunables 0 0 0 : slabdata 9 9 0
> > kmem_cache_node 192 192 64 64 1 : tunables 0 0 0 : slabdata 3 3 0
> > xfs_buf_item 162 162 224 18 1 : tunables 0 0 0 : slabdata 9 9 0
> > ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
> > ext4_io_end 140 154 1128 14 4 : tunables 0 0 0 : slabdata 11 11 0
> > dnotify_mark 120 120 136 30 1 : tunables 0 0 0 : slabdata 4 4 0
> > TCPv6 104 119 1856 17 8 : tunables 0 0 0 : slabdata 7 7 0
> > cfq_queue 102 102 232 17 1 : tunables 0 0 0 : slabdata 6 6 0
> > Acpi-State 102 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
> > sigqueue 100 100 160 25 1 : tunables 0 0 0 : slabdata 4 4 0
> > xfs_efd_item 80 80 400 20 2 : tunables 0 0 0 : slabdata 4 4 0
> > tw_sock_TCPv6 64 144 256 16 1 : tunables 0 0 0 : slabdata 9 9 0
> > bdev_cache 57 57 832 19 4 : tunables 0 0 0 : slabdata 3 3 0
> > blkdev_queue 54 54 1744 18 8 : tunables 0 0 0 : slabdata 3 3 0
> > net_namespace 52 52 2432 13 8 : tunables 0 0 0 : slabdata 4 4 0
> > kmalloc-8192 52 52 8192 4 8 : tunables 0 0 0 : slabdata 13 13 0
> > kmem_cache 42 42 192 21 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_log_ticket 40 40 200 20 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_btree_cur 38 38 208 19 1 : tunables 0 0 0 : slabdata 2 2 0
> > rpc_inode_cache 38 38 832 19 4 : tunables 0 0 0 : slabdata 2 2 0
> > nf_conntrack_expect 34 34 240 17 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_da_state 32 32 488 16 2 : tunables 0 0 0 : slabdata 2 2 0
> > UDPv6 32 32 1024 16 4 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_trans 28 28 280 14 1 : tunables 0 0 0 : slabdata 2 2 0
> > taskstats 24 24 328 12 1 : tunables 0 0 0 : slabdata 2 2 0
> > dio 24 24 640 12 2 : tunables 0 0 0 : slabdata 2 2 0
> > posix_timers_cache 23 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
> > hugetlbfs_inode_cache 14 14 560 14 2 : tunables 0 0 0 : slabdata 1 1 0
> > xfrm_dst_cache 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
> > user_namespace 0 0 1072 15 4 : tunables 0 0 0 : slabdata 0 0 0
> > UDPLITEv6 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> > UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> > kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> > flow_cache 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> > ext2_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> > ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
> > dquot 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> > dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-512 0 0 512 16 2 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-256 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-192 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-1024 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> > bsg_cmd 0 0 312 13 1 : tunables 0 0 0 : slabdata 0 0 0
> >
> > > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > > is doing.
> >
> > Uhm, there is no such entry in proc. Guess I need to enable some kernel
> > option and recompile?
>
> Yes, you need CONFIG_STACKTRACE. But you can get a similar information
> by sysrq+t
Ok, here it is (4 outputs at random time during dd):
<6>kswapd0 R running task 0 420 2 0x00000000
<4> 0000000000000001 0000000000000001 ffff88011b3bdc80 0000000000000000
<4> 0000000000000000 0000000000000001 ffff88011b3bde70 ffff8800d0ab1468
<4> ffffffffffffff96 ffffffff81059391 0000000000000010 0000000000000246
<4>Call Trace:
<4> [<ffffffff81059391>] ? down_read_trylock+0x11/0x20
<4> [<ffffffff810da7ad>] ? grab_super_passive+0x1d/0xb0
<4> [<ffffffff810d9ced>] ? drop_super+0xd/0x20
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15c4>] ? shrink_slab+0x74/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb
<6>kswapd0 R running task 0 420 2 0x00000000
<4> ffff88011aeff630 000000000000003f ffffffff811c9601 ffff880118bb7cc0
<4> 0000000000000040 0000000000001443 ffff880117cc4000 0000000000000001
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffff811c9601>] ? radix_tree_gang_lookup_tag+0x81/0xf0
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7ad>] ? grab_super_passive+0x1d/0xb0
<4> [<ffffffff810d9c48>] ? put_super+0x18/0x20
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb
<6>kswapd0 R running task 0 420 2 0x00000000
<4> ffff88011aeff630 000000000000003f ffffffff811c9601 ffff88011b0646c0
<4> 0000000000000040 0000000000000d3e ffff880117cc4000 0000000000000005
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffff811c9601>] ? radix_tree_gang_lookup_tag+0x81/0xf0
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7d5>] ? grab_super_passive+0x45/0xb0
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb
<6>kswapd0 R running task 0 420 2 0x00000000
<4> 000000000003337a 000000000001747d 0000000000000000 0000000000000004
<4> ffffffff816492c0 0000000000000000 ffff880117cc4000 0000000000000000
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7d5>] ? grab_super_passive+0x45/0xb0
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb
> > This will reset uptime and a problem for another 5-10 days..
>
> Then don't do that ;)
>
> > > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > > days uptime, and this is the result as of today
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root R 93 0.0 110:48.48 0 kswapd0
> > > > 30085 nowhere D 42 0.2 0:04.36 10m dd
> > > >
> > > > PPS: Please CC me.
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-21 10:24 ` Michal Hocko
2011-12-21 10:52 ` nowhere
@ 2011-12-21 14:06 ` Alex Elder
2011-12-21 14:19 ` nowhere
1 sibling, 1 reply; 32+ messages in thread
From: Alex Elder @ 2011-12-21 14:06 UTC (permalink / raw)
To: Michal Hocko
Cc: nowhere, linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
xfs, Ben Myers, Alex Elder
On Wed, 2011-12-21 at 11:24 +0100, Michal Hocko wrote:
> [Let's cc also some fs and xfs people]
>
> On Wed 21-12-11 14:15:21, nowhere wrote:
> > В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> > > [Let's CC linux-mm]
> > >
> > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > Hello,
> > > >
> > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > file system operations and network i/o (i.e. file server). And there is
> > > > a problem with kswapd.
> > >
> > > What kind of filesystem do you use?
> >
> > Well, that is XFS.
> > I have a large volume with ~200000 files, and a periodic job, which
> > checks all file's timestamps once per 30 minutes and makes actions if
> > timestamp has changed.
>
> Is it the first time you are seeing this? I am not familiar with xfs at
> all but the number of files sounds like dcache shrinker might be really
> busy...
>
> > > > I'm playing with dd:
> > > > dd if=/some/big/file of=/dev/null bs=8M
Just to be clear, is it always the same file you are
operating on in all of the cases you show here? If
so, there anything else happening to that file in
the mean time? Is it being written to at all?
Is the filesystem in question (the one with the
big file) also being NFS served? Are there any
other filesystems being served, and if so, are
they also XFS?
I presume you are doing nothing special with respect
to the file serving, i.e. in all these cases and
timings, etc. the system has been and continues to
act as an NFS server.
Is the filesystem anywhere near full?
I don't know if these answers change much but it
might help to complete the picture.
Thanks.
-Alex
> > > > I.e. I'm filling page cache.
> > > >
> > > > So when the machine is just rebooted, kswapd during this operation is
> > > > almost idle, just 5-8 percent according to top.
> > > >
> > > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > > for kswapd:
> > > >
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > > >
> > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > time.
> > > >
> > > > Also read performance degrades over time. After reboot:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > >
> > > > After ~5 days uptime:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > >
> > > > Whereas raw disk sequential read performance stays the same:
> > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > >
> > > > Also after dropping caches, situation somehow improves, but not to the
> > > > state of freshly restarted system:
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > > >
> > > > perf shows:
> > > >
> > > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> > >
> > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > spin lock contention.
> >
> > This is slabinfo, sorted by num objects:
> >
> > xfs_inode 192941 193205 960 17 4 : tunables 0 0 0 : slabdata 11365 11365 0
> > dentry 118818 118818 192 21 1 : tunables 0 0 0 : slabdata 5658 5658 0
> > kmalloc-256 107920 107920 256 16 1 : tunables 0 0 0 : slabdata 6745 6745 0
> > kmalloc-64 59912 102656 64 64 1 : tunables 0 0 0 : slabdata 1604 1604 0
> > radix_tree_node 30618 33474 568 14 2 : tunables 0 0 0 : slabdata 2391 2391 0
> > kmalloc-96 27092 41202 96 42 1 : tunables 0 0 0 : slabdata 981 981 0
> > buffer_head 24892 63843 104 39 1 : tunables 0 0 0 : slabdata 1637 1637 0
> > kmalloc-192 23332 34503 192 21 1 : tunables 0 0 0 : slabdata 1643 1643 0
> > sysfs_dir_cache 17444 17444 144 28 1 : tunables 0 0 0 : slabdata 623 623 0
> > arp_cache 12863 14796 320 12 1 : tunables 0 0 0 : slabdata 1233 1233 0
> > kmalloc-512 11051 11440 512 16 2 : tunables 0 0 0 : slabdata 715 715 0
> > kmalloc-128 10611 13152 128 32 1 : tunables 0 0 0 : slabdata 411 411 0
> > ext4_inode_cache 9660 18018 880 18 4 : tunables 0 0 0 : slabdata 1001 1001 0
> > kmalloc-8 8704 8704 8 512 1 : tunables 0 0 0 : slabdata 17 17 0
> > ext4_io_page 6912 6912 16 256 1 : tunables 0 0 0 : slabdata 27 27 0
> > anon_vma_chain 6701 10880 48 85 1 : tunables 0 0 0 : slabdata 128 128 0
> > Acpi-Namespace 6611 8058 40 102 1 : tunables 0 0 0 : slabdata 79 79 0
> > fsnotify_event_holder 6290 6970 24 170 1 : tunables 0 0 0 : slabdata 41 41 0
> > kmalloc-1024 5813 5888 1024 16 4 : tunables 0 0 0 : slabdata 368 368 0
> > vm_area_struct 5664 5664 168 24 1 : tunables 0 0 0 : slabdata 236 236 0
> > reiser_inode_cache 3992 5198 704 23 4 : tunables 0 0 0 : slabdata 226 226 0
> > Acpi-ParseExt 3808 3808 72 56 1 : tunables 0 0 0 : slabdata 68 68 0
> > kmalloc-2048 3587 3888 2048 16 8 : tunables 0 0 0 : slabdata 243 243 0
> > proc_inode_cache 3498 3510 624 13 2 : tunables 0 0 0 : slabdata 270 270 0
> > anon_vma 3380 3640 72 56 1 : tunables 0 0 0 : slabdata 65 65 0
> > kmalloc-16 3072 3072 16 256 1 : tunables 0 0 0 : slabdata 12 12 0
> > inode_cache 3024 3024 560 14 2 : tunables 0 0 0 : slabdata 216 216 0
> > ext4_allocation_context 3000 3000 136 30 1 : tunables 0 0 0 : slabdata 100 100 0
> > nf_conntrack_ffffffff81776d40 2910 3549 312 13 1 : tunables 0 0 0 : slabdata 273 273 0
> > kmalloc-4096 2792 3136 4096 8 8 : tunables 0 0 0 : slabdata 392 392 0
> > ext4_free_data 2701 2701 56 73 1 : tunables 0 0 0 : slabdata 37 37 0
> > pid_namespace 2130 2130 2112 15 8 : tunables 0 0 0 : slabdata 142 142 0
> > mqueue_inode_cache 2124 2124 896 18 4 : tunables 0 0 0 : slabdata 118 118 0
> > jbd2_revoke_record 1664 1664 32 128 1 : tunables 0 0 0 : slabdata 13 13 0
> > kmalloc-32 1434 3072 32 128 1 : tunables 0 0 0 : slabdata 24 24 0
> > shmem_inode_cache 1400 1521 624 13 2 : tunables 0 0 0 : slabdata 117 117 0
> > xfs_ili 1139 1260 216 18 1 : tunables 0 0 0 : slabdata 70 70 0
> > nfsd4_stateids 1131 1496 120 34 1 : tunables 0 0 0 : slabdata 44 44 0
> > idr_layer_cache 900 900 544 15 2 : tunables 0 0 0 : slabdata 60 60 0
> > jbd2_journal_head 742 1008 112 36 1 : tunables 0 0 0 : slabdata 28 28 0
> > fsnotify_event 648 648 112 36 1 : tunables 0 0 0 : slabdata 18 18 0
> > sock_inode_cache 577 624 640 12 2 : tunables 0 0 0 : slabdata 52 52 0
> > tw_sock_TCP 504 504 192 21 1 : tunables 0 0 0 : slabdata 24 24 0
> > TCP 356 414 1728 18 8 : tunables 0 0 0 : slabdata 23 23 0
> > RAW 342 342 832 19 4 : tunables 0 0 0 : slabdata 18 18 0
> > jbd2_journal_handle 340 340 24 170 1 : tunables 0 0 0 : slabdata 2 2 0
> > blkdev_requests 322 322 344 23 2 : tunables 0 0 0 : slabdata 14 14 0
> > task_struct 293 357 1504 21 8 : tunables 0 0 0 : slabdata 17 17 0
> > UDP 285 285 832 19 4 : tunables 0 0 0 : slabdata 15 15 0
> > files_cache 276 276 704 23 4 : tunables 0 0 0 : slabdata 12 12 0
> > nfsd4_openowners 220 320 392 20 2 : tunables 0 0 0 : slabdata 16 16 0
> > mm_struct 216 216 896 18 4 : tunables 0 0 0 : slabdata 12 12 0
> > sighand_cache 199 225 2112 15 8 : tunables 0 0 0 : slabdata 15 15 0
> > nfsd4_delegations 198 198 368 22 2 : tunables 0 0 0 : slabdata 9 9 0
> > kmem_cache_node 192 192 64 64 1 : tunables 0 0 0 : slabdata 3 3 0
> > xfs_buf_item 162 162 224 18 1 : tunables 0 0 0 : slabdata 9 9 0
> > ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
> > ext4_io_end 140 154 1128 14 4 : tunables 0 0 0 : slabdata 11 11 0
> > dnotify_mark 120 120 136 30 1 : tunables 0 0 0 : slabdata 4 4 0
> > TCPv6 104 119 1856 17 8 : tunables 0 0 0 : slabdata 7 7 0
> > cfq_queue 102 102 232 17 1 : tunables 0 0 0 : slabdata 6 6 0
> > Acpi-State 102 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
> > sigqueue 100 100 160 25 1 : tunables 0 0 0 : slabdata 4 4 0
> > xfs_efd_item 80 80 400 20 2 : tunables 0 0 0 : slabdata 4 4 0
> > tw_sock_TCPv6 64 144 256 16 1 : tunables 0 0 0 : slabdata 9 9 0
> > bdev_cache 57 57 832 19 4 : tunables 0 0 0 : slabdata 3 3 0
> > blkdev_queue 54 54 1744 18 8 : tunables 0 0 0 : slabdata 3 3 0
> > net_namespace 52 52 2432 13 8 : tunables 0 0 0 : slabdata 4 4 0
> > kmalloc-8192 52 52 8192 4 8 : tunables 0 0 0 : slabdata 13 13 0
> > kmem_cache 42 42 192 21 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_log_ticket 40 40 200 20 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_btree_cur 38 38 208 19 1 : tunables 0 0 0 : slabdata 2 2 0
> > rpc_inode_cache 38 38 832 19 4 : tunables 0 0 0 : slabdata 2 2 0
> > nf_conntrack_expect 34 34 240 17 1 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_da_state 32 32 488 16 2 : tunables 0 0 0 : slabdata 2 2 0
> > UDPv6 32 32 1024 16 4 : tunables 0 0 0 : slabdata 2 2 0
> > xfs_trans 28 28 280 14 1 : tunables 0 0 0 : slabdata 2 2 0
> > taskstats 24 24 328 12 1 : tunables 0 0 0 : slabdata 2 2 0
> > dio 24 24 640 12 2 : tunables 0 0 0 : slabdata 2 2 0
> > posix_timers_cache 23 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
> > hugetlbfs_inode_cache 14 14 560 14 2 : tunables 0 0 0 : slabdata 1 1 0
> > xfrm_dst_cache 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
> > user_namespace 0 0 1072 15 4 : tunables 0 0 0 : slabdata 0 0 0
> > UDPLITEv6 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> > UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> > kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> > flow_cache 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> > ext2_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> > ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
> > dquot 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> > dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-512 0 0 512 16 2 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-256 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-192 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> > dma-kmalloc-1024 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> > bsg_cmd 0 0 312 13 1 : tunables 0 0 0 : slabdata 0 0 0
> >
> > > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > > is doing.
> >
> > Uhm, there is no such entry in proc. Guess I need to enable some kernel
> > option and recompile?
>
> Yes, you need CONFIG_STACKTRACE. But you can get a similar information
> by sysrq+t
>
> > This will reset uptime and a problem for another 5-10 days..
>
> Then don't do that ;)
>
> > > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > > days uptime, and this is the result as of today
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root R 93 0.0 110:48.48 0 kswapd0
> > > > 30085 nowhere D 42 0.2 0:04.36 10m dd
> > > >
> > > > PPS: Please CC me.
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-21 14:06 ` Alex Elder
@ 2011-12-21 14:19 ` nowhere
0 siblings, 0 replies; 32+ messages in thread
From: nowhere @ 2011-12-21 14:19 UTC (permalink / raw)
To: elder
Cc: Michal Hocko, linux-kernel, linux-mm, Christoph Hellwig,
Dave Chinner, xfs, Ben Myers, Alex Elder
В Ср., 21/12/2011 в 08:06 -0600, Alex Elder пишет:
> On Wed, 2011-12-21 at 11:24 +0100, Michal Hocko wrote:
> > [Let's cc also some fs and xfs people]
> >
> > On Wed 21-12-11 14:15:21, nowhere wrote:
> > > В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> > > > [Let's CC linux-mm]
> > > >
> > > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > > Hello,
> > > > >
> > > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > > file system operations and network i/o (i.e. file server). And there is
> > > > > a problem with kswapd.
> > > >
> > > > What kind of filesystem do you use?
> > >
> > > Well, that is XFS.
> > > I have a large volume with ~200000 files, and a periodic job, which
> > > checks all file's timestamps once per 30 minutes and makes actions if
> > > timestamp has changed.
> >
> > Is it the first time you are seeing this? I am not familiar with xfs at
> > all but the number of files sounds like dcache shrinker might be really
> > busy...
> >
> > > > > I'm playing with dd:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
>
>
> Just to be clear, is it always the same file you are
> operating on in all of the cases you show here? If
> so, there anything else happening to that file in
> the mean time? Is it being written to at all?
No, this file is not modified. And yes, this is always the same file.
> Is the filesystem in question (the one with the
> big file) also being NFS served? Are there any
> other filesystems being served, and if so, are
> they also XFS?
Yes, the FS is exported. But nothing happens over NFS at the time of
dd'ing. There is also ext4 fs which is exported, but it is also idle.
> I presume you are doing nothing special with respect
> to the file serving, i.e. in all these cases and
> timings, etc. the system has been and continues to
> act as an NFS server.
Yes, that is correct. NFS + a little ftp + a little of samba. In
read-only fashion.
> Is the filesystem anywhere near full?
FS in question is nearly full. There are ~10G free out of 6,4Tb.
>
> I don't know if these answers change much but it
> might help to complete the picture.
>
> Thanks.
>
> -Alex
> > > > > I.e. I'm filling page cache.
> > > > >
> > > > > So when the machine is just rebooted, kswapd during this operation is
> > > > > almost idle, just 5-8 percent according to top.
> > > > >
> > > > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > > > for kswapd:
> > > > >
> > > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > > > >
> > > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > > time.
> > > > >
> > > > > Also read performance degrades over time. After reboot:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > >
> > > > > After ~5 days uptime:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > >
> > > > > Whereas raw disk sequential read performance stays the same:
> > > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > >
> > > > > Also after dropping caches, situation somehow improves, but not to the
> > > > > state of freshly restarted system:
> > > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > > > >
> > > > > perf shows:
> > > > >
> > > > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> > > >
> > > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > > spin lock contention.
> > >
> > > This is slabinfo, sorted by num objects:
> > >
> > > xfs_inode 192941 193205 960 17 4 : tunables 0 0 0 : slabdata 11365 11365 0
> > > dentry 118818 118818 192 21 1 : tunables 0 0 0 : slabdata 5658 5658 0
> > > kmalloc-256 107920 107920 256 16 1 : tunables 0 0 0 : slabdata 6745 6745 0
> > > kmalloc-64 59912 102656 64 64 1 : tunables 0 0 0 : slabdata 1604 1604 0
> > > radix_tree_node 30618 33474 568 14 2 : tunables 0 0 0 : slabdata 2391 2391 0
> > > kmalloc-96 27092 41202 96 42 1 : tunables 0 0 0 : slabdata 981 981 0
> > > buffer_head 24892 63843 104 39 1 : tunables 0 0 0 : slabdata 1637 1637 0
> > > kmalloc-192 23332 34503 192 21 1 : tunables 0 0 0 : slabdata 1643 1643 0
> > > sysfs_dir_cache 17444 17444 144 28 1 : tunables 0 0 0 : slabdata 623 623 0
> > > arp_cache 12863 14796 320 12 1 : tunables 0 0 0 : slabdata 1233 1233 0
> > > kmalloc-512 11051 11440 512 16 2 : tunables 0 0 0 : slabdata 715 715 0
> > > kmalloc-128 10611 13152 128 32 1 : tunables 0 0 0 : slabdata 411 411 0
> > > ext4_inode_cache 9660 18018 880 18 4 : tunables 0 0 0 : slabdata 1001 1001 0
> > > kmalloc-8 8704 8704 8 512 1 : tunables 0 0 0 : slabdata 17 17 0
> > > ext4_io_page 6912 6912 16 256 1 : tunables 0 0 0 : slabdata 27 27 0
> > > anon_vma_chain 6701 10880 48 85 1 : tunables 0 0 0 : slabdata 128 128 0
> > > Acpi-Namespace 6611 8058 40 102 1 : tunables 0 0 0 : slabdata 79 79 0
> > > fsnotify_event_holder 6290 6970 24 170 1 : tunables 0 0 0 : slabdata 41 41 0
> > > kmalloc-1024 5813 5888 1024 16 4 : tunables 0 0 0 : slabdata 368 368 0
> > > vm_area_struct 5664 5664 168 24 1 : tunables 0 0 0 : slabdata 236 236 0
> > > reiser_inode_cache 3992 5198 704 23 4 : tunables 0 0 0 : slabdata 226 226 0
> > > Acpi-ParseExt 3808 3808 72 56 1 : tunables 0 0 0 : slabdata 68 68 0
> > > kmalloc-2048 3587 3888 2048 16 8 : tunables 0 0 0 : slabdata 243 243 0
> > > proc_inode_cache 3498 3510 624 13 2 : tunables 0 0 0 : slabdata 270 270 0
> > > anon_vma 3380 3640 72 56 1 : tunables 0 0 0 : slabdata 65 65 0
> > > kmalloc-16 3072 3072 16 256 1 : tunables 0 0 0 : slabdata 12 12 0
> > > inode_cache 3024 3024 560 14 2 : tunables 0 0 0 : slabdata 216 216 0
> > > ext4_allocation_context 3000 3000 136 30 1 : tunables 0 0 0 : slabdata 100 100 0
> > > nf_conntrack_ffffffff81776d40 2910 3549 312 13 1 : tunables 0 0 0 : slabdata 273 273 0
> > > kmalloc-4096 2792 3136 4096 8 8 : tunables 0 0 0 : slabdata 392 392 0
> > > ext4_free_data 2701 2701 56 73 1 : tunables 0 0 0 : slabdata 37 37 0
> > > pid_namespace 2130 2130 2112 15 8 : tunables 0 0 0 : slabdata 142 142 0
> > > mqueue_inode_cache 2124 2124 896 18 4 : tunables 0 0 0 : slabdata 118 118 0
> > > jbd2_revoke_record 1664 1664 32 128 1 : tunables 0 0 0 : slabdata 13 13 0
> > > kmalloc-32 1434 3072 32 128 1 : tunables 0 0 0 : slabdata 24 24 0
> > > shmem_inode_cache 1400 1521 624 13 2 : tunables 0 0 0 : slabdata 117 117 0
> > > xfs_ili 1139 1260 216 18 1 : tunables 0 0 0 : slabdata 70 70 0
> > > nfsd4_stateids 1131 1496 120 34 1 : tunables 0 0 0 : slabdata 44 44 0
> > > idr_layer_cache 900 900 544 15 2 : tunables 0 0 0 : slabdata 60 60 0
> > > jbd2_journal_head 742 1008 112 36 1 : tunables 0 0 0 : slabdata 28 28 0
> > > fsnotify_event 648 648 112 36 1 : tunables 0 0 0 : slabdata 18 18 0
> > > sock_inode_cache 577 624 640 12 2 : tunables 0 0 0 : slabdata 52 52 0
> > > tw_sock_TCP 504 504 192 21 1 : tunables 0 0 0 : slabdata 24 24 0
> > > TCP 356 414 1728 18 8 : tunables 0 0 0 : slabdata 23 23 0
> > > RAW 342 342 832 19 4 : tunables 0 0 0 : slabdata 18 18 0
> > > jbd2_journal_handle 340 340 24 170 1 : tunables 0 0 0 : slabdata 2 2 0
> > > blkdev_requests 322 322 344 23 2 : tunables 0 0 0 : slabdata 14 14 0
> > > task_struct 293 357 1504 21 8 : tunables 0 0 0 : slabdata 17 17 0
> > > UDP 285 285 832 19 4 : tunables 0 0 0 : slabdata 15 15 0
> > > files_cache 276 276 704 23 4 : tunables 0 0 0 : slabdata 12 12 0
> > > nfsd4_openowners 220 320 392 20 2 : tunables 0 0 0 : slabdata 16 16 0
> > > mm_struct 216 216 896 18 4 : tunables 0 0 0 : slabdata 12 12 0
> > > sighand_cache 199 225 2112 15 8 : tunables 0 0 0 : slabdata 15 15 0
> > > nfsd4_delegations 198 198 368 22 2 : tunables 0 0 0 : slabdata 9 9 0
> > > kmem_cache_node 192 192 64 64 1 : tunables 0 0 0 : slabdata 3 3 0
> > > xfs_buf_item 162 162 224 18 1 : tunables 0 0 0 : slabdata 9 9 0
> > > ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
> > > ext4_io_end 140 154 1128 14 4 : tunables 0 0 0 : slabdata 11 11 0
> > > dnotify_mark 120 120 136 30 1 : tunables 0 0 0 : slabdata 4 4 0
> > > TCPv6 104 119 1856 17 8 : tunables 0 0 0 : slabdata 7 7 0
> > > cfq_queue 102 102 232 17 1 : tunables 0 0 0 : slabdata 6 6 0
> > > Acpi-State 102 102 80 51 1 : tunables 0 0 0 : slabdata 2 2 0
> > > sigqueue 100 100 160 25 1 : tunables 0 0 0 : slabdata 4 4 0
> > > xfs_efd_item 80 80 400 20 2 : tunables 0 0 0 : slabdata 4 4 0
> > > tw_sock_TCPv6 64 144 256 16 1 : tunables 0 0 0 : slabdata 9 9 0
> > > bdev_cache 57 57 832 19 4 : tunables 0 0 0 : slabdata 3 3 0
> > > blkdev_queue 54 54 1744 18 8 : tunables 0 0 0 : slabdata 3 3 0
> > > net_namespace 52 52 2432 13 8 : tunables 0 0 0 : slabdata 4 4 0
> > > kmalloc-8192 52 52 8192 4 8 : tunables 0 0 0 : slabdata 13 13 0
> > > kmem_cache 42 42 192 21 1 : tunables 0 0 0 : slabdata 2 2 0
> > > xfs_log_ticket 40 40 200 20 1 : tunables 0 0 0 : slabdata 2 2 0
> > > xfs_btree_cur 38 38 208 19 1 : tunables 0 0 0 : slabdata 2 2 0
> > > rpc_inode_cache 38 38 832 19 4 : tunables 0 0 0 : slabdata 2 2 0
> > > nf_conntrack_expect 34 34 240 17 1 : tunables 0 0 0 : slabdata 2 2 0
> > > xfs_da_state 32 32 488 16 2 : tunables 0 0 0 : slabdata 2 2 0
> > > UDPv6 32 32 1024 16 4 : tunables 0 0 0 : slabdata 2 2 0
> > > xfs_trans 28 28 280 14 1 : tunables 0 0 0 : slabdata 2 2 0
> > > taskstats 24 24 328 12 1 : tunables 0 0 0 : slabdata 2 2 0
> > > dio 24 24 640 12 2 : tunables 0 0 0 : slabdata 2 2 0
> > > posix_timers_cache 23 23 176 23 1 : tunables 0 0 0 : slabdata 1 1 0
> > > hugetlbfs_inode_cache 14 14 560 14 2 : tunables 0 0 0 : slabdata 1 1 0
> > > xfrm_dst_cache 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
> > > user_namespace 0 0 1072 15 4 : tunables 0 0 0 : slabdata 0 0 0
> > > UDPLITEv6 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> > > UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> > > kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> > > flow_cache 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> > > ext2_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> > > ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
> > > dquot 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-512 0 0 512 16 2 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-256 0 0 256 16 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-192 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> > > dma-kmalloc-1024 0 0 1024 16 4 : tunables 0 0 0 : slabdata 0 0 0
> > > bsg_cmd 0 0 312 13 1 : tunables 0 0 0 : slabdata 0 0 0
> > >
> > > > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > > > is doing.
> > >
> > > Uhm, there is no such entry in proc. Guess I need to enable some kernel
> > > option and recompile?
> >
> > Yes, you need CONFIG_STACKTRACE. But you can get a similar information
> > by sysrq+t
> >
> > > This will reset uptime and a problem for another 5-10 days..
> >
> > Then don't do that ;)
> >
> > > > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > > > days uptime, and this is the result as of today
> > > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > > 420 root R 93 0.0 110:48.48 0 kswapd0
> > > > > 30085 nowhere D 42 0.2 0:04.36 10m dd
> > > > >
> > > > > PPS: Please CC me.
> >
>
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-21 9:52 ` Michal Hocko
2011-12-21 10:15 ` nowhere
@ 2011-12-21 22:55 ` Dave Chinner
2011-12-23 9:01 ` nowhere
1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2011-12-21 22:55 UTC (permalink / raw)
To: Michal Hocko; +Cc: Nikolay S., linux-kernel, linux-mm
On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> [Let's CC linux-mm]
>
> On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > Hello,
> >
> > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > file system operations and network i/o (i.e. file server). And there is
> > a problem with kswapd.
>
> What kind of filesystem do you use?
>
> >
> > I'm playing with dd:
> > dd if=/some/big/file of=/dev/null bs=8M
> >
> > I.e. I'm filling page cache.
> >
> > So when the machine is just rebooted, kswapd during this operation is
> > almost idle, just 5-8 percent according to top.
> >
> > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > for kswapd:
> >
> > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > 420 root R 70 0.0 22:09.60 0 kswapd0
> > 17717 nowhere D 27 0.2 0:01.81 10m dd
> >
> > In fact, kswapd cpu usage on this operation steadily increases over
> > time.
> >
> > Also read performance degrades over time. After reboot:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> >
> > After ~5 days uptime:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> >
> > Whereas raw disk sequential read performance stays the same:
> > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> >
> > Also after dropping caches, situation somehow improves, but not to the
> > state of freshly restarted system:
> > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > 420 root S 39 0.0 23:31.17 0 kswapd0
> > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> >
> > perf shows:
> >
> > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
>
> Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> spin lock contention.
That's just scanning superblocks, not apparently doing anything
useful like shrinking dentries or inodes attached to each sb. i.e.
the shrinkers are being called an awful lot and basically have
nothing to do. I'd be suspecting a problem higher up in the stack to
do with how shrink_slab is operating or being called.
I'd suggest gathering event traces for mm_shrink_slab_start/
mm_shrink_slab_end to try to see how the shrinkers are being
driven...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-21 22:55 ` Dave Chinner
@ 2011-12-23 9:01 ` nowhere
2011-12-23 10:20 ` Dave Chinner
0 siblings, 1 reply; 32+ messages in thread
From: nowhere @ 2011-12-23 9:01 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michal Hocko, linux-kernel, linux-mm
В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > [Let's CC linux-mm]
> >
> > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > Hello,
> > >
> > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > file system operations and network i/o (i.e. file server). And there is
> > > a problem with kswapd.
> >
> > What kind of filesystem do you use?
> >
> > >
> > > I'm playing with dd:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > >
> > > I.e. I'm filling page cache.
> > >
> > > So when the machine is just rebooted, kswapd during this operation is
> > > almost idle, just 5-8 percent according to top.
> > >
> > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > for kswapd:
> > >
> > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > >
> > > In fact, kswapd cpu usage on this operation steadily increases over
> > > time.
> > >
> > > Also read performance degrades over time. After reboot:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > >
> > > After ~5 days uptime:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > >
> > > Whereas raw disk sequential read performance stays the same:
> > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > >
> > > Also after dropping caches, situation somehow improves, but not to the
> > > state of freshly restarted system:
> > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > >
> > > perf shows:
> > >
> > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> >
> > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > spin lock contention.
>
> That's just scanning superblocks, not apparently doing anything
> useful like shrinking dentries or inodes attached to each sb. i.e.
> the shrinkers are being called an awful lot and basically have
> nothing to do. I'd be suspecting a problem higher up in the stack to
> do with how shrink_slab is operating or being called.
>
> I'd suggest gathering event traces for mm_shrink_slab_start/
> mm_shrink_slab_end to try to see how the shrinkers are being
> driven...
>
> Cheers,
>
> Dave.
I have recompiled kernel with tracers, and today the problem is visible
again. So here is the trace for mm_shrink_slab_start (it is HUGE):
kswapd0 421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
kswapd0 421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
kswapd0 421 [000] 103976.627884: mm_shrink_slab_start: prune_super+0x0 0xffff88011a892300: objects to shrink 110 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 2000 del
kswapd0 421 [000] 103976.627887: mm_shrink_slab_start: prune_super+0x0 0xffff88011a893700: objects to shrink 31 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 4900 delt
kswapd0 421 [000] 103976.627888: mm_shrink_slab_start: prune_super+0x0 0xffff88011a386700: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 300 delta
kswapd0 421 [000] 103976.627889: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381700: objects to shrink 6 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 4700 delta
kswapd0 421 [000] 103976.627890: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381300: objects to shrink 3 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 600 delta
kswapd0 421 [000] 103976.627893: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a171058: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1 d
kswapd0 421 [000] 103976.627895: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a29e658: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1 d
kswapd0 421 [000] 103976.627897: mm_shrink_slab_start: prune_super+0x0 0xffff88011ab17300: objects to shrink 311 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 4900 del
kswapd0 421 [000] 103976.627897: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: objects to shrink 27 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 49
kswapd0 421 [000] 103976.628108: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 1500 delt
kswapd0 421 [000] 103976.628110: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 5300 del
kswapd0 421 [000] 103976.628111: mm_shrink_slab_start: prune_super+0x0 0xffff88011a892300: objects to shrink 110 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 2000 del
kswapd0 421 [000] 103976.628112: mm_shrink_slab_start: prune_super+0x0 0xffff88011a893700: objects to shrink 31 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 4900 delt
kswapd0 421 [000] 103976.628113: mm_shrink_slab_start: prune_super+0x0 0xffff88011a386700: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 300 delta
kswapd0 421 [000] 103976.628113: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381700: objects to shrink 6 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 4700 delta
kswapd0 421 [000] 103976.628114: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381300: objects to shrink 3 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 600 delta
kswapd0 421 [000] 103976.628115: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a171058: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 1 d
kswapd0 421 [000] 103976.628116: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a29e658: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 1 d
kswapd0 421 [000] 103976.628116: mm_shrink_slab_start: prune_super+0x0 0xffff88011ab17300: objects to shrink 311 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 4900 del
kswapd0 421 [000] 103976.628117: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: objects to shrink 27 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 49
kswapd0 421 [000] 103976.628161: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 1500 delta
kswapd0 421 [000] 103976.628163: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 268 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 5300 delt
kswapd0 421 [000] 103976.628163: mm_shrink_slab_start: prune_super+0x0 0xffff88011a892300: objects to shrink 110 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 2000 delt
kswapd0 421 [000] 103976.628165: mm_shrink_slab_start: prune_super+0x0 0xffff88011a893700: objects to shrink 31 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 4900 delta
kswapd0 421 [000] 103976.628165: mm_shrink_slab_start: prune_super+0x0 0xffff88011a386700: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 300 delta 0
kswapd0 421 [000] 103976.628166: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381700: objects to shrink 6 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 4700 delta
kswapd0 421 [000] 103976.628167: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381300: objects to shrink 3 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 600 delta 0
kswapd0 421 [000] 103976.628167: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a171058: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 1 de
kswapd0 421 [000] 103976.628168: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a29e658: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 1 de
kswapd0 421 [000] 103976.628169: mm_shrink_slab_start: prune_super+0x0 0xffff88011ab17300: objects to shrink 311 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 4900 delt
kswapd0 421 [000] 103976.628169: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: objects to shrink 27 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 493
kswapd0 421 [000] 103976.628208: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 1500 delt
kswapd0 421 [000] 103976.628210: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 268 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 5300 del
kswapd0 421 [000] 103976.628210: mm_shrink_slab_start: prune_super+0x0 0xffff88011a892300: objects to shrink 110 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 2000 del
kswapd0 421 [000] 103976.628212: mm_shrink_slab_start: prune_super+0x0 0xffff88011a893700: objects to shrink 31 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 4900 delt
kswapd0 421 [000] 103976.628212: mm_shrink_slab_start: prune_super+0x0 0xffff88011a386700: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 300 delta
kswapd0 421 [000] 103976.628213: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381700: objects to shrink 6 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 4700 delta
kswapd0 421 [000] 103976.628214: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381300: objects to shrink 3 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 600 delta
kswapd0 421 [000] 103976.628214: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a171058: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 1 d
kswapd0 421 [000] 103976.628215: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a29e658: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 1 d
kswapd0 421 [000] 103976.628216: mm_shrink_slab_start: prune_super+0x0 0xffff88011ab17300: objects to shrink 311 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 4900 del
kswapd0 421 [000] 103976.628216: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: objects to shrink 27 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 49
And mm_shrink_slab_end:
kswapd0 421 [000] 104433.026125: mm_shrink_slab_end: prune_super+0x0 0xffff88011b00d300: unused scan count 85 new scan count 85 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026133: mm_shrink_slab_end: prune_super+0x0 0xffff88011a20ab00: unused scan count 265 new scan count 265 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026134: mm_shrink_slab_end: prune_super+0x0 0xffff88011a892300: unused scan count 217 new scan count 217 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026137: mm_shrink_slab_end: prune_super+0x0 0xffff88011a893700: unused scan count 421 new scan count 421 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026138: mm_shrink_slab_end: prune_super+0x0 0xffff88011a386700: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026139: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381700: unused scan count 361 new scan count 361 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026140: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381300: unused scan count 20 new scan count 20 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026143: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a171058: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026144: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a29e658: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026146: mm_shrink_slab_end: prune_super+0x0 0xffff88011ab17300: unused scan count 711 new scan count 725 total_scan 14 last shrinker return val 0
kswapd0 421 [000] 104433.026146: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: unused scan count 50 new scan count 51 total_scan 1 last shrinker return val 0
kswapd0 421 [000] 104433.026280: mm_shrink_slab_end: prune_super+0x0 0xffff88011b00d300: unused scan count 85 new scan count 85 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026282: mm_shrink_slab_end: prune_super+0x0 0xffff88011a20ab00: unused scan count 265 new scan count 265 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026282: mm_shrink_slab_end: prune_super+0x0 0xffff88011a892300: unused scan count 217 new scan count 217 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026284: mm_shrink_slab_end: prune_super+0x0 0xffff88011a893700: unused scan count 421 new scan count 421 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026285: mm_shrink_slab_end: prune_super+0x0 0xffff88011a386700: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026285: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381700: unused scan count 361 new scan count 361 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026286: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381300: unused scan count 20 new scan count 20 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026287: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a171058: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026287: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a29e658: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026288: mm_shrink_slab_end: prune_super+0x0 0xffff88011ab17300: unused scan count 725 new scan count 759 total_scan 34 last shrinker return val 0
kswapd0 421 [000] 104433.026289: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: unused scan count 51 new scan count 54 total_scan 3 last shrinker return val 0
kswapd0 421 [000] 104433.026329: mm_shrink_slab_end: prune_super+0x0 0xffff88011b00d300: unused scan count 85 new scan count 85 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026331: mm_shrink_slab_end: prune_super+0x0 0xffff88011a20ab00: unused scan count 265 new scan count 265 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026331: mm_shrink_slab_end: prune_super+0x0 0xffff88011a892300: unused scan count 217 new scan count 217 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026333: mm_shrink_slab_end: prune_super+0x0 0xffff88011a893700: unused scan count 421 new scan count 421 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026333: mm_shrink_slab_end: prune_super+0x0 0xffff88011a386700: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026334: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381700: unused scan count 361 new scan count 361 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026335: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381300: unused scan count 20 new scan count 20 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026335: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a171058: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026336: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a29e658: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026337: mm_shrink_slab_end: prune_super+0x0 0xffff88011ab17300: unused scan count 759 new scan count 760 total_scan 1 last shrinker return val 0
kswapd0 421 [000] 104433.026337: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: unused scan count 54 new scan count 54 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026376: mm_shrink_slab_end: prune_super+0x0 0xffff88011b00d300: unused scan count 85 new scan count 85 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026378: mm_shrink_slab_end: prune_super+0x0 0xffff88011a20ab00: unused scan count 265 new scan count 265 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026378: mm_shrink_slab_end: prune_super+0x0 0xffff88011a892300: unused scan count 217 new scan count 217 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026380: mm_shrink_slab_end: prune_super+0x0 0xffff88011a893700: unused scan count 421 new scan count 421 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026380: mm_shrink_slab_end: prune_super+0x0 0xffff88011a386700: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026381: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381700: unused scan count 361 new scan count 361 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026382: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381300: unused scan count 20 new scan count 20 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026382: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a171058: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026383: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a29e658: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
kswapd0 421 [000] 104433.026384: mm_shrink_slab_end: prune_super+0x0 0xffff88011ab17300: unused scan count 760 new scan count 774 total_scan 14 last shrinker return val 0
kswapd0 421 [000] 104433.026384: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: unused scan count 54 new scan count 55 total_scan 1 last shrinker return val 0
PID USER S %CPU %MEM TIME+ COMMAND
4438 nowhere D 27 0.2 0:02.54 dd
421 root S 17 0.0 1:45.79 kswapd0
Uptime is 1 day, 5:02
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-23 9:01 ` nowhere
@ 2011-12-23 10:20 ` Dave Chinner
2011-12-23 11:04 ` nowhere
0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2011-12-23 10:20 UTC (permalink / raw)
To: nowhere; +Cc: Michal Hocko, linux-kernel, linux-mm
On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > [Let's CC linux-mm]
> > >
> > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > Hello,
> > > >
> > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > file system operations and network i/o (i.e. file server). And there is
> > > > a problem with kswapd.
> > >
> > > What kind of filesystem do you use?
> > >
> > > >
> > > > I'm playing with dd:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > >
> > > > I.e. I'm filling page cache.
> > > >
> > > > So when the machine is just rebooted, kswapd during this operation is
> > > > almost idle, just 5-8 percent according to top.
> > > >
> > > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > > for kswapd:
> > > >
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > > >
> > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > time.
> > > >
> > > > Also read performance degrades over time. After reboot:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > >
> > > > After ~5 days uptime:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > >
> > > > Whereas raw disk sequential read performance stays the same:
> > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > >
> > > > Also after dropping caches, situation somehow improves, but not to the
> > > > state of freshly restarted system:
> > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > > >
> > > > perf shows:
> > > >
> > > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> > >
> > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > spin lock contention.
> >
> > That's just scanning superblocks, not apparently doing anything
> > useful like shrinking dentries or inodes attached to each sb. i.e.
> > the shrinkers are being called an awful lot and basically have
> > nothing to do. I'd be suspecting a problem higher up in the stack to
> > do with how shrink_slab is operating or being called.
> >
> > I'd suggest gathering event traces for mm_shrink_slab_start/
> > mm_shrink_slab_end to try to see how the shrinkers are being
> > driven...
> >
> > Cheers,
> >
> > Dave.
>
> I have recompiled kernel with tracers, and today the problem is visible
> again. So here is the trace for mm_shrink_slab_start (it is HUGE):
>
> kswapd0 421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
> kswapd0 421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
And possibly useless in this form. I need to see the
mm_shrink_slab_start/mm_shrink_slab_end events interleaved so I can
see exactly how much work each shrinker call is doing, and the start
events are truncated so not all the info I need is present.
Perhaps you should install trace-cmd.
$ trace-cmd record -e mm_shrink_slab*
(wait 30s, then ^C)
$ trace-cmd report > shrink.trace
And then compress and attach the trace file or put up on the web
somewhere for me ot download if it's too large for email...
As it is, there's ~940k pages in the LRU, and shrink_slab is being
called after 32, 95, 8, 8, 32 and 32 pages on the LRU have been
scanned. That seems like the shrinkers are being called rather too
often.
The end traces indicate the shrinker caches aren't able to free
anything. So it looks like the vmscan code has got itself in a
situation where it is not scanning many pages between shrinker
callouts, and the shrinkers scan but can't make any progress. Looks
like a vmscan balancing problem right now, not anything to do with
the shrinker code. A better trace will confirm that.
FWIW, if you use trace-cmd, it might be worthwhile collecting all the
vmscan trace events too, as that might help the VM folk understand
the problem without needing to ask you for more info.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-23 10:20 ` Dave Chinner
@ 2011-12-23 11:04 ` nowhere
2011-12-23 20:45 ` Dave Chinner
0 siblings, 1 reply; 32+ messages in thread
From: nowhere @ 2011-12-23 11:04 UTC (permalink / raw)
To: Dave Chinner; +Cc: Michal Hocko, linux-kernel, linux-mm
В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > > [Let's CC linux-mm]
> > > >
> > > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > > Hello,
> > > > >
> > > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > > file system operations and network i/o (i.e. file server). And there is
> > > > > a problem with kswapd.
> > > >
> > > > What kind of filesystem do you use?
> > > >
> > > > >
> > > > > I'm playing with dd:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > >
> > > > > I.e. I'm filling page cache.
> > > > >
> > > > > So when the machine is just rebooted, kswapd during this operation is
> > > > > almost idle, just 5-8 percent according to top.
> > > > >
> > > > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > > > for kswapd:
> > > > >
> > > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > > > >
> > > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > > time.
> > > > >
> > > > > Also read performance degrades over time. After reboot:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > >
> > > > > After ~5 days uptime:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > >
> > > > > Whereas raw disk sequential read performance stays the same:
> > > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > >
> > > > > Also after dropping caches, situation somehow improves, but not to the
> > > > > state of freshly restarted system:
> > > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > > > >
> > > > > perf shows:
> > > > >
> > > > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> > > >
> > > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > > spin lock contention.
> > >
> > > That's just scanning superblocks, not apparently doing anything
> > > useful like shrinking dentries or inodes attached to each sb. i.e.
> > > the shrinkers are being called an awful lot and basically have
> > > nothing to do. I'd be suspecting a problem higher up in the stack to
> > > do with how shrink_slab is operating or being called.
> > >
> > > I'd suggest gathering event traces for mm_shrink_slab_start/
> > > mm_shrink_slab_end to try to see how the shrinkers are being
> > > driven...
> > >
> > > Cheers,
> > >
> > > Dave.
> >
> > I have recompiled kernel with tracers, and today the problem is visible
> > again. So here is the trace for mm_shrink_slab_start (it is HUGE):
> >
> > kswapd0 421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
> > kswapd0 421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
>
> And possibly useless in this form. I need to see the
> mm_shrink_slab_start/mm_shrink_slab_end events interleaved so I can
> see exactly how much work each shrinker call is doing, and the start
> events are truncated so not all the info I need is present.
>
> Perhaps you should install trace-cmd.
>
> $ trace-cmd record -e mm_shrink_slab*
> (wait 30s, then ^C)
> $ trace-cmd report > shrink.trace
>
> And then compress and attach the trace file or put up on the web
> somewhere for me ot download if it's too large for email...
>
> As it is, there's ~940k pages in the LRU, and shrink_slab is being
> called after 32, 95, 8, 8, 32 and 32 pages on the LRU have been
> scanned. That seems like the shrinkers are being called rather too
> often.
>
> The end traces indicate the shrinker caches aren't able to free
> anything. So it looks like the vmscan code has got itself in a
> situation where it is not scanning many pages between shrinker
> callouts, and the shrinkers scan but can't make any progress. Looks
> like a vmscan balancing problem right now, not anything to do with
> the shrinker code. A better trace will confirm that.
>
> FWIW, if you use trace-cmd, it might be worthwhile collecting all the
> vmscan trace events too, as that might help the VM folk understand
> the problem without needing to ask you for more info.
./trace-cmd record -e vmscan/*
Here is the report of trace-cmd while dd'ing
https://80.237.6.56/report-dd.xz
And one more during normal operation
https://80.237.6.56/report-normal.xz
>
> Cheers,
>
> Dave.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-23 11:04 ` nowhere
@ 2011-12-23 20:45 ` Dave Chinner
2011-12-25 9:09 ` Hillf Danton
2011-12-27 2:15 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2011-12-23 20:45 UTC (permalink / raw)
To: nowhere; +Cc: Michal Hocko, linux-kernel, linux-mm
On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > > > [Let's CC linux-mm]
> > > > >
> > > > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > > > file system operations and network i/o (i.e. file server). And there is
> > > > > > a problem with kswapd.
> > > > >
> > > > > What kind of filesystem do you use?
> > > > >
> > > > > >
> > > > > > I'm playing with dd:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > >
> > > > > > I.e. I'm filling page cache.
> > > > > >
> > > > > > So when the machine is just rebooted, kswapd during this operation is
> > > > > > almost idle, just 5-8 percent according to top.
> > > > > >
> > > > > > After ~5 days of uptime (5 days, 2:10), the same operation demands ~70%
> > > > > > for kswapd:
> > > > > >
> > > > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > > > 420 root R 70 0.0 22:09.60 0 kswapd0
> > > > > > 17717 nowhere D 27 0.2 0:01.81 10m dd
> > > > > >
> > > > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > > > time.
> > > > > >
> > > > > > Also read performance degrades over time. After reboot:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > > >
> > > > > > After ~5 days uptime:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > > >
> > > > > > Whereas raw disk sequential read performance stays the same:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > > >
> > > > > > Also after dropping caches, situation somehow improves, but not to the
> > > > > > state of freshly restarted system:
> > > > > > PID USER S %CPU %MEM TIME+ SWAP COMMAND
> > > > > > 420 root S 39 0.0 23:31.17 0 kswapd0
> > > > > > 19829 nowhere D 24 0.2 0:02.72 7764 dd
> > > > > >
> > > > > > perf shows:
> > > > > >
> > > > > > 31.24% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock
> > > > > > 26.19% kswapd0 [kernel.kallsyms] [k] shrink_slab
> > > > > > 16.28% kswapd0 [kernel.kallsyms] [k] prune_super
> > > > > > 6.55% kswapd0 [kernel.kallsyms] [k] grab_super_passive
> > > > > > 5.35% kswapd0 [kernel.kallsyms] [k] down_read_trylock
> > > > > > 4.03% kswapd0 [kernel.kallsyms] [k] up_read
> > > > > > 2.31% kswapd0 [kernel.kallsyms] [k] put_super
> > > > > > 1.81% kswapd0 [kernel.kallsyms] [k] drop_super
> > > > > > 0.99% kswapd0 [kernel.kallsyms] [k] __put_super
> > > > > > 0.25% kswapd0 [kernel.kallsyms] [k] __isolate_lru_page
> > > > > > 0.23% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
> > > > > > 0.19% kswapd0 [r8169] [k] rtl8169_interrupt
> > > > > > 0.15% kswapd0 [kernel.kallsyms] [k] twa_interrupt
> > > > >
> > > > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > > > spin lock contention.
> > > >
> > > > That's just scanning superblocks, not apparently doing anything
> > > > useful like shrinking dentries or inodes attached to each sb. i.e.
> > > > the shrinkers are being called an awful lot and basically have
> > > > nothing to do. I'd be suspecting a problem higher up in the stack to
> > > > do with how shrink_slab is operating or being called.
> > > >
> > > > I'd suggest gathering event traces for mm_shrink_slab_start/
> > > > mm_shrink_slab_end to try to see how the shrinkers are being
> > > > driven...
> > > >
> > > > Cheers,
> > > >
> > > > Dave.
> > >
> > > I have recompiled kernel with tracers, and today the problem is visible
> > > again. So here is the trace for mm_shrink_slab_start (it is HUGE):
> > >
> > > kswapd0 421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
> > > kswapd0 421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
> >
> > And possibly useless in this form. I need to see the
> > mm_shrink_slab_start/mm_shrink_slab_end events interleaved so I can
> > see exactly how much work each shrinker call is doing, and the start
> > events are truncated so not all the info I need is present.
> >
> > Perhaps you should install trace-cmd.
> >
> > $ trace-cmd record -e mm_shrink_slab*
> > (wait 30s, then ^C)
> > $ trace-cmd report > shrink.trace
> >
> > And then compress and attach the trace file or put up on the web
> > somewhere for me ot download if it's too large for email...
> >
> > As it is, there's ~940k pages in the LRU, and shrink_slab is being
> > called after 32, 95, 8, 8, 32 and 32 pages on the LRU have been
> > scanned. That seems like the shrinkers are being called rather too
> > often.
> >
> > The end traces indicate the shrinker caches aren't able to free
> > anything. So it looks like the vmscan code has got itself in a
> > situation where it is not scanning many pages between shrinker
> > callouts, and the shrinkers scan but can't make any progress. Looks
> > like a vmscan balancing problem right now, not anything to do with
> > the shrinker code. A better trace will confirm that.
> >
> > FWIW, if you use trace-cmd, it might be worthwhile collecting all the
> > vmscan trace events too, as that might help the VM folk understand
> > the problem without needing to ask you for more info.
>
> ./trace-cmd record -e vmscan/*
>
> Here is the report of trace-cmd while dd'ing
> https://80.237.6.56/report-dd.xz
Ok, it's not a shrink_slab() problem - it's just being called ~100uS
by kswapd. The pattern is:
- reclaim 94 (batches of 32,32,30) pages from iinactive list
of zone 1, node 0, prio 12
- call shrink_slab
- scan all caches
- all shrinkers return 0 saying nothing to shrink
- 40us gap
- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
- call shrink_slab
- scan all caches
- all shrinkers return 0 saying nothing to shrink
- 40us gap
- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
- call shrink_slab
- scan all caches
- all shrinkers return 0 saying nothing to shrink
40us gap
And it just repeats over and over again. After a while, nid=0,zone=1
drops out of the traces, so reclaim only comes in batches of 10-30
pages from zone 2 between each shrink_slab() call.
The trace starts at 111209.881s, with 944776 pages on the LRUs. It
finishes at 111216.1 with kswapd going to sleep on node 0 with
930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
2,000 pages/s) which is awfully slow....
vmscan gurus - time for you to step in now...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-23 20:45 ` Dave Chinner
@ 2011-12-25 9:09 ` Hillf Danton
2011-12-25 10:21 ` Nikolay S.
2011-12-27 2:15 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 32+ messages in thread
From: Hillf Danton @ 2011-12-25 9:09 UTC (permalink / raw)
To: Dave Chinner; +Cc: nowhere, Michal Hocko, linux-kernel, linux-mm
On Sat, Dec 24, 2011 at 4:45 AM, Dave Chinner <david@fromorbit.com> wrote:
[...]
>
> Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> by kswapd. The pattern is:
>
> - reclaim 94 (batches of 32,32,30) pages from iinactive list
> of zone 1, node 0, prio 12
> - call shrink_slab
> - scan all caches
> - all shrinkers return 0 saying nothing to shrink
> - 40us gap
> - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> - call shrink_slab
> - scan all caches
> - all shrinkers return 0 saying nothing to shrink
> - 40us gap
> - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> - call shrink_slab
> - scan all caches
> - all shrinkers return 0 saying nothing to shrink
> 40us gap
>
> And it just repeats over and over again. After a while, nid=0,zone=1
> drops out of the traces, so reclaim only comes in batches of 10-30
> pages from zone 2 between each shrink_slab() call.
>
> The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> finishes at 111216.1 with kswapd going to sleep on node 0 with
> 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> 2,000 pages/s) which is awfully slow....
>
Hi all,
In hope, the added debug info is helpful.
Hillf
---
--- a/mm/memcontrol.c Fri Dec 9 21:57:40 2011
+++ b/mm/memcontrol.c Sun Dec 25 17:08:14 2011
@@ -1038,7 +1038,11 @@ void mem_cgroup_lru_del_list(struct page
memcg = root_mem_cgroup;
mz = page_cgroup_zoneinfo(memcg, page);
/* huge page split is done under lru_lock. so, we have no races. */
- MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
+ if (WARN_ON_ONCE(MEM_CGROUP_ZSTAT(mz, lru) <
+ (1 << compound_order(page))))
+ MEM_CGROUP_ZSTAT(mz, lru) = 0;
+ else
+ MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
}
void mem_cgroup_lru_del(struct page *page)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-25 9:09 ` Hillf Danton
@ 2011-12-25 10:21 ` Nikolay S.
2011-12-26 12:35 ` Hillf Danton
0 siblings, 1 reply; 32+ messages in thread
From: Nikolay S. @ 2011-12-25 10:21 UTC (permalink / raw)
To: Hillf Danton; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm
В Вс., 25/12/2011 в 17:09 +0800, Hillf Danton пишет:
> On Sat, Dec 24, 2011 at 4:45 AM, Dave Chinner <david@fromorbit.com> wrote:
> [...]
> >
> > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > by kswapd. The pattern is:
> >
> > - reclaim 94 (batches of 32,32,30) pages from iinactive list
> > of zone 1, node 0, prio 12
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > - 40us gap
> > - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > - 40us gap
> > - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > 40us gap
> >
> > And it just repeats over and over again. After a while, nid=0,zone=1
> > drops out of the traces, so reclaim only comes in batches of 10-30
> > pages from zone 2 between each shrink_slab() call.
> >
> > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > 2,000 pages/s) which is awfully slow....
> >
> Hi all,
>
> In hope, the added debug info is helpful.
>
> Hillf
> ---
>
> --- a/mm/memcontrol.c Fri Dec 9 21:57:40 2011
> +++ b/mm/memcontrol.c Sun Dec 25 17:08:14 2011
> @@ -1038,7 +1038,11 @@ void mem_cgroup_lru_del_list(struct page
> memcg = root_mem_cgroup;
> mz = page_cgroup_zoneinfo(memcg, page);
> /* huge page split is done under lru_lock. so, we have no races. */
> - MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> + if (WARN_ON_ONCE(MEM_CGROUP_ZSTAT(mz, lru) <
> + (1 << compound_order(page))))
> + MEM_CGROUP_ZSTAT(mz, lru) = 0;
> + else
> + MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> }
>
> void mem_cgroup_lru_del(struct page *page)
Hello,
Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
place changes there?
And also, -rc7 is here. May the problem be addressed as part of some
ongoing work? Is there any reason to try -rc7 (the problem requires
several days of uptime to become obvious)?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-25 10:21 ` Nikolay S.
@ 2011-12-26 12:35 ` Hillf Danton
2011-12-27 0:20 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 32+ messages in thread
From: Hillf Danton @ 2011-12-26 12:35 UTC (permalink / raw)
To: Nikolay S.; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm
On Sun, Dec 25, 2011 at 6:21 PM, Nikolay S. <nowhere@hakkenden.ath.cx> wrote:
>
> Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
> mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
> place changes there?
>
> And also, -rc7 is here. May the problem be addressed as part of some
> ongoing work? Is there any reason to try -rc7 (the problem requires
> several days of uptime to become obvious)?
>
Sorry, Nikolay, it is not based on the -next, nor on the -rc5(I assumed it was).
The following is based on -next, and if you want to test -rc5, please
grep MEM_CGROUP_ZSTAT mm/memcontrol.c and change it.
Best regard
Hillf
---
--- a/mm/memcontrol.c Mon Dec 26 20:34:38 2011
+++ b/mm/memcontrol.c Mon Dec 26 20:37:54 2011
@@ -1076,7 +1076,11 @@ void mem_cgroup_lru_del_list(struct page
VM_BUG_ON(!memcg);
mz = page_cgroup_zoneinfo(memcg, page);
/* huge page split is done under lru_lock. so, we have no races. */
- MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
+ if (WARN_ON_ONCE(MEM_CGROUP_ZSTAT(mz, lru) <
+ (1 << compound_order(page))))
+ MEM_CGROUP_ZSTAT(mz, lru) = 0;
+ else
+ MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
}
void mem_cgroup_lru_del(struct page *page)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-26 12:35 ` Hillf Danton
@ 2011-12-27 0:20 ` KAMEZAWA Hiroyuki
2011-12-27 13:33 ` Hillf Danton
0 siblings, 1 reply; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-27 0:20 UTC (permalink / raw)
To: Hillf Danton
Cc: Nikolay S., Dave Chinner, Michal Hocko, linux-kernel, linux-mm
On Mon, 26 Dec 2011 20:35:46 +0800
Hillf Danton <dhillf@gmail.com> wrote:
> On Sun, Dec 25, 2011 at 6:21 PM, Nikolay S. <nowhere@hakkenden.ath.cx> wrote:
> >
> > Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
> > mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
> > place changes there?
> >
> > And also, -rc7 is here. May the problem be addressed as part of some
> > ongoing work? Is there any reason to try -rc7 (the problem requires
> > several days of uptime to become obvious)?
> >
>
> Sorry, Nikolay, it is not based on the -next, nor on the -rc5(I assumed it was).
> The following is based on -next, and if you want to test -rc5, please
> grep MEM_CGROUP_ZSTAT mm/memcontrol.c and change it.
>
> Best regard
>
Hmm ? memcg is used ? Why do you consider this will be a help ?
Thanks,
-Kame
> Hillf
> ---
>
> --- a/mm/memcontrol.c Mon Dec 26 20:34:38 2011
> +++ b/mm/memcontrol.c Mon Dec 26 20:37:54 2011
> @@ -1076,7 +1076,11 @@ void mem_cgroup_lru_del_list(struct page
> VM_BUG_ON(!memcg);
> mz = page_cgroup_zoneinfo(memcg, page);
> /* huge page split is done under lru_lock. so, we have no races. */
> - MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> + if (WARN_ON_ONCE(MEM_CGROUP_ZSTAT(mz, lru) <
> + (1 << compound_order(page))))
> + MEM_CGROUP_ZSTAT(mz, lru) = 0;
> + else
> + MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> }
>
> void mem_cgroup_lru_del(struct page *page)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-23 20:45 ` Dave Chinner
2011-12-25 9:09 ` Hillf Danton
@ 2011-12-27 2:15 ` KAMEZAWA Hiroyuki
2011-12-27 2:50 ` Nikolay S.
2011-12-27 3:57 ` Minchan Kim
1 sibling, 2 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-27 2:15 UTC (permalink / raw)
To: Dave Chinner; +Cc: nowhere, Michal Hocko, linux-kernel, linux-mm
On Sat, 24 Dec 2011 07:45:03 +1100
Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > Here is the report of trace-cmd while dd'ing
> > https://80.237.6.56/report-dd.xz
>
> Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> by kswapd. The pattern is:
>
> - reclaim 94 (batches of 32,32,30) pages from iinactive list
> of zone 1, node 0, prio 12
> - call shrink_slab
> - scan all caches
> - all shrinkers return 0 saying nothing to shrink
> - 40us gap
> - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> - call shrink_slab
> - scan all caches
> - all shrinkers return 0 saying nothing to shrink
> - 40us gap
> - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> - call shrink_slab
> - scan all caches
> - all shrinkers return 0 saying nothing to shrink
> 40us gap
>
> And it just repeats over and over again. After a while, nid=0,zone=1
> drops out of the traces, so reclaim only comes in batches of 10-30
> pages from zone 2 between each shrink_slab() call.
>
> The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> finishes at 111216.1 with kswapd going to sleep on node 0 with
> 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> 2,000 pages/s) which is awfully slow....
>
> vmscan gurus - time for you to step in now...
>
Can you show /proc/zoneinfo ? I want to know each zone's size.
Below is my memo.
In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.
Seeing balance_pgdat() and trace log, I guess it does
wake up
shrink_zone(zone=0(DMA?)) => nothing to reclaim.
shrink_slab()
shrink_zone(zone=1(DMA32?)) => reclaim 32,32,31 pages
shrink_slab()
shrink_zone(zone=2(NORMAL?)) => reclaim 13 pages.
srhink_slab()
sleep or retry.
Why shrink_slab() need to be called frequently like this ?
BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
while 'dd' operation ? (no direct relcaim by dd.)
Is this log record cpu hog after 'dd' ?
Thanks,
-Kame
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 2:15 ` KAMEZAWA Hiroyuki
@ 2011-12-27 2:50 ` Nikolay S.
2011-12-27 4:44 ` KAMEZAWA Hiroyuki
2011-12-27 3:57 ` Minchan Kim
1 sibling, 1 reply; 32+ messages in thread
From: Nikolay S. @ 2011-12-27 2:50 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm
В Вт., 27/12/2011 в 11:15 +0900, KAMEZAWA Hiroyuki пишет:
> On Sat, 24 Dec 2011 07:45:03 +1100
> Dave Chinner <david@fromorbit.com> wrote:
>
> > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
>
> > > Here is the report of trace-cmd while dd'ing
> > > https://80.237.6.56/report-dd.xz
> >
> > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > by kswapd. The pattern is:
> >
> > - reclaim 94 (batches of 32,32,30) pages from iinactive list
> > of zone 1, node 0, prio 12
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > - 40us gap
> > - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > - 40us gap
> > - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > 40us gap
> >
> > And it just repeats over and over again. After a while, nid=0,zone=1
> > drops out of the traces, so reclaim only comes in batches of 10-30
> > pages from zone 2 between each shrink_slab() call.
> >
> > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > 2,000 pages/s) which is awfully slow....
> >
> > vmscan gurus - time for you to step in now...
> >
>
> Can you show /proc/zoneinfo ? I want to know each zone's size.
$ cat /proc/zoneinfo
Node 0, zone DMA
pages free 3980
min 64
low 80
high 96
scanned 0
spanned 4080
present 3916
nr_free_pages 3980
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 0
nr_written 0
nr_anon_transparent_hugepages 0
protection: (0, 3503, 4007, 4007)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 4
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 4
all_unreclaimable: 1
start_pfn: 16
inactive_ratio: 1
Node 0, zone DMA32
pages free 19620
min 14715
low 18393
high 22072
scanned 0
spanned 1044480
present 896960
nr_free_pages 19620
nr_inactive_anon 43203
nr_active_anon 206577
nr_inactive_file 412249
nr_active_file 126151
nr_unevictable 7
nr_mlock 7
nr_anon_pages 108557
nr_mapped 6683
nr_file_pages 540415
nr_dirty 5
nr_writeback 0
nr_slab_reclaimable 58887
nr_slab_unreclaimable 12145
nr_page_table_pages 1389
nr_kernel_stack 100
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1021
nr_vmscan_immediate_reclaim 69337
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 1861
nr_dirtied 1586363
nr_written 1245872
nr_anon_transparent_hugepages 272
protection: (0, 0, 504, 504)
pagesets
cpu: 0
count: 4
high: 186
batch: 31
vm stats threshold: 24
cpu: 1
count: 0
high: 186
batch: 31
vm stats threshold: 24
all_unreclaimable: 0
start_pfn: 4096
inactive_ratio: 5
Node 0, zone Normal
pages free 2854
min 2116
low 2645
high 3174
scanned 0
spanned 131072
present 129024
nr_free_pages 2854
nr_inactive_anon 20682
nr_active_anon 10262
nr_inactive_file 47083
nr_active_file 11292
nr_unevictable 518
nr_mlock 518
nr_anon_pages 22801
nr_mapped 1798
nr_file_pages 58853
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 4347
nr_slab_unreclaimable 5955
nr_page_table_pages 769
nr_kernel_stack 128
nr_unstable 0
nr_bounce 0
nr_vmscan_write 5285
nr_vmscan_immediate_reclaim 51475
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 28
nr_dirtied 251597
nr_written 191561
nr_anon_transparent_hugepages 16
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 30
high: 186
batch: 31
vm stats threshold: 12
cpu: 1
count: 0
high: 186
batch: 31
vm stats threshold: 12
all_unreclaimable: 0
start_pfn: 1048576
inactive_ratio:
>
> Below is my memo.
>
> In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
> to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.
>
> Seeing balance_pgdat() and trace log, I guess it does
>
> wake up
>
> shrink_zone(zone=0(DMA?)) => nothing to reclaim.
> shrink_slab()
> shrink_zone(zone=1(DMA32?)) => reclaim 32,32,31 pages
> shrink_slab()
> shrink_zone(zone=2(NORMAL?)) => reclaim 13 pages.
> srhink_slab()
>
> sleep or retry.
>
> Why shrink_slab() need to be called frequently like this ?
>
> BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> while 'dd' operation ? (no direct relcaim by dd.)
> Is this log record cpu hog after 'dd' ?
report-dd.xz is while_ dd.
report-normal.xz - some time after
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 2:15 ` KAMEZAWA Hiroyuki
2011-12-27 2:50 ` Nikolay S.
@ 2011-12-27 3:57 ` Minchan Kim
2011-12-27 4:56 ` KAMEZAWA Hiroyuki
2012-01-11 1:17 ` Rik van Riel
1 sibling, 2 replies; 32+ messages in thread
From: Minchan Kim @ 2011-12-27 3:57 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Dave Chinner, nowhere, Michal Hocko, linux-kernel, linux-mm
On Tue, Dec 27, 2011 at 11:15:43AM +0900, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Dec 2011 07:45:03 +1100
> Dave Chinner <david@fromorbit.com> wrote:
>
> > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
>
> > > Here is the report of trace-cmd while dd'ing
> > > https://80.237.6.56/report-dd.xz
> >
> > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > by kswapd. The pattern is:
> >
> > - reclaim 94 (batches of 32,32,30) pages from iinactive list
> > of zone 1, node 0, prio 12
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > - 40us gap
> > - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > - 40us gap
> > - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > - call shrink_slab
> > - scan all caches
> > - all shrinkers return 0 saying nothing to shrink
> > 40us gap
> >
> > And it just repeats over and over again. After a while, nid=0,zone=1
> > drops out of the traces, so reclaim only comes in batches of 10-30
> > pages from zone 2 between each shrink_slab() call.
> >
> > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > 2,000 pages/s) which is awfully slow....
> >
> > vmscan gurus - time for you to step in now...
> >
>
> Can you show /proc/zoneinfo ? I want to know each zone's size.
>
> Below is my memo.
>
> In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
> to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.
>
> Seeing balance_pgdat() and trace log, I guess it does
>
> wake up
>
> shrink_zone(zone=0(DMA?)) => nothing to reclaim.
> shrink_slab()
> shrink_zone(zone=1(DMA32?)) => reclaim 32,32,31 pages
> shrink_slab()
> shrink_zone(zone=2(NORMAL?)) => reclaim 13 pages.
> srhink_slab()
>
> sleep or retry.
>
> Why shrink_slab() need to be called frequently like this ?
I guess it's caused by small NORMAL zone.
The scenario I think is as follows,
1. dd comsumes memory in NORMAL zone
2. dd enter direct reclaim and wakeup kswapd
3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
4. schedule
5. dd consumes memory again in NORMAL zone
6. kswapd fail to reclaim memory by high watermark due to 5.
7. loop again, goto 3.
The point is speed between reclaim VS memory consumption.
So kswapd cannot reach a point which enough pages are in NORMAL zone.
>
> BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> while 'dd' operation ? (no direct relcaim by dd.)
> Is this log record cpu hog after 'dd' ?
If above scenario is right, dd couldn't enter direct reclaim to reclaim memory.
>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 2:50 ` Nikolay S.
@ 2011-12-27 4:44 ` KAMEZAWA Hiroyuki
2011-12-27 6:06 ` nowhere
2011-12-28 21:33 ` Dave Chinner
0 siblings, 2 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-27 4:44 UTC (permalink / raw)
To: Nikolay S.; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm
On Tue, 27 Dec 2011 06:50:08 +0400
"Nikolay S." <nowhere@hakkenden.ath.cx> wrote:
> В Вт., 27/12/2011 в 11:15 +0900, KAMEZAWA Hiroyuki пишет:
> > On Sat, 24 Dec 2011 07:45:03 +1100
> > Dave Chinner <david@fromorbit.com> wrote:
> >
> > > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> >
> > > > Here is the report of trace-cmd while dd'ing
> > > > https://80.237.6.56/report-dd.xz
> > >
> > > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > > by kswapd. The pattern is:
> > >
> > > - reclaim 94 (batches of 32,32,30) pages from iinactive list
> > > of zone 1, node 0, prio 12
> > > - call shrink_slab
> > > - scan all caches
> > > - all shrinkers return 0 saying nothing to shrink
> > > - 40us gap
> > > - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > > - call shrink_slab
> > > - scan all caches
> > > - all shrinkers return 0 saying nothing to shrink
> > > - 40us gap
> > > - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > > - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > > - call shrink_slab
> > > - scan all caches
> > > - all shrinkers return 0 saying nothing to shrink
> > > 40us gap
> > >
> > > And it just repeats over and over again. After a while, nid=0,zone=1
> > > drops out of the traces, so reclaim only comes in batches of 10-30
> > > pages from zone 2 between each shrink_slab() call.
> > >
> > > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > > 2,000 pages/s) which is awfully slow....
> > >
> > > vmscan gurus - time for you to step in now...
> > >
> >
> > Can you show /proc/zoneinfo ? I want to know each zone's size.
>
Thanks,
Qeustion:
1. does this system has no swap ?
2. What version of kernel which you didn't see the kswapd issue ?
3. Is this real host ? or virtualized ?
> $ cat /proc/zoneinfo
...
Node 0, zone DMA32
pages free 19620
min 14715
low 18393
high 22072
scanned 0
spanned 1044480
present 896960
nr_free_pages 19620
nr_inactive_anon 43203
nr_active_anon 206577
nr_inactive_file 412249
nr_active_file 126151
Then, DMA32(zone=1) files are enough large (> 32 << 12)
Hmm. assuming all frees are used for file(of dd)
(412249 + 126151 + 19620) >> 12 = 136
So, 32, 32, 30 scan seems to work as desgined.
> Node 0, zone Normal
> pages free 2854
> min 2116
> low 2645
> high 3174
> scanned 0
> spanned 131072
> present 129024
> nr_free_pages 2854
> nr_inactive_anon 20682
> nr_active_anon 10262
> nr_inactive_file 47083
> nr_active_file 11292
Hmm, NORMAL is much smaller than DMA32. (only 500MB.)
Then, at priority=12,
13 << 12 = 53248
13 pages per a scan seems to work as designed.
To me, it seems kswapd does usual work...reclaim small memory until free
gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
of gfp_t fallbacks.
Memo.
1. why shrink_slab() should be called per zone, which is not zone aware.
Isn't it enough to call it per priority ?
2. what spinlock contention that perf showed ?
And if shrink_slab() doesn't consume cpu as trace shows, why perf
says shrink_slab() is heavy..
3. because 8/9 of memory is in DMA32, calling shrink_slab() frequently
at scanning NORMAL seems to be time wasting.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 3:57 ` Minchan Kim
@ 2011-12-27 4:56 ` KAMEZAWA Hiroyuki
2012-01-10 22:33 ` Andrew Morton
2012-01-11 0:33 ` Dave Chinner
2012-01-11 1:17 ` Rik van Riel
1 sibling, 2 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-27 4:56 UTC (permalink / raw)
To: Minchan Kim; +Cc: Dave Chinner, nowhere, Michal Hocko, linux-kernel, linux-mm
On Tue, 27 Dec 2011 12:57:31 +0900
Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Dec 27, 2011 at 11:15:43AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Sat, 24 Dec 2011 07:45:03 +1100
> > Dave Chinner <david@fromorbit.com> wrote:
> >
> > > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> >
> > > > Here is the report of trace-cmd while dd'ing
> > > > https://80.237.6.56/report-dd.xz
> > >
> > > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > > by kswapd. The pattern is:
> > >
> > > - reclaim 94 (batches of 32,32,30) pages from iinactive list
> > > of zone 1, node 0, prio 12
> > > - call shrink_slab
> > > - scan all caches
> > > - all shrinkers return 0 saying nothing to shrink
> > > - 40us gap
> > > - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > > - call shrink_slab
> > > - scan all caches
> > > - all shrinkers return 0 saying nothing to shrink
> > > - 40us gap
> > > - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > > - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > > - call shrink_slab
> > > - scan all caches
> > > - all shrinkers return 0 saying nothing to shrink
> > > 40us gap
> > >
> > > And it just repeats over and over again. After a while, nid=0,zone=1
> > > drops out of the traces, so reclaim only comes in batches of 10-30
> > > pages from zone 2 between each shrink_slab() call.
> > >
> > > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > > 2,000 pages/s) which is awfully slow....
> > >
> > > vmscan gurus - time for you to step in now...
> > >
> >
> > Can you show /proc/zoneinfo ? I want to know each zone's size.
> >
> > Below is my memo.
> >
> > In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
> > to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.
> >
> > Seeing balance_pgdat() and trace log, I guess it does
> >
> > wake up
> >
> > shrink_zone(zone=0(DMA?)) => nothing to reclaim.
> > shrink_slab()
> > shrink_zone(zone=1(DMA32?)) => reclaim 32,32,31 pages
> > shrink_slab()
> > shrink_zone(zone=2(NORMAL?)) => reclaim 13 pages.
> > srhink_slab()
> >
> > sleep or retry.
> >
> > Why shrink_slab() need to be called frequently like this ?
>
> I guess it's caused by small NORMAL zone.
You're right. I confirmed his zoneinfo.
> The scenario I think is as follows,
>
> 1. dd comsumes memory in NORMAL zone
> 2. dd enter direct reclaim and wakeup kswapd
> 3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
> 4. schedule
> 5. dd consumes memory again in NORMAL zone
> 6. kswapd fail to reclaim memory by high watermark due to 5.
> 7. loop again, goto 3.
>
> The point is speed between reclaim VS memory consumption.
> So kswapd cannot reach a point which enough pages are in NORMAL zone.
>
> >
> > BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> > while 'dd' operation ? (no direct relcaim by dd.)
> > Is this log record cpu hog after 'dd' ?
>
> If above scenario is right, dd couldn't enter direct reclaim to reclaim memory.
>
I think you're right. IIUC, kswapd's behavior is what we usually see.
Hmm, if I understand correctly,
- dd's speed down is caused by kswapd's cpu consumption.
- kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
- kswapd can't stop because NORMAL zone is small.
- memory reclaim speed is enough because dd can't get enough cpu.
I wonder reducing to call shrink_slab() may be a help but I'm not sure
where lock conention comes from...
Regards,
-Kame
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 4:44 ` KAMEZAWA Hiroyuki
@ 2011-12-27 6:06 ` nowhere
2011-12-28 21:33 ` Dave Chinner
1 sibling, 0 replies; 32+ messages in thread
From: nowhere @ 2011-12-27 6:06 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm
В Вт., 27/12/2011 в 13:44 +0900, KAMEZAWA Hiroyuki пишет:
> On Tue, 27 Dec 2011 06:50:08 +0400
> "Nikolay S." <nowhere@hakkenden.ath.cx> wrote:
>
> > В Вт., 27/12/2011 в 11:15 +0900, KAMEZAWA Hiroyuki пишет:
> > > On Sat, 24 Dec 2011 07:45:03 +1100
> > > Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > >
> > > > > Here is the report of trace-cmd while dd'ing
> > > > > https://80.237.6.56/report-dd.xz
> > > >
> > > > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > > > by kswapd. The pattern is:
> > > >
> > > > - reclaim 94 (batches of 32,32,30) pages from iinactive list
> > > > of zone 1, node 0, prio 12
> > > > - call shrink_slab
> > > > - scan all caches
> > > > - all shrinkers return 0 saying nothing to shrink
> > > > - 40us gap
> > > > - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > > > - call shrink_slab
> > > > - scan all caches
> > > > - all shrinkers return 0 saying nothing to shrink
> > > > - 40us gap
> > > > - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > > > - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > > > - call shrink_slab
> > > > - scan all caches
> > > > - all shrinkers return 0 saying nothing to shrink
> > > > 40us gap
> > > >
> > > > And it just repeats over and over again. After a while, nid=0,zone=1
> > > > drops out of the traces, so reclaim only comes in batches of 10-30
> > > > pages from zone 2 between each shrink_slab() call.
> > > >
> > > > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > > > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > > > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > > > 2,000 pages/s) which is awfully slow....
> > > >
> > > > vmscan gurus - time for you to step in now...
> > > >
> > >
> > > Can you show /proc/zoneinfo ? I want to know each zone's size.
> >
>
> Thanks,
> Qeustion:
> 1. does this system has no swap ?
It has. 4G
> 2. What version of kernel which you didn't see the kswapd issue ?
Hmm... 3.1 and below, I presume
> 3. Is this real host ? or virtualized ?
100% real
>
> > $ cat /proc/zoneinfo
> ...
> Node 0, zone DMA32
> pages free 19620
> min 14715
> low 18393
> high 22072
> scanned 0
> spanned 1044480
> present 896960
> nr_free_pages 19620
> nr_inactive_anon 43203
> nr_active_anon 206577
> nr_inactive_file 412249
> nr_active_file 126151
>
> Then, DMA32(zone=1) files are enough large (> 32 << 12)
> Hmm. assuming all frees are used for file(of dd)
>
>
> (412249 + 126151 + 19620) >> 12 = 136
>
> So, 32, 32, 30 scan seems to work as desgined.
>
> > Node 0, zone Normal
> > pages free 2854
> > min 2116
> > low 2645
> > high 3174
> > scanned 0
> > spanned 131072
> > present 129024
> > nr_free_pages 2854
> > nr_inactive_anon 20682
> > nr_active_anon 10262
> > nr_inactive_file 47083
> > nr_active_file 11292
>
> Hmm, NORMAL is much smaller than DMA32. (only 500MB.)
>
> Then, at priority=12,
>
> 13 << 12 = 53248
>
> 13 pages per a scan seems to work as designed.
> To me, it seems kswapd does usual work...reclaim small memory until free
> gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
> of gfp_t fallbacks.
>
>
> Memo.
>
> 1. why shrink_slab() should be called per zone, which is not zone aware.
> Isn't it enough to call it per priority ?
>
> 2. what spinlock contention that perf showed ?
> And if shrink_slab() doesn't consume cpu as trace shows, why perf
> says shrink_slab() is heavy..
>
> 3. because 8/9 of memory is in DMA32, calling shrink_slab() frequently
> at scanning NORMAL seems to be time wasting.
>
>
> Thanks,
> -Kame
>
>
>
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 0:20 ` KAMEZAWA Hiroyuki
@ 2011-12-27 13:33 ` Hillf Danton
2011-12-28 0:06 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 32+ messages in thread
From: Hillf Danton @ 2011-12-27 13:33 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Nikolay S., Dave Chinner, Michal Hocko, linux-kernel, linux-mm
On Tue, Dec 27, 2011 at 8:20 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Dec 2011 20:35:46 +0800
> Hillf Danton <dhillf@gmail.com> wrote:
>
>> On Sun, Dec 25, 2011 at 6:21 PM, Nikolay S. <nowhere@hakkenden.ath.cx> wrote:
>> >
>> > Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
>> > mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
>> > place changes there?
>> >
>> > And also, -rc7 is here. May the problem be addressed as part of some
>> > ongoing work? Is there any reason to try -rc7 (the problem requires
>> > several days of uptime to become obvious)?
>> >
>>
>> Sorry, Nikolay, it is not based on the -next, nor on the -rc5(I assumed it was).
>> The following is based on -next, and if you want to test -rc5, please
>> grep MEM_CGROUP_ZSTAT mm/memcontrol.c and change it.
>>
>> Best regard
>>
>
> Hmm ? memcg is used ? Why do you consider this will be a help ?
>
Hi Kame
Please see the livelock at
https://lkml.org/lkml/2011/12/23/222
and if it is related to the hog here, please check if the following is
in right direction.
Thanks
Hillf
---
--- a/mm/memcontrol.c Mon Dec 26 20:34:38 2011
+++ b/mm/memcontrol.c Tue Dec 27 20:05:12 2011
@@ -3637,6 +3637,7 @@ static int mem_cgroup_force_empty_list(s
list = &mz->lruvec.lists[lru];
loop = MEM_CGROUP_ZSTAT(mz, lru);
+ WARN_ON((long)loop < 0);
/* give some margin against EBUSY etc...*/
loop += 256;
busy = NULL;
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 13:33 ` Hillf Danton
@ 2011-12-28 0:06 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-28 0:06 UTC (permalink / raw)
To: Hillf Danton
Cc: Nikolay S., Dave Chinner, Michal Hocko, linux-kernel, linux-mm
On Tue, 27 Dec 2011 21:33:04 +0800
Hillf Danton <dhillf@gmail.com> wrote:
> On Tue, Dec 27, 2011 at 8:20 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 26 Dec 2011 20:35:46 +0800
> > Hillf Danton <dhillf@gmail.com> wrote:
> >
> >> On Sun, Dec 25, 2011 at 6:21 PM, Nikolay S. <nowhere@hakkenden.ath.cx> wrote:
> >> >
> >> > Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
> >> > mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
> >> > place changes there?
> >> >
> >> > And also, -rc7 is here. May the problem be addressed as part of some
> >> > ongoing work? Is there any reason to try -rc7 (the problem requires
> >> > several days of uptime to become obvious)?
> >> >
> >>
> >> Sorry, Nikolay, it is not based on the -next, nor on the -rc5(I assumed it was).
> >> The following is based on -next, and if you want to test -rc5, please
> >> grep MEM_CGROUP_ZSTAT mm/memcontrol.c and change it.
> >>
> >> Best regard
> >>
> >
> > Hmm ? memcg is used ? Why do you consider this will be a help ?
> >
>
> Hi Kame
>
> Please see the livelock at
> https://lkml.org/lkml/2011/12/23/222
>
> and if it is related to the hog here, please check if the following is
> in right direction.
>
But linux-next and his 3.2-rc5 has totally different LRU design.
3.2-rc doesn't have unified memcg/global LRU and memcg doesn't affect
kswapd behavior if not used.
Thanks,
-Kame
> Thanks
> Hillf
> ---
>
> --- a/mm/memcontrol.c Mon Dec 26 20:34:38 2011
> +++ b/mm/memcontrol.c Tue Dec 27 20:05:12 2011
> @@ -3637,6 +3637,7 @@ static int mem_cgroup_force_empty_list(s
> list = &mz->lruvec.lists[lru];
>
> loop = MEM_CGROUP_ZSTAT(mz, lru);
> + WARN_ON((long)loop < 0);
> /* give some margin against EBUSY etc...*/
> loop += 256;
> busy = NULL;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 4:44 ` KAMEZAWA Hiroyuki
2011-12-27 6:06 ` nowhere
@ 2011-12-28 21:33 ` Dave Chinner
2011-12-28 22:57 ` KOSAKI Motohiro
1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2011-12-28 21:33 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Nikolay S., Michal Hocko, linux-kernel, linux-mm
On Tue, Dec 27, 2011 at 01:44:05PM +0900, KAMEZAWA Hiroyuki wrote:
> To me, it seems kswapd does usual work...reclaim small memory until free
> gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
> of gfp_t fallbacks.
>
>
> Memo.
>
> 1. why shrink_slab() should be called per zone, which is not zone aware.
> Isn't it enough to call it per priority ?
It is intended that it should be zone aware, but the current
shrinkers only have global LRUs and hence cannot discriminate
between objects from different zones easily. And if only a single
node/zone is being scanned, then we still have to call shirnk_slab()
to try to free objects in that zone/node, despite it's current
global scope.
I have some prototype patches that make the major slab caches and
shrinkers zone/node aware - that is the eventual goal here - but
first all the major slab cache LRUs need to be converted to be node
aware first. Then we can pass a nodemask into shrink_slab() and down
to the shrinkers so that those that have per-node LRUs can scan only
the appropriate nodes for objects to free. This is someting that I'm
working on in my spare time, but I have very little of that at the
moment, unfortunately.
> 2. what spinlock contention that perf showed ?
> And if shrink_slab() doesn't consume cpu as trace shows, why perf
> says shrink_slab() is heavy..
There isn't any spin lock contention - it's just showing how
expensive locking superblocks is when it's being done every few
microseconds for no good reason.
> 3. because 8/9 of memory is in DMA32, calling shrink_slab() frequently
> at scanning NORMAL seems to be time wasting.
Especially as the shrink_slab() calls are returning zero pages freed
every single time (i.e. the slab caches are empty). kswapd needs to
back off here, I think, or free more memory at a time. Only freeing
100 pages at a time is pretty inefficient, esp. as we have 4 orders
of magnitude more pages on the LRU and that is consuming >90% of
RAM...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-28 21:33 ` Dave Chinner
@ 2011-12-28 22:57 ` KOSAKI Motohiro
2012-01-02 7:00 ` Dave Chinner
0 siblings, 1 reply; 32+ messages in thread
From: KOSAKI Motohiro @ 2011-12-28 22:57 UTC (permalink / raw)
To: Dave Chinner
Cc: KAMEZAWA Hiroyuki, Nikolay S., Michal Hocko, linux-kernel, linux-mm
(12/28/11 4:33 PM), Dave Chinner wrote:
> On Tue, Dec 27, 2011 at 01:44:05PM +0900, KAMEZAWA Hiroyuki wrote:
>> To me, it seems kswapd does usual work...reclaim small memory until free
>> gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
>> of gfp_t fallbacks.
>>
>>
>> Memo.
>>
>> 1. why shrink_slab() should be called per zone, which is not zone aware.
>> Isn't it enough to call it per priority ?
>
> It is intended that it should be zone aware, but the current
> shrinkers only have global LRUs and hence cannot discriminate
> between objects from different zones easily. And if only a single
> node/zone is being scanned, then we still have to call shirnk_slab()
> to try to free objects in that zone/node, despite it's current
> global scope.
>
> I have some prototype patches that make the major slab caches and
> shrinkers zone/node aware - that is the eventual goal here - but
> first all the major slab cache LRUs need to be converted to be node
> aware first. Then we can pass a nodemask into shrink_slab() and down
> to the shrinkers so that those that have per-node LRUs can scan only
> the appropriate nodes for objects to free. This is someting that I'm
> working on in my spare time, but I have very little of that at the
> moment, unfortunately.
His machine only have one node. per node basis don't help him. We need
1) to make per zone LRU or 2) to implement more clever inter zone
balancing code. but it seems off topic. I'm not sure why you are starting
talk about per node shrinker.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-28 22:57 ` KOSAKI Motohiro
@ 2012-01-02 7:00 ` Dave Chinner
0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2012-01-02 7:00 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki, Nikolay S., Michal Hocko, linux-kernel, linux-mm
On Wed, Dec 28, 2011 at 05:57:38PM -0500, KOSAKI Motohiro wrote:
> (12/28/11 4:33 PM), Dave Chinner wrote:
> >On Tue, Dec 27, 2011 at 01:44:05PM +0900, KAMEZAWA Hiroyuki wrote:
> >>To me, it seems kswapd does usual work...reclaim small memory until free
> >>gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
> >>of gfp_t fallbacks.
> >>
> >>
> >>Memo.
> >>
> >>1. why shrink_slab() should be called per zone, which is not zone aware.
> >> Isn't it enough to call it per priority ?
> >
> >It is intended that it should be zone aware, but the current
> >shrinkers only have global LRUs and hence cannot discriminate
> >between objects from different zones easily. And if only a single
> >node/zone is being scanned, then we still have to call shirnk_slab()
> >to try to free objects in that zone/node, despite it's current
> >global scope.
> >
> >I have some prototype patches that make the major slab caches and
> >shrinkers zone/node aware - that is the eventual goal here - but
> >first all the major slab cache LRUs need to be converted to be node
> >aware first. Then we can pass a nodemask into shrink_slab() and down
> >to the shrinkers so that those that have per-node LRUs can scan only
> >the appropriate nodes for objects to free. This is someting that I'm
> >working on in my spare time, but I have very little of that at the
> >moment, unfortunately.
>
> His machine only have one node. per node basis don't help him.
You've got hung up on a little detail like how to define locality(*)
and so completely missed my point.
That being: Improving reclaim granularity won't help at all until
the subsystems index their caches in a more fine grained manner.
And with work in progress to make the shrinkers *locality* aware,
then maybe per-priority is not sufficient, either.
IOWs, I don't think there's a right answer to your question, but
there's a good chance that it will be irrelevant given work that is
in progress...
Cheers,
Dave.
----
(*) Keep in mind that the biggest problem with the current
LRU/shrinker implementations is not the shrinker - it's that lock
contention in fast paths is a major performance limiting factor. In
the VFS, that limit is hit somewhere between 8-16 concurrent
processes all banging on a filesystem doing metadata intensive work.
That's the real problem we have to solve first, and what I'm trying
to do is solve it in a way that -aligns- to the MM architecture.
Forcing everyone to know about MM zones is not the right way to go -
it means we can't easily change the internals of the MM subsystem
without touching 20-30 external shrinker implementations. IOWs,
there needs to be a level of abstraction between the MM internals
and the shrinker API exposed to other subsystems.
However, we still need to expose some type of locality information
through that interface to allow the MM to shrink caches in a more
efficient manner. Given that the allocation interfaces expose
per-node interfaces and so programmers are familiar with this method
of specifying locality, it seems natural to make the shrinker API
symmetric with that. And given that most caches are allocated out of
a single zone per node, having per-zone LRUs is just going to waste
a lot of memory....
But we should really discuss this when I post the patches again...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 4:56 ` KAMEZAWA Hiroyuki
@ 2012-01-10 22:33 ` Andrew Morton
2012-01-11 3:25 ` Nikolay S.
2012-01-11 0:33 ` Dave Chinner
1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2012-01-10 22:33 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Minchan Kim, Dave Chinner, nowhere, Michal Hocko, linux-kernel, linux-mm
On Tue, 27 Dec 2011 13:56:58 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Hmm, if I understand correctly,
>
> - dd's speed down is caused by kswapd's cpu consumption.
> - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
> - kswapd can't stop because NORMAL zone is small.
> - memory reclaim speed is enough because dd can't get enough cpu.
>
> I wonder reducing to call shrink_slab() may be a help but I'm not sure
> where lock conention comes from...
Nikolay, it sounds as if this problem has only recently started
happening? Was 3.1 OK?
If so, we should work out what we did post-3.1 to cause this.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 4:56 ` KAMEZAWA Hiroyuki
2012-01-10 22:33 ` Andrew Morton
@ 2012-01-11 0:33 ` Dave Chinner
1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2012-01-11 0:33 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Minchan Kim, nowhere, Michal Hocko, linux-kernel, linux-mm
On Tue, Dec 27, 2011 at 01:56:58PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Dec 2011 12:57:31 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> > The scenario I think is as follows,
> >
> > 1. dd comsumes memory in NORMAL zone
> > 2. dd enter direct reclaim and wakeup kswapd
> > 3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
> > 4. schedule
> > 5. dd consumes memory again in NORMAL zone
> > 6. kswapd fail to reclaim memory by high watermark due to 5.
> > 7. loop again, goto 3.
> >
> > The point is speed between reclaim VS memory consumption.
> > So kswapd cannot reach a point which enough pages are in NORMAL zone.
> >
> > >
> > > BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> > > while 'dd' operation ? (no direct relcaim by dd.)
> > > Is this log record cpu hog after 'dd' ?
> >
> > If above scenario is right, dd couldn't enter direct reclaim to reclaim memory.
> >
>
> I think you're right. IIUC, kswapd's behavior is what we usually see.
>
> Hmm, if I understand correctly,
>
> - dd's speed down is caused by kswapd's cpu consumption.
> - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
> - kswapd can't stop because NORMAL zone is small.
> - memory reclaim speed is enough because dd can't get enough cpu.
>
> I wonder reducing to call shrink_slab() may be a help but I'm not sure
> where lock conention comes from...
There is no lock contention. It's simply the overhead of grabbing a
passive reference to every superblock in the machine 3 times every
100uS.
FWIW, I don't think kswapd should be polling the shrinkers this
often when there are no/very few pages available to be freed from
the slab caches. There are many more shrinkers now than there were
in the past, so the overhead of polling all shrinkers very quickly
is significant.
FWIW, when we move to locality aware shrinkers, the polling overhead
is going to be even higher, so either the VM needs to call
shrink_slab less aggressively, or shrink_slab() needs to have some
method of reducing shrinker polling frequency.....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2011-12-27 3:57 ` Minchan Kim
2011-12-27 4:56 ` KAMEZAWA Hiroyuki
@ 2012-01-11 1:17 ` Rik van Riel
1 sibling, 0 replies; 32+ messages in thread
From: Rik van Riel @ 2012-01-11 1:17 UTC (permalink / raw)
To: Minchan Kim
Cc: KAMEZAWA Hiroyuki, Dave Chinner, nowhere, Michal Hocko,
linux-kernel, linux-mm
On 12/26/2011 10:57 PM, Minchan Kim wrote:
> I guess it's caused by small NORMAL zone.
> The scenario I think is as follows,
I guess it is exaggerated by a small NORMAL zone. Even on my
system, where the NORMAL zone is about 3x as large as the DMA32
zone, I can see that the pages in the NORMAL zone get recycled
slightly faster than those in the DMA32 zone...
> 1. dd comsumes memory in NORMAL zone
> 2. dd enter direct reclaim and wakeup kswapd
> 3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
> 4. schedule
> 5. dd consumes memory again in NORMAL zone
> 6. kswapd fail to reclaim memory by high watermark due to 5.
> 7. loop again, goto 3.
>
> The point is speed between reclaim VS memory consumption.
> So kswapd cannot reach a point which enough pages are in NORMAL zone.
I wonder if it would make sense for kswapd to count how many
pages it needs to free in each zone (at step 2), and then stop
reclaiming once it has freed that many pages.
That could leave the NORMAL (or HIGHMEM, on 32 bit) zone below
the watermark, but as long as the other zones are still good,
allocations can proceed just fine.
It would have the disadvantage of kswapd stopping reclaim with
possibly a zone below the watermark, but the advantage of
better balancing of allocations between zones.
Does this idea make sense?
--
All rights reversed
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2012-01-10 22:33 ` Andrew Morton
@ 2012-01-11 3:25 ` Nikolay S.
2012-01-11 4:42 ` Andrew Morton
0 siblings, 1 reply; 32+ messages in thread
From: Nikolay S. @ 2012-01-11 3:25 UTC (permalink / raw)
To: Andrew Morton
Cc: KAMEZAWA Hiroyuki, Minchan Kim, Dave Chinner, Michal Hocko,
linux-kernel, linux-mm
В Вт., 10/01/2012 в 14:33 -0800, Andrew Morton пишет:
> On Tue, 27 Dec 2011 13:56:58 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > Hmm, if I understand correctly,
> >
> > - dd's speed down is caused by kswapd's cpu consumption.
> > - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
> > - kswapd can't stop because NORMAL zone is small.
> > - memory reclaim speed is enough because dd can't get enough cpu.
> >
> > I wonder reducing to call shrink_slab() may be a help but I'm not sure
> > where lock conention comes from...
>
> Nikolay, it sounds as if this problem has only recently started
> happening? Was 3.1 OK?
>
> If so, we should work out what we did post-3.1 to cause this.
Yes, 3.1. was ok.
Recently I have upgraded to 3.2, and I can not reproduce the problem.
I'm now at 5 days uptime, the machine usage pattern, the software - all
the same, but the problem is not visible anymore:
PID USER S %CPU %MEM TIME+ SWAP COMMAND
14822 nowhere R 30 0.2 0:01.52 10m dd
416 root S 7 0.0 6:26.72 0 kswapd0
(also, kswapd run time after 5 days is only 6,5 seconds, whereas with
-rc5 it was 22 seconds after 5 days).
I can provide similar traces to see what has changed in kswapd's
activities (if it is of any value)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Kswapd in 3.2.0-rc5 is a CPU hog
2012-01-11 3:25 ` Nikolay S.
@ 2012-01-11 4:42 ` Andrew Morton
0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2012-01-11 4:42 UTC (permalink / raw)
To: Nikolay S.
Cc: KAMEZAWA Hiroyuki, Minchan Kim, Dave Chinner, Michal Hocko,
linux-kernel, linux-mm
On Wed, 11 Jan 2012 07:25:20 +0400 "Nikolay S." <nowhere@hakkenden.ath.cx> wrote:
> __ ____., 10/01/2012 __ 14:33 -0800, Andrew Morton __________:
> > On Tue, 27 Dec 2011 13:56:58 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > Hmm, if I understand correctly,
> > >
> > > - dd's speed down is caused by kswapd's cpu consumption.
> > > - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
> > > - kswapd can't stop because NORMAL zone is small.
> > > - memory reclaim speed is enough because dd can't get enough cpu.
> > >
> > > I wonder reducing to call shrink_slab() may be a help but I'm not sure
> > > where lock conention comes from...
> >
> > Nikolay, it sounds as if this problem has only recently started
> > happening? Was 3.1 OK?
> >
> > If so, we should work out what we did post-3.1 to cause this.
>
> Yes, 3.1. was ok.
> Recently I have upgraded to 3.2, and I can not reproduce the problem.
> I'm now at 5 days uptime, the machine usage pattern, the software - all
> the same, but the problem is not visible anymore:
>
> PID USER S %CPU %MEM TIME+ SWAP COMMAND
> 14822 nowhere R 30 0.2 0:01.52 10m dd
> 416 root S 7 0.0 6:26.72 0 kswapd0
>
> (also, kswapd run time after 5 days is only 6,5 seconds, whereas with
> -rc5 it was 22 seconds after 5 days).
>
> I can provide similar traces to see what has changed in kswapd's
> activities (if it is of any value)
OK, thanks for the followup. I suppose we assume this is fixed unless
someone reports it in 3.2.
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2012-01-11 4:38 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-21 3:10 Kswapd in 3.2.0-rc5 is a CPU hog Nikolay S.
2011-12-21 9:52 ` Michal Hocko
2011-12-21 10:15 ` nowhere
2011-12-21 10:24 ` Michal Hocko
2011-12-21 10:52 ` nowhere
2011-12-21 14:06 ` Alex Elder
2011-12-21 14:19 ` nowhere
2011-12-21 22:55 ` Dave Chinner
2011-12-23 9:01 ` nowhere
2011-12-23 10:20 ` Dave Chinner
2011-12-23 11:04 ` nowhere
2011-12-23 20:45 ` Dave Chinner
2011-12-25 9:09 ` Hillf Danton
2011-12-25 10:21 ` Nikolay S.
2011-12-26 12:35 ` Hillf Danton
2011-12-27 0:20 ` KAMEZAWA Hiroyuki
2011-12-27 13:33 ` Hillf Danton
2011-12-28 0:06 ` KAMEZAWA Hiroyuki
2011-12-27 2:15 ` KAMEZAWA Hiroyuki
2011-12-27 2:50 ` Nikolay S.
2011-12-27 4:44 ` KAMEZAWA Hiroyuki
2011-12-27 6:06 ` nowhere
2011-12-28 21:33 ` Dave Chinner
2011-12-28 22:57 ` KOSAKI Motohiro
2012-01-02 7:00 ` Dave Chinner
2011-12-27 3:57 ` Minchan Kim
2011-12-27 4:56 ` KAMEZAWA Hiroyuki
2012-01-10 22:33 ` Andrew Morton
2012-01-11 3:25 ` Nikolay S.
2012-01-11 4:42 ` Andrew Morton
2012-01-11 0:33 ` Dave Chinner
2012-01-11 1:17 ` Rik van Riel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).