* Over-eager swapping
@ 2012-04-23 9:27 ` Richard Davies
0 siblings, 0 replies; 75+ messages in thread
From: Richard Davies @ 2012-04-23 9:27 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Minchan Kim, Wu Fengguang, Balbir Singh, Christoph Lameter,
Lee Schermerhorn, Chris Webb
We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some of the machines. This is resulting in load spikes during the
swapping and customer reports of very poor response latency from the virtual
machines which have been swapped out, despite the hosts apparently having
large amounts of free memory, and running fine if swap is turned off.
All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
(previous helpful contributors cc:ed here - thanks).
We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
http://users.org.uk/config-3.1.4
The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
We estimate memory used from /proc/meminfo as:
= MemTotal - MemFree - Buffers + SwapTotal - SwapFree
The first rrd shows memory used increasing as a VM starts, but not getting
near the 64GB of physical RAM.
The second rrd shows the heavy swapping this VM start caused.
The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
storm.
It is obviously hard to capture all of the relevant data actually during an
incident. However, as of this morning, the relevant stats are as below.
Any help much appreciated! Our strong belief is that there is unnecessary
swapping going on here, and causing these load spikes. We would like to run
with swap for real out-of-memory situations, but at present it is causing
these kind of load spikes on machines which run completely happily with swap
disabled.
Thanks,
Richard.
# cat /proc/meminfo
MemTotal: 65915384 kB
MemFree: 271104 kB
Buffers: 36274368 kB
Cached: 31048 kB
SwapCached: 1830860 kB
Active: 30594144 kB
Inactive: 32295972 kB
Active(anon): 21883428 kB
Inactive(anon): 4695308 kB
Active(file): 8710716 kB
Inactive(file): 27600664 kB
Unevictable: 6740 kB
Mlocked: 6740 kB
SwapTotal: 33054708 kB
SwapFree: 30067948 kB
Dirty: 1044 kB
Writeback: 0 kB
AnonPages: 24962708 kB
Mapped: 7320 kB
Shmem: 48 kB
Slab: 2210964 kB
SReclaimable: 1013272 kB
SUnreclaim: 1197692 kB
KernelStack: 6816 kB
PageTables: 129248 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 66012400 kB
Committed_AS: 67375852 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 259380 kB
VmallocChunk: 34308695568 kB
HardwareCorrupted: 0 kB
AnonHugePages: 155648 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 576 kB
DirectMap2M: 2095104 kB
DirectMap1G: 65011712 kB
# cat /proc/sys/vm/zone_reclaim_mode
0
# cat /proc/sys/vm/min_unmapped_ratio
1
# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_groupinfo_1k 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
RAWv6 34 34 960 34 8 : tunables 0 0 0 : slabdata 1 1 0
UDPLITEv6 0 0 960 34 8 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 544 544 960 34 8 : tunables 0 0 0 : slabdata 16 16 0
tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
nf_conntrack_expect 592 592 216 37 2 : tunables 0 0 0 : slabdata 16 16 0
nf_conntrack_ffffffff8199a280 933 1856 280 29 2 : tunables 0 0 0 : slabdata 64 64 0
dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
dm_snap_pending_exception 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
dm_crypt_io 1811 2574 152 26 1 : tunables 0 0 0 : slabdata 99 99 0
kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
udf_inode_cache 0 0 656 24 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_request 0 0 608 26 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 704 46 8 : tunables 0 0 0 : slabdata 0 0 0
ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
ntfs_inode_cache 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
isofs_inode_cache 0 0 600 27 4 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 664 24 4 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 28 28 568 28 4 : tunables 0 0 0 : slabdata 1 1 0
squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
jbd2_journal_handle 2720 2720 24 170 1 : tunables 0 0 0 : slabdata 16 16 0
jbd2_journal_head 818 1620 112 36 1 : tunables 0 0 0 : slabdata 45 45 0
jbd2_revoke_record 2048 4096 32 128 1 : tunables 0 0 0 : slabdata 32 32 0
ext4_inode_cache 2754 5328 864 37 8 : tunables 0 0 0 : slabdata 144 144 0
ext4_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_free_data 1168 2628 56 73 1 : tunables 0 0 0 : slabdata 36 36 0
ext4_allocation_context 540 540 136 30 1 : tunables 0 0 0 : slabdata 18 18 0
ext4_io_end 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_io_page 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0
configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
kioctx 0 0 384 42 4 : tunables 0 0 0 : slabdata 0 0 0
inotify_inode_mark 30 30 136 30 1 : tunables 0 0 0 : slabdata 1 1 0
kvm_async_pf 448 448 144 28 1 : tunables 0 0 0 : slabdata 16 16 0
kvm_vcpu 64 94 13856 2 8 : tunables 0 0 0 : slabdata 47 47 0
UDP-Lite 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 219 219 56 73 1 : tunables 0 0 0 : slabdata 3 3 0
arp_cache 417 500 320 25 2 : tunables 0 0 0 : slabdata 20 20 0
RAW 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
UDP 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
tw_sock_TCP 512 1088 256 32 2 : tunables 0 0 0 : slabdata 34 34 0
TCP 345 357 1536 21 8 : tunables 0 0 0 : slabdata 17 17 0
blkdev_queue 414 440 1616 20 8 : tunables 0 0 0 : slabdata 22 22 0
blkdev_requests 945 2209 344 47 4 : tunables 0 0 0 : slabdata 47 47 0
sock_inode_cache 456 475 640 25 4 : tunables 0 0 0 : slabdata 19 19 0
shmem_inode_cache 2063 2375 632 25 4 : tunables 0 0 0 : slabdata 95 95 0
Acpi-ParseExt 3848 3864 72 56 1 : tunables 0 0 0 : slabdata 69 69 0
Acpi-Namespace 633667 1059270 40 102 1 : tunables 0 0 0 : slabdata 10385 10385 0
task_delay_info 1238 1584 112 36 1 : tunables 0 0 0 : slabdata 44 44 0
taskstats 384 384 328 24 2 : tunables 0 0 0 : slabdata 16 16 0
proc_inode_cache 2460 3250 616 26 4 : tunables 0 0 0 : slabdata 125 125 0
sigqueue 400 400 160 25 1 : tunables 0 0 0 : slabdata 16 16 0
bdev_cache 701 714 768 42 8 : tunables 0 0 0 : slabdata 17 17 0
sysfs_dir_cache 31662 34425 80 51 1 : tunables 0 0 0 : slabdata 675 675 0
inode_cache 2546 3886 552 29 4 : tunables 0 0 0 : slabdata 134 134 0
dentry 9452 14868 192 42 2 : tunables 0 0 0 : slabdata 354 354 0
buffer_head 8175114 8360937 104 39 1 : tunables 0 0 0 : slabdata 214383 214383 0
vm_area_struct 35344 35834 176 46 2 : tunables 0 0 0 : slabdata 782 782 0
files_cache 736 874 704 46 8 : tunables 0 0 0 : slabdata 19 19 0
signal_cache 1011 1296 896 36 8 : tunables 0 0 0 : slabdata 36 36 0
sighand_cache 682 945 2112 15 8 : tunables 0 0 0 : slabdata 63 63 0
task_struct 1057 1386 1520 21 8 : tunables 0 0 0 : slabdata 66 66 0
anon_vma 2417 2856 72 56 1 : tunables 0 0 0 : slabdata 51 51 0
shared_policy_node 4877 6800 48 85 1 : tunables 0 0 0 : slabdata 80 80 0
numa_policy 45589 48450 24 170 1 : tunables 0 0 0 : slabdata 285 285 0
radix_tree_node 227192 248388 568 28 4 : tunables 0 0 0 : slabdata 9174 9174 0
idr_layer_cache 603 660 544 30 4 : tunables 0 0 0 : slabdata 22 22 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8192 88 100 8192 4 8 : tunables 0 0 0 : slabdata 25 25 0
kmalloc-4096 3567 3704 4096 8 8 : tunables 0 0 0 : slabdata 463 463 0
kmalloc-2048 55140 55936 2048 16 8 : tunables 0 0 0 : slabdata 3496 3496 0
kmalloc-1024 5960 6496 1024 32 8 : tunables 0 0 0 : slabdata 203 203 0
kmalloc-512 12185 12704 512 32 4 : tunables 0 0 0 : slabdata 397 397 0
kmalloc-256 195078 199040 256 32 2 : tunables 0 0 0 : slabdata 6220 6220 0
kmalloc-128 45645 47328 128 32 1 : tunables 0 0 0 : slabdata 1479 1479 0
kmalloc-64 14647251 14776576 64 64 1 : tunables 0 0 0 : slabdata 230884 230884 0
kmalloc-32 5573 7552 32 128 1 : tunables 0 0 0 : slabdata 59 59 0
kmalloc-16 7550 10752 16 256 1 : tunables 0 0 0 : slabdata 42 42 0
kmalloc-8 13805 14848 8 512 1 : tunables 0 0 0 : slabdata 29 29 0
kmalloc-192 47641 50883 192 42 2 : tunables 0 0 0 : slabdata 1214 1214 0
kmalloc-96 3673 6006 96 42 1 : tunables 0 0 0 : slabdata 143 143 0
kmem_cache 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0
kmem_cache_node 495 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
# cat /proc/buddyinfo
Node 0, zone DMA 0 0 1 0 2 1 1 0 1 1 3
Node 0, zone DMA32 9148 1941 657 673 131 53 18 2 0 0 0
Node 0, zone Normal 8080 13 0 2 0 2 1 0 1 0 0
Node 1, zone Normal 19071 3239 675 200 413 37 4 1 2 0 0
Node 2, zone Normal 37716 3924 154 9 3 1 2 0 1 0 0
Node 3, zone Normal 20015 4590 1768 996 334 20 1 1 1 0 0
# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree: 201460 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree: 283224 kB
/sys/devices/system/node/node2/meminfo:Node 2 MemFree: 287060 kB
/sys/devices/system/node/node3/meminfo:Node 3 MemFree: 316928 kB
# cat /proc/vmstat
nr_free_pages 224933
nr_inactive_anon 1173838
nr_active_anon 5209232
nr_inactive_file 6998686
nr_active_file 2180311
nr_unevictable 1685
nr_mlock 1685
nr_anon_pages 5940145
nr_mapped 1836
nr_file_pages 9635092
nr_dirty 603
nr_writeback 0
nr_slab_reclaimable 253121
nr_slab_unreclaimable 299440
nr_page_table_pages 32311
nr_kernel_stack 854
nr_unstable 0
nr_bounce 0
nr_vmscan_write 50485772
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 12
nr_dirtied 5630347228
nr_written 5625041387
numa_hit 28372623283
numa_miss 4761673976
numa_foreign 4761673976
numa_interleave 30490
numa_local 28372334279
numa_other 4761962980
nr_anon_transparent_hugepages 76
nr_dirty_threshold 8192
nr_dirty_background_threshold 4096
pgpgin 9523143630
pgpgout 23124688920
pswpin 57978726
pswpout 50121412
pgalloc_dma 0
pgalloc_dma32 1132547190
pgalloc_normal 32421613044
pgalloc_movable 0
pgfree 39379011152
pgactivate 751722445
pgdeactivate 591205976
pgfault 41103638391
pgmajfault 11853858
pgrefill_dma 0
pgrefill_dma32 24124080
pgrefill_normal 540719764
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 297677595
pgsteal_normal 4784595717
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 241277864
pgscan_kswapd_normal 4004618399
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 65729843
pgscan_direct_normal 1012932822
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 66
slabs_scanned 668153728
kswapd_steal 4063341017
kswapd_inodesteal 2063
kswapd_low_wmark_hit_quickly 9834
kswapd_high_wmark_hit_quickly 488468
kswapd_skip_congestion_wait 580150
pageoutrun 22006623
allocstall 926752
pgrotated 28467920
compact_blocks_moved 522323130
compact_pages_moved 5774251432
compact_pagemigrate_failed 5267247
compact_stall 121045
compact_fail 68349
compact_success 52696
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 19976952
unevictable_pgs_scanned 0
unevictable_pgs_rescued 33137561
unevictable_pgs_mlocked 35042070
unevictable_pgs_munlocked 33138335
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 1024
thp_fault_alloc 263176
thp_fault_fallback 717335
thp_collapse_alloc 21307
thp_collapse_alloc_failed 91103
thp_split 90328
^ permalink raw reply [flat|nested] 75+ messages in thread
* Over-eager swapping
@ 2012-04-23 9:27 ` Richard Davies
0 siblings, 0 replies; 75+ messages in thread
From: Richard Davies @ 2012-04-23 9:27 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: Minchan Kim, Wu Fengguang, Balbir Singh, Christoph Lameter,
Lee Schermerhorn, Chris Webb
We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some of the machines. This is resulting in load spikes during the
swapping and customer reports of very poor response latency from the virtual
machines which have been swapped out, despite the hosts apparently having
large amounts of free memory, and running fine if swap is turned off.
All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
(previous helpful contributors cc:ed here - thanks).
We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
http://users.org.uk/config-3.1.4
The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
We estimate memory used from /proc/meminfo as:
= MemTotal - MemFree - Buffers + SwapTotal - SwapFree
The first rrd shows memory used increasing as a VM starts, but not getting
near the 64GB of physical RAM.
The second rrd shows the heavy swapping this VM start caused.
The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
storm.
It is obviously hard to capture all of the relevant data actually during an
incident. However, as of this morning, the relevant stats are as below.
Any help much appreciated! Our strong belief is that there is unnecessary
swapping going on here, and causing these load spikes. We would like to run
with swap for real out-of-memory situations, but at present it is causing
these kind of load spikes on machines which run completely happily with swap
disabled.
Thanks,
Richard.
# cat /proc/meminfo
MemTotal: 65915384 kB
MemFree: 271104 kB
Buffers: 36274368 kB
Cached: 31048 kB
SwapCached: 1830860 kB
Active: 30594144 kB
Inactive: 32295972 kB
Active(anon): 21883428 kB
Inactive(anon): 4695308 kB
Active(file): 8710716 kB
Inactive(file): 27600664 kB
Unevictable: 6740 kB
Mlocked: 6740 kB
SwapTotal: 33054708 kB
SwapFree: 30067948 kB
Dirty: 1044 kB
Writeback: 0 kB
AnonPages: 24962708 kB
Mapped: 7320 kB
Shmem: 48 kB
Slab: 2210964 kB
SReclaimable: 1013272 kB
SUnreclaim: 1197692 kB
KernelStack: 6816 kB
PageTables: 129248 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 66012400 kB
Committed_AS: 67375852 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 259380 kB
VmallocChunk: 34308695568 kB
HardwareCorrupted: 0 kB
AnonHugePages: 155648 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 576 kB
DirectMap2M: 2095104 kB
DirectMap1G: 65011712 kB
# cat /proc/sys/vm/zone_reclaim_mode
0
# cat /proc/sys/vm/min_unmapped_ratio
1
# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_groupinfo_1k 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
RAWv6 34 34 960 34 8 : tunables 0 0 0 : slabdata 1 1 0
UDPLITEv6 0 0 960 34 8 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 544 544 960 34 8 : tunables 0 0 0 : slabdata 16 16 0
tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
nf_conntrack_expect 592 592 216 37 2 : tunables 0 0 0 : slabdata 16 16 0
nf_conntrack_ffffffff8199a280 933 1856 280 29 2 : tunables 0 0 0 : slabdata 64 64 0
dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
dm_snap_pending_exception 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
dm_crypt_io 1811 2574 152 26 1 : tunables 0 0 0 : slabdata 99 99 0
kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
udf_inode_cache 0 0 656 24 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_request 0 0 608 26 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 704 46 8 : tunables 0 0 0 : slabdata 0 0 0
ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
ntfs_inode_cache 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
isofs_inode_cache 0 0 600 27 4 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 664 24 4 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 28 28 568 28 4 : tunables 0 0 0 : slabdata 1 1 0
squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
jbd2_journal_handle 2720 2720 24 170 1 : tunables 0 0 0 : slabdata 16 16 0
jbd2_journal_head 818 1620 112 36 1 : tunables 0 0 0 : slabdata 45 45 0
jbd2_revoke_record 2048 4096 32 128 1 : tunables 0 0 0 : slabdata 32 32 0
ext4_inode_cache 2754 5328 864 37 8 : tunables 0 0 0 : slabdata 144 144 0
ext4_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_free_data 1168 2628 56 73 1 : tunables 0 0 0 : slabdata 36 36 0
ext4_allocation_context 540 540 136 30 1 : tunables 0 0 0 : slabdata 18 18 0
ext4_io_end 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_io_page 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0
configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
kioctx 0 0 384 42 4 : tunables 0 0 0 : slabdata 0 0 0
inotify_inode_mark 30 30 136 30 1 : tunables 0 0 0 : slabdata 1 1 0
kvm_async_pf 448 448 144 28 1 : tunables 0 0 0 : slabdata 16 16 0
kvm_vcpu 64 94 13856 2 8 : tunables 0 0 0 : slabdata 47 47 0
UDP-Lite 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 219 219 56 73 1 : tunables 0 0 0 : slabdata 3 3 0
arp_cache 417 500 320 25 2 : tunables 0 0 0 : slabdata 20 20 0
RAW 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
UDP 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
tw_sock_TCP 512 1088 256 32 2 : tunables 0 0 0 : slabdata 34 34 0
TCP 345 357 1536 21 8 : tunables 0 0 0 : slabdata 17 17 0
blkdev_queue 414 440 1616 20 8 : tunables 0 0 0 : slabdata 22 22 0
blkdev_requests 945 2209 344 47 4 : tunables 0 0 0 : slabdata 47 47 0
sock_inode_cache 456 475 640 25 4 : tunables 0 0 0 : slabdata 19 19 0
shmem_inode_cache 2063 2375 632 25 4 : tunables 0 0 0 : slabdata 95 95 0
Acpi-ParseExt 3848 3864 72 56 1 : tunables 0 0 0 : slabdata 69 69 0
Acpi-Namespace 633667 1059270 40 102 1 : tunables 0 0 0 : slabdata 10385 10385 0
task_delay_info 1238 1584 112 36 1 : tunables 0 0 0 : slabdata 44 44 0
taskstats 384 384 328 24 2 : tunables 0 0 0 : slabdata 16 16 0
proc_inode_cache 2460 3250 616 26 4 : tunables 0 0 0 : slabdata 125 125 0
sigqueue 400 400 160 25 1 : tunables 0 0 0 : slabdata 16 16 0
bdev_cache 701 714 768 42 8 : tunables 0 0 0 : slabdata 17 17 0
sysfs_dir_cache 31662 34425 80 51 1 : tunables 0 0 0 : slabdata 675 675 0
inode_cache 2546 3886 552 29 4 : tunables 0 0 0 : slabdata 134 134 0
dentry 9452 14868 192 42 2 : tunables 0 0 0 : slabdata 354 354 0
buffer_head 8175114 8360937 104 39 1 : tunables 0 0 0 : slabdata 214383 214383 0
vm_area_struct 35344 35834 176 46 2 : tunables 0 0 0 : slabdata 782 782 0
files_cache 736 874 704 46 8 : tunables 0 0 0 : slabdata 19 19 0
signal_cache 1011 1296 896 36 8 : tunables 0 0 0 : slabdata 36 36 0
sighand_cache 682 945 2112 15 8 : tunables 0 0 0 : slabdata 63 63 0
task_struct 1057 1386 1520 21 8 : tunables 0 0 0 : slabdata 66 66 0
anon_vma 2417 2856 72 56 1 : tunables 0 0 0 : slabdata 51 51 0
shared_policy_node 4877 6800 48 85 1 : tunables 0 0 0 : slabdata 80 80 0
numa_policy 45589 48450 24 170 1 : tunables 0 0 0 : slabdata 285 285 0
radix_tree_node 227192 248388 568 28 4 : tunables 0 0 0 : slabdata 9174 9174 0
idr_layer_cache 603 660 544 30 4 : tunables 0 0 0 : slabdata 22 22 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8192 88 100 8192 4 8 : tunables 0 0 0 : slabdata 25 25 0
kmalloc-4096 3567 3704 4096 8 8 : tunables 0 0 0 : slabdata 463 463 0
kmalloc-2048 55140 55936 2048 16 8 : tunables 0 0 0 : slabdata 3496 3496 0
kmalloc-1024 5960 6496 1024 32 8 : tunables 0 0 0 : slabdata 203 203 0
kmalloc-512 12185 12704 512 32 4 : tunables 0 0 0 : slabdata 397 397 0
kmalloc-256 195078 199040 256 32 2 : tunables 0 0 0 : slabdata 6220 6220 0
kmalloc-128 45645 47328 128 32 1 : tunables 0 0 0 : slabdata 1479 1479 0
kmalloc-64 14647251 14776576 64 64 1 : tunables 0 0 0 : slabdata 230884 230884 0
kmalloc-32 5573 7552 32 128 1 : tunables 0 0 0 : slabdata 59 59 0
kmalloc-16 7550 10752 16 256 1 : tunables 0 0 0 : slabdata 42 42 0
kmalloc-8 13805 14848 8 512 1 : tunables 0 0 0 : slabdata 29 29 0
kmalloc-192 47641 50883 192 42 2 : tunables 0 0 0 : slabdata 1214 1214 0
kmalloc-96 3673 6006 96 42 1 : tunables 0 0 0 : slabdata 143 143 0
kmem_cache 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0
kmem_cache_node 495 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
# cat /proc/buddyinfo
Node 0, zone DMA 0 0 1 0 2 1 1 0 1 1 3
Node 0, zone DMA32 9148 1941 657 673 131 53 18 2 0 0 0
Node 0, zone Normal 8080 13 0 2 0 2 1 0 1 0 0
Node 1, zone Normal 19071 3239 675 200 413 37 4 1 2 0 0
Node 2, zone Normal 37716 3924 154 9 3 1 2 0 1 0 0
Node 3, zone Normal 20015 4590 1768 996 334 20 1 1 1 0 0
# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree: 201460 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree: 283224 kB
/sys/devices/system/node/node2/meminfo:Node 2 MemFree: 287060 kB
/sys/devices/system/node/node3/meminfo:Node 3 MemFree: 316928 kB
# cat /proc/vmstat
nr_free_pages 224933
nr_inactive_anon 1173838
nr_active_anon 5209232
nr_inactive_file 6998686
nr_active_file 2180311
nr_unevictable 1685
nr_mlock 1685
nr_anon_pages 5940145
nr_mapped 1836
nr_file_pages 9635092
nr_dirty 603
nr_writeback 0
nr_slab_reclaimable 253121
nr_slab_unreclaimable 299440
nr_page_table_pages 32311
nr_kernel_stack 854
nr_unstable 0
nr_bounce 0
nr_vmscan_write 50485772
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 12
nr_dirtied 5630347228
nr_written 5625041387
numa_hit 28372623283
numa_miss 4761673976
numa_foreign 4761673976
numa_interleave 30490
numa_local 28372334279
numa_other 4761962980
nr_anon_transparent_hugepages 76
nr_dirty_threshold 8192
nr_dirty_background_threshold 4096
pgpgin 9523143630
pgpgout 23124688920
pswpin 57978726
pswpout 50121412
pgalloc_dma 0
pgalloc_dma32 1132547190
pgalloc_normal 32421613044
pgalloc_movable 0
pgfree 39379011152
pgactivate 751722445
pgdeactivate 591205976
pgfault 41103638391
pgmajfault 11853858
pgrefill_dma 0
pgrefill_dma32 24124080
pgrefill_normal 540719764
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 297677595
pgsteal_normal 4784595717
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 241277864
pgscan_kswapd_normal 4004618399
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 65729843
pgscan_direct_normal 1012932822
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 66
slabs_scanned 668153728
kswapd_steal 4063341017
kswapd_inodesteal 2063
kswapd_low_wmark_hit_quickly 9834
kswapd_high_wmark_hit_quickly 488468
kswapd_skip_congestion_wait 580150
pageoutrun 22006623
allocstall 926752
pgrotated 28467920
compact_blocks_moved 522323130
compact_pages_moved 5774251432
compact_pagemigrate_failed 5267247
compact_stall 121045
compact_fail 68349
compact_success 52696
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 19976952
unevictable_pgs_scanned 0
unevictable_pgs_rescued 33137561
unevictable_pgs_mlocked 35042070
unevictable_pgs_munlocked 33138335
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 1024
thp_fault_alloc 263176
thp_fault_fallback 717335
thp_collapse_alloc 21307
thp_collapse_alloc_failed 91103
thp_split 90328
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2012-04-23 9:27 ` Richard Davies
@ 2012-04-23 12:07 ` Zdenek Kaspar
-1 siblings, 0 replies; 75+ messages in thread
From: Zdenek Kaspar @ 2012-04-23 12:07 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm
Dne 23.4.2012 11:27, Richard Davies napsal(a):
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
>
>
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
>
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
>
>
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
> = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
>
>
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
>
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
>
> Thanks,
>
> Richard.
>
>
> # cat /proc/meminfo
> MemTotal: 65915384 kB
> MemFree: 271104 kB
> Buffers: 36274368 kB
> Cached: 31048 kB
> SwapCached: 1830860 kB
> Active: 30594144 kB
> Inactive: 32295972 kB
> Active(anon): 21883428 kB
> Inactive(anon): 4695308 kB
> Active(file): 8710716 kB
> Inactive(file): 27600664 kB
> Unevictable: 6740 kB
> Mlocked: 6740 kB
> SwapTotal: 33054708 kB
> SwapFree: 30067948 kB
> Dirty: 1044 kB
> Writeback: 0 kB
> AnonPages: 24962708 kB
> Mapped: 7320 kB
> Shmem: 48 kB
> Slab: 2210964 kB
> SReclaimable: 1013272 kB
> SUnreclaim: 1197692 kB
> KernelStack: 6816 kB
> PageTables: 129248 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 66012400 kB
> Committed_AS: 67375852 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 259380 kB
> VmallocChunk: 34308695568 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 155648 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 576 kB
> DirectMap2M: 2095104 kB
> DirectMap1G: 65011712 kB
>
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
>
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 34 34 960 34 8 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 34 8 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 544 544 960 34 8 : tunables 0 0 0 : slabdata 16 16 0
> tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
> nf_conntrack_expect 592 592 216 37 2 : tunables 0 0 0 : slabdata 16 16 0
> nf_conntrack_ffffffff8199a280 933 1856 280 29 2 : tunables 0 0 0 : slabdata 64 64 0
> dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_snap_pending_exception 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> dm_crypt_io 1811 2574 152 26 1 : tunables 0 0 0 : slabdata 99 99 0
> kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
> bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
> udf_inode_cache 0 0 656 24 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_request 0 0 608 26 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_inode 0 0 704 46 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_inode_cache 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
> isofs_inode_cache 0 0 600 27 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_inode_cache 0 0 664 24 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 28 28 568 28 4 : tunables 0 0 0 : slabdata 1 1 0
> squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> jbd2_journal_handle 2720 2720 24 170 1 : tunables 0 0 0 : slabdata 16 16 0
> jbd2_journal_head 818 1620 112 36 1 : tunables 0 0 0 : slabdata 45 45 0
> jbd2_revoke_record 2048 4096 32 128 1 : tunables 0 0 0 : slabdata 32 32 0
> ext4_inode_cache 2754 5328 864 37 8 : tunables 0 0 0 : slabdata 144 144 0
> ext4_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_free_data 1168 2628 56 73 1 : tunables 0 0 0 : slabdata 36 36 0
> ext4_allocation_context 540 540 136 30 1 : tunables 0 0 0 : slabdata 18 18 0
> ext4_io_end 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
> ext4_io_page 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0
> configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 384 42 4 : tunables 0 0 0 : slabdata 0 0 0
> inotify_inode_mark 30 30 136 30 1 : tunables 0 0 0 : slabdata 1 1 0
> kvm_async_pf 448 448 144 28 1 : tunables 0 0 0 : slabdata 16 16 0
> kvm_vcpu 64 94 13856 2 8 : tunables 0 0 0 : slabdata 47 47 0
> UDP-Lite 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
> xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_fib_trie 219 219 56 73 1 : tunables 0 0 0 : slabdata 3 3 0
> arp_cache 417 500 320 25 2 : tunables 0 0 0 : slabdata 20 20 0
> RAW 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
> UDP 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
> tw_sock_TCP 512 1088 256 32 2 : tunables 0 0 0 : slabdata 34 34 0
> TCP 345 357 1536 21 8 : tunables 0 0 0 : slabdata 17 17 0
> blkdev_queue 414 440 1616 20 8 : tunables 0 0 0 : slabdata 22 22 0
> blkdev_requests 945 2209 344 47 4 : tunables 0 0 0 : slabdata 47 47 0
> sock_inode_cache 456 475 640 25 4 : tunables 0 0 0 : slabdata 19 19 0
> shmem_inode_cache 2063 2375 632 25 4 : tunables 0 0 0 : slabdata 95 95 0
> Acpi-ParseExt 3848 3864 72 56 1 : tunables 0 0 0 : slabdata 69 69 0
> Acpi-Namespace 633667 1059270 40 102 1 : tunables 0 0 0 : slabdata 10385 10385 0
> task_delay_info 1238 1584 112 36 1 : tunables 0 0 0 : slabdata 44 44 0
> taskstats 384 384 328 24 2 : tunables 0 0 0 : slabdata 16 16 0
> proc_inode_cache 2460 3250 616 26 4 : tunables 0 0 0 : slabdata 125 125 0
> sigqueue 400 400 160 25 1 : tunables 0 0 0 : slabdata 16 16 0
> bdev_cache 701 714 768 42 8 : tunables 0 0 0 : slabdata 17 17 0
> sysfs_dir_cache 31662 34425 80 51 1 : tunables 0 0 0 : slabdata 675 675 0
> inode_cache 2546 3886 552 29 4 : tunables 0 0 0 : slabdata 134 134 0
> dentry 9452 14868 192 42 2 : tunables 0 0 0 : slabdata 354 354 0
> buffer_head 8175114 8360937 104 39 1 : tunables 0 0 0 : slabdata 214383 214383 0
> vm_area_struct 35344 35834 176 46 2 : tunables 0 0 0 : slabdata 782 782 0
> files_cache 736 874 704 46 8 : tunables 0 0 0 : slabdata 19 19 0
> signal_cache 1011 1296 896 36 8 : tunables 0 0 0 : slabdata 36 36 0
> sighand_cache 682 945 2112 15 8 : tunables 0 0 0 : slabdata 63 63 0
> task_struct 1057 1386 1520 21 8 : tunables 0 0 0 : slabdata 66 66 0
> anon_vma 2417 2856 72 56 1 : tunables 0 0 0 : slabdata 51 51 0
> shared_policy_node 4877 6800 48 85 1 : tunables 0 0 0 : slabdata 80 80 0
> numa_policy 45589 48450 24 170 1 : tunables 0 0 0 : slabdata 285 285 0
> radix_tree_node 227192 248388 568 28 4 : tunables 0 0 0 : slabdata 9174 9174 0
> idr_layer_cache 603 660 544 30 4 : tunables 0 0 0 : slabdata 22 22 0
> dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc-8192 88 100 8192 4 8 : tunables 0 0 0 : slabdata 25 25 0
> kmalloc-4096 3567 3704 4096 8 8 : tunables 0 0 0 : slabdata 463 463 0
> kmalloc-2048 55140 55936 2048 16 8 : tunables 0 0 0 : slabdata 3496 3496 0
> kmalloc-1024 5960 6496 1024 32 8 : tunables 0 0 0 : slabdata 203 203 0
> kmalloc-512 12185 12704 512 32 4 : tunables 0 0 0 : slabdata 397 397 0
> kmalloc-256 195078 199040 256 32 2 : tunables 0 0 0 : slabdata 6220 6220 0
> kmalloc-128 45645 47328 128 32 1 : tunables 0 0 0 : slabdata 1479 1479 0
> kmalloc-64 14647251 14776576 64 64 1 : tunables 0 0 0 : slabdata 230884 230884 0
> kmalloc-32 5573 7552 32 128 1 : tunables 0 0 0 : slabdata 59 59 0
> kmalloc-16 7550 10752 16 256 1 : tunables 0 0 0 : slabdata 42 42 0
> kmalloc-8 13805 14848 8 512 1 : tunables 0 0 0 : slabdata 29 29 0
> kmalloc-192 47641 50883 192 42 2 : tunables 0 0 0 : slabdata 1214 1214 0
> kmalloc-96 3673 6006 96 42 1 : tunables 0 0 0 : slabdata 143 143 0
> kmem_cache 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0
> kmem_cache_node 495 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 0 0 1 0 2 1 1 0 1 1 3
> Node 0, zone DMA32 9148 1941 657 673 131 53 18 2 0 0 0
> Node 0, zone Normal 8080 13 0 2 0 2 1 0 1 0 0
> Node 1, zone Normal 19071 3239 675 200 413 37 4 1 2 0 0
> Node 2, zone Normal 37716 3924 154 9 3 1 2 0 1 0 0
> Node 3, zone Normal 20015 4590 1768 996 334 20 1 1 1 0 0
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree: 287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree: 316928 kB
>
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
Since I have this issue too..
Does anyone on list have idea if it's possible to disable memory reclaim
for specified processes, but not by patching binaries?
It's really frustrating seeing some latency sensitive processes swapped
out, in example:
./getswap.pl
PID COMMAND SWSIZE
1 init 116 kB
594 udevd 288 kB
1211 dhclient 344 kB
1255 rsyslogd 208 kB
1274 rpcbind 140 kB
1292 rpc.statd 444 kB
1310 mdadm 84 kB
1421 upsmon 436 kB
1422 upsmon 408 kB
1432 sshd 556 kB
1454 ksmtuned 96 kB
1463 crond 552 kB
1494 smartd 164 kB
1502 mingetty 76 kB
2200 smbd 620 kB
2212 smbd 748 kB
2213 nmbd 532 kB
2265 rpc.mountd 428 kB
2282 tgtd 92 kB
2283 tgtd 96 kB
2328 qemu-vm3 15512 kB
2366 qemu-vm2 13204 kB
2410 qemu-vm4 17140 kB
2448 qemu-vm5 38532 kB
2495 qemu-vm6 19148 kB
2534 qemu-vm7 44552 kB
2579 qemu-vm9 18788 kB
2620 qemu-vm10 19256 kB
2699 qemu-vm8 40204 kB
6376 qemu-vm1 29232 kB
7646 ntpd 280 kB
32322 smbd 468 kB
NOTE to OP: I just don't use swap if possible, but with qemu-kvm you can
use hugetlbfs as another workaround, but you will sacrifice some
functionality, like KSM and maybe memory ballooning etc..
HTH, Z.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2012-04-23 12:07 ` Zdenek Kaspar
0 siblings, 0 replies; 75+ messages in thread
From: Zdenek Kaspar @ 2012-04-23 12:07 UTC (permalink / raw)
To: linux-mm; +Cc: linux-kernel
Dne 23.4.2012 11:27, Richard Davies napsal(a):
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
>
>
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
>
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
>
>
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
> = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
>
>
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
>
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
>
> Thanks,
>
> Richard.
>
>
> # cat /proc/meminfo
> MemTotal: 65915384 kB
> MemFree: 271104 kB
> Buffers: 36274368 kB
> Cached: 31048 kB
> SwapCached: 1830860 kB
> Active: 30594144 kB
> Inactive: 32295972 kB
> Active(anon): 21883428 kB
> Inactive(anon): 4695308 kB
> Active(file): 8710716 kB
> Inactive(file): 27600664 kB
> Unevictable: 6740 kB
> Mlocked: 6740 kB
> SwapTotal: 33054708 kB
> SwapFree: 30067948 kB
> Dirty: 1044 kB
> Writeback: 0 kB
> AnonPages: 24962708 kB
> Mapped: 7320 kB
> Shmem: 48 kB
> Slab: 2210964 kB
> SReclaimable: 1013272 kB
> SUnreclaim: 1197692 kB
> KernelStack: 6816 kB
> PageTables: 129248 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 66012400 kB
> Committed_AS: 67375852 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 259380 kB
> VmallocChunk: 34308695568 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 155648 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 576 kB
> DirectMap2M: 2095104 kB
> DirectMap1G: 65011712 kB
>
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
>
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 34 34 960 34 8 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 34 8 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 544 544 960 34 8 : tunables 0 0 0 : slabdata 16 16 0
> tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
> nf_conntrack_expect 592 592 216 37 2 : tunables 0 0 0 : slabdata 16 16 0
> nf_conntrack_ffffffff8199a280 933 1856 280 29 2 : tunables 0 0 0 : slabdata 64 64 0
> dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_snap_pending_exception 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> dm_crypt_io 1811 2574 152 26 1 : tunables 0 0 0 : slabdata 99 99 0
> kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
> bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
> udf_inode_cache 0 0 656 24 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_request 0 0 608 26 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_inode 0 0 704 46 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_inode_cache 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
> isofs_inode_cache 0 0 600 27 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_inode_cache 0 0 664 24 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 28 28 568 28 4 : tunables 0 0 0 : slabdata 1 1 0
> squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> jbd2_journal_handle 2720 2720 24 170 1 : tunables 0 0 0 : slabdata 16 16 0
> jbd2_journal_head 818 1620 112 36 1 : tunables 0 0 0 : slabdata 45 45 0
> jbd2_revoke_record 2048 4096 32 128 1 : tunables 0 0 0 : slabdata 32 32 0
> ext4_inode_cache 2754 5328 864 37 8 : tunables 0 0 0 : slabdata 144 144 0
> ext4_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_free_data 1168 2628 56 73 1 : tunables 0 0 0 : slabdata 36 36 0
> ext4_allocation_context 540 540 136 30 1 : tunables 0 0 0 : slabdata 18 18 0
> ext4_io_end 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
> ext4_io_page 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0
> configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 384 42 4 : tunables 0 0 0 : slabdata 0 0 0
> inotify_inode_mark 30 30 136 30 1 : tunables 0 0 0 : slabdata 1 1 0
> kvm_async_pf 448 448 144 28 1 : tunables 0 0 0 : slabdata 16 16 0
> kvm_vcpu 64 94 13856 2 8 : tunables 0 0 0 : slabdata 47 47 0
> UDP-Lite 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
> xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_fib_trie 219 219 56 73 1 : tunables 0 0 0 : slabdata 3 3 0
> arp_cache 417 500 320 25 2 : tunables 0 0 0 : slabdata 20 20 0
> RAW 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
> UDP 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
> tw_sock_TCP 512 1088 256 32 2 : tunables 0 0 0 : slabdata 34 34 0
> TCP 345 357 1536 21 8 : tunables 0 0 0 : slabdata 17 17 0
> blkdev_queue 414 440 1616 20 8 : tunables 0 0 0 : slabdata 22 22 0
> blkdev_requests 945 2209 344 47 4 : tunables 0 0 0 : slabdata 47 47 0
> sock_inode_cache 456 475 640 25 4 : tunables 0 0 0 : slabdata 19 19 0
> shmem_inode_cache 2063 2375 632 25 4 : tunables 0 0 0 : slabdata 95 95 0
> Acpi-ParseExt 3848 3864 72 56 1 : tunables 0 0 0 : slabdata 69 69 0
> Acpi-Namespace 633667 1059270 40 102 1 : tunables 0 0 0 : slabdata 10385 10385 0
> task_delay_info 1238 1584 112 36 1 : tunables 0 0 0 : slabdata 44 44 0
> taskstats 384 384 328 24 2 : tunables 0 0 0 : slabdata 16 16 0
> proc_inode_cache 2460 3250 616 26 4 : tunables 0 0 0 : slabdata 125 125 0
> sigqueue 400 400 160 25 1 : tunables 0 0 0 : slabdata 16 16 0
> bdev_cache 701 714 768 42 8 : tunables 0 0 0 : slabdata 17 17 0
> sysfs_dir_cache 31662 34425 80 51 1 : tunables 0 0 0 : slabdata 675 675 0
> inode_cache 2546 3886 552 29 4 : tunables 0 0 0 : slabdata 134 134 0
> dentry 9452 14868 192 42 2 : tunables 0 0 0 : slabdata 354 354 0
> buffer_head 8175114 8360937 104 39 1 : tunables 0 0 0 : slabdata 214383 214383 0
> vm_area_struct 35344 35834 176 46 2 : tunables 0 0 0 : slabdata 782 782 0
> files_cache 736 874 704 46 8 : tunables 0 0 0 : slabdata 19 19 0
> signal_cache 1011 1296 896 36 8 : tunables 0 0 0 : slabdata 36 36 0
> sighand_cache 682 945 2112 15 8 : tunables 0 0 0 : slabdata 63 63 0
> task_struct 1057 1386 1520 21 8 : tunables 0 0 0 : slabdata 66 66 0
> anon_vma 2417 2856 72 56 1 : tunables 0 0 0 : slabdata 51 51 0
> shared_policy_node 4877 6800 48 85 1 : tunables 0 0 0 : slabdata 80 80 0
> numa_policy 45589 48450 24 170 1 : tunables 0 0 0 : slabdata 285 285 0
> radix_tree_node 227192 248388 568 28 4 : tunables 0 0 0 : slabdata 9174 9174 0
> idr_layer_cache 603 660 544 30 4 : tunables 0 0 0 : slabdata 22 22 0
> dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc-8192 88 100 8192 4 8 : tunables 0 0 0 : slabdata 25 25 0
> kmalloc-4096 3567 3704 4096 8 8 : tunables 0 0 0 : slabdata 463 463 0
> kmalloc-2048 55140 55936 2048 16 8 : tunables 0 0 0 : slabdata 3496 3496 0
> kmalloc-1024 5960 6496 1024 32 8 : tunables 0 0 0 : slabdata 203 203 0
> kmalloc-512 12185 12704 512 32 4 : tunables 0 0 0 : slabdata 397 397 0
> kmalloc-256 195078 199040 256 32 2 : tunables 0 0 0 : slabdata 6220 6220 0
> kmalloc-128 45645 47328 128 32 1 : tunables 0 0 0 : slabdata 1479 1479 0
> kmalloc-64 14647251 14776576 64 64 1 : tunables 0 0 0 : slabdata 230884 230884 0
> kmalloc-32 5573 7552 32 128 1 : tunables 0 0 0 : slabdata 59 59 0
> kmalloc-16 7550 10752 16 256 1 : tunables 0 0 0 : slabdata 42 42 0
> kmalloc-8 13805 14848 8 512 1 : tunables 0 0 0 : slabdata 29 29 0
> kmalloc-192 47641 50883 192 42 2 : tunables 0 0 0 : slabdata 1214 1214 0
> kmalloc-96 3673 6006 96 42 1 : tunables 0 0 0 : slabdata 143 143 0
> kmem_cache 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0
> kmem_cache_node 495 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 0 0 1 0 2 1 1 0 1 1 3
> Node 0, zone DMA32 9148 1941 657 673 131 53 18 2 0 0 0
> Node 0, zone Normal 8080 13 0 2 0 2 1 0 1 0 0
> Node 1, zone Normal 19071 3239 675 200 413 37 4 1 2 0 0
> Node 2, zone Normal 37716 3924 154 9 3 1 2 0 1 0 0
> Node 3, zone Normal 20015 4590 1768 996 334 20 1 1 1 0 0
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree: 287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree: 316928 kB
>
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
Since I have this issue too..
Does anyone on list have idea if it's possible to disable memory reclaim
for specified processes, but not by patching binaries?
It's really frustrating seeing some latency sensitive processes swapped
out, in example:
./getswap.pl
PID COMMAND SWSIZE
1 init 116 kB
594 udevd 288 kB
1211 dhclient 344 kB
1255 rsyslogd 208 kB
1274 rpcbind 140 kB
1292 rpc.statd 444 kB
1310 mdadm 84 kB
1421 upsmon 436 kB
1422 upsmon 408 kB
1432 sshd 556 kB
1454 ksmtuned 96 kB
1463 crond 552 kB
1494 smartd 164 kB
1502 mingetty 76 kB
2200 smbd 620 kB
2212 smbd 748 kB
2213 nmbd 532 kB
2265 rpc.mountd 428 kB
2282 tgtd 92 kB
2283 tgtd 96 kB
2328 qemu-vm3 15512 kB
2366 qemu-vm2 13204 kB
2410 qemu-vm4 17140 kB
2448 qemu-vm5 38532 kB
2495 qemu-vm6 19148 kB
2534 qemu-vm7 44552 kB
2579 qemu-vm9 18788 kB
2620 qemu-vm10 19256 kB
2699 qemu-vm8 40204 kB
6376 qemu-vm1 29232 kB
7646 ntpd 280 kB
32322 smbd 468 kB
NOTE to OP: I just don't use swap if possible, but with qemu-kvm you can
use hugetlbfs as another workaround, but you will sacrifice some
functionality, like KSM and maybe memory ballooning etc..
HTH, Z.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2012-04-23 9:27 ` Richard Davies
@ 2012-04-23 17:19 ` Dave Hansen
-1 siblings, 0 replies; 75+ messages in thread
From: Dave Hansen @ 2012-04-23 17:19 UTC (permalink / raw)
To: Richard Davies
Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
Christoph Lameter, Lee Schermerhorn, Chris Webb, Badari
On 04/23/2012 02:27 AM, Richard Davies wrote:
> # cat /proc/meminfo
> MemTotal: 65915384 kB
> MemFree: 271104 kB
> Buffers: 36274368 kB
Your "Buffers" are the only thing that really stands out here. We used
to see this kind of thing on ext3 a lot, but it's gotten much better
lately. From slabinfo, you can see all the buffer_heads:
buffer_head 8175114 8360937 104 39 1 : tunables 0 0
0 : slabdata 214383 214383 0
I _think_ this was a filesystems issue where the FS for some reason kept
the buffers locked down. The swapping just comes later as so much of
RAM is eaten up by buffers.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2012-04-23 17:19 ` Dave Hansen
0 siblings, 0 replies; 75+ messages in thread
From: Dave Hansen @ 2012-04-23 17:19 UTC (permalink / raw)
To: Richard Davies
Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
Christoph Lameter, Lee Schermerhorn, Chris Webb, Badari
On 04/23/2012 02:27 AM, Richard Davies wrote:
> # cat /proc/meminfo
> MemTotal: 65915384 kB
> MemFree: 271104 kB
> Buffers: 36274368 kB
Your "Buffers" are the only thing that really stands out here. We used
to see this kind of thing on ext3 a lot, but it's gotten much better
lately. From slabinfo, you can see all the buffer_heads:
buffer_head 8175114 8360937 104 39 1 : tunables 0 0
0 : slabdata 214383 214383 0
I _think_ this was a filesystems issue where the FS for some reason kept
the buffers locked down. The swapping just comes later as so much of
RAM is eaten up by buffers.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2012-04-23 9:27 ` Richard Davies
@ 2012-04-24 0:35 ` Minchan Kim
-1 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2012-04-24 0:35 UTC (permalink / raw)
To: Richard Davies
Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
Christoph Lameter, Lee Schermerhorn, Chris Webb
On 04/23/2012 06:27 PM, Richard Davies wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
>
>
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
>
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
Although you set swappiness to 0, kernel can swap out anon pages in
current implementation. I think it's a severe problem.
Couldn't this patch help you?
http://permalink.gmane.org/gmane.linux.kernel.mm/74824
It can prevent anon pages's swap out until few page cache remain.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2012-04-24 0:35 ` Minchan Kim
0 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2012-04-24 0:35 UTC (permalink / raw)
To: Richard Davies
Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
Christoph Lameter, Lee Schermerhorn, Chris Webb
On 04/23/2012 06:27 PM, Richard Davies wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
>
>
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
>
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
Although you set swappiness to 0, kernel can swap out anon pages in
current implementation. I think it's a severe problem.
Couldn't this patch help you?
http://permalink.gmane.org/gmane.linux.kernel.mm/74824
It can prevent anon pages's swap out until few page cache remain.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2012-04-23 9:27 ` Richard Davies
@ 2012-04-24 11:16 ` Peter Lieven
-1 siblings, 0 replies; 75+ messages in thread
From: Peter Lieven @ 2012-04-24 11:16 UTC (permalink / raw)
To: Richard Davies
Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
Christoph Lameter, Lee Schermerhorn, Chris Webb
On 23.04.2012 11:27, Richard Davies wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
>
>
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
>
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
>
>
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
> = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
>
>
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
>
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
I wonder why a lot of buffers are allocated at all. Can you describe
whats your
storage backend and provide your qemu-kvm command line. Anyhow,
qemu-devel@nongnu.org
might be the better list to discuss this.
Peter
> Thanks,
>
> Richard.
>
>
> # cat /proc/meminfo
> MemTotal: 65915384 kB
> MemFree: 271104 kB
> Buffers: 36274368 kB
> Cached: 31048 kB
> SwapCached: 1830860 kB
> Active: 30594144 kB
> Inactive: 32295972 kB
> Active(anon): 21883428 kB
> Inactive(anon): 4695308 kB
> Active(file): 8710716 kB
> Inactive(file): 27600664 kB
> Unevictable: 6740 kB
> Mlocked: 6740 kB
> SwapTotal: 33054708 kB
> SwapFree: 30067948 kB
> Dirty: 1044 kB
> Writeback: 0 kB
> AnonPages: 24962708 kB
> Mapped: 7320 kB
> Shmem: 48 kB
> Slab: 2210964 kB
> SReclaimable: 1013272 kB
> SUnreclaim: 1197692 kB
> KernelStack: 6816 kB
> PageTables: 129248 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 66012400 kB
> Committed_AS: 67375852 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 259380 kB
> VmallocChunk: 34308695568 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 155648 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 576 kB
> DirectMap2M: 2095104 kB
> DirectMap1G: 65011712 kB
>
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
>
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name<active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables<limit> <batchcount> <sharedfactor> : slabdata<active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 34 34 960 34 8 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 34 8 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 544 544 960 34 8 : tunables 0 0 0 : slabdata 16 16 0
> tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
> nf_conntrack_expect 592 592 216 37 2 : tunables 0 0 0 : slabdata 16 16 0
> nf_conntrack_ffffffff8199a280 933 1856 280 29 2 : tunables 0 0 0 : slabdata 64 64 0
> dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_snap_pending_exception 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> dm_crypt_io 1811 2574 152 26 1 : tunables 0 0 0 : slabdata 99 99 0
> kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
> bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
> udf_inode_cache 0 0 656 24 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_request 0 0 608 26 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_inode 0 0 704 46 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_inode_cache 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
> isofs_inode_cache 0 0 600 27 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_inode_cache 0 0 664 24 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 28 28 568 28 4 : tunables 0 0 0 : slabdata 1 1 0
> squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> jbd2_journal_handle 2720 2720 24 170 1 : tunables 0 0 0 : slabdata 16 16 0
> jbd2_journal_head 818 1620 112 36 1 : tunables 0 0 0 : slabdata 45 45 0
> jbd2_revoke_record 2048 4096 32 128 1 : tunables 0 0 0 : slabdata 32 32 0
> ext4_inode_cache 2754 5328 864 37 8 : tunables 0 0 0 : slabdata 144 144 0
> ext4_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_free_data 1168 2628 56 73 1 : tunables 0 0 0 : slabdata 36 36 0
> ext4_allocation_context 540 540 136 30 1 : tunables 0 0 0 : slabdata 18 18 0
> ext4_io_end 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
> ext4_io_page 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0
> configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 384 42 4 : tunables 0 0 0 : slabdata 0 0 0
> inotify_inode_mark 30 30 136 30 1 : tunables 0 0 0 : slabdata 1 1 0
> kvm_async_pf 448 448 144 28 1 : tunables 0 0 0 : slabdata 16 16 0
> kvm_vcpu 64 94 13856 2 8 : tunables 0 0 0 : slabdata 47 47 0
> UDP-Lite 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
> xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_fib_trie 219 219 56 73 1 : tunables 0 0 0 : slabdata 3 3 0
> arp_cache 417 500 320 25 2 : tunables 0 0 0 : slabdata 20 20 0
> RAW 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
> UDP 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
> tw_sock_TCP 512 1088 256 32 2 : tunables 0 0 0 : slabdata 34 34 0
> TCP 345 357 1536 21 8 : tunables 0 0 0 : slabdata 17 17 0
> blkdev_queue 414 440 1616 20 8 : tunables 0 0 0 : slabdata 22 22 0
> blkdev_requests 945 2209 344 47 4 : tunables 0 0 0 : slabdata 47 47 0
> sock_inode_cache 456 475 640 25 4 : tunables 0 0 0 : slabdata 19 19 0
> shmem_inode_cache 2063 2375 632 25 4 : tunables 0 0 0 : slabdata 95 95 0
> Acpi-ParseExt 3848 3864 72 56 1 : tunables 0 0 0 : slabdata 69 69 0
> Acpi-Namespace 633667 1059270 40 102 1 : tunables 0 0 0 : slabdata 10385 10385 0
> task_delay_info 1238 1584 112 36 1 : tunables 0 0 0 : slabdata 44 44 0
> taskstats 384 384 328 24 2 : tunables 0 0 0 : slabdata 16 16 0
> proc_inode_cache 2460 3250 616 26 4 : tunables 0 0 0 : slabdata 125 125 0
> sigqueue 400 400 160 25 1 : tunables 0 0 0 : slabdata 16 16 0
> bdev_cache 701 714 768 42 8 : tunables 0 0 0 : slabdata 17 17 0
> sysfs_dir_cache 31662 34425 80 51 1 : tunables 0 0 0 : slabdata 675 675 0
> inode_cache 2546 3886 552 29 4 : tunables 0 0 0 : slabdata 134 134 0
> dentry 9452 14868 192 42 2 : tunables 0 0 0 : slabdata 354 354 0
> buffer_head 8175114 8360937 104 39 1 : tunables 0 0 0 : slabdata 214383 214383 0
> vm_area_struct 35344 35834 176 46 2 : tunables 0 0 0 : slabdata 782 782 0
> files_cache 736 874 704 46 8 : tunables 0 0 0 : slabdata 19 19 0
> signal_cache 1011 1296 896 36 8 : tunables 0 0 0 : slabdata 36 36 0
> sighand_cache 682 945 2112 15 8 : tunables 0 0 0 : slabdata 63 63 0
> task_struct 1057 1386 1520 21 8 : tunables 0 0 0 : slabdata 66 66 0
> anon_vma 2417 2856 72 56 1 : tunables 0 0 0 : slabdata 51 51 0
> shared_policy_node 4877 6800 48 85 1 : tunables 0 0 0 : slabdata 80 80 0
> numa_policy 45589 48450 24 170 1 : tunables 0 0 0 : slabdata 285 285 0
> radix_tree_node 227192 248388 568 28 4 : tunables 0 0 0 : slabdata 9174 9174 0
> idr_layer_cache 603 660 544 30 4 : tunables 0 0 0 : slabdata 22 22 0
> dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc-8192 88 100 8192 4 8 : tunables 0 0 0 : slabdata 25 25 0
> kmalloc-4096 3567 3704 4096 8 8 : tunables 0 0 0 : slabdata 463 463 0
> kmalloc-2048 55140 55936 2048 16 8 : tunables 0 0 0 : slabdata 3496 3496 0
> kmalloc-1024 5960 6496 1024 32 8 : tunables 0 0 0 : slabdata 203 203 0
> kmalloc-512 12185 12704 512 32 4 : tunables 0 0 0 : slabdata 397 397 0
> kmalloc-256 195078 199040 256 32 2 : tunables 0 0 0 : slabdata 6220 6220 0
> kmalloc-128 45645 47328 128 32 1 : tunables 0 0 0 : slabdata 1479 1479 0
> kmalloc-64 14647251 14776576 64 64 1 : tunables 0 0 0 : slabdata 230884 230884 0
> kmalloc-32 5573 7552 32 128 1 : tunables 0 0 0 : slabdata 59 59 0
> kmalloc-16 7550 10752 16 256 1 : tunables 0 0 0 : slabdata 42 42 0
> kmalloc-8 13805 14848 8 512 1 : tunables 0 0 0 : slabdata 29 29 0
> kmalloc-192 47641 50883 192 42 2 : tunables 0 0 0 : slabdata 1214 1214 0
> kmalloc-96 3673 6006 96 42 1 : tunables 0 0 0 : slabdata 143 143 0
> kmem_cache 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0
> kmem_cache_node 495 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 0 0 1 0 2 1 1 0 1 1 3
> Node 0, zone DMA32 9148 1941 657 673 131 53 18 2 0 0 0
> Node 0, zone Normal 8080 13 0 2 0 2 1 0 1 0 0
> Node 1, zone Normal 19071 3239 675 200 413 37 4 1 2 0 0
> Node 2, zone Normal 37716 3924 154 9 3 1 2 0 1 0 0
> Node 3, zone Normal 20015 4590 1768 996 334 20 1 1 1 0 0
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree: 287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree: 316928 kB
>
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org"> email@kvack.org</a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2012-04-24 11:16 ` Peter Lieven
0 siblings, 0 replies; 75+ messages in thread
From: Peter Lieven @ 2012-04-24 11:16 UTC (permalink / raw)
To: Richard Davies
Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
Christoph Lameter, Lee Schermerhorn, Chris Webb
On 23.04.2012 11:27, Richard Davies wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
>
>
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
>
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
>
>
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
> = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
>
>
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
>
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
I wonder why a lot of buffers are allocated at all. Can you describe
whats your
storage backend and provide your qemu-kvm command line. Anyhow,
qemu-devel@nongnu.org
might be the better list to discuss this.
Peter
> Thanks,
>
> Richard.
>
>
> # cat /proc/meminfo
> MemTotal: 65915384 kB
> MemFree: 271104 kB
> Buffers: 36274368 kB
> Cached: 31048 kB
> SwapCached: 1830860 kB
> Active: 30594144 kB
> Inactive: 32295972 kB
> Active(anon): 21883428 kB
> Inactive(anon): 4695308 kB
> Active(file): 8710716 kB
> Inactive(file): 27600664 kB
> Unevictable: 6740 kB
> Mlocked: 6740 kB
> SwapTotal: 33054708 kB
> SwapFree: 30067948 kB
> Dirty: 1044 kB
> Writeback: 0 kB
> AnonPages: 24962708 kB
> Mapped: 7320 kB
> Shmem: 48 kB
> Slab: 2210964 kB
> SReclaimable: 1013272 kB
> SUnreclaim: 1197692 kB
> KernelStack: 6816 kB
> PageTables: 129248 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 66012400 kB
> Committed_AS: 67375852 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 259380 kB
> VmallocChunk: 34308695568 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 155648 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 576 kB
> DirectMap2M: 2095104 kB
> DirectMap1G: 65011712 kB
>
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
>
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name<active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables<limit> <batchcount> <sharedfactor> : slabdata<active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 34 34 960 34 8 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 34 8 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 544 544 960 34 8 : tunables 0 0 0 : slabdata 16 16 0
> tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
> nf_conntrack_expect 592 592 216 37 2 : tunables 0 0 0 : slabdata 16 16 0
> nf_conntrack_ffffffff8199a280 933 1856 280 29 2 : tunables 0 0 0 : slabdata 64 64 0
> dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_snap_pending_exception 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> dm_crypt_io 1811 2574 152 26 1 : tunables 0 0 0 : slabdata 99 99 0
> kcopyd_job 0 0 3240 10 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
> bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
> udf_inode_cache 0 0 656 24 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_request 0 0 608 26 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_inode 0 0 704 46 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_inode_cache 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
> isofs_inode_cache 0 0 600 27 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_inode_cache 0 0 664 24 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 28 28 568 28 4 : tunables 0 0 0 : slabdata 1 1 0
> squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> jbd2_journal_handle 2720 2720 24 170 1 : tunables 0 0 0 : slabdata 16 16 0
> jbd2_journal_head 818 1620 112 36 1 : tunables 0 0 0 : slabdata 45 45 0
> jbd2_revoke_record 2048 4096 32 128 1 : tunables 0 0 0 : slabdata 32 32 0
> ext4_inode_cache 2754 5328 864 37 8 : tunables 0 0 0 : slabdata 144 144 0
> ext4_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_free_data 1168 2628 56 73 1 : tunables 0 0 0 : slabdata 36 36 0
> ext4_allocation_context 540 540 136 30 1 : tunables 0 0 0 : slabdata 18 18 0
> ext4_io_end 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
> ext4_io_page 256 256 16 256 1 : tunables 0 0 0 : slabdata 1 1 0
> configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 384 42 4 : tunables 0 0 0 : slabdata 0 0 0
> inotify_inode_mark 30 30 136 30 1 : tunables 0 0 0 : slabdata 1 1 0
> kvm_async_pf 448 448 144 28 1 : tunables 0 0 0 : slabdata 16 16 0
> kvm_vcpu 64 94 13856 2 8 : tunables 0 0 0 : slabdata 47 47 0
> UDP-Lite 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
> xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_fib_trie 219 219 56 73 1 : tunables 0 0 0 : slabdata 3 3 0
> arp_cache 417 500 320 25 2 : tunables 0 0 0 : slabdata 20 20 0
> RAW 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
> UDP 672 672 768 42 8 : tunables 0 0 0 : slabdata 16 16 0
> tw_sock_TCP 512 1088 256 32 2 : tunables 0 0 0 : slabdata 34 34 0
> TCP 345 357 1536 21 8 : tunables 0 0 0 : slabdata 17 17 0
> blkdev_queue 414 440 1616 20 8 : tunables 0 0 0 : slabdata 22 22 0
> blkdev_requests 945 2209 344 47 4 : tunables 0 0 0 : slabdata 47 47 0
> sock_inode_cache 456 475 640 25 4 : tunables 0 0 0 : slabdata 19 19 0
> shmem_inode_cache 2063 2375 632 25 4 : tunables 0 0 0 : slabdata 95 95 0
> Acpi-ParseExt 3848 3864 72 56 1 : tunables 0 0 0 : slabdata 69 69 0
> Acpi-Namespace 633667 1059270 40 102 1 : tunables 0 0 0 : slabdata 10385 10385 0
> task_delay_info 1238 1584 112 36 1 : tunables 0 0 0 : slabdata 44 44 0
> taskstats 384 384 328 24 2 : tunables 0 0 0 : slabdata 16 16 0
> proc_inode_cache 2460 3250 616 26 4 : tunables 0 0 0 : slabdata 125 125 0
> sigqueue 400 400 160 25 1 : tunables 0 0 0 : slabdata 16 16 0
> bdev_cache 701 714 768 42 8 : tunables 0 0 0 : slabdata 17 17 0
> sysfs_dir_cache 31662 34425 80 51 1 : tunables 0 0 0 : slabdata 675 675 0
> inode_cache 2546 3886 552 29 4 : tunables 0 0 0 : slabdata 134 134 0
> dentry 9452 14868 192 42 2 : tunables 0 0 0 : slabdata 354 354 0
> buffer_head 8175114 8360937 104 39 1 : tunables 0 0 0 : slabdata 214383 214383 0
> vm_area_struct 35344 35834 176 46 2 : tunables 0 0 0 : slabdata 782 782 0
> files_cache 736 874 704 46 8 : tunables 0 0 0 : slabdata 19 19 0
> signal_cache 1011 1296 896 36 8 : tunables 0 0 0 : slabdata 36 36 0
> sighand_cache 682 945 2112 15 8 : tunables 0 0 0 : slabdata 63 63 0
> task_struct 1057 1386 1520 21 8 : tunables 0 0 0 : slabdata 66 66 0
> anon_vma 2417 2856 72 56 1 : tunables 0 0 0 : slabdata 51 51 0
> shared_policy_node 4877 6800 48 85 1 : tunables 0 0 0 : slabdata 80 80 0
> numa_policy 45589 48450 24 170 1 : tunables 0 0 0 : slabdata 285 285 0
> radix_tree_node 227192 248388 568 28 4 : tunables 0 0 0 : slabdata 9174 9174 0
> idr_layer_cache 603 660 544 30 4 : tunables 0 0 0 : slabdata 22 22 0
> dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
> dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc-8192 88 100 8192 4 8 : tunables 0 0 0 : slabdata 25 25 0
> kmalloc-4096 3567 3704 4096 8 8 : tunables 0 0 0 : slabdata 463 463 0
> kmalloc-2048 55140 55936 2048 16 8 : tunables 0 0 0 : slabdata 3496 3496 0
> kmalloc-1024 5960 6496 1024 32 8 : tunables 0 0 0 : slabdata 203 203 0
> kmalloc-512 12185 12704 512 32 4 : tunables 0 0 0 : slabdata 397 397 0
> kmalloc-256 195078 199040 256 32 2 : tunables 0 0 0 : slabdata 6220 6220 0
> kmalloc-128 45645 47328 128 32 1 : tunables 0 0 0 : slabdata 1479 1479 0
> kmalloc-64 14647251 14776576 64 64 1 : tunables 0 0 0 : slabdata 230884 230884 0
> kmalloc-32 5573 7552 32 128 1 : tunables 0 0 0 : slabdata 59 59 0
> kmalloc-16 7550 10752 16 256 1 : tunables 0 0 0 : slabdata 42 42 0
> kmalloc-8 13805 14848 8 512 1 : tunables 0 0 0 : slabdata 29 29 0
> kmalloc-192 47641 50883 192 42 2 : tunables 0 0 0 : slabdata 1214 1214 0
> kmalloc-96 3673 6006 96 42 1 : tunables 0 0 0 : slabdata 143 143 0
> kmem_cache 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0
> kmem_cache_node 495 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 0 0 1 0 2 1 1 0 1 1 3
> Node 0, zone DMA32 9148 1941 657 673 131 53 18 2 0 0 0
> Node 0, zone Normal 8080 13 0 2 0 2 1 0 1 0 0
> Node 1, zone Normal 19071 3239 675 200 413 37 4 1 2 0 0
> Node 2, zone Normal 37716 3924 154 9 3 1 2 0 1 0 0
> Node 3, zone Normal 20015 4590 1768 996 334 20 1 1 1 0 0
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree: 287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree: 316928 kB
>
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org"> email@kvack.org</a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2012-04-23 9:27 ` Richard Davies
@ 2012-04-25 14:41 ` Rik van Riel
-1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-04-25 14:41 UTC (permalink / raw)
To: Richard Davies
Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
Christoph Lameter, Lee Schermerhorn, Chris Webb
On 04/23/2012 05:27 AM, Richard Davies wrote:
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
> = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
These are exactly the kind of swap storms that led me
make the VM tweaks that got merged into 3.4-rc :)
See these commits:
fe2c2a106663130a5ab45cb0e3414b52df2fff0c
7be62de99adcab4449d416977b4274985c5fe023
aff622495c9a0b56148192e53bdec539f5e147f2
1480de0340a8d5f094b74d7c4b902456c9a06903
496b919b3bdd957d4b1727df79bfa3751bced1c1
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2012-04-25 14:41 ` Rik van Riel
0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-04-25 14:41 UTC (permalink / raw)
To: Richard Davies
Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
Christoph Lameter, Lee Schermerhorn, Chris Webb
On 04/23/2012 05:27 AM, Richard Davies wrote:
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
> = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
These are exactly the kind of swap storms that led me
make the VM tweaks that got merged into 3.4-rc :)
See these commits:
fe2c2a106663130a5ab45cb0e3414b52df2fff0c
7be62de99adcab4449d416977b4274985c5fe023
aff622495c9a0b56148192e53bdec539f5e147f2
1480de0340a8d5f094b74d7c4b902456c9a06903
496b919b3bdd957d4b1727df79bfa3751bced1c1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-19 10:20 ` Chris Webb
@ 2010-08-19 19:03 ` Christoph Lameter
-1 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-19 19:03 UTC (permalink / raw)
To: Chris Webb
Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen
On Thu, 19 Aug 2010, Chris Webb wrote:
> I tried this on a handful of the problem hosts before re-adding their swap.
> One of them now runs without dipping into swap. The other three I tried had
> the same behaviour of sitting at zero swap usage for a while, before
> suddenly spiralling up with %wait going through the roof. I had to swapoff
> on them to bring them back into a sane state. So it looks like it helps a
> bit, but doesn't cure the problem.
>
> I could definitely believe an explanation that we're swapping in preference
> to allocating remote zone pages somehow, given the imbalance in free memory
> between the nodes which we saw. However, I read the documentation for
> vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
> pages from remote zones should be allocated automatically in preference to
> swap given that zone_reclaim_mode & 4 == 0?
If zone reclaim is off then pages from other nodes will be allocated if a
node is filled up with page cache.
zone reclaim typically only evicts clean page cache pages in order to keep
the additional overhead down. Enabling swapping allows a more aggressive
form of recovering memory in preference of going off line.
The VM should work fine even without zone reclaim.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-19 19:03 ` Christoph Lameter
0 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-19 19:03 UTC (permalink / raw)
To: Chris Webb
Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen
On Thu, 19 Aug 2010, Chris Webb wrote:
> I tried this on a handful of the problem hosts before re-adding their swap.
> One of them now runs without dipping into swap. The other three I tried had
> the same behaviour of sitting at zero swap usage for a while, before
> suddenly spiralling up with %wait going through the roof. I had to swapoff
> on them to bring them back into a sane state. So it looks like it helps a
> bit, but doesn't cure the problem.
>
> I could definitely believe an explanation that we're swapping in preference
> to allocating remote zone pages somehow, given the imbalance in free memory
> between the nodes which we saw. However, I read the documentation for
> vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
> pages from remote zones should be allocated automatically in preference to
> swap given that zone_reclaim_mode & 4 == 0?
If zone reclaim is off then pages from other nodes will be allocated if a
node is filled up with page cache.
zone reclaim typically only evicts clean page cache pages in order to keep
the additional overhead down. Enabling swapping allows a more aggressive
form of recovering memory in preference of going off line.
The VM should work fine even without zone reclaim.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-19 9:25 ` Chris Webb
@ 2010-08-19 15:13 ` Balbir Singh
-1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19 15:13 UTC (permalink / raw)
To: Chris Webb
Cc: linux-mm, linux-kernel, Wu Fengguang, Minchan Kim, KOSAKI Motohiro
* Chris Webb <chris@arachsys.com> [2010-08-19 10:25:36]:
> Balbir Singh <balbir@linux.vnet.ibm.com> writes:
>
> > Can you give an idea of what the meminfo inside the guest looks like.
>
> Sorry for the slow reply here. Unfortunately not, as these guests are run on
> behalf of customers. They install them with operating systems of their
> choice, and run them on our service.
>
Thanks for clarifying.
> > Have you looked at
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
>
> Yes, I've been watching this discussions with interest. Our application is
> one where we have little to no control over what goes on inside the guests,
> but these sorts of things definitely make sense where the two are under the
> same administrative control.
>
Not necessarily, in some cases you can use a guest that uses lesser
page cache, but that might not matter in your case at the moment.
> > Do we have reason to believe the problem can be solved entirely in the
> > host?
>
> It's not clear to me why this should be difficult, given that the total size
> of vm allocated to guests (and system processes) is always strictly less
> than the total amount of RAM available in the host. I do understand that it
> won't allow for as impressive overcommit (except by ksm) or be as efficient,
> because file-backed guest pages won't get evicted by pressure in the host as
> they are indistinguishable from anonymous pages.
>
> After all, a solution that isn't ideal, but does work, is to turn off swap
> completely! This is what we've been doing to date. The only problem with
> this is that we can't dip into swap in an emergency if there's no swap there
> at all.
If you are not overcommitting it should work, in my experiments I've
seen a lot of memory used by the host as page cache on behalf of the
guest. I've done my experiments using cgroups to identify accurate
usage.
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-19 15:13 ` Balbir Singh
0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19 15:13 UTC (permalink / raw)
To: Chris Webb
Cc: linux-mm, linux-kernel, Wu Fengguang, Minchan Kim, KOSAKI Motohiro
* Chris Webb <chris@arachsys.com> [2010-08-19 10:25:36]:
> Balbir Singh <balbir@linux.vnet.ibm.com> writes:
>
> > Can you give an idea of what the meminfo inside the guest looks like.
>
> Sorry for the slow reply here. Unfortunately not, as these guests are run on
> behalf of customers. They install them with operating systems of their
> choice, and run them on our service.
>
Thanks for clarifying.
> > Have you looked at
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
>
> Yes, I've been watching this discussions with interest. Our application is
> one where we have little to no control over what goes on inside the guests,
> but these sorts of things definitely make sense where the two are under the
> same administrative control.
>
Not necessarily, in some cases you can use a guest that uses lesser
page cache, but that might not matter in your case at the moment.
> > Do we have reason to believe the problem can be solved entirely in the
> > host?
>
> It's not clear to me why this should be difficult, given that the total size
> of vm allocated to guests (and system processes) is always strictly less
> than the total amount of RAM available in the host. I do understand that it
> won't allow for as impressive overcommit (except by ksm) or be as efficient,
> because file-backed guest pages won't get evicted by pressure in the host as
> they are indistinguishable from anonymous pages.
>
> After all, a solution that isn't ideal, but does work, is to turn off swap
> completely! This is what we've been doing to date. The only problem with
> this is that we can't dip into swap in an emergency if there's no swap there
> at all.
If you are not overcommitting it should work, in my experiments I've
seen a lot of memory used by the host as page cache on behalf of the
guest. I've done my experiments using cgroups to identify accurate
usage.
--
Three Cheers,
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 16:13 ` Christoph Lameter
@ 2010-08-19 10:20 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-19 10:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen
Christoph Lameter <cl@linux-foundation.org> writes:
> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.
I tried this on a handful of the problem hosts before re-adding their swap.
One of them now runs without dipping into swap. The other three I tried had
the same behaviour of sitting at zero swap usage for a while, before
suddenly spiralling up with %wait going through the roof. I had to swapoff
on them to bring them back into a sane state. So it looks like it helps a
bit, but doesn't cure the problem.
I could definitely believe an explanation that we're swapping in preference
to allocating remote zone pages somehow, given the imbalance in free memory
between the nodes which we saw. However, I read the documentation for
vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
pages from remote zones should be allocated automatically in preference to
swap given that zone_reclaim_mode & 4 == 0?
Cheers,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-19 10:20 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-19 10:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen
Christoph Lameter <cl@linux-foundation.org> writes:
> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.
I tried this on a handful of the problem hosts before re-adding their swap.
One of them now runs without dipping into swap. The other three I tried had
the same behaviour of sitting at zero swap usage for a while, before
suddenly spiralling up with %wait going through the roof. I had to swapoff
on them to bring them back into a sane state. So it looks like it helps a
bit, but doesn't cure the problem.
I could definitely believe an explanation that we're swapping in preference
to allocating remote zone pages somehow, given the imbalance in free memory
between the nodes which we saw. However, I read the documentation for
vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
pages from remote zones should be allocated automatically in preference to
swap given that zone_reclaim_mode & 4 == 0?
Cheers,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 16:45 ` Balbir Singh
@ 2010-08-19 9:25 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-19 9:25 UTC (permalink / raw)
To: Balbir Singh
Cc: linux-mm, linux-kernel, Wu Fengguang, Minchan Kim, KOSAKI Motohiro
Balbir Singh <balbir@linux.vnet.ibm.com> writes:
> Can you give an idea of what the meminfo inside the guest looks like.
Sorry for the slow reply here. Unfortunately not, as these guests are run on
behalf of customers. They install them with operating systems of their
choice, and run them on our service.
> Have you looked at
> http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
Yes, I've been watching this discussions with interest. Our application is
one where we have little to no control over what goes on inside the guests,
but these sorts of things definitely make sense where the two are under the
same administrative control.
> Do we have reason to believe the problem can be solved entirely in the
> host?
It's not clear to me why this should be difficult, given that the total size
of vm allocated to guests (and system processes) is always strictly less
than the total amount of RAM available in the host. I do understand that it
won't allow for as impressive overcommit (except by ksm) or be as efficient,
because file-backed guest pages won't get evicted by pressure in the host as
they are indistinguishable from anonymous pages.
After all, a solution that isn't ideal, but does work, is to turn off swap
completely! This is what we've been doing to date. The only problem with
this is that we can't dip into swap in an emergency if there's no swap there
at all.
Best wishes,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-19 9:25 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-19 9:25 UTC (permalink / raw)
To: Balbir Singh
Cc: linux-mm, linux-kernel, Wu Fengguang, Minchan Kim, KOSAKI Motohiro
Balbir Singh <balbir@linux.vnet.ibm.com> writes:
> Can you give an idea of what the meminfo inside the guest looks like.
Sorry for the slow reply here. Unfortunately not, as these guests are run on
behalf of customers. They install them with operating systems of their
choice, and run them on our service.
> Have you looked at
> http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
Yes, I've been watching this discussions with interest. Our application is
one where we have little to no control over what goes on inside the guests,
but these sorts of things definitely make sense where the two are under the
same administrative control.
> Do we have reason to believe the problem can be solved entirely in the
> host?
It's not clear to me why this should be difficult, given that the total size
of vm allocated to guests (and system processes) is always strictly less
than the total amount of RAM available in the host. I do understand that it
won't allow for as impressive overcommit (except by ksm) or be as efficient,
because file-backed guest pages won't get evicted by pressure in the host as
they are indistinguishable from anonymous pages.
After all, a solution that isn't ideal, but does work, is to turn off swap
completely! This is what we've been doing to date. The only problem with
this is that we can't dip into swap in an emergency if there's no swap there
at all.
Best wishes,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 16:13 ` Christoph Lameter
@ 2010-08-19 5:16 ` Balbir Singh
-1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19 5:16 UTC (permalink / raw)
To: Christoph Lameter
Cc: Chris Webb, Lee Schermerhorn, Wu Fengguang, Minchan Kim,
linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg,
Andi Kleen
* Christoph Lameter <cl@linux-foundation.org> [2010-08-18 11:13:03]:
> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.
>
Isn't that bad in terms of how we treat the cost of remote node
allocations? Is local zone_reclaim() always a good thing or is it
something for chris to try and see if that helps his situation?
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-19 5:16 ` Balbir Singh
0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19 5:16 UTC (permalink / raw)
To: Christoph Lameter
Cc: Chris Webb, Lee Schermerhorn, Wu Fengguang, Minchan Kim,
linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg,
Andi Kleen
* Christoph Lameter <cl@linux-foundation.org> [2010-08-18 11:13:03]:
> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.
>
Isn't that bad in terms of how we treat the cost of remote node
allocations? Is local zone_reclaim() always a good thing or is it
something for chris to try and see if that helps his situation?
--
Three Cheers,
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-03 4:28 ` Wu Fengguang
@ 2010-08-19 5:13 ` Balbir Singh
-1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19 5:13 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro
* Wu Fengguang <fengguang.wu@intel.com> [2010-08-03 12:28:35]:
> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> > On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> > > Minchan Kim <minchan.kim@gmail.com> writes:
> > >
> > >> Another possibility is _zone_reclaim_ in NUMA.
> > >> Your working set has many anonymous page.
> > >>
> > >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> > >> It can make reclaim mode to lumpy so it can page out anon pages.
> > >>
> > >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> > >
> > > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > > these are
> > >
> > > # cat /proc/sys/vm/zone_reclaim_mode
> > > 0
> > > # cat /proc/sys/vm/min_unmapped_ratio
> > > 1
> >
> > if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
>
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is
> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?
>
> Chris, what's in your /proc/slabinfo?
>
I don't know if Chris saw the link I pointed to earlier, but one of
the reclaim challenges with virtual machines is that cached memory
in the guest (in fact all memory) shows up as anonymous on the host.
If the guests are doing a lot of caching and the guest reclaim sees
no reason to evict the cache, the host will see pressure.
That is one of the reasons I wanted to see meminfo inside the guest if
possible. Setting swappiness to 0 inside the guest is one way of
avoiding double caching that might take place, but I've not found it
to be very effective.
Do we have reason to believe the problem can be solved entirely in the
host?
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-19 5:13 ` Balbir Singh
0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19 5:13 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro
* Wu Fengguang <fengguang.wu@intel.com> [2010-08-03 12:28:35]:
> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> > On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> > > Minchan Kim <minchan.kim@gmail.com> writes:
> > >
> > >> Another possibility is _zone_reclaim_ in NUMA.
> > >> Your working set has many anonymous page.
> > >>
> > >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> > >> It can make reclaim mode to lumpy so it can page out anon pages.
> > >>
> > >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> > >
> > > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > > these are
> > >
> > > # cat /proc/sys/vm/zone_reclaim_mode
> > > 0
> > > # cat /proc/sys/vm/min_unmapped_ratio
> > > 1
> >
> > if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
>
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is
> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?
>
> Chris, what's in your /proc/slabinfo?
>
I don't know if Chris saw the link I pointed to earlier, but one of
the reclaim challenges with virtual machines is that cached memory
in the guest (in fact all memory) shows up as anonymous on the host.
If the guests are doing a lot of caching and the guest reclaim sees
no reason to evict the cache, the host will see pressure.
That is one of the reasons I wanted to see meminfo inside the guest if
possible. Setting swappiness to 0 inside the guest is one way of
avoiding double caching that might take place, but I've not found it
to be very effective.
Do we have reason to believe the problem can be solved entirely in the
host?
--
Three Cheers,
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-02 12:47 ` Chris Webb
@ 2010-08-18 16:45 ` Balbir Singh
-1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-18 16:45 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-mm, linux-kernel
* Chris Webb <chris@arachsys.com> [2010-08-02 13:47:35]:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
> http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
> -CONFIG_MCORE2=y
> +CONFIG_MK8=y
> -CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
> # cat /proc/meminfo
> MemTotal: 33083420 kB
> MemFree: 693164 kB
> Buffers: 8834380 kB
> Cached: 11212 kB
> SwapCached: 1443524 kB
> Active: 21656844 kB
> Inactive: 8119352 kB
> Active(anon): 17203092 kB
> Inactive(anon): 3729032 kB
> Active(file): 4453752 kB
> Inactive(file): 4390320 kB
> Unevictable: 5472 kB
> Mlocked: 5472 kB
> SwapTotal: 25165816 kB
> SwapFree: 21854572 kB
> Dirty: 4300 kB
> Writeback: 4 kB
> AnonPages: 20780368 kB
> Mapped: 6056 kB
> Shmem: 56 kB
> Slab: 961512 kB
> SReclaimable: 438276 kB
> SUnreclaim: 523236 kB
> KernelStack: 10152 kB
> PageTables: 67176 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 41707524 kB
> Committed_AS: 39870868 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 150880 kB
> VmallocChunk: 34342404996 kB
> HardwareCorrupted: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 5824 kB
> DirectMap2M: 3205120 kB
> DirectMap1G: 30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>
> After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out. Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use, before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB.
>
> We could run with these machines without swap (in the worst cases we're
> already doing so), but I'd prefer to have a reserve of swap available in
> case of genuine emergency. If it's a choice between swapping out a guest or
> oom-killing it, I'd prefer to swap... but I really don't want to swap out
> running virtual machines in order to have eight gigabytes of page cache
> instead of five!
>
> Is this a problem with the page reclaim priorities, or am I just tuning
> these hosts incorrectly? Is there more detailed info than /proc/meminfo
> available which might shed more light on what's going wrong here?
>
Can you give an idea of what the meminfo inside the guest looks like.
Have you looked at
http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 16:45 ` Balbir Singh
0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-18 16:45 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-mm, linux-kernel
* Chris Webb <chris@arachsys.com> [2010-08-02 13:47:35]:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
> http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
> -CONFIG_MCORE2=y
> +CONFIG_MK8=y
> -CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
> # cat /proc/meminfo
> MemTotal: 33083420 kB
> MemFree: 693164 kB
> Buffers: 8834380 kB
> Cached: 11212 kB
> SwapCached: 1443524 kB
> Active: 21656844 kB
> Inactive: 8119352 kB
> Active(anon): 17203092 kB
> Inactive(anon): 3729032 kB
> Active(file): 4453752 kB
> Inactive(file): 4390320 kB
> Unevictable: 5472 kB
> Mlocked: 5472 kB
> SwapTotal: 25165816 kB
> SwapFree: 21854572 kB
> Dirty: 4300 kB
> Writeback: 4 kB
> AnonPages: 20780368 kB
> Mapped: 6056 kB
> Shmem: 56 kB
> Slab: 961512 kB
> SReclaimable: 438276 kB
> SUnreclaim: 523236 kB
> KernelStack: 10152 kB
> PageTables: 67176 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 41707524 kB
> Committed_AS: 39870868 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 150880 kB
> VmallocChunk: 34342404996 kB
> HardwareCorrupted: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 5824 kB
> DirectMap2M: 3205120 kB
> DirectMap1G: 30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>
> After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out. Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use, before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB.
>
> We could run with these machines without swap (in the worst cases we're
> already doing so), but I'd prefer to have a reserve of swap available in
> case of genuine emergency. If it's a choice between swapping out a guest or
> oom-killing it, I'd prefer to swap... but I really don't want to swap out
> running virtual machines in order to have eight gigabytes of page cache
> instead of five!
>
> Is this a problem with the page reclaim priorities, or am I just tuning
> these hosts incorrectly? Is there more detailed info than /proc/meminfo
> available which might shed more light on what's going wrong here?
>
Can you give an idea of what the meminfo inside the guest looks like.
Have you looked at
http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
--
Three Cheers,
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 16:13 ` Christoph Lameter
@ 2010-08-18 16:32 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 16:32 UTC (permalink / raw)
To: Christoph Lameter
Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen
Christoph Lameter <cl@linux-foundation.org> writes:
> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.
I'll try this tonight: setting this to one and readding swap on one of the
problem machines.
Best wishes,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 16:32 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 16:32 UTC (permalink / raw)
To: Christoph Lameter
Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen
Christoph Lameter <cl@linux-foundation.org> writes:
> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.
I'll try this tonight: setting this to one and readding swap on one of the
problem machines.
Best wishes,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 16:13 ` Wu Fengguang
@ 2010-08-18 16:31 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 16:31 UTC (permalink / raw)
To: Wu Fengguang
Cc: Lee Schermerhorn, Minchan Kim, linux-mm, linux-kernel,
KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter
Wu Fengguang <fengguang.wu@intel.com> writes:
> Chris, can you post /proc/vmstat on the problem machines?
Here's /proc/vmstat from one of the bad machines with swap taken out:
# cat /proc/vmstat
nr_free_pages 115572
nr_inactive_anon 562140
nr_active_anon 5015609
nr_inactive_file 997097
nr_active_file 996989
nr_unevictable 1368
nr_mlock 1368
nr_anon_pages 5862299
nr_mapped 1414
nr_file_pages 1994569
nr_dirty 619
nr_writeback 0
nr_slab_reclaimable 88883
nr_slab_unreclaimable 129859
nr_page_table_pages 15744
nr_kernel_stack 1132
nr_unstable 0
nr_bounce 0
nr_vmscan_write 68708505
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 14
numa_hit 15295188815
numa_miss 9391232519
numa_foreign 9391232519
numa_interleave 16982
numa_local 15294742520
numa_other 9391678814
pgpgin 20644565778
pgpgout 28740368207
pswpin 63818244
pswpout 61199234
pgalloc_dma 0
pgalloc_dma32 4967135753
pgalloc_normal 19812671901
pgalloc_movable 0
pgfree 24779926775
pgactivate 1290396237
pgdeactivate 1289759899
pgfault 19993995783
pgmajfault 21059190
pgrefill_dma 0
pgrefill_dma32 133366009
pgrefill_normal 921184739
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 1275354745
pgsteal_normal 5641309780
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 1333139288
pgscan_kswapd_normal 5870516663
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 1064518
pgscan_direct_normal 13317302
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 1682790400
kswapd_steal 6902288285
kswapd_inodesteal 4909342
pageoutrun 65408579
allocstall 33223
pgrotated 68402979
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 3538872
unevictable_pgs_scanned 0
unevictable_pgs_rescued 4989403
unevictable_pgs_mlocked 5192009
unevictable_pgs_munlocked 4989074
unevictable_pgs_cleared 2295
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
The not-so-bad machine that's 3G in swap that I mentioned previously has
# cat /proc/vmstat
nr_free_pages 898394
nr_inactive_anon 834445
nr_active_anon 4118034
nr_inactive_file 904411
nr_active_file 910902
nr_unevictable 2440
nr_mlock 2440
nr_anon_pages 4836349
nr_mapped 1553
nr_file_pages 2243152
nr_dirty 1097
nr_writeback 0
nr_slab_reclaimable 88788
nr_slab_unreclaimable 127310
nr_page_table_pages 14762
nr_kernel_stack 532
nr_unstable 0
nr_bounce 0
nr_vmscan_write 37404214
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 12
numa_hit 14220178949
numa_miss 3903552922
numa_foreign 3903552922
numa_interleave 16282
numa_local 14219905325
numa_other 3903826546
pgpgin 6500403846
pgpgout 13255814979
pswpin 36384510
pswpout 36380545
pgalloc_dma 4
pgalloc_dma32 2019546454
pgalloc_normal 16466621455
pgalloc_movable 0
pgfree 18487068066
pgactivate 530670561
pgdeactivate 506674301
pgfault 19986735100
pgmajfault 10611234
pgrefill_dma 0
pgrefill_dma32 41306492
pgrefill_normal 318767138
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 214447663
pgsteal_normal 1645250232
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 218030201
pgscan_kswapd_normal 1812499810
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 157144
pgscan_direct_normal 1095919
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 50051072
kswapd_steal 1858447127
kswapd_inodesteal 202297
pageoutrun 15070446
allocstall 3104
pgrotated 37181651
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2113384
unevictable_pgs_scanned 0
unevictable_pgs_rescued 3055005
unevictable_pgs_mlocked 3184675
unevictable_pgs_munlocked 3045129
unevictable_pgs_cleared 10034
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
Best wishes,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 16:31 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 16:31 UTC (permalink / raw)
To: Wu Fengguang
Cc: Lee Schermerhorn, Minchan Kim, linux-mm, linux-kernel,
KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter
Wu Fengguang <fengguang.wu@intel.com> writes:
> Chris, can you post /proc/vmstat on the problem machines?
Here's /proc/vmstat from one of the bad machines with swap taken out:
# cat /proc/vmstat
nr_free_pages 115572
nr_inactive_anon 562140
nr_active_anon 5015609
nr_inactive_file 997097
nr_active_file 996989
nr_unevictable 1368
nr_mlock 1368
nr_anon_pages 5862299
nr_mapped 1414
nr_file_pages 1994569
nr_dirty 619
nr_writeback 0
nr_slab_reclaimable 88883
nr_slab_unreclaimable 129859
nr_page_table_pages 15744
nr_kernel_stack 1132
nr_unstable 0
nr_bounce 0
nr_vmscan_write 68708505
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 14
numa_hit 15295188815
numa_miss 9391232519
numa_foreign 9391232519
numa_interleave 16982
numa_local 15294742520
numa_other 9391678814
pgpgin 20644565778
pgpgout 28740368207
pswpin 63818244
pswpout 61199234
pgalloc_dma 0
pgalloc_dma32 4967135753
pgalloc_normal 19812671901
pgalloc_movable 0
pgfree 24779926775
pgactivate 1290396237
pgdeactivate 1289759899
pgfault 19993995783
pgmajfault 21059190
pgrefill_dma 0
pgrefill_dma32 133366009
pgrefill_normal 921184739
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 1275354745
pgsteal_normal 5641309780
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 1333139288
pgscan_kswapd_normal 5870516663
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 1064518
pgscan_direct_normal 13317302
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 1682790400
kswapd_steal 6902288285
kswapd_inodesteal 4909342
pageoutrun 65408579
allocstall 33223
pgrotated 68402979
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 3538872
unevictable_pgs_scanned 0
unevictable_pgs_rescued 4989403
unevictable_pgs_mlocked 5192009
unevictable_pgs_munlocked 4989074
unevictable_pgs_cleared 2295
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
The not-so-bad machine that's 3G in swap that I mentioned previously has
# cat /proc/vmstat
nr_free_pages 898394
nr_inactive_anon 834445
nr_active_anon 4118034
nr_inactive_file 904411
nr_active_file 910902
nr_unevictable 2440
nr_mlock 2440
nr_anon_pages 4836349
nr_mapped 1553
nr_file_pages 2243152
nr_dirty 1097
nr_writeback 0
nr_slab_reclaimable 88788
nr_slab_unreclaimable 127310
nr_page_table_pages 14762
nr_kernel_stack 532
nr_unstable 0
nr_bounce 0
nr_vmscan_write 37404214
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 12
numa_hit 14220178949
numa_miss 3903552922
numa_foreign 3903552922
numa_interleave 16282
numa_local 14219905325
numa_other 3903826546
pgpgin 6500403846
pgpgout 13255814979
pswpin 36384510
pswpout 36380545
pgalloc_dma 4
pgalloc_dma32 2019546454
pgalloc_normal 16466621455
pgalloc_movable 0
pgfree 18487068066
pgactivate 530670561
pgdeactivate 506674301
pgfault 19986735100
pgmajfault 10611234
pgrefill_dma 0
pgrefill_dma32 41306492
pgrefill_normal 318767138
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 214447663
pgsteal_normal 1645250232
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 218030201
pgscan_kswapd_normal 1812499810
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 157144
pgscan_direct_normal 1095919
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 50051072
kswapd_steal 1858447127
kswapd_inodesteal 202297
pageoutrun 15070446
allocstall 3104
pgrotated 37181651
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2113384
unevictable_pgs_scanned 0
unevictable_pgs_rescued 3055005
unevictable_pgs_mlocked 3184675
unevictable_pgs_munlocked 3045129
unevictable_pgs_cleared 10034
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
Best wishes,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 15:57 ` Christoph Lameter
@ 2010-08-18 16:20 ` Wu Fengguang
-1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 16:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Lee Schermerhorn
On Wed, Aug 18, 2010 at 11:57:09PM +0800, Christoph Lameter wrote:
> On Wed, 18 Aug 2010, Wu Fengguang wrote:
>
> > Andi, Christoph and Lee:
> >
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
>
> Is zone reclaim active? It may not activate on smaller systems leading
> to an unbalance memory usage between node.
Another possibility is, there are many low watermark page allocations,
leading to kswapd page out activities.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 16:20 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 16:20 UTC (permalink / raw)
To: Christoph Lameter
Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Lee Schermerhorn
On Wed, Aug 18, 2010 at 11:57:09PM +0800, Christoph Lameter wrote:
> On Wed, 18 Aug 2010, Wu Fengguang wrote:
>
> > Andi, Christoph and Lee:
> >
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
>
> Is zone reclaim active? It may not activate on smaller systems leading
> to an unbalance memory usage between node.
Another possibility is, there are many low watermark page allocations,
leading to kswapd page out activities.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 15:58 ` Chris Webb
@ 2010-08-18 16:13 ` Wu Fengguang
-1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 16:13 UTC (permalink / raw)
To: Chris Webb
Cc: Lee Schermerhorn, Minchan Kim, linux-mm, linux-kernel,
KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter
On Wed, Aug 18, 2010 at 11:58:25PM +0800, Chris Webb wrote:
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
>
> > On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > > Andi, Christoph and Lee:
> > >
> > > This looks like an "unbalanced NUMA memory usage leading to premature
> > > swapping" problem.
> >
> > What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
> > system will go into zone reclaim before allocating off-node pages.
> > However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> > != 0. And even then, zone reclaim should only reclaim file pages, not
> > anon. In theory...
>
> Hi. This is zero on all our machines:
>
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0
Chris, can you post /proc/vmstat on the problem machines?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 16:13 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 16:13 UTC (permalink / raw)
To: Chris Webb
Cc: Lee Schermerhorn, Minchan Kim, linux-mm, linux-kernel,
KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter
On Wed, Aug 18, 2010 at 11:58:25PM +0800, Chris Webb wrote:
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
>
> > On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > > Andi, Christoph and Lee:
> > >
> > > This looks like an "unbalanced NUMA memory usage leading to premature
> > > swapping" problem.
> >
> > What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
> > system will go into zone reclaim before allocating off-node pages.
> > However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> > != 0. And even then, zone reclaim should only reclaim file pages, not
> > anon. In theory...
>
> Hi. This is zero on all our machines:
>
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0
Chris, can you post /proc/vmstat on the problem machines?
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 15:58 ` Chris Webb
@ 2010-08-18 16:13 ` Christoph Lameter
-1 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-18 16:13 UTC (permalink / raw)
To: Chris Webb
Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen
On Wed, 18 Aug 2010, Chris Webb wrote:
> > != 0. And even then, zone reclaim should only reclaim file pages, not
> > anon. In theory...
>
> Hi. This is zero on all our machines:
>
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0
Set it to 1.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 16:13 ` Christoph Lameter
0 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-18 16:13 UTC (permalink / raw)
To: Chris Webb
Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen
On Wed, 18 Aug 2010, Chris Webb wrote:
> > != 0. And even then, zone reclaim should only reclaim file pages, not
> > anon. In theory...
>
> Hi. This is zero on all our machines:
>
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0
Set it to 1.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 15:57 ` Lee Schermerhorn
@ 2010-08-18 15:58 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 15:58 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: Wu Fengguang, Minchan Kim, linux-mm, linux-kernel,
KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter
Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
> On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > Andi, Christoph and Lee:
> >
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
>
> What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
> system will go into zone reclaim before allocating off-node pages.
> However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> != 0. And even then, zone reclaim should only reclaim file pages, not
> anon. In theory...
Hi. This is zero on all our machines:
# sysctl vm.zone_reclaim_mode
vm.zone_reclaim_mode = 0
Cheers,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 15:58 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 15:58 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: Wu Fengguang, Minchan Kim, linux-mm, linux-kernel,
KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter
Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
> On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > Andi, Christoph and Lee:
> >
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
>
> What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
> system will go into zone reclaim before allocating off-node pages.
> However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> != 0. And even then, zone reclaim should only reclaim file pages, not
> anon. In theory...
Hi. This is zero on all our machines:
# sysctl vm.zone_reclaim_mode
vm.zone_reclaim_mode = 0
Cheers,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 15:21 ` Wu Fengguang
@ 2010-08-18 15:57 ` Lee Schermerhorn
-1 siblings, 0 replies; 75+ messages in thread
From: Lee Schermerhorn @ 2010-08-18 15:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Christoph Lameter
On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> Andi, Christoph and Lee:
>
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.
What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
system will go into zone reclaim before allocating off-node pages.
However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
!= 0. And even then, zone reclaim should only reclaim file pages, not
anon. In theory...
Note: zone_reclaim_mode will be enabled by default [= 1] if the SLIT
contains any distances > 2.0 [20]. Check SLIT values via 'numactl
--hardware'.
Lee
>
> Thanks,
> Fengguang
>
> On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> > Wu Fengguang <fengguang.wu@intel.com> writes:
> >
> > > Did you enable any NUMA policy? That could start swapping even if
> > > there are lots of free pages in some nodes.
> >
> > Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> > NUMA behaviour, but NUMA support is definitely compiled into the kernel:
> >
> > # zgrep NUMA /proc/config.gz
> > CONFIG_NUMA_IRQ_DESC=y
> > CONFIG_NUMA=y
> > CONFIG_K8_NUMA=y
> > CONFIG_X86_64_ACPI_NUMA=y
> > # CONFIG_NUMA_EMU is not set
> > CONFIG_ACPI_NUMA=y
> > # grep -i numa /var/log/dmesg.boot
> > NUMe: Allocated memnodemap from b000 - 1b540
> > NUMA: Using 20 for the hash shift.
> >
> > > Are your free pages equally distributed over the nodes? Or limited to
> > > some of the nodes? Try this command:
> > >
> > > grep MemFree /sys/devices/system/node/node*/meminfo
> >
> > My worst-case machines current have swap completely turned off to make them
> > usable for clients, but I have one machine which is about 3GB into swap with
> > 8GB of buffers and 3GB free. This shows
> >
> > # grep MemFree /sys/devices/system/node/node*/meminfo
> > /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
> > /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB
> >
> > I could definitely imagine that one of the nodes could have dipped down to
> > zero in the past. I'll try enabling swap on one of our machines with the bad
> > problem late tonight and repeat the experiment. The node meminfo on this box
> > currently looks like
> >
> > # grep MemFree /sys/devices/system/node/node*/meminfo
> > /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
> > /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB
> >
> > Best wishes,
> >
> > Chris.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 15:57 ` Lee Schermerhorn
0 siblings, 0 replies; 75+ messages in thread
From: Lee Schermerhorn @ 2010-08-18 15:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Christoph Lameter
On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> Andi, Christoph and Lee:
>
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.
What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
system will go into zone reclaim before allocating off-node pages.
However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
!= 0. And even then, zone reclaim should only reclaim file pages, not
anon. In theory...
Note: zone_reclaim_mode will be enabled by default [= 1] if the SLIT
contains any distances > 2.0 [20]. Check SLIT values via 'numactl
--hardware'.
Lee
>
> Thanks,
> Fengguang
>
> On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> > Wu Fengguang <fengguang.wu@intel.com> writes:
> >
> > > Did you enable any NUMA policy? That could start swapping even if
> > > there are lots of free pages in some nodes.
> >
> > Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> > NUMA behaviour, but NUMA support is definitely compiled into the kernel:
> >
> > # zgrep NUMA /proc/config.gz
> > CONFIG_NUMA_IRQ_DESC=y
> > CONFIG_NUMA=y
> > CONFIG_K8_NUMA=y
> > CONFIG_X86_64_ACPI_NUMA=y
> > # CONFIG_NUMA_EMU is not set
> > CONFIG_ACPI_NUMA=y
> > # grep -i numa /var/log/dmesg.boot
> > NUMe: Allocated memnodemap from b000 - 1b540
> > NUMA: Using 20 for the hash shift.
> >
> > > Are your free pages equally distributed over the nodes? Or limited to
> > > some of the nodes? Try this command:
> > >
> > > grep MemFree /sys/devices/system/node/node*/meminfo
> >
> > My worst-case machines current have swap completely turned off to make them
> > usable for clients, but I have one machine which is about 3GB into swap with
> > 8GB of buffers and 3GB free. This shows
> >
> > # grep MemFree /sys/devices/system/node/node*/meminfo
> > /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
> > /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB
> >
> > I could definitely imagine that one of the nodes could have dipped down to
> > zero in the past. I'll try enabling swap on one of our machines with the bad
> > problem late tonight and repeat the experiment. The node meminfo on this box
> > currently looks like
> >
> > # grep MemFree /sys/devices/system/node/node*/meminfo
> > /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
> > /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB
> >
> > Best wishes,
> >
> > Chris.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 15:21 ` Wu Fengguang
@ 2010-08-18 15:57 ` Christoph Lameter
-1 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-18 15:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Lee Schermerhorn
On Wed, 18 Aug 2010, Wu Fengguang wrote:
> Andi, Christoph and Lee:
>
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.
Is zone reclaim active? It may not activate on smaller systems leading
to an unbalance memory usage between node.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 15:57 ` Christoph Lameter
0 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-18 15:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Lee Schermerhorn
On Wed, 18 Aug 2010, Wu Fengguang wrote:
> Andi, Christoph and Lee:
>
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.
Is zone reclaim active? It may not activate on smaller systems leading
to an unbalance memory usage between node.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 14:46 ` Chris Webb
@ 2010-08-18 15:21 ` Wu Fengguang
-1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 15:21 UTC (permalink / raw)
To: Chris Webb
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Lee Schermerhorn, Christoph Lameter
Andi, Christoph and Lee:
This looks like an "unbalanced NUMA memory usage leading to premature
swapping" problem.
Thanks,
Fengguang
On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
>
> > Did you enable any NUMA policy? That could start swapping even if
> > there are lots of free pages in some nodes.
>
> Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> NUMA behaviour, but NUMA support is definitely compiled into the kernel:
>
> # zgrep NUMA /proc/config.gz
> CONFIG_NUMA_IRQ_DESC=y
> CONFIG_NUMA=y
> CONFIG_K8_NUMA=y
> CONFIG_X86_64_ACPI_NUMA=y
> # CONFIG_NUMA_EMU is not set
> CONFIG_ACPI_NUMA=y
> # grep -i numa /var/log/dmesg.boot
> NUMe: Allocated memnodemap from b000 - 1b540
> NUMA: Using 20 for the hash shift.
>
> > Are your free pages equally distributed over the nodes? Or limited to
> > some of the nodes? Try this command:
> >
> > grep MemFree /sys/devices/system/node/node*/meminfo
>
> My worst-case machines current have swap completely turned off to make them
> usable for clients, but I have one machine which is about 3GB into swap with
> 8GB of buffers and 3GB free. This shows
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB
>
> I could definitely imagine that one of the nodes could have dipped down to
> zero in the past. I'll try enabling swap on one of our machines with the bad
> problem late tonight and repeat the experiment. The node meminfo on this box
> currently looks like
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB
>
> Best wishes,
>
> Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 15:21 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 15:21 UTC (permalink / raw)
To: Chris Webb
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
Pekka Enberg, Andi Kleen, Lee Schermerhorn, Christoph Lameter
Andi, Christoph and Lee:
This looks like an "unbalanced NUMA memory usage leading to premature
swapping" problem.
Thanks,
Fengguang
On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
>
> > Did you enable any NUMA policy? That could start swapping even if
> > there are lots of free pages in some nodes.
>
> Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> NUMA behaviour, but NUMA support is definitely compiled into the kernel:
>
> # zgrep NUMA /proc/config.gz
> CONFIG_NUMA_IRQ_DESC=y
> CONFIG_NUMA=y
> CONFIG_K8_NUMA=y
> CONFIG_X86_64_ACPI_NUMA=y
> # CONFIG_NUMA_EMU is not set
> CONFIG_ACPI_NUMA=y
> # grep -i numa /var/log/dmesg.boot
> NUMe: Allocated memnodemap from b000 - 1b540
> NUMA: Using 20 for the hash shift.
>
> > Are your free pages equally distributed over the nodes? Or limited to
> > some of the nodes? Try this command:
> >
> > grep MemFree /sys/devices/system/node/node*/meminfo
>
> My worst-case machines current have swap completely turned off to make them
> usable for clients, but I have one machine which is about 3GB into swap with
> 8GB of buffers and 3GB free. This shows
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB
>
> I could definitely imagine that one of the nodes could have dipped down to
> zero in the past. I'll try enabling swap on one of our machines with the bad
> problem late tonight and repeat the experiment. The node meminfo on this box
> currently looks like
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB
>
> Best wishes,
>
> Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-18 14:38 ` Wu Fengguang
@ 2010-08-18 14:46 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 14:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Wu Fengguang <fengguang.wu@intel.com> writes:
> Did you enable any NUMA policy? That could start swapping even if
> there are lots of free pages in some nodes.
Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
NUMA behaviour, but NUMA support is definitely compiled into the kernel:
# zgrep NUMA /proc/config.gz
CONFIG_NUMA_IRQ_DESC=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_ACPI_NUMA=y
# grep -i numa /var/log/dmesg.boot
NUMe: Allocated memnodemap from b000 - 1b540
NUMA: Using 20 for the hash shift.
> Are your free pages equally distributed over the nodes? Or limited to
> some of the nodes? Try this command:
>
> grep MemFree /sys/devices/system/node/node*/meminfo
My worst-case machines current have swap completely turned off to make them
usable for clients, but I have one machine which is about 3GB into swap with
8GB of buffers and 3GB free. This shows
# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB
I could definitely imagine that one of the nodes could have dipped down to
zero in the past. I'll try enabling swap on one of our machines with the bad
problem late tonight and repeat the experiment. The node meminfo on this box
currently looks like
# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB
Best wishes,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 14:46 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 14:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Wu Fengguang <fengguang.wu@intel.com> writes:
> Did you enable any NUMA policy? That could start swapping even if
> there are lots of free pages in some nodes.
Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
NUMA behaviour, but NUMA support is definitely compiled into the kernel:
# zgrep NUMA /proc/config.gz
CONFIG_NUMA_IRQ_DESC=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_ACPI_NUMA=y
# grep -i numa /var/log/dmesg.boot
NUMe: Allocated memnodemap from b000 - 1b540
NUMA: Using 20 for the hash shift.
> Are your free pages equally distributed over the nodes? Or limited to
> some of the nodes? Try this command:
>
> grep MemFree /sys/devices/system/node/node*/meminfo
My worst-case machines current have swap completely turned off to make them
usable for clients, but I have one machine which is about 3GB into swap with
8GB of buffers and 3GB free. This shows
# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB
I could definitely imagine that one of the nodes could have dipped down to
zero in the past. I'll try enabling swap on one of our machines with the bad
problem late tonight and repeat the experiment. The node meminfo on this box
currently looks like
# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB
Best wishes,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-04 12:04 ` Chris Webb
@ 2010-08-18 14:38 ` Wu Fengguang
-1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 14:38 UTC (permalink / raw)
To: Chris Webb
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Chris,
Did you enable any NUMA policy? That could start swapping even if
there are lots of free pages in some nodes.
Are your free pages equally distributed over the nodes? Or limited to
some of the nodes? Try this command:
grep MemFree /sys/devices/system/node/node*/meminfo
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-18 14:38 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 14:38 UTC (permalink / raw)
To: Chris Webb
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Chris,
Did you enable any NUMA policy? That could start swapping even if
there are lots of free pages in some nodes.
Are your free pages equally distributed over the nodes? Or limited to
some of the nodes? Try this command:
grep MemFree /sys/devices/system/node/node*/meminfo
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-04 11:49 ` Wu Fengguang
@ 2010-08-04 12:04 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-04 12:04 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Wu Fengguang <fengguang.wu@intel.com> writes:
> Maybe turn off KSM? It helps to isolate problems. It's a relative new
> and complex feature after all.
Good idea! I'll give that a go on one of the machines without swap at the
moment, re-add the swap with ksm turned off, and see what happens.
> > However, your suggestion is right that the CPU loads on these machines are
> > typically quite high. The large number of kvm virtual machines they run mean
> > thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> > these are higher when there's swap than after it has been removed. I assume
> > this is mostly because of increased IO wait, as this number increases
> > significantly in top.
>
> iowait = CPU (idle) waiting for disk IO
>
> So iowait means not CPU load, but somehow disk load :)
Sorry, yes, I wrote very unclearly here. What I should have written is that
the load numbers are fairly high even without swap, when the IO wait figure
is pretty small. This is presumably normal CPU load from the guests.
The load average rises significantly when swap is added, but I think that
rise is due to an increase in processes waiting for IO (io wait %age
increases considerably) rather than extra CPU work. Presumably this is the
IO from swapping.
Cheers,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-04 12:04 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-04 12:04 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Wu Fengguang <fengguang.wu@intel.com> writes:
> Maybe turn off KSM? It helps to isolate problems. It's a relative new
> and complex feature after all.
Good idea! I'll give that a go on one of the machines without swap at the
moment, re-add the swap with ksm turned off, and see what happens.
> > However, your suggestion is right that the CPU loads on these machines are
> > typically quite high. The large number of kvm virtual machines they run mean
> > thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> > these are higher when there's swap than after it has been removed. I assume
> > this is mostly because of increased IO wait, as this number increases
> > significantly in top.
>
> iowait = CPU (idle) waiting for disk IO
>
> So iowait means not CPU load, but somehow disk load :)
Sorry, yes, I wrote very unclearly here. What I should have written is that
the load numbers are fairly high even without swap, when the IO wait figure
is pretty small. This is presumably normal CPU load from the guests.
The load average rises significantly when swap is added, but I think that
rise is due to an increase in processes waiting for IO (io wait %age
increases considerably) rather than extra CPU work. Presumably this is the
IO from swapping.
Cheers,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-04 9:58 ` Chris Webb
@ 2010-08-04 11:49 ` Wu Fengguang
-1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04 11:49 UTC (permalink / raw)
To: Chris Webb
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
On Wed, Aug 04, 2010 at 05:58:12PM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
>
> > This is interesting. Why is it waiting for 1m here? Are there high CPU
> > loads? Would you do a
> >
> > echo t > /proc/sysrq-trigger
> >
> > and show us the dmesg?
>
> Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
> way I can get this info for you? Replacing the kernels on the machines is a
> painful job as I have to give the clients running on them quite a bit of
> notice of the reboot, and I haven't been able to reproduce the problem on a
> test machine.
Maybe turn off KSM? It helps to isolate problems. It's a relative new
and complex feature after all.
> I also think the swap use is much better following a reboot, and only starts
> to spiral out of control after the machines have been running for a week or
> so.
Something deteriorates over long time.. It may take time to catch this bug..
> However, your suggestion is right that the CPU loads on these machines are
> typically quite high. The large number of kvm virtual machines they run mean
> thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> these are higher when there's swap than after it has been removed. I assume
> this is mostly because of increased IO wait, as this number increases
> significantly in top.
iowait = CPU (idle) waiting for disk IO
So iowait means not CPU load, but somehow disk load :)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-04 11:49 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04 11:49 UTC (permalink / raw)
To: Chris Webb
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
On Wed, Aug 04, 2010 at 05:58:12PM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
>
> > This is interesting. Why is it waiting for 1m here? Are there high CPU
> > loads? Would you do a
> >
> > echo t > /proc/sysrq-trigger
> >
> > and show us the dmesg?
>
> Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
> way I can get this info for you? Replacing the kernels on the machines is a
> painful job as I have to give the clients running on them quite a bit of
> notice of the reboot, and I haven't been able to reproduce the problem on a
> test machine.
Maybe turn off KSM? It helps to isolate problems. It's a relative new
and complex feature after all.
> I also think the swap use is much better following a reboot, and only starts
> to spiral out of control after the machines have been running for a week or
> so.
Something deteriorates over long time.. It may take time to catch this bug..
> However, your suggestion is right that the CPU loads on these machines are
> typically quite high. The large number of kvm virtual machines they run mean
> thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> these are higher when there's swap than after it has been removed. I assume
> this is mostly because of increased IO wait, as this number increases
> significantly in top.
iowait = CPU (idle) waiting for disk IO
So iowait means not CPU load, but somehow disk load :)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-04 3:24 ` Wu Fengguang
@ 2010-08-04 9:58 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-04 9:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Wu Fengguang <fengguang.wu@intel.com> writes:
> This is interesting. Why is it waiting for 1m here? Are there high CPU
> loads? Would you do a
>
> echo t > /proc/sysrq-trigger
>
> and show us the dmesg?
Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
way I can get this info for you? Replacing the kernels on the machines is a
painful job as I have to give the clients running on them quite a bit of
notice of the reboot, and I haven't been able to reproduce the problem on a
test machine.
I also think the swap use is much better following a reboot, and only starts
to spiral out of control after the machines have been running for a week or
so.
However, your suggestion is right that the CPU loads on these machines are
typically quite high. The large number of kvm virtual machines they run mean
thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
these are higher when there's swap than after it has been removed. I assume
this is mostly because of increased IO wait, as this number increases
significantly in top.
Cheers,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-04 9:58 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-04 9:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Wu Fengguang <fengguang.wu@intel.com> writes:
> This is interesting. Why is it waiting for 1m here? Are there high CPU
> loads? Would you do a
>
> echo t > /proc/sysrq-trigger
>
> and show us the dmesg?
Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
way I can get this info for you? Replacing the kernels on the machines is a
painful job as I have to give the clients running on them quite a bit of
notice of the reboot, and I haven't been able to reproduce the problem on a
test machine.
I also think the swap use is much better following a reboot, and only starts
to spiral out of control after the machines have been running for a week or
so.
However, your suggestion is right that the CPU loads on these machines are
typically quite high. The large number of kvm virtual machines they run mean
thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
these are higher when there's swap than after it has been removed. I assume
this is mostly because of increased IO wait, as this number increases
significantly in top.
Cheers,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-04 3:10 ` Minchan Kim
@ 2010-08-04 3:24 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04 3:24 UTC (permalink / raw)
To: Minchan Kim
Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
On Wed, Aug 04, 2010 at 11:10:46AM +0800, Minchan Kim wrote:
> On Wed, Aug 4, 2010 at 11:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > Chris,
> >
> > Your slabinfo does contain many order 1-3 slab caches, this is a major source
> > of high order allocations and hence lumpy reclaim. fork() is another.
> >
> > In another thread, Pekka Enberg offers a tip:
> >
> > You can pass "slub_debug=o" as a kernel parameter to disable higher
> > order allocations if you want to test things.
> >
> > Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
> >
> > Thanks,
> > Fengguang
>
> He said following as.
> "After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out.
>
> Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use,
This is interesting. Why is it waiting for 1m here? Are there high CPU
loads? Would you do a
echo t > /proc/sysrq-trigger
and show us the dmesg?
Thanks,
Fengguang
> before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB."
>
> 1. His system works well without swap.
> 2. His system increase swap by 2G rapidly and more steadily to 3GB.
>
> So I thought it isn't likely to relate normal lumpy.
>
> Of course, without swap, lumpy can scan more file pages to make
> contiguous page frames. so it could work well, still. But I can't
> understand 2.
>
> Hmm, I have no idea. :(
>
> Off-Topic:
>
> Hi, Pekka.
>
> Document says.
> "Debugging options may require the minimum possible slab order to increase as
> a result of storing the metadata (for example, caches with PAGE_SIZE object
> sizes). This has a higher liklihood of resulting in slab allocation errors
> in low memory situations or if there's high fragmentation of memory. To
> switch off debugging for such caches by default, use
>
> slub_debug=O"
>
> But when I tested it in my machine(2.6.34), with slub_debug=O, it
> increase objsize and pagesperslab. Even it increase the number of
> slab(But I am not sure this part since it might not the same time from
> booting)
> What am I missing now?
>
> But SLAB seems to be consumed small pages than SLUB. Hmm.
> SLAB is more proper than SLUBin small memory system(ex, embedded)?
>
>
> --
> Kind regards,
> Minchan Kim
>
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu 0 0 9200 3 8 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc_dma-512 16 16 512 16 2 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 17 17 960 17 4 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 17 4 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 51 51 960 17 4 : tunables 0 0 0 : slabdata 3 3 0
> TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
> nf_conntrack_c10a8540 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_raid1_read_record 0 0 1056 31 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2464 13 8 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> fuse_request 18 18 432 18 2 : tunables 0 0 0 : slabdata 1 1 0
> fuse_inode 21 21 768 21 4 : tunables 0 0 0 : slabdata 1 1 0
> nfsd4_stateowners 0 0 344 23 2 : tunables 0 0 0 : slabdata 0 0 0
> nfs_read_data 72 72 448 18 2 : tunables 0 0 0 : slabdata 4 4 0
> nfs_inode_cache 0 0 1040 31 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_inode_cache 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 24 24 656 24 4 : tunables 0 0 0 : slabdata 1 1 0
> ext4_inode_cache 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 944 17 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 5032 5032 928 17 4 : tunables 0 0 0 : slabdata 296 296 0
> rpc_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> UNIX 532 532 832 19 4 : tunables 0 0 0 : slabdata 28 28 0
> UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> UDP 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> TCP 60 60 1600 20 8 : tunables 0 0 0 : slabdata 3 3 0
> sgpool-128 48 48 2560 12 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-64 100 100 1280 25 8 : tunables 0 0 0 : slabdata 4 4 0
> blkdev_queue 76 76 1688 19 8 : tunables 0 0 0 : slabdata 4 4 0
> biovec-256 10 10 3072 10 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-128 21 21 1536 21 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-64 84 84 768 21 4 : tunables 0 0 0 : slabdata 4 4 0
> bip-256 10 10 3200 10 8 : tunables 0 0 0 : slabdata 1 1 0
> bip-128 0 0 1664 19 8 : tunables 0 0 0 : slabdata 0 0 0
> bip-64 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> bip-16 100 100 320 25 2 : tunables 0 0 0 : slabdata 4 4 0
> sock_inode_cache 609 609 768 21 4 : tunables 0 0 0 : slabdata 29 29 0
> skbuff_fclone_cache 84 84 384 21 2 : tunables 0 0 0 : slabdata 4 4 0
> shmem_inode_cache 1835 1840 784 20 4 : tunables 0 0 0 : slabdata 92 92 0
> taskstats 96 96 328 24 2 : tunables 0 0 0 : slabdata 4 4 0
> proc_inode_cache 1584 1584 680 24 4 : tunables 0 0 0 : slabdata 66 66 0
> bdev_cache 72 72 896 18 4 : tunables 0 0 0 : slabdata 4 4 0
> inode_cache 7126 7128 656 24 4 : tunables 0 0 0 : slabdata 297 297 0
> signal_cache 332 350 640 25 4 : tunables 0 0 0 : slabdata 14 14 0
> sighand_cache 246 253 1408 23 8 : tunables 0 0 0 : slabdata 11 11 0
> task_xstate 193 196 576 28 4 : tunables 0 0 0 : slabdata 7 7 0
> task_struct 274 285 5472 5 8 : tunables 0 0 0 : slabdata 57 57 0
> radix_tree_node 3208 3213 296 27 2 : tunables 0 0 0 : slabdata 119 119 0
> kmalloc-8192 20 20 8192 4 8 : tunables 0 0 0 : slabdata 5 5 0
> kmalloc-4096 78 80 4096 8 8 : tunables 0 0 0 : slabdata 10 10 0
> kmalloc-2048 400 400 2048 16 8 : tunables 0 0 0 : slabdata 25 25 0
> kmalloc-1024 326 336 1024 16 4 : tunables 0 0 0 : slabdata 21 21 0
> kmalloc-512 758 784 512 16 2 : tunables 0 0 0 : slabdata 49 49 0
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu 0 0 9248 3 8 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc_dma-512 29 29 560 29 4 : tunables 0 0 0 : slabdata 1 1 0
> clip_arp_cache 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> ip6_dst_cache 25 25 320 25 2 : tunables 0 0 0 : slabdata 1 1 0
> ndisc_cache 25 25 320 25 2 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 16 16 1024 16 4 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 17 4 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 68 68 960 17 4 : tunables 0 0 0 : slabdata 4 4 0
> tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> TCPv6 36 36 1792 18 8 : tunables 0 0 0 : slabdata 2 2 0
> nf_conntrack_c10a8540 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_raid1_read_record 0 0 1096 29 8 : tunables 0 0 0 : slabdata 0 0 0
> kcopyd_job 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2504 13 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_rq_target_io 0 0 272 30 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 17 17 960 17 4 : tunables 0 0 0 : slabdata 1 1 0
> fuse_request 17 17 480 17 2 : tunables 0 0 0 : slabdata 1 1 0
> fuse_inode 19 19 832 19 4 : tunables 0 0 0 : slabdata 1 1 0
> nfsd4_stateowners 0 0 392 20 2 : tunables 0 0 0 : slabdata 0 0 0
> nfs_write_data 48 48 512 16 2 : tunables 0 0 0 : slabdata 3 3 0
> nfs_read_data 32 32 512 16 2 : tunables 0 0 0 : slabdata 2 2 0
> nfs_inode_cache 0 0 1080 30 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_key_record_cache 0 0 576 28 4 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_sb_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_inode_cache 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_auth_tok_list_item 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 23 23 696 23 4 : tunables 0 0 0 : slabdata 1 1 0
> ext4_inode_cache 0 0 1168 28 8 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 984 16 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 5391 5392 968 16 4 : tunables 0 0 0 : slabdata 337 337 0
> dquot 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
> rpc_buffers 30 30 2112 15 8 : tunables 0 0 0 : slabdata 2 2 0
> rpc_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> UNIX 556 558 896 18 4 : tunables 0 0 0 : slabdata 31 31 0
> UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_dst_cache 125 125 320 25 2 : tunables 0 0 0 : slabdata 5 5 0
> arp_cache 100 100 320 25 2 : tunables 0 0 0 : slabdata 4 4 0
> RAW 19 19 832 19 4 : tunables 0 0 0 : slabdata 1 1 0
> UDP 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> TCP 76 76 1664 19 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-128 48 48 2624 12 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-64 96 96 1344 24 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-32 92 92 704 23 4 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-16 84 84 384 21 2 : tunables 0 0 0 : slabdata 4 4 0
> blkdev_queue 72 72 1736 18 8 : tunables 0 0 0 : slabdata 4 4 0
> biovec-256 10 10 3136 10 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-128 20 20 1600 20 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-64 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> bip-256 10 10 3200 10 8 : tunables 0 0 0 : slabdata 1 1 0
> bip-128 0 0 1664 19 8 : tunables 0 0 0 : slabdata 0 0 0
> bip-64 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> bip-16 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> sock_inode_cache 629 630 768 21 4 : tunables 0 0 0 : slabdata 30 30 0
> skbuff_fclone_cache 72 72 448 18 2 : tunables 0 0 0 : slabdata 4 4 0
> shmem_inode_cache 1862 1862 824 19 4 : tunables 0 0 0 : slabdata 98 98 0
> taskstats 84 84 376 21 2 : tunables 0 0 0 : slabdata 4 4 0
> proc_inode_cache 1623 1650 720 22 4 : tunables 0 0 0 : slabdata 75 75 0
> bdev_cache 68 68 960 17 4 : tunables 0 0 0 : slabdata 4 4 0
> inode_cache 7125 7130 696 23 4 : tunables 0 0 0 : slabdata 310 310 0
> mm_struct 135 138 704 23 4 : tunables 0 0 0 : slabdata 6 6 0
> files_cache 142 150 320 25 2 : tunables 0 0 0 : slabdata 6 6 0
> signal_cache 229 230 704 23 4 : tunables 0 0 0 : slabdata 10 10 0
> sighand_cache 228 230 1408 23 8 : tunables 0 0 0 : slabdata 10 10 0
> task_xstate 195 200 640 25 4 : tunables 0 0 0 : slabdata 8 8 0
> task_struct 271 285 5520 5 8 : tunables 0 0 0 : slabdata 57 57 0
> radix_tree_node 3484 3504 336 24 2 : tunables 0 0 0 : slabdata 146 146 0
> kmalloc-8192 20 20 8192 4 8 : tunables 0 0 0 : slabdata 5 5 0
> kmalloc-4096 79 80 4096 8 8 : tunables 0 0 0 : slabdata 10 10 0
> kmalloc-2048 388 390 2096 15 8 : tunables 0 0 0 : slabdata 26 26 0
> kmalloc-1024 382 390 1072 30 8 : tunables 0 0 0 : slabdata 13 13 0
> kmalloc-512 796 812 560 29 4 : tunables 0 0 0 : slabdata 28 28 0
> kmalloc-256 153 156 304 26 2 : tunables 0 0 0 : slabdata 6 6 0
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-04 3:24 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04 3:24 UTC (permalink / raw)
To: Minchan Kim
Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
On Wed, Aug 04, 2010 at 11:10:46AM +0800, Minchan Kim wrote:
> On Wed, Aug 4, 2010 at 11:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > Chris,
> >
> > Your slabinfo does contain many order 1-3 slab caches, this is a major source
> > of high order allocations and hence lumpy reclaim. fork() is another.
> >
> > In another thread, Pekka Enberg offers a tip:
> >
> > A A A A You can pass "slub_debug=o" as a kernel parameter to disable higher
> > A A A A order allocations if you want to test things.
> >
> > Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
> >
> > Thanks,
> > Fengguang
>
> He said following as.
> "After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out.
>
> Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use,
This is interesting. Why is it waiting for 1m here? Are there high CPU
loads? Would you do a
echo t > /proc/sysrq-trigger
and show us the dmesg?
Thanks,
Fengguang
> before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB."
>
> 1. His system works well without swap.
> 2. His system increase swap by 2G rapidly and more steadily to 3GB.
>
> So I thought it isn't likely to relate normal lumpy.
>
> Of course, without swap, lumpy can scan more file pages to make
> contiguous page frames. so it could work well, still. But I can't
> understand 2.
>
> Hmm, I have no idea. :(
>
> Off-Topic:
>
> Hi, Pekka.
>
> Document says.
> "Debugging options may require the minimum possible slab order to increase as
> a result of storing the metadata (for example, caches with PAGE_SIZE object
> sizes). A This has a higher liklihood of resulting in slab allocation errors
> in low memory situations or if there's high fragmentation of memory. A To
> switch off debugging for such caches by default, use
>
> A A A A slub_debug=O"
>
> But when I tested it in my machine(2.6.34), A with slub_debug=O, it
> increase objsize and pagesperslab. Even it increase the number of
> slab(But I am not sure this part since it might not the same time from
> booting)
> What am I missing now?
>
> But SLAB seems to be consumed small pages than SLUB. Hmm.
> SLAB is more proper than SLUBin small memory system(ex, embedded)?
>
>
> --
> Kind regards,
> Minchan Kim
>
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu 0 0 9200 3 8 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc_dma-512 16 16 512 16 2 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 17 17 960 17 4 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 17 4 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 51 51 960 17 4 : tunables 0 0 0 : slabdata 3 3 0
> TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
> nf_conntrack_c10a8540 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_raid1_read_record 0 0 1056 31 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2464 13 8 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> fuse_request 18 18 432 18 2 : tunables 0 0 0 : slabdata 1 1 0
> fuse_inode 21 21 768 21 4 : tunables 0 0 0 : slabdata 1 1 0
> nfsd4_stateowners 0 0 344 23 2 : tunables 0 0 0 : slabdata 0 0 0
> nfs_read_data 72 72 448 18 2 : tunables 0 0 0 : slabdata 4 4 0
> nfs_inode_cache 0 0 1040 31 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_inode_cache 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 24 24 656 24 4 : tunables 0 0 0 : slabdata 1 1 0
> ext4_inode_cache 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 944 17 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 5032 5032 928 17 4 : tunables 0 0 0 : slabdata 296 296 0
> rpc_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> UNIX 532 532 832 19 4 : tunables 0 0 0 : slabdata 28 28 0
> UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> UDP 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> TCP 60 60 1600 20 8 : tunables 0 0 0 : slabdata 3 3 0
> sgpool-128 48 48 2560 12 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-64 100 100 1280 25 8 : tunables 0 0 0 : slabdata 4 4 0
> blkdev_queue 76 76 1688 19 8 : tunables 0 0 0 : slabdata 4 4 0
> biovec-256 10 10 3072 10 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-128 21 21 1536 21 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-64 84 84 768 21 4 : tunables 0 0 0 : slabdata 4 4 0
> bip-256 10 10 3200 10 8 : tunables 0 0 0 : slabdata 1 1 0
> bip-128 0 0 1664 19 8 : tunables 0 0 0 : slabdata 0 0 0
> bip-64 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> bip-16 100 100 320 25 2 : tunables 0 0 0 : slabdata 4 4 0
> sock_inode_cache 609 609 768 21 4 : tunables 0 0 0 : slabdata 29 29 0
> skbuff_fclone_cache 84 84 384 21 2 : tunables 0 0 0 : slabdata 4 4 0
> shmem_inode_cache 1835 1840 784 20 4 : tunables 0 0 0 : slabdata 92 92 0
> taskstats 96 96 328 24 2 : tunables 0 0 0 : slabdata 4 4 0
> proc_inode_cache 1584 1584 680 24 4 : tunables 0 0 0 : slabdata 66 66 0
> bdev_cache 72 72 896 18 4 : tunables 0 0 0 : slabdata 4 4 0
> inode_cache 7126 7128 656 24 4 : tunables 0 0 0 : slabdata 297 297 0
> signal_cache 332 350 640 25 4 : tunables 0 0 0 : slabdata 14 14 0
> sighand_cache 246 253 1408 23 8 : tunables 0 0 0 : slabdata 11 11 0
> task_xstate 193 196 576 28 4 : tunables 0 0 0 : slabdata 7 7 0
> task_struct 274 285 5472 5 8 : tunables 0 0 0 : slabdata 57 57 0
> radix_tree_node 3208 3213 296 27 2 : tunables 0 0 0 : slabdata 119 119 0
> kmalloc-8192 20 20 8192 4 8 : tunables 0 0 0 : slabdata 5 5 0
> kmalloc-4096 78 80 4096 8 8 : tunables 0 0 0 : slabdata 10 10 0
> kmalloc-2048 400 400 2048 16 8 : tunables 0 0 0 : slabdata 25 25 0
> kmalloc-1024 326 336 1024 16 4 : tunables 0 0 0 : slabdata 21 21 0
> kmalloc-512 758 784 512 16 2 : tunables 0 0 0 : slabdata 49 49 0
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu 0 0 9248 3 8 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc_dma-512 29 29 560 29 4 : tunables 0 0 0 : slabdata 1 1 0
> clip_arp_cache 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> ip6_dst_cache 25 25 320 25 2 : tunables 0 0 0 : slabdata 1 1 0
> ndisc_cache 25 25 320 25 2 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 16 16 1024 16 4 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 17 4 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 68 68 960 17 4 : tunables 0 0 0 : slabdata 4 4 0
> tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> TCPv6 36 36 1792 18 8 : tunables 0 0 0 : slabdata 2 2 0
> nf_conntrack_c10a8540 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_raid1_read_record 0 0 1096 29 8 : tunables 0 0 0 : slabdata 0 0 0
> kcopyd_job 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2504 13 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_rq_target_io 0 0 272 30 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 17 17 960 17 4 : tunables 0 0 0 : slabdata 1 1 0
> fuse_request 17 17 480 17 2 : tunables 0 0 0 : slabdata 1 1 0
> fuse_inode 19 19 832 19 4 : tunables 0 0 0 : slabdata 1 1 0
> nfsd4_stateowners 0 0 392 20 2 : tunables 0 0 0 : slabdata 0 0 0
> nfs_write_data 48 48 512 16 2 : tunables 0 0 0 : slabdata 3 3 0
> nfs_read_data 32 32 512 16 2 : tunables 0 0 0 : slabdata 2 2 0
> nfs_inode_cache 0 0 1080 30 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_key_record_cache 0 0 576 28 4 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_sb_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_inode_cache 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_auth_tok_list_item 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 23 23 696 23 4 : tunables 0 0 0 : slabdata 1 1 0
> ext4_inode_cache 0 0 1168 28 8 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 984 16 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 5391 5392 968 16 4 : tunables 0 0 0 : slabdata 337 337 0
> dquot 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
> rpc_buffers 30 30 2112 15 8 : tunables 0 0 0 : slabdata 2 2 0
> rpc_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> UNIX 556 558 896 18 4 : tunables 0 0 0 : slabdata 31 31 0
> UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_dst_cache 125 125 320 25 2 : tunables 0 0 0 : slabdata 5 5 0
> arp_cache 100 100 320 25 2 : tunables 0 0 0 : slabdata 4 4 0
> RAW 19 19 832 19 4 : tunables 0 0 0 : slabdata 1 1 0
> UDP 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> TCP 76 76 1664 19 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-128 48 48 2624 12 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-64 96 96 1344 24 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-32 92 92 704 23 4 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-16 84 84 384 21 2 : tunables 0 0 0 : slabdata 4 4 0
> blkdev_queue 72 72 1736 18 8 : tunables 0 0 0 : slabdata 4 4 0
> biovec-256 10 10 3136 10 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-128 20 20 1600 20 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-64 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> bip-256 10 10 3200 10 8 : tunables 0 0 0 : slabdata 1 1 0
> bip-128 0 0 1664 19 8 : tunables 0 0 0 : slabdata 0 0 0
> bip-64 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> bip-16 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> sock_inode_cache 629 630 768 21 4 : tunables 0 0 0 : slabdata 30 30 0
> skbuff_fclone_cache 72 72 448 18 2 : tunables 0 0 0 : slabdata 4 4 0
> shmem_inode_cache 1862 1862 824 19 4 : tunables 0 0 0 : slabdata 98 98 0
> taskstats 84 84 376 21 2 : tunables 0 0 0 : slabdata 4 4 0
> proc_inode_cache 1623 1650 720 22 4 : tunables 0 0 0 : slabdata 75 75 0
> bdev_cache 68 68 960 17 4 : tunables 0 0 0 : slabdata 4 4 0
> inode_cache 7125 7130 696 23 4 : tunables 0 0 0 : slabdata 310 310 0
> mm_struct 135 138 704 23 4 : tunables 0 0 0 : slabdata 6 6 0
> files_cache 142 150 320 25 2 : tunables 0 0 0 : slabdata 6 6 0
> signal_cache 229 230 704 23 4 : tunables 0 0 0 : slabdata 10 10 0
> sighand_cache 228 230 1408 23 8 : tunables 0 0 0 : slabdata 10 10 0
> task_xstate 195 200 640 25 4 : tunables 0 0 0 : slabdata 8 8 0
> task_struct 271 285 5520 5 8 : tunables 0 0 0 : slabdata 57 57 0
> radix_tree_node 3484 3504 336 24 2 : tunables 0 0 0 : slabdata 146 146 0
> kmalloc-8192 20 20 8192 4 8 : tunables 0 0 0 : slabdata 5 5 0
> kmalloc-4096 79 80 4096 8 8 : tunables 0 0 0 : slabdata 10 10 0
> kmalloc-2048 388 390 2096 15 8 : tunables 0 0 0 : slabdata 26 26 0
> kmalloc-1024 382 390 1072 30 8 : tunables 0 0 0 : slabdata 13 13 0
> kmalloc-512 796 812 560 29 4 : tunables 0 0 0 : slabdata 28 28 0
> kmalloc-256 153 156 304 26 2 : tunables 0 0 0 : slabdata 6 6 0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-04 2:21 ` Wu Fengguang
(?)
@ 2010-08-04 3:10 ` Minchan Kim
2010-08-04 3:24 ` Wu Fengguang
-1 siblings, 1 reply; 75+ messages in thread
From: Minchan Kim @ 2010-08-04 3:10 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
[-- Attachment #1: Type: text/plain, Size: 2349 bytes --]
On Wed, Aug 4, 2010 at 11:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> Chris,
>
> Your slabinfo does contain many order 1-3 slab caches, this is a major source
> of high order allocations and hence lumpy reclaim. fork() is another.
>
> In another thread, Pekka Enberg offers a tip:
>
> You can pass "slub_debug=o" as a kernel parameter to disable higher
> order allocations if you want to test things.
>
> Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
>
> Thanks,
> Fengguang
He said following as.
"After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB."
1. His system works well without swap.
2. His system increase swap by 2G rapidly and more steadily to 3GB.
So I thought it isn't likely to relate normal lumpy.
Of course, without swap, lumpy can scan more file pages to make
contiguous page frames. so it could work well, still. But I can't
understand 2.
Hmm, I have no idea. :(
Off-Topic:
Hi, Pekka.
Document says.
"Debugging options may require the minimum possible slab order to increase as
a result of storing the metadata (for example, caches with PAGE_SIZE object
sizes). This has a higher liklihood of resulting in slab allocation errors
in low memory situations or if there's high fragmentation of memory. To
switch off debugging for such caches by default, use
slub_debug=O"
But when I tested it in my machine(2.6.34), with slub_debug=O, it
increase objsize and pagesperslab. Even it increase the number of
slab(But I am not sure this part since it might not the same time from
booting)
What am I missing now?
But SLAB seems to be consumed small pages than SLUB. Hmm.
SLAB is more proper than SLUBin small memory system(ex, embedded)?
--
Kind regards,
Minchan Kim
[-- Attachment #2: slub_debug.log --]
[-- Type: text/x-log, Size: 5786 bytes --]
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_vcpu 0 0 9200 3 8 : tunables 0 0 0 : slabdata 0 0 0
kmalloc_dma-512 16 16 512 16 2 : tunables 0 0 0 : slabdata 1 1 0
RAWv6 17 17 960 17 4 : tunables 0 0 0 : slabdata 1 1 0
UDPLITEv6 0 0 960 17 4 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 51 51 960 17 4 : tunables 0 0 0 : slabdata 3 3 0
TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
nf_conntrack_c10a8540 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
dm_raid1_read_record 0 0 1056 31 8 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2464 13 8 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
fuse_request 18 18 432 18 2 : tunables 0 0 0 : slabdata 1 1 0
fuse_inode 21 21 768 21 4 : tunables 0 0 0 : slabdata 1 1 0
nfsd4_stateowners 0 0 344 23 2 : tunables 0 0 0 : slabdata 0 0 0
nfs_read_data 72 72 448 18 2 : tunables 0 0 0 : slabdata 4 4 0
nfs_inode_cache 0 0 1040 31 8 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_inode_cache 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 24 24 656 24 4 : tunables 0 0 0 : slabdata 1 1 0
ext4_inode_cache 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
ext2_inode_cache 0 0 944 17 4 : tunables 0 0 0 : slabdata 0 0 0
ext3_inode_cache 5032 5032 928 17 4 : tunables 0 0 0 : slabdata 296 296 0
rpc_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
UNIX 532 532 832 19 4 : tunables 0 0 0 : slabdata 28 28 0
UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
UDP 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
TCP 60 60 1600 20 8 : tunables 0 0 0 : slabdata 3 3 0
sgpool-128 48 48 2560 12 8 : tunables 0 0 0 : slabdata 4 4 0
sgpool-64 100 100 1280 25 8 : tunables 0 0 0 : slabdata 4 4 0
blkdev_queue 76 76 1688 19 8 : tunables 0 0 0 : slabdata 4 4 0
biovec-256 10 10 3072 10 8 : tunables 0 0 0 : slabdata 1 1 0
biovec-128 21 21 1536 21 8 : tunables 0 0 0 : slabdata 1 1 0
biovec-64 84 84 768 21 4 : tunables 0 0 0 : slabdata 4 4 0
bip-256 10 10 3200 10 8 : tunables 0 0 0 : slabdata 1 1 0
bip-128 0 0 1664 19 8 : tunables 0 0 0 : slabdata 0 0 0
bip-64 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
bip-16 100 100 320 25 2 : tunables 0 0 0 : slabdata 4 4 0
sock_inode_cache 609 609 768 21 4 : tunables 0 0 0 : slabdata 29 29 0
skbuff_fclone_cache 84 84 384 21 2 : tunables 0 0 0 : slabdata 4 4 0
shmem_inode_cache 1835 1840 784 20 4 : tunables 0 0 0 : slabdata 92 92 0
taskstats 96 96 328 24 2 : tunables 0 0 0 : slabdata 4 4 0
proc_inode_cache 1584 1584 680 24 4 : tunables 0 0 0 : slabdata 66 66 0
bdev_cache 72 72 896 18 4 : tunables 0 0 0 : slabdata 4 4 0
inode_cache 7126 7128 656 24 4 : tunables 0 0 0 : slabdata 297 297 0
signal_cache 332 350 640 25 4 : tunables 0 0 0 : slabdata 14 14 0
sighand_cache 246 253 1408 23 8 : tunables 0 0 0 : slabdata 11 11 0
task_xstate 193 196 576 28 4 : tunables 0 0 0 : slabdata 7 7 0
task_struct 274 285 5472 5 8 : tunables 0 0 0 : slabdata 57 57 0
radix_tree_node 3208 3213 296 27 2 : tunables 0 0 0 : slabdata 119 119 0
kmalloc-8192 20 20 8192 4 8 : tunables 0 0 0 : slabdata 5 5 0
kmalloc-4096 78 80 4096 8 8 : tunables 0 0 0 : slabdata 10 10 0
kmalloc-2048 400 400 2048 16 8 : tunables 0 0 0 : slabdata 25 25 0
kmalloc-1024 326 336 1024 16 4 : tunables 0 0 0 : slabdata 21 21 0
kmalloc-512 758 784 512 16 2 : tunables 0 0 0 : slabdata 49 49 0
[-- Attachment #3: slub_debug_disable.log --]
[-- Type: text/x-log, Size: 8050 bytes --]
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_vcpu 0 0 9248 3 8 : tunables 0 0 0 : slabdata 0 0 0
kmalloc_dma-512 29 29 560 29 4 : tunables 0 0 0 : slabdata 1 1 0
clip_arp_cache 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
ip6_dst_cache 25 25 320 25 2 : tunables 0 0 0 : slabdata 1 1 0
ndisc_cache 25 25 320 25 2 : tunables 0 0 0 : slabdata 1 1 0
RAWv6 16 16 1024 16 4 : tunables 0 0 0 : slabdata 1 1 0
UDPLITEv6 0 0 960 17 4 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 68 68 960 17 4 : tunables 0 0 0 : slabdata 4 4 0
tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
TCPv6 36 36 1792 18 8 : tunables 0 0 0 : slabdata 2 2 0
nf_conntrack_c10a8540 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
dm_raid1_read_record 0 0 1096 29 8 : tunables 0 0 0 : slabdata 0 0 0
kcopyd_job 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2504 13 8 : tunables 0 0 0 : slabdata 0 0 0
dm_rq_target_io 0 0 272 30 2 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 17 17 960 17 4 : tunables 0 0 0 : slabdata 1 1 0
fuse_request 17 17 480 17 2 : tunables 0 0 0 : slabdata 1 1 0
fuse_inode 19 19 832 19 4 : tunables 0 0 0 : slabdata 1 1 0
nfsd4_stateowners 0 0 392 20 2 : tunables 0 0 0 : slabdata 0 0 0
nfs_write_data 48 48 512 16 2 : tunables 0 0 0 : slabdata 3 3 0
nfs_read_data 32 32 512 16 2 : tunables 0 0 0 : slabdata 2 2 0
nfs_inode_cache 0 0 1080 30 8 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_key_record_cache 0 0 576 28 4 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_sb_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_inode_cache 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_auth_tok_list_item 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 23 23 696 23 4 : tunables 0 0 0 : slabdata 1 1 0
ext4_inode_cache 0 0 1168 28 8 : tunables 0 0 0 : slabdata 0 0 0
ext2_inode_cache 0 0 984 16 4 : tunables 0 0 0 : slabdata 0 0 0
ext3_inode_cache 5391 5392 968 16 4 : tunables 0 0 0 : slabdata 337 337 0
dquot 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
kioctx 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
rpc_buffers 30 30 2112 15 8 : tunables 0 0 0 : slabdata 2 2 0
rpc_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
UNIX 556 558 896 18 4 : tunables 0 0 0 : slabdata 31 31 0
UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
ip_dst_cache 125 125 320 25 2 : tunables 0 0 0 : slabdata 5 5 0
arp_cache 100 100 320 25 2 : tunables 0 0 0 : slabdata 4 4 0
RAW 19 19 832 19 4 : tunables 0 0 0 : slabdata 1 1 0
UDP 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
TCP 76 76 1664 19 8 : tunables 0 0 0 : slabdata 4 4 0
sgpool-128 48 48 2624 12 8 : tunables 0 0 0 : slabdata 4 4 0
sgpool-64 96 96 1344 24 8 : tunables 0 0 0 : slabdata 4 4 0
sgpool-32 92 92 704 23 4 : tunables 0 0 0 : slabdata 4 4 0
sgpool-16 84 84 384 21 2 : tunables 0 0 0 : slabdata 4 4 0
blkdev_queue 72 72 1736 18 8 : tunables 0 0 0 : slabdata 4 4 0
biovec-256 10 10 3136 10 8 : tunables 0 0 0 : slabdata 1 1 0
biovec-128 20 20 1600 20 8 : tunables 0 0 0 : slabdata 1 1 0
biovec-64 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
bip-256 10 10 3200 10 8 : tunables 0 0 0 : slabdata 1 1 0
bip-128 0 0 1664 19 8 : tunables 0 0 0 : slabdata 0 0 0
bip-64 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
bip-16 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
sock_inode_cache 629 630 768 21 4 : tunables 0 0 0 : slabdata 30 30 0
skbuff_fclone_cache 72 72 448 18 2 : tunables 0 0 0 : slabdata 4 4 0
shmem_inode_cache 1862 1862 824 19 4 : tunables 0 0 0 : slabdata 98 98 0
taskstats 84 84 376 21 2 : tunables 0 0 0 : slabdata 4 4 0
proc_inode_cache 1623 1650 720 22 4 : tunables 0 0 0 : slabdata 75 75 0
bdev_cache 68 68 960 17 4 : tunables 0 0 0 : slabdata 4 4 0
inode_cache 7125 7130 696 23 4 : tunables 0 0 0 : slabdata 310 310 0
mm_struct 135 138 704 23 4 : tunables 0 0 0 : slabdata 6 6 0
files_cache 142 150 320 25 2 : tunables 0 0 0 : slabdata 6 6 0
signal_cache 229 230 704 23 4 : tunables 0 0 0 : slabdata 10 10 0
sighand_cache 228 230 1408 23 8 : tunables 0 0 0 : slabdata 10 10 0
task_xstate 195 200 640 25 4 : tunables 0 0 0 : slabdata 8 8 0
task_struct 271 285 5520 5 8 : tunables 0 0 0 : slabdata 57 57 0
radix_tree_node 3484 3504 336 24 2 : tunables 0 0 0 : slabdata 146 146 0
kmalloc-8192 20 20 8192 4 8 : tunables 0 0 0 : slabdata 5 5 0
kmalloc-4096 79 80 4096 8 8 : tunables 0 0 0 : slabdata 10 10 0
kmalloc-2048 388 390 2096 15 8 : tunables 0 0 0 : slabdata 26 26 0
kmalloc-1024 382 390 1072 30 8 : tunables 0 0 0 : slabdata 13 13 0
kmalloc-512 796 812 560 29 4 : tunables 0 0 0 : slabdata 28 28 0
kmalloc-256 153 156 304 26 2 : tunables 0 0 0 : slabdata 6 6 0
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-03 21:49 ` Chris Webb
@ 2010-08-04 2:21 ` Wu Fengguang
-1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04 2:21 UTC (permalink / raw)
To: Chris Webb
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Chris,
Your slabinfo does contain many order 1-3 slab caches, this is a major source
of high order allocations and hence lumpy reclaim. fork() is another.
In another thread, Pekka Enberg offers a tip:
You can pass "slub_debug=o" as a kernel parameter to disable higher
order allocations if you want to test things.
Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
Thanks,
Fengguang
On Wed, Aug 04, 2010 at 05:49:46AM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
>
> > Chris, what's in your /proc/slabinfo?
>
> Hi. Sorry for the slow reply. The exact machine from which I previously
> extracted that /proc/memstat has unfortunately had swap turned off by a
> colleague while I was away, presumably because its behaviour because too
> bad. However, here is info from another member of the cluster, this time
> with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:
>
> # cat /proc/meminfo
> MemTotal: 33084008 kB
> MemFree: 2291464 kB
> Buffers: 4908468 kB
> Cached: 16056 kB
> SwapCached: 1427480 kB
> Active: 22885508 kB
> Inactive: 5719520 kB
> Active(anon): 20466488 kB
> Inactive(anon): 3215888 kB
> Active(file): 2419020 kB
> Inactive(file): 2503632 kB
> Unevictable: 10688 kB
> Mlocked: 10688 kB
> SwapTotal: 25165816 kB
> SwapFree: 22798248 kB
> Dirty: 2616 kB
> Writeback: 0 kB
> AnonPages: 23410296 kB
> Mapped: 6324 kB
> Shmem: 56 kB
> Slab: 692296 kB
> SReclaimable: 189032 kB
> SUnreclaim: 503264 kB
> KernelStack: 4568 kB
> PageTables: 65588 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 41707820 kB
> Committed_AS: 34859884 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 147616 kB
> VmallocChunk: 34342399496 kB
> HardwareCorrupted: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 5888 kB
> DirectMap2M: 2156544 kB
> DirectMap1G: 31457280 kB
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kmalloc_dma-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
> nf_conntrack_expect 312 312 208 39 2 : tunables 0 0 0 : slabdata 8 8 0
> nf_conntrack 240 240 272 30 2 : tunables 0 0 0 : slabdata 8 8 0
> dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_crypt_io 240 260 152 26 1 : tunables 0 0 0 : slabdata 10 10 0
> kcopyd_job 0 0 368 22 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_rq_target_io 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
> cfq_queue 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
> bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
> udf_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_request 0 0 632 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_inode 0 0 704 23 4 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_inode_cache 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
> isofs_inode_cache 0 0 616 26 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_inode_cache 0 0 648 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 28 28 584 28 4 : tunables 0 0 0 : slabdata 1 1 0
> squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> journal_handle 1360 1360 24 170 1 : tunables 0 0 0 : slabdata 8 8 0
> journal_head 288 288 112 36 1 : tunables 0 0 0 : slabdata 8 8 0
> revoke_table 512 512 16 256 1 : tunables 0 0 0 : slabdata 2 2 0
> revoke_record 1024 1024 32 128 1 : tunables 0 0 0 : slabdata 8 8 0
> ext4_inode_cache 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
> ext4_free_block_extents 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_alloc_context 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_prealloc_space 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_system_zone 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 2371 2457 768 21 4 : tunables 0 0 0 : slabdata 117 117 0
> ext3_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> inotify_inode_mark_entry 36 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
> posix_timers_cache 224 224 144 28 1 : tunables 0 0 0 : slabdata 8 8 0
> kvm_vcpu 38 45 10256 3 8 : tunables 0 0 0 : slabdata 15 15 0
> kvm_rmap_desc 19408 21828 40 102 1 : tunables 0 0 0 : slabdata 214 214 0
> kvm_pte_chain 14514 28543 56 73 1 : tunables 0 0 0 : slabdata 391 391 0
> UDP-Lite 0 0 768 21 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_dst_cache 221 231 384 21 2 : tunables 0 0 0 : slabdata 11 11 0
> UDP 168 168 768 21 4 : tunables 0 0 0 : slabdata 8 8 0
> tw_sock_TCP 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
> TCP 191 220 1472 22 8 : tunables 0 0 0 : slabdata 10 10 0
> blkdev_queue 178 210 2128 15 8 : tunables 0 0 0 : slabdata 14 14 0
> blkdev_requests 608 816 336 24 2 : tunables 0 0 0 : slabdata 34 34 0
> fsnotify_event 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> sock_inode_cache 250 300 640 25 4 : tunables 0 0 0 : slabdata 12 12 0
> file_lock_cache 176 176 184 22 1 : tunables 0 0 0 : slabdata 8 8 0
> shmem_inode_cache 1617 1827 776 21 4 : tunables 0 0 0 : slabdata 87 87 0
> Acpi-ParseExt 1692 1736 72 56 1 : tunables 0 0 0 : slabdata 31 31 0
> proc_inode_cache 1182 1326 616 26 4 : tunables 0 0 0 : slabdata 51 51 0
> sigqueue 200 200 160 25 1 : tunables 0 0 0 : slabdata 8 8 0
> radix_tree_node 65891 69542 560 29 4 : tunables 0 0 0 : slabdata 2398 2398 0
> bdev_cache 312 312 832 39 8 : tunables 0 0 0 : slabdata 8 8 0
> sysfs_dir_cache 21585 22287 80 51 1 : tunables 0 0 0 : slabdata 437 437 0
> inode_cache 2903 2996 568 28 4 : tunables 0 0 0 : slabdata 107 107 0
> dentry 8532 8631 192 21 1 : tunables 0 0 0 : slabdata 411 411 0
> buffer_head 1227688 1296648 112 36 1 : tunables 0 0 0 : slabdata 36018 36018 0
> vm_area_struct 18494 19389 176 23 1 : tunables 0 0 0 : slabdata 843 843 0
> files_cache 236 322 704 23 4 : tunables 0 0 0 : slabdata 14 14 0
> signal_cache 606 702 832 39 8 : tunables 0 0 0 : slabdata 18 18 0
> sighand_cache 415 480 2112 15 8 : tunables 0 0 0 : slabdata 32 32 0
> task_struct 671 840 1616 20 8 : tunables 0 0 0 : slabdata 42 42 0
> anon_vma 1511 1920 32 128 1 : tunables 0 0 0 : slabdata 15 15 0
> shared_policy_node 255 255 48 85 1 : tunables 0 0 0 : slabdata 3 3 0
> numa_policy 19205 20910 24 170 1 : tunables 0 0 0 : slabdata 123 123 0
> idr_layer_cache 373 390 544 30 4 : tunables 0 0 0 : slabdata 13 13 0
> kmalloc-8192 36 36 8192 4 8 : tunables 0 0 0 : slabdata 9 9 0
> kmalloc-4096 2284 2592 4096 8 8 : tunables 0 0 0 : slabdata 324 324 0
> kmalloc-2048 750 896 2048 16 8 : tunables 0 0 0 : slabdata 56 56 0
> kmalloc-1024 4025 4320 1024 32 8 : tunables 0 0 0 : slabdata 135 135 0
> kmalloc-512 1358 1760 512 32 4 : tunables 0 0 0 : slabdata 55 55 0
> kmalloc-256 1402 1952 256 32 2 : tunables 0 0 0 : slabdata 61 61 0
> kmalloc-128 8625 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0
> kmalloc-64 7030122 7455232 64 64 1 : tunables 0 0 0 : slabdata 116488 116488 0
> kmalloc-32 18603 19712 32 128 1 : tunables 0 0 0 : slabdata 154 154 0
> kmalloc-16 8895 9728 16 256 1 : tunables 0 0 0 : slabdata 38 38 0
> kmalloc-8 9047 10752 8 512 1 : tunables 0 0 0 : slabdata 21 21 0
> kmalloc-192 5130 9135 192 21 1 : tunables 0 0 0 : slabdata 435 435 0
> kmalloc-96 1905 2940 96 42 1 : tunables 0 0 0 : slabdata 70 70 0
> kmem_cache_node 196 256 64 64 1 : tunables 0 0 0 : slabdata 4 4 0
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 2 0 2 2 2 2 2 1 2 2 2
> Node 0, zone DMA32 61877 10368 111 10 2 3 1 0 0 0 0
> Node 0, zone Normal 2036 0 14 12 6 3 3 0 1 0 0
> Node 1, zone Normal 483348 15 2 3 7 1 3 1 0 0 0
>
> Best wishes,
>
> Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-04 2:21 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04 2:21 UTC (permalink / raw)
To: Chris Webb
Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg
Chris,
Your slabinfo does contain many order 1-3 slab caches, this is a major source
of high order allocations and hence lumpy reclaim. fork() is another.
In another thread, Pekka Enberg offers a tip:
You can pass "slub_debug=o" as a kernel parameter to disable higher
order allocations if you want to test things.
Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
Thanks,
Fengguang
On Wed, Aug 04, 2010 at 05:49:46AM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
>
> > Chris, what's in your /proc/slabinfo?
>
> Hi. Sorry for the slow reply. The exact machine from which I previously
> extracted that /proc/memstat has unfortunately had swap turned off by a
> colleague while I was away, presumably because its behaviour because too
> bad. However, here is info from another member of the cluster, this time
> with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:
>
> # cat /proc/meminfo
> MemTotal: 33084008 kB
> MemFree: 2291464 kB
> Buffers: 4908468 kB
> Cached: 16056 kB
> SwapCached: 1427480 kB
> Active: 22885508 kB
> Inactive: 5719520 kB
> Active(anon): 20466488 kB
> Inactive(anon): 3215888 kB
> Active(file): 2419020 kB
> Inactive(file): 2503632 kB
> Unevictable: 10688 kB
> Mlocked: 10688 kB
> SwapTotal: 25165816 kB
> SwapFree: 22798248 kB
> Dirty: 2616 kB
> Writeback: 0 kB
> AnonPages: 23410296 kB
> Mapped: 6324 kB
> Shmem: 56 kB
> Slab: 692296 kB
> SReclaimable: 189032 kB
> SUnreclaim: 503264 kB
> KernelStack: 4568 kB
> PageTables: 65588 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 41707820 kB
> Committed_AS: 34859884 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 147616 kB
> VmallocChunk: 34342399496 kB
> HardwareCorrupted: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 5888 kB
> DirectMap2M: 2156544 kB
> DirectMap1G: 31457280 kB
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kmalloc_dma-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
> nf_conntrack_expect 312 312 208 39 2 : tunables 0 0 0 : slabdata 8 8 0
> nf_conntrack 240 240 272 30 2 : tunables 0 0 0 : slabdata 8 8 0
> dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_crypt_io 240 260 152 26 1 : tunables 0 0 0 : slabdata 10 10 0
> kcopyd_job 0 0 368 22 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_rq_target_io 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
> cfq_queue 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
> bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
> udf_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_request 0 0 632 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_inode 0 0 704 23 4 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_inode_cache 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
> isofs_inode_cache 0 0 616 26 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_inode_cache 0 0 648 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 28 28 584 28 4 : tunables 0 0 0 : slabdata 1 1 0
> squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> journal_handle 1360 1360 24 170 1 : tunables 0 0 0 : slabdata 8 8 0
> journal_head 288 288 112 36 1 : tunables 0 0 0 : slabdata 8 8 0
> revoke_table 512 512 16 256 1 : tunables 0 0 0 : slabdata 2 2 0
> revoke_record 1024 1024 32 128 1 : tunables 0 0 0 : slabdata 8 8 0
> ext4_inode_cache 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
> ext4_free_block_extents 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_alloc_context 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_prealloc_space 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_system_zone 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 2371 2457 768 21 4 : tunables 0 0 0 : slabdata 117 117 0
> ext3_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> inotify_inode_mark_entry 36 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
> posix_timers_cache 224 224 144 28 1 : tunables 0 0 0 : slabdata 8 8 0
> kvm_vcpu 38 45 10256 3 8 : tunables 0 0 0 : slabdata 15 15 0
> kvm_rmap_desc 19408 21828 40 102 1 : tunables 0 0 0 : slabdata 214 214 0
> kvm_pte_chain 14514 28543 56 73 1 : tunables 0 0 0 : slabdata 391 391 0
> UDP-Lite 0 0 768 21 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_dst_cache 221 231 384 21 2 : tunables 0 0 0 : slabdata 11 11 0
> UDP 168 168 768 21 4 : tunables 0 0 0 : slabdata 8 8 0
> tw_sock_TCP 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
> TCP 191 220 1472 22 8 : tunables 0 0 0 : slabdata 10 10 0
> blkdev_queue 178 210 2128 15 8 : tunables 0 0 0 : slabdata 14 14 0
> blkdev_requests 608 816 336 24 2 : tunables 0 0 0 : slabdata 34 34 0
> fsnotify_event 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> sock_inode_cache 250 300 640 25 4 : tunables 0 0 0 : slabdata 12 12 0
> file_lock_cache 176 176 184 22 1 : tunables 0 0 0 : slabdata 8 8 0
> shmem_inode_cache 1617 1827 776 21 4 : tunables 0 0 0 : slabdata 87 87 0
> Acpi-ParseExt 1692 1736 72 56 1 : tunables 0 0 0 : slabdata 31 31 0
> proc_inode_cache 1182 1326 616 26 4 : tunables 0 0 0 : slabdata 51 51 0
> sigqueue 200 200 160 25 1 : tunables 0 0 0 : slabdata 8 8 0
> radix_tree_node 65891 69542 560 29 4 : tunables 0 0 0 : slabdata 2398 2398 0
> bdev_cache 312 312 832 39 8 : tunables 0 0 0 : slabdata 8 8 0
> sysfs_dir_cache 21585 22287 80 51 1 : tunables 0 0 0 : slabdata 437 437 0
> inode_cache 2903 2996 568 28 4 : tunables 0 0 0 : slabdata 107 107 0
> dentry 8532 8631 192 21 1 : tunables 0 0 0 : slabdata 411 411 0
> buffer_head 1227688 1296648 112 36 1 : tunables 0 0 0 : slabdata 36018 36018 0
> vm_area_struct 18494 19389 176 23 1 : tunables 0 0 0 : slabdata 843 843 0
> files_cache 236 322 704 23 4 : tunables 0 0 0 : slabdata 14 14 0
> signal_cache 606 702 832 39 8 : tunables 0 0 0 : slabdata 18 18 0
> sighand_cache 415 480 2112 15 8 : tunables 0 0 0 : slabdata 32 32 0
> task_struct 671 840 1616 20 8 : tunables 0 0 0 : slabdata 42 42 0
> anon_vma 1511 1920 32 128 1 : tunables 0 0 0 : slabdata 15 15 0
> shared_policy_node 255 255 48 85 1 : tunables 0 0 0 : slabdata 3 3 0
> numa_policy 19205 20910 24 170 1 : tunables 0 0 0 : slabdata 123 123 0
> idr_layer_cache 373 390 544 30 4 : tunables 0 0 0 : slabdata 13 13 0
> kmalloc-8192 36 36 8192 4 8 : tunables 0 0 0 : slabdata 9 9 0
> kmalloc-4096 2284 2592 4096 8 8 : tunables 0 0 0 : slabdata 324 324 0
> kmalloc-2048 750 896 2048 16 8 : tunables 0 0 0 : slabdata 56 56 0
> kmalloc-1024 4025 4320 1024 32 8 : tunables 0 0 0 : slabdata 135 135 0
> kmalloc-512 1358 1760 512 32 4 : tunables 0 0 0 : slabdata 55 55 0
> kmalloc-256 1402 1952 256 32 2 : tunables 0 0 0 : slabdata 61 61 0
> kmalloc-128 8625 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0
> kmalloc-64 7030122 7455232 64 64 1 : tunables 0 0 0 : slabdata 116488 116488 0
> kmalloc-32 18603 19712 32 128 1 : tunables 0 0 0 : slabdata 154 154 0
> kmalloc-16 8895 9728 16 256 1 : tunables 0 0 0 : slabdata 38 38 0
> kmalloc-8 9047 10752 8 512 1 : tunables 0 0 0 : slabdata 21 21 0
> kmalloc-192 5130 9135 192 21 1 : tunables 0 0 0 : slabdata 435 435 0
> kmalloc-96 1905 2940 96 42 1 : tunables 0 0 0 : slabdata 70 70 0
> kmem_cache_node 196 256 64 64 1 : tunables 0 0 0 : slabdata 4 4 0
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 2 0 2 2 2 2 2 1 2 2 2
> Node 0, zone DMA32 61877 10368 111 10 2 3 1 0 0 0 0
> Node 0, zone Normal 2036 0 14 12 6 3 3 0 1 0 0
> Node 1, zone Normal 483348 15 2 3 7 1 3 1 0 0 0
>
> Best wishes,
>
> Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-03 4:28 ` Wu Fengguang
@ 2010-08-03 21:49 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-03 21:49 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro
Wu Fengguang <fengguang.wu@intel.com> writes:
> Chris, what's in your /proc/slabinfo?
Hi. Sorry for the slow reply. The exact machine from which I previously
extracted that /proc/memstat has unfortunately had swap turned off by a
colleague while I was away, presumably because its behaviour because too
bad. However, here is info from another member of the cluster, this time
with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:
# cat /proc/meminfo
MemTotal: 33084008 kB
MemFree: 2291464 kB
Buffers: 4908468 kB
Cached: 16056 kB
SwapCached: 1427480 kB
Active: 22885508 kB
Inactive: 5719520 kB
Active(anon): 20466488 kB
Inactive(anon): 3215888 kB
Active(file): 2419020 kB
Inactive(file): 2503632 kB
Unevictable: 10688 kB
Mlocked: 10688 kB
SwapTotal: 25165816 kB
SwapFree: 22798248 kB
Dirty: 2616 kB
Writeback: 0 kB
AnonPages: 23410296 kB
Mapped: 6324 kB
Shmem: 56 kB
Slab: 692296 kB
SReclaimable: 189032 kB
SUnreclaim: 503264 kB
KernelStack: 4568 kB
PageTables: 65588 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 41707820 kB
Committed_AS: 34859884 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 147616 kB
VmallocChunk: 34342399496 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5888 kB
DirectMap2M: 2156544 kB
DirectMap1G: 31457280 kB
# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc_dma-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
nf_conntrack_expect 312 312 208 39 2 : tunables 0 0 0 : slabdata 8 8 0
nf_conntrack 240 240 272 30 2 : tunables 0 0 0 : slabdata 8 8 0
dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
dm_crypt_io 240 260 152 26 1 : tunables 0 0 0 : slabdata 10 10 0
kcopyd_job 0 0 368 22 2 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
dm_rq_target_io 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
udf_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_request 0 0 632 25 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 704 23 4 : tunables 0 0 0 : slabdata 0 0 0
ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
ntfs_inode_cache 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
isofs_inode_cache 0 0 616 26 4 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 648 25 4 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 28 28 584 28 4 : tunables 0 0 0 : slabdata 1 1 0
squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
journal_handle 1360 1360 24 170 1 : tunables 0 0 0 : slabdata 8 8 0
journal_head 288 288 112 36 1 : tunables 0 0 0 : slabdata 8 8 0
revoke_table 512 512 16 256 1 : tunables 0 0 0 : slabdata 2 2 0
revoke_record 1024 1024 32 128 1 : tunables 0 0 0 : slabdata 8 8 0
ext4_inode_cache 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_free_block_extents 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_alloc_context 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_prealloc_space 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_system_zone 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
ext3_inode_cache 2371 2457 768 21 4 : tunables 0 0 0 : slabdata 117 117 0
ext3_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
kioctx 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
inotify_inode_mark_entry 36 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
posix_timers_cache 224 224 144 28 1 : tunables 0 0 0 : slabdata 8 8 0
kvm_vcpu 38 45 10256 3 8 : tunables 0 0 0 : slabdata 15 15 0
kvm_rmap_desc 19408 21828 40 102 1 : tunables 0 0 0 : slabdata 214 214 0
kvm_pte_chain 14514 28543 56 73 1 : tunables 0 0 0 : slabdata 391 391 0
UDP-Lite 0 0 768 21 4 : tunables 0 0 0 : slabdata 0 0 0
ip_dst_cache 221 231 384 21 2 : tunables 0 0 0 : slabdata 11 11 0
UDP 168 168 768 21 4 : tunables 0 0 0 : slabdata 8 8 0
tw_sock_TCP 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
TCP 191 220 1472 22 8 : tunables 0 0 0 : slabdata 10 10 0
blkdev_queue 178 210 2128 15 8 : tunables 0 0 0 : slabdata 14 14 0
blkdev_requests 608 816 336 24 2 : tunables 0 0 0 : slabdata 34 34 0
fsnotify_event 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
sock_inode_cache 250 300 640 25 4 : tunables 0 0 0 : slabdata 12 12 0
file_lock_cache 176 176 184 22 1 : tunables 0 0 0 : slabdata 8 8 0
shmem_inode_cache 1617 1827 776 21 4 : tunables 0 0 0 : slabdata 87 87 0
Acpi-ParseExt 1692 1736 72 56 1 : tunables 0 0 0 : slabdata 31 31 0
proc_inode_cache 1182 1326 616 26 4 : tunables 0 0 0 : slabdata 51 51 0
sigqueue 200 200 160 25 1 : tunables 0 0 0 : slabdata 8 8 0
radix_tree_node 65891 69542 560 29 4 : tunables 0 0 0 : slabdata 2398 2398 0
bdev_cache 312 312 832 39 8 : tunables 0 0 0 : slabdata 8 8 0
sysfs_dir_cache 21585 22287 80 51 1 : tunables 0 0 0 : slabdata 437 437 0
inode_cache 2903 2996 568 28 4 : tunables 0 0 0 : slabdata 107 107 0
dentry 8532 8631 192 21 1 : tunables 0 0 0 : slabdata 411 411 0
buffer_head 1227688 1296648 112 36 1 : tunables 0 0 0 : slabdata 36018 36018 0
vm_area_struct 18494 19389 176 23 1 : tunables 0 0 0 : slabdata 843 843 0
files_cache 236 322 704 23 4 : tunables 0 0 0 : slabdata 14 14 0
signal_cache 606 702 832 39 8 : tunables 0 0 0 : slabdata 18 18 0
sighand_cache 415 480 2112 15 8 : tunables 0 0 0 : slabdata 32 32 0
task_struct 671 840 1616 20 8 : tunables 0 0 0 : slabdata 42 42 0
anon_vma 1511 1920 32 128 1 : tunables 0 0 0 : slabdata 15 15 0
shared_policy_node 255 255 48 85 1 : tunables 0 0 0 : slabdata 3 3 0
numa_policy 19205 20910 24 170 1 : tunables 0 0 0 : slabdata 123 123 0
idr_layer_cache 373 390 544 30 4 : tunables 0 0 0 : slabdata 13 13 0
kmalloc-8192 36 36 8192 4 8 : tunables 0 0 0 : slabdata 9 9 0
kmalloc-4096 2284 2592 4096 8 8 : tunables 0 0 0 : slabdata 324 324 0
kmalloc-2048 750 896 2048 16 8 : tunables 0 0 0 : slabdata 56 56 0
kmalloc-1024 4025 4320 1024 32 8 : tunables 0 0 0 : slabdata 135 135 0
kmalloc-512 1358 1760 512 32 4 : tunables 0 0 0 : slabdata 55 55 0
kmalloc-256 1402 1952 256 32 2 : tunables 0 0 0 : slabdata 61 61 0
kmalloc-128 8625 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0
kmalloc-64 7030122 7455232 64 64 1 : tunables 0 0 0 : slabdata 116488 116488 0
kmalloc-32 18603 19712 32 128 1 : tunables 0 0 0 : slabdata 154 154 0
kmalloc-16 8895 9728 16 256 1 : tunables 0 0 0 : slabdata 38 38 0
kmalloc-8 9047 10752 8 512 1 : tunables 0 0 0 : slabdata 21 21 0
kmalloc-192 5130 9135 192 21 1 : tunables 0 0 0 : slabdata 435 435 0
kmalloc-96 1905 2940 96 42 1 : tunables 0 0 0 : slabdata 70 70 0
kmem_cache_node 196 256 64 64 1 : tunables 0 0 0 : slabdata 4 4 0
# cat /proc/buddyinfo
Node 0, zone DMA 2 0 2 2 2 2 2 1 2 2 2
Node 0, zone DMA32 61877 10368 111 10 2 3 1 0 0 0 0
Node 0, zone Normal 2036 0 14 12 6 3 3 0 1 0 0
Node 1, zone Normal 483348 15 2 3 7 1 3 1 0 0 0
Best wishes,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-03 21:49 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-03 21:49 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro
Wu Fengguang <fengguang.wu@intel.com> writes:
> Chris, what's in your /proc/slabinfo?
Hi. Sorry for the slow reply. The exact machine from which I previously
extracted that /proc/memstat has unfortunately had swap turned off by a
colleague while I was away, presumably because its behaviour because too
bad. However, here is info from another member of the cluster, this time
with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:
# cat /proc/meminfo
MemTotal: 33084008 kB
MemFree: 2291464 kB
Buffers: 4908468 kB
Cached: 16056 kB
SwapCached: 1427480 kB
Active: 22885508 kB
Inactive: 5719520 kB
Active(anon): 20466488 kB
Inactive(anon): 3215888 kB
Active(file): 2419020 kB
Inactive(file): 2503632 kB
Unevictable: 10688 kB
Mlocked: 10688 kB
SwapTotal: 25165816 kB
SwapFree: 22798248 kB
Dirty: 2616 kB
Writeback: 0 kB
AnonPages: 23410296 kB
Mapped: 6324 kB
Shmem: 56 kB
Slab: 692296 kB
SReclaimable: 189032 kB
SUnreclaim: 503264 kB
KernelStack: 4568 kB
PageTables: 65588 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 41707820 kB
Committed_AS: 34859884 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 147616 kB
VmallocChunk: 34342399496 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5888 kB
DirectMap2M: 2156544 kB
DirectMap1G: 31457280 kB
# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc_dma-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
nf_conntrack_expect 312 312 208 39 2 : tunables 0 0 0 : slabdata 8 8 0
nf_conntrack 240 240 272 30 2 : tunables 0 0 0 : slabdata 8 8 0
dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
dm_crypt_io 240 260 152 26 1 : tunables 0 0 0 : slabdata 10 10 0
kcopyd_job 0 0 368 22 2 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
dm_rq_target_io 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
udf_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_request 0 0 632 25 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 704 23 4 : tunables 0 0 0 : slabdata 0 0 0
ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
ntfs_inode_cache 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
isofs_inode_cache 0 0 616 26 4 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 648 25 4 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 28 28 584 28 4 : tunables 0 0 0 : slabdata 1 1 0
squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
journal_handle 1360 1360 24 170 1 : tunables 0 0 0 : slabdata 8 8 0
journal_head 288 288 112 36 1 : tunables 0 0 0 : slabdata 8 8 0
revoke_table 512 512 16 256 1 : tunables 0 0 0 : slabdata 2 2 0
revoke_record 1024 1024 32 128 1 : tunables 0 0 0 : slabdata 8 8 0
ext4_inode_cache 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_free_block_extents 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_alloc_context 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_prealloc_space 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_system_zone 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
ext3_inode_cache 2371 2457 768 21 4 : tunables 0 0 0 : slabdata 117 117 0
ext3_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
kioctx 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
inotify_inode_mark_entry 36 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
posix_timers_cache 224 224 144 28 1 : tunables 0 0 0 : slabdata 8 8 0
kvm_vcpu 38 45 10256 3 8 : tunables 0 0 0 : slabdata 15 15 0
kvm_rmap_desc 19408 21828 40 102 1 : tunables 0 0 0 : slabdata 214 214 0
kvm_pte_chain 14514 28543 56 73 1 : tunables 0 0 0 : slabdata 391 391 0
UDP-Lite 0 0 768 21 4 : tunables 0 0 0 : slabdata 0 0 0
ip_dst_cache 221 231 384 21 2 : tunables 0 0 0 : slabdata 11 11 0
UDP 168 168 768 21 4 : tunables 0 0 0 : slabdata 8 8 0
tw_sock_TCP 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
TCP 191 220 1472 22 8 : tunables 0 0 0 : slabdata 10 10 0
blkdev_queue 178 210 2128 15 8 : tunables 0 0 0 : slabdata 14 14 0
blkdev_requests 608 816 336 24 2 : tunables 0 0 0 : slabdata 34 34 0
fsnotify_event 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
sock_inode_cache 250 300 640 25 4 : tunables 0 0 0 : slabdata 12 12 0
file_lock_cache 176 176 184 22 1 : tunables 0 0 0 : slabdata 8 8 0
shmem_inode_cache 1617 1827 776 21 4 : tunables 0 0 0 : slabdata 87 87 0
Acpi-ParseExt 1692 1736 72 56 1 : tunables 0 0 0 : slabdata 31 31 0
proc_inode_cache 1182 1326 616 26 4 : tunables 0 0 0 : slabdata 51 51 0
sigqueue 200 200 160 25 1 : tunables 0 0 0 : slabdata 8 8 0
radix_tree_node 65891 69542 560 29 4 : tunables 0 0 0 : slabdata 2398 2398 0
bdev_cache 312 312 832 39 8 : tunables 0 0 0 : slabdata 8 8 0
sysfs_dir_cache 21585 22287 80 51 1 : tunables 0 0 0 : slabdata 437 437 0
inode_cache 2903 2996 568 28 4 : tunables 0 0 0 : slabdata 107 107 0
dentry 8532 8631 192 21 1 : tunables 0 0 0 : slabdata 411 411 0
buffer_head 1227688 1296648 112 36 1 : tunables 0 0 0 : slabdata 36018 36018 0
vm_area_struct 18494 19389 176 23 1 : tunables 0 0 0 : slabdata 843 843 0
files_cache 236 322 704 23 4 : tunables 0 0 0 : slabdata 14 14 0
signal_cache 606 702 832 39 8 : tunables 0 0 0 : slabdata 18 18 0
sighand_cache 415 480 2112 15 8 : tunables 0 0 0 : slabdata 32 32 0
task_struct 671 840 1616 20 8 : tunables 0 0 0 : slabdata 42 42 0
anon_vma 1511 1920 32 128 1 : tunables 0 0 0 : slabdata 15 15 0
shared_policy_node 255 255 48 85 1 : tunables 0 0 0 : slabdata 3 3 0
numa_policy 19205 20910 24 170 1 : tunables 0 0 0 : slabdata 123 123 0
idr_layer_cache 373 390 544 30 4 : tunables 0 0 0 : slabdata 13 13 0
kmalloc-8192 36 36 8192 4 8 : tunables 0 0 0 : slabdata 9 9 0
kmalloc-4096 2284 2592 4096 8 8 : tunables 0 0 0 : slabdata 324 324 0
kmalloc-2048 750 896 2048 16 8 : tunables 0 0 0 : slabdata 56 56 0
kmalloc-1024 4025 4320 1024 32 8 : tunables 0 0 0 : slabdata 135 135 0
kmalloc-512 1358 1760 512 32 4 : tunables 0 0 0 : slabdata 55 55 0
kmalloc-256 1402 1952 256 32 2 : tunables 0 0 0 : slabdata 61 61 0
kmalloc-128 8625 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0
kmalloc-64 7030122 7455232 64 64 1 : tunables 0 0 0 : slabdata 116488 116488 0
kmalloc-32 18603 19712 32 128 1 : tunables 0 0 0 : slabdata 154 154 0
kmalloc-16 8895 9728 16 256 1 : tunables 0 0 0 : slabdata 38 38 0
kmalloc-8 9047 10752 8 512 1 : tunables 0 0 0 : slabdata 21 21 0
kmalloc-192 5130 9135 192 21 1 : tunables 0 0 0 : slabdata 435 435 0
kmalloc-96 1905 2940 96 42 1 : tunables 0 0 0 : slabdata 70 70 0
kmem_cache_node 196 256 64 64 1 : tunables 0 0 0 : slabdata 4 4 0
# cat /proc/buddyinfo
Node 0, zone DMA 2 0 2 2 2 2 2 1 2 2 2
Node 0, zone DMA32 61877 10368 111 10 2 3 1 0 0 0 0
Node 0, zone Normal 2036 0 14 12 6 3 3 0 1 0 0
Node 1, zone Normal 483348 15 2 3 7 1 3 1 0 0 0
Best wishes,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-03 4:47 ` Minchan Kim
@ 2010-08-03 6:39 ` Wu Fengguang
-1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-03 6:39 UTC (permalink / raw)
To: Minchan Kim; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro
On Tue, Aug 03, 2010 at 12:47:36PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> >> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> >> > Minchan Kim <minchan.kim@gmail.com> writes:
> >> >
> >> >> Another possibility is _zone_reclaim_ in NUMA.
> >> >> Your working set has many anonymous page.
> >> >>
> >> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >> >>
> >> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >> >
> >> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> >> > these are
> >> >
> >> > # cat /proc/sys/vm/zone_reclaim_mode
> >> > 0
> >> > # cat /proc/sys/vm/min_unmapped_ratio
> >> > 1
> >>
> >> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
> >
> > If there are lots of order-1 or higher allocations, anonymous pages
> > will be randomly evicted, regardless of their LRU ages. This is
>
> I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
> But it's possible. :)
>
> > probably another factor why the users claim. Are there easy ways to
> > confirm this other than patching the kernel?
>
> cat /proc/buddyinfo can help?
Some high order slab caches may show up there :)
> Off-topic:
> It would be better to add new vmstat of lumpy entrance.
I think it's a good debug entry. Although convenient, lumpy reclaim
is accompanied with some bad side effects. When something goes wrong,
it helps to check the number of lumpy reclaims.
Thanks,
Fengguang
> Pseudo code.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0f9f624..d10ff4e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1641,7 +1641,7 @@ out:
> }
> }
>
> -static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
> +static void set_lumpy_reclaim_mode(int priority, struct scan_control
> *sc, struct zone *zone)
> {
> /*
> * If we need a large contiguous chunk of memory, or have
> @@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
> struct scan_control *sc)
> sc->lumpy_reclaim_mode = 1;
> else
> sc->lumpy_reclaim_mode = 0;
> +
> + if (sc->lumpy_reclaim_mode)
> + inc_zone_state(zone, NR_LUMPY);
> }
>
> /*
> @@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,
>
> get_scan_count(zone, sc, nr, priority);
>
> - set_lumpy_reclaim_mode(priority, sc);
> + set_lumpy_reclaim_mode(priority, sc, zone);
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
>
> --
> Kind regards,
> Minchan Kim
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-03 6:39 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-03 6:39 UTC (permalink / raw)
To: Minchan Kim; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro
On Tue, Aug 03, 2010 at 12:47:36PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> >> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> >> > Minchan Kim <minchan.kim@gmail.com> writes:
> >> >
> >> >> Another possibility is _zone_reclaim_ in NUMA.
> >> >> Your working set has many anonymous page.
> >> >>
> >> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >> >>
> >> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >> >
> >> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> >> > these are
> >> >
> >> > A # cat /proc/sys/vm/zone_reclaim_mode
> >> > A 0
> >> > A # cat /proc/sys/vm/min_unmapped_ratio
> >> > A 1
> >>
> >> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
> >
> > If there are lots of order-1 or higher allocations, anonymous pages
> > will be randomly evicted, regardless of their LRU ages. This is
>
> I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
> But it's possible. :)
>
> > probably another factor why the users claim. Are there easy ways to
> > confirm this other than patching the kernel?
>
> cat /proc/buddyinfo can help?
Some high order slab caches may show up there :)
> Off-topic:
> It would be better to add new vmstat of lumpy entrance.
I think it's a good debug entry. Although convenient, lumpy reclaim
is accompanied with some bad side effects. When something goes wrong,
it helps to check the number of lumpy reclaims.
Thanks,
Fengguang
> Pseudo code.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0f9f624..d10ff4e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1641,7 +1641,7 @@ out:
> }
> }
>
> -static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
> +static void set_lumpy_reclaim_mode(int priority, struct scan_control
> *sc, struct zone *zone)
> {
> /*
> * If we need a large contiguous chunk of memory, or have
> @@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
> struct scan_control *sc)
> sc->lumpy_reclaim_mode = 1;
> else
> sc->lumpy_reclaim_mode = 0;
> +
> + if (sc->lumpy_reclaim_mode)
> + inc_zone_state(zone, NR_LUMPY);
> }
>
> /*
> @@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,
>
> get_scan_count(zone, sc, nr, priority);
>
> - set_lumpy_reclaim_mode(priority, sc);
> + set_lumpy_reclaim_mode(priority, sc, zone);
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-03 4:28 ` Wu Fengguang
@ 2010-08-03 4:47 ` Minchan Kim
-1 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-03 4:47 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro
On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
>> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
>> > Minchan Kim <minchan.kim@gmail.com> writes:
>> >
>> >> Another possibility is _zone_reclaim_ in NUMA.
>> >> Your working set has many anonymous page.
>> >>
>> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> >> It can make reclaim mode to lumpy so it can page out anon pages.
>> >>
>> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>> >
>> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
>> > these are
>> >
>> > # cat /proc/sys/vm/zone_reclaim_mode
>> > 0
>> > # cat /proc/sys/vm/min_unmapped_ratio
>> > 1
>>
>> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
>
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is
I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
But it's possible. :)
> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?
cat /proc/buddyinfo can help?
Off-topic:
It would be better to add new vmstat of lumpy entrance.
Pseudo code.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f9f624..d10ff4e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1641,7 +1641,7 @@ out:
}
}
-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
+static void set_lumpy_reclaim_mode(int priority, struct scan_control
*sc, struct zone *zone)
{
/*
* If we need a large contiguous chunk of memory, or have
@@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
struct scan_control *sc)
sc->lumpy_reclaim_mode = 1;
else
sc->lumpy_reclaim_mode = 0;
+
+ if (sc->lumpy_reclaim_mode)
+ inc_zone_state(zone, NR_LUMPY);
}
/*
@@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,
get_scan_count(zone, sc, nr, priority);
- set_lumpy_reclaim_mode(priority, sc);
+ set_lumpy_reclaim_mode(priority, sc, zone);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
--
Kind regards,
Minchan Kim
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-03 4:47 ` Minchan Kim
0 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-03 4:47 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro
On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
>> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
>> > Minchan Kim <minchan.kim@gmail.com> writes:
>> >
>> >> Another possibility is _zone_reclaim_ in NUMA.
>> >> Your working set has many anonymous page.
>> >>
>> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> >> It can make reclaim mode to lumpy so it can page out anon pages.
>> >>
>> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>> >
>> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
>> > these are
>> >
>> > # cat /proc/sys/vm/zone_reclaim_mode
>> > 0
>> > # cat /proc/sys/vm/min_unmapped_ratio
>> > 1
>>
>> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
>
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is
I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
But it's possible. :)
> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?
cat /proc/buddyinfo can help?
Off-topic:
It would be better to add new vmstat of lumpy entrance.
Pseudo code.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f9f624..d10ff4e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1641,7 +1641,7 @@ out:
}
}
-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
+static void set_lumpy_reclaim_mode(int priority, struct scan_control
*sc, struct zone *zone)
{
/*
* If we need a large contiguous chunk of memory, or have
@@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
struct scan_control *sc)
sc->lumpy_reclaim_mode = 1;
else
sc->lumpy_reclaim_mode = 0;
+
+ if (sc->lumpy_reclaim_mode)
+ inc_zone_state(zone, NR_LUMPY);
}
/*
@@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,
get_scan_count(zone, sc, nr, priority);
- set_lumpy_reclaim_mode(priority, sc);
+ set_lumpy_reclaim_mode(priority, sc, zone);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-03 4:09 ` Minchan Kim
@ 2010-08-03 4:28 ` Wu Fengguang
-1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-03 4:28 UTC (permalink / raw)
To: Minchan Kim; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro
On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> > Minchan Kim <minchan.kim@gmail.com> writes:
> >
> >> Another possibility is _zone_reclaim_ in NUMA.
> >> Your working set has many anonymous page.
> >>
> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >>
> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >
> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > these are
> >
> > # cat /proc/sys/vm/zone_reclaim_mode
> > 0
> > # cat /proc/sys/vm/min_unmapped_ratio
> > 1
>
> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
If there are lots of order-1 or higher allocations, anonymous pages
will be randomly evicted, regardless of their LRU ages. This is
probably another factor why the users claim. Are there easy ways to
confirm this other than patching the kernel?
Chris, what's in your /proc/slabinfo?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-03 4:28 ` Wu Fengguang
0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-03 4:28 UTC (permalink / raw)
To: Minchan Kim; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro
On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> > Minchan Kim <minchan.kim@gmail.com> writes:
> >
> >> Another possibility is _zone_reclaim_ in NUMA.
> >> Your working set has many anonymous page.
> >>
> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >>
> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >
> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > these are
> >
> > A # cat /proc/sys/vm/zone_reclaim_mode
> > A 0
> > A # cat /proc/sys/vm/min_unmapped_ratio
> > A 1
>
> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
If there are lots of order-1 or higher allocations, anonymous pages
will be randomly evicted, regardless of their LRU ages. This is
probably another factor why the users claim. Are there easy ways to
confirm this other than patching the kernel?
Chris, what's in your /proc/slabinfo?
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-03 3:31 ` Chris Webb
@ 2010-08-03 4:09 ` Minchan Kim
-1 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-03 4:09 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang
On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> Minchan Kim <minchan.kim@gmail.com> writes:
>
>> Another possibility is _zone_reclaim_ in NUMA.
>> Your working set has many anonymous page.
>>
>> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> It can make reclaim mode to lumpy so it can page out anon pages.
>>
>> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>
> Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> these are
>
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
1) How does VM reclaim anonymous pages even though vm_swappiness ==
zero and has big page cache?
2) I doubt file pages of your system is fulled by Buffers while Cached
is almost 10M.
Why is it remained although anon pages is swapped out and cached page
are reclaimed?
Hmm. I have no idea. :(
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-03 4:09 ` Minchan Kim
0 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-03 4:09 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang
On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> Minchan Kim <minchan.kim@gmail.com> writes:
>
>> Another possibility is _zone_reclaim_ in NUMA.
>> Your working set has many anonymous page.
>>
>> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> It can make reclaim mode to lumpy so it can page out anon pages.
>>
>> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>
> Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> these are
>
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
1) How does VM reclaim anonymous pages even though vm_swappiness ==
zero and has big page cache?
2) I doubt file pages of your system is fulled by Buffers while Cached
is almost 10M.
Why is it remained although anon pages is swapped out and cached page
are reclaimed?
Hmm. I have no idea. :(
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-02 23:55 ` Minchan Kim
@ 2010-08-03 3:31 ` Chris Webb
-1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-03 3:31 UTC (permalink / raw)
To: Minchan Kim; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang
Minchan Kim <minchan.kim@gmail.com> writes:
> Another possibility is _zone_reclaim_ in NUMA.
> Your working set has many anonymous page.
>
> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> It can make reclaim mode to lumpy so it can page out anon pages.
>
> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
these are
# cat /proc/sys/vm/zone_reclaim_mode
0
# cat /proc/sys/vm/min_unmapped_ratio
1
I haven't changed either of these from the kernel default.
Many thanks,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-03 3:31 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-03 3:31 UTC (permalink / raw)
To: Minchan Kim; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang
Minchan Kim <minchan.kim@gmail.com> writes:
> Another possibility is _zone_reclaim_ in NUMA.
> Your working set has many anonymous page.
>
> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> It can make reclaim mode to lumpy so it can page out anon pages.
>
> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
these are
# cat /proc/sys/vm/zone_reclaim_mode
0
# cat /proc/sys/vm/min_unmapped_ratio
1
I haven't changed either of these from the kernel default.
Many thanks,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
2010-08-02 12:47 ` Chris Webb
@ 2010-08-02 23:55 ` Minchan Kim
-1 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-02 23:55 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang
On Mon, Aug 2, 2010 at 9:47 PM, Chris Webb <chris@arachsys.com> wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
> http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
> -CONFIG_MCORE2=y
> +CONFIG_MK8=y
> -CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
> # cat /proc/meminfo
> MemTotal: 33083420 kB
> MemFree: 693164 kB
> Buffers: 8834380 kB
> Cached: 11212 kB
> SwapCached: 1443524 kB
> Active: 21656844 kB
> Inactive: 8119352 kB
> Active(anon): 17203092 kB
> Inactive(anon): 3729032 kB
> Active(file): 4453752 kB
> Inactive(file): 4390320 kB
> Unevictable: 5472 kB
> Mlocked: 5472 kB
> SwapTotal: 25165816 kB
> SwapFree: 21854572 kB
> Dirty: 4300 kB
> Writeback: 4 kB
> AnonPages: 20780368 kB
> Mapped: 6056 kB
> Shmem: 56 kB
> Slab: 961512 kB
> SReclaimable: 438276 kB
> SUnreclaim: 523236 kB
> KernelStack: 10152 kB
> PageTables: 67176 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 41707524 kB
> Committed_AS: 39870868 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 150880 kB
> VmallocChunk: 34342404996 kB
> HardwareCorrupted: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 5824 kB
> DirectMap2M: 3205120 kB
> DirectMap1G: 30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>
Hmm, Strange.
We reclaim only anon pages when the system has few page cache.
(ie, file + free <= high_water_mark)
But in your meminfo, your system has lots of page cache page.
So It isn't likely.
Another possibility is _zone_reclaim_ in NUMA.
Your working set has many anonymous page.
The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
It can make reclaim mode to lumpy so it can page out anon pages.
Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Over-eager swapping
@ 2010-08-02 23:55 ` Minchan Kim
0 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-02 23:55 UTC (permalink / raw)
To: Chris Webb; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang
On Mon, Aug 2, 2010 at 9:47 PM, Chris Webb <chris@arachsys.com> wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
> http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
> -CONFIG_MCORE2=y
> +CONFIG_MK8=y
> -CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
> # cat /proc/meminfo
> MemTotal: 33083420 kB
> MemFree: 693164 kB
> Buffers: 8834380 kB
> Cached: 11212 kB
> SwapCached: 1443524 kB
> Active: 21656844 kB
> Inactive: 8119352 kB
> Active(anon): 17203092 kB
> Inactive(anon): 3729032 kB
> Active(file): 4453752 kB
> Inactive(file): 4390320 kB
> Unevictable: 5472 kB
> Mlocked: 5472 kB
> SwapTotal: 25165816 kB
> SwapFree: 21854572 kB
> Dirty: 4300 kB
> Writeback: 4 kB
> AnonPages: 20780368 kB
> Mapped: 6056 kB
> Shmem: 56 kB
> Slab: 961512 kB
> SReclaimable: 438276 kB
> SUnreclaim: 523236 kB
> KernelStack: 10152 kB
> PageTables: 67176 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 41707524 kB
> Committed_AS: 39870868 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 150880 kB
> VmallocChunk: 34342404996 kB
> HardwareCorrupted: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 5824 kB
> DirectMap2M: 3205120 kB
> DirectMap1G: 30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>
Hmm, Strange.
We reclaim only anon pages when the system has few page cache.
(ie, file + free <= high_water_mark)
But in your meminfo, your system has lots of page cache page.
So It isn't likely.
Another possibility is _zone_reclaim_ in NUMA.
Your working set has many anonymous page.
The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
It can make reclaim mode to lumpy so it can page out anon pages.
Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
* Over-eager swapping
@ 2010-08-02 12:47 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-02 12:47 UTC (permalink / raw)
To: linux-mm, linux-kernel
We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some (but not all) of the machines. This is resulting in
customer reports of very poor response latency from the virtual machines
which have been swapped out, despite the hosts apparently having large
amounts of free memory, and running fine if swap is turned off.
All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
machines which apparently doesn't exhibit the problem, and a cluster of
2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
the affected machines is at
http://cdw.me.uk/tmp/config-2.6.32.7
This differs very little from the config on the unaffected Xeon machines,
essentially just
-CONFIG_MCORE2=y
+CONFIG_MK8=y
-CONFIG_X86_P6_NOP=y
On a typical affected machine, the virtual machines and other processes
would apparently leave around 5.5GB of RAM available for buffers, but the
system seems to want to swap out 3GB of anonymous pages to give itself more
like 9GB of buffers:
# cat /proc/meminfo
MemTotal: 33083420 kB
MemFree: 693164 kB
Buffers: 8834380 kB
Cached: 11212 kB
SwapCached: 1443524 kB
Active: 21656844 kB
Inactive: 8119352 kB
Active(anon): 17203092 kB
Inactive(anon): 3729032 kB
Active(file): 4453752 kB
Inactive(file): 4390320 kB
Unevictable: 5472 kB
Mlocked: 5472 kB
SwapTotal: 25165816 kB
SwapFree: 21854572 kB
Dirty: 4300 kB
Writeback: 4 kB
AnonPages: 20780368 kB
Mapped: 6056 kB
Shmem: 56 kB
Slab: 961512 kB
SReclaimable: 438276 kB
SUnreclaim: 523236 kB
KernelStack: 10152 kB
PageTables: 67176 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 41707524 kB
Committed_AS: 39870868 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 150880 kB
VmallocChunk: 34342404996 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5824 kB
DirectMap2M: 3205120 kB
DirectMap1G: 30408704 kB
We see this despite the machine having vm.swappiness set to 0 in an attempt
to skew the reclaim as far as possible in favour of releasing page cache
instead of swapping anonymous pages.
After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB.
We could run with these machines without swap (in the worst cases we're
already doing so), but I'd prefer to have a reserve of swap available in
case of genuine emergency. If it's a choice between swapping out a guest or
oom-killing it, I'd prefer to swap... but I really don't want to swap out
running virtual machines in order to have eight gigabytes of page cache
instead of five!
Is this a problem with the page reclaim priorities, or am I just tuning
these hosts incorrectly? Is there more detailed info than /proc/meminfo
available which might shed more light on what's going wrong here?
Best wishes,
Chris.
^ permalink raw reply [flat|nested] 75+ messages in thread
* Over-eager swapping
@ 2010-08-02 12:47 ` Chris Webb
0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-02 12:47 UTC (permalink / raw)
To: linux-mm, linux-kernel
We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some (but not all) of the machines. This is resulting in
customer reports of very poor response latency from the virtual machines
which have been swapped out, despite the hosts apparently having large
amounts of free memory, and running fine if swap is turned off.
All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
machines which apparently doesn't exhibit the problem, and a cluster of
2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
the affected machines is at
http://cdw.me.uk/tmp/config-2.6.32.7
This differs very little from the config on the unaffected Xeon machines,
essentially just
-CONFIG_MCORE2=y
+CONFIG_MK8=y
-CONFIG_X86_P6_NOP=y
On a typical affected machine, the virtual machines and other processes
would apparently leave around 5.5GB of RAM available for buffers, but the
system seems to want to swap out 3GB of anonymous pages to give itself more
like 9GB of buffers:
# cat /proc/meminfo
MemTotal: 33083420 kB
MemFree: 693164 kB
Buffers: 8834380 kB
Cached: 11212 kB
SwapCached: 1443524 kB
Active: 21656844 kB
Inactive: 8119352 kB
Active(anon): 17203092 kB
Inactive(anon): 3729032 kB
Active(file): 4453752 kB
Inactive(file): 4390320 kB
Unevictable: 5472 kB
Mlocked: 5472 kB
SwapTotal: 25165816 kB
SwapFree: 21854572 kB
Dirty: 4300 kB
Writeback: 4 kB
AnonPages: 20780368 kB
Mapped: 6056 kB
Shmem: 56 kB
Slab: 961512 kB
SReclaimable: 438276 kB
SUnreclaim: 523236 kB
KernelStack: 10152 kB
PageTables: 67176 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 41707524 kB
Committed_AS: 39870868 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 150880 kB
VmallocChunk: 34342404996 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5824 kB
DirectMap2M: 3205120 kB
DirectMap1G: 30408704 kB
We see this despite the machine having vm.swappiness set to 0 in an attempt
to skew the reclaim as far as possible in favour of releasing page cache
instead of swapping anonymous pages.
After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB.
We could run with these machines without swap (in the worst cases we're
already doing so), but I'd prefer to have a reserve of swap available in
case of genuine emergency. If it's a choice between swapping out a guest or
oom-killing it, I'd prefer to swap... but I really don't want to swap out
running virtual machines in order to have eight gigabytes of page cache
instead of five!
Is this a problem with the page reclaim priorities, or am I just tuning
these hosts incorrectly? Is there more detailed info than /proc/meminfo
available which might shed more light on what's going wrong here?
Best wishes,
Chris.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 75+ messages in thread
end of thread, other threads:[~2012-04-25 14:42 UTC | newest]
Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-23 9:27 Over-eager swapping Richard Davies
2012-04-23 9:27 ` Richard Davies
2012-04-23 12:07 ` Zdenek Kaspar
2012-04-23 12:07 ` Zdenek Kaspar
2012-04-23 17:19 ` Dave Hansen
2012-04-23 17:19 ` Dave Hansen
2012-04-24 0:35 ` Minchan Kim
2012-04-24 0:35 ` Minchan Kim
2012-04-24 11:16 ` Peter Lieven
2012-04-24 11:16 ` Peter Lieven
2012-04-25 14:41 ` Rik van Riel
2012-04-25 14:41 ` Rik van Riel
-- strict thread matches above, loose matches on Subject: below --
2010-08-02 12:47 Chris Webb
2010-08-02 12:47 ` Chris Webb
2010-08-02 23:55 ` Minchan Kim
2010-08-02 23:55 ` Minchan Kim
2010-08-03 3:31 ` Chris Webb
2010-08-03 3:31 ` Chris Webb
2010-08-03 4:09 ` Minchan Kim
2010-08-03 4:09 ` Minchan Kim
2010-08-03 4:28 ` Wu Fengguang
2010-08-03 4:28 ` Wu Fengguang
2010-08-03 4:47 ` Minchan Kim
2010-08-03 4:47 ` Minchan Kim
2010-08-03 6:39 ` Wu Fengguang
2010-08-03 6:39 ` Wu Fengguang
2010-08-03 21:49 ` Chris Webb
2010-08-03 21:49 ` Chris Webb
2010-08-04 2:21 ` Wu Fengguang
2010-08-04 2:21 ` Wu Fengguang
2010-08-04 3:10 ` Minchan Kim
2010-08-04 3:24 ` Wu Fengguang
2010-08-04 3:24 ` Wu Fengguang
2010-08-04 9:58 ` Chris Webb
2010-08-04 9:58 ` Chris Webb
2010-08-04 11:49 ` Wu Fengguang
2010-08-04 11:49 ` Wu Fengguang
2010-08-04 12:04 ` Chris Webb
2010-08-04 12:04 ` Chris Webb
2010-08-18 14:38 ` Wu Fengguang
2010-08-18 14:38 ` Wu Fengguang
2010-08-18 14:46 ` Chris Webb
2010-08-18 14:46 ` Chris Webb
2010-08-18 15:21 ` Wu Fengguang
2010-08-18 15:21 ` Wu Fengguang
2010-08-18 15:57 ` Christoph Lameter
2010-08-18 15:57 ` Christoph Lameter
2010-08-18 16:20 ` Wu Fengguang
2010-08-18 16:20 ` Wu Fengguang
2010-08-18 15:57 ` Lee Schermerhorn
2010-08-18 15:57 ` Lee Schermerhorn
2010-08-18 15:58 ` Chris Webb
2010-08-18 15:58 ` Chris Webb
2010-08-18 16:13 ` Christoph Lameter
2010-08-18 16:13 ` Christoph Lameter
2010-08-18 16:32 ` Chris Webb
2010-08-18 16:32 ` Chris Webb
2010-08-19 5:16 ` Balbir Singh
2010-08-19 5:16 ` Balbir Singh
2010-08-19 10:20 ` Chris Webb
2010-08-19 10:20 ` Chris Webb
2010-08-19 19:03 ` Christoph Lameter
2010-08-19 19:03 ` Christoph Lameter
2010-08-18 16:13 ` Wu Fengguang
2010-08-18 16:13 ` Wu Fengguang
2010-08-18 16:31 ` Chris Webb
2010-08-18 16:31 ` Chris Webb
2010-08-19 5:13 ` Balbir Singh
2010-08-19 5:13 ` Balbir Singh
2010-08-18 16:45 ` Balbir Singh
2010-08-18 16:45 ` Balbir Singh
2010-08-19 9:25 ` Chris Webb
2010-08-19 9:25 ` Chris Webb
2010-08-19 15:13 ` Balbir Singh
2010-08-19 15:13 ` Balbir Singh
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.