All of lore.kernel.org
 help / color / mirror / Atom feed
* Over-eager swapping
@ 2012-04-23  9:27 ` Richard Davies
  0 siblings, 0 replies; 75+ messages in thread
From: Richard Davies @ 2012-04-23  9:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Minchan Kim, Wu Fengguang, Balbir Singh, Christoph Lameter,
	Lee Schermerhorn, Chris Webb

We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some of the machines. This is resulting in load spikes during the
swapping and customer reports of very poor response latency from the virtual
machines which have been swapped out, despite the hosts apparently having
large amounts of free memory, and running fine if swap is turned off.


All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
(previous helpful contributors cc:ed here - thanks).

We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
http://users.org.uk/config-3.1.4


The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.

We estimate memory used from /proc/meminfo as:

  = MemTotal - MemFree - Buffers + SwapTotal - SwapFree

The first rrd shows memory used increasing as a VM starts, but not getting
near the 64GB of physical RAM.

The second rrd shows the heavy swapping this VM start caused.

The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree

The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
storm.


It is obviously hard to capture all of the relevant data actually during an
incident. However, as of this morning, the relevant stats are as below.

Any help much appreciated! Our strong belief is that there is unnecessary
swapping going on here, and causing these load spikes. We would like to run
with swap for real out-of-memory situations, but at present it is causing
these kind of load spikes on machines which run completely happily with swap
disabled.

Thanks,

Richard.


# cat /proc/meminfo
MemTotal:       65915384 kB
MemFree:          271104 kB
Buffers:        36274368 kB
Cached:            31048 kB
SwapCached:      1830860 kB
Active:         30594144 kB
Inactive:       32295972 kB
Active(anon):   21883428 kB
Inactive(anon):  4695308 kB
Active(file):    8710716 kB
Inactive(file): 27600664 kB
Unevictable:        6740 kB
Mlocked:            6740 kB
SwapTotal:      33054708 kB
SwapFree:       30067948 kB
Dirty:              1044 kB
Writeback:             0 kB
AnonPages:      24962708 kB
Mapped:             7320 kB
Shmem:                48 kB
Slab:            2210964 kB
SReclaimable:    1013272 kB
SUnreclaim:      1197692 kB
KernelStack:        6816 kB
PageTables:       129248 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    66012400 kB
Committed_AS:   67375852 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      259380 kB
VmallocChunk:   34308695568 kB
HardwareCorrupted:     0 kB
AnonHugePages:    155648 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:         576 kB
DirectMap2M:     2095104 kB
DirectMap1G:    65011712 kB

# cat /proc/sys/vm/zone_reclaim_mode
0

# cat /proc/sys/vm/min_unmapped_ratio
1

# cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_groupinfo_1k     32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
RAWv6                 34     34    960   34    8 : tunables    0    0    0 : slabdata      1      1      0
UDPLITEv6              0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                544    544    960   34    8 : tunables    0    0    0 : slabdata     16     16      0
tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
nf_conntrack_expect    592    592    216   37    2 : tunables    0    0    0 : slabdata     16     16      0
nf_conntrack_ffffffff8199a280    933   1856    280   29    2 : tunables    0    0    0 : slabdata     64     64      0
dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
dm_snap_pending_exception      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
dm_crypt_io         1811   2574    152   26    1 : tunables    0    0    0 : slabdata     99     99      0
kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
cfq_queue              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
udf_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_request           0      0    608   26    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode             0      0    704   46    8 : tunables    0    0    0 : slabdata      0      0      0
ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
ntfs_inode_cache       0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
isofs_inode_cache      0      0    600   27    4 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    664   24    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     28     28    568   28    4 : tunables    0    0    0 : slabdata      1      1      0
squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
jbd2_journal_handle   2720   2720     24  170    1 : tunables    0    0    0 : slabdata     16     16      0
jbd2_journal_head    818   1620    112   36    1 : tunables    0    0    0 : slabdata     45     45      0
jbd2_revoke_record   2048   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0
ext4_inode_cache    2754   5328    864   37    8 : tunables    0    0    0 : slabdata    144    144      0
ext4_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_data      1168   2628     56   73    1 : tunables    0    0    0 : slabdata     36     36      0
ext4_allocation_context    540    540    136   30    1 : tunables    0    0    0 : slabdata     18     18      0
ext4_io_end            0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_page         256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
kioctx                 0      0    384   42    4 : tunables    0    0    0 : slabdata      0      0      0
inotify_inode_mark     30     30    136   30    1 : tunables    0    0    0 : slabdata      1      1      0
kvm_async_pf         448    448    144   28    1 : tunables    0    0    0 : slabdata     16     16      0
kvm_vcpu              64     94  13856    2    8 : tunables    0    0    0 : slabdata     47     47      0
UDP-Lite               0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
ip_fib_trie          219    219     56   73    1 : tunables    0    0    0 : slabdata      3      3      0
arp_cache            417    500    320   25    2 : tunables    0    0    0 : slabdata     20     20      0
RAW                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
UDP                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
tw_sock_TCP          512   1088    256   32    2 : tunables    0    0    0 : slabdata     34     34      0
TCP                  345    357   1536   21    8 : tunables    0    0    0 : slabdata     17     17      0
blkdev_queue         414    440   1616   20    8 : tunables    0    0    0 : slabdata     22     22      0
blkdev_requests      945   2209    344   47    4 : tunables    0    0    0 : slabdata     47     47      0
sock_inode_cache     456    475    640   25    4 : tunables    0    0    0 : slabdata     19     19      0
shmem_inode_cache   2063   2375    632   25    4 : tunables    0    0    0 : slabdata     95     95      0
Acpi-ParseExt       3848   3864     72   56    1 : tunables    0    0    0 : slabdata     69     69      0
Acpi-Namespace    633667 1059270     40  102    1 : tunables    0    0    0 : slabdata  10385  10385      0
task_delay_info     1238   1584    112   36    1 : tunables    0    0    0 : slabdata     44     44      0
taskstats            384    384    328   24    2 : tunables    0    0    0 : slabdata     16     16      0
proc_inode_cache    2460   3250    616   26    4 : tunables    0    0    0 : slabdata    125    125      0
sigqueue             400    400    160   25    1 : tunables    0    0    0 : slabdata     16     16      0
bdev_cache           701    714    768   42    8 : tunables    0    0    0 : slabdata     17     17      0
sysfs_dir_cache    31662  34425     80   51    1 : tunables    0    0    0 : slabdata    675    675      0
inode_cache         2546   3886    552   29    4 : tunables    0    0    0 : slabdata    134    134      0
dentry              9452  14868    192   42    2 : tunables    0    0    0 : slabdata    354    354      0
buffer_head       8175114 8360937    104   39    1 : tunables    0    0    0 : slabdata 214383 214383      0
vm_area_struct     35344  35834    176   46    2 : tunables    0    0    0 : slabdata    782    782      0
files_cache          736    874    704   46    8 : tunables    0    0    0 : slabdata     19     19      0
signal_cache        1011   1296    896   36    8 : tunables    0    0    0 : slabdata     36     36      0
sighand_cache        682    945   2112   15    8 : tunables    0    0    0 : slabdata     63     63      0
task_struct         1057   1386   1520   21    8 : tunables    0    0    0 : slabdata     66     66      0
anon_vma            2417   2856     72   56    1 : tunables    0    0    0 : slabdata     51     51      0
shared_policy_node   4877   6800     48   85    1 : tunables    0    0    0 : slabdata     80     80      0
numa_policy        45589  48450     24  170    1 : tunables    0    0    0 : slabdata    285    285      0
radix_tree_node   227192 248388    568   28    4 : tunables    0    0    0 : slabdata   9174   9174      0
idr_layer_cache      603    660    544   30    4 : tunables    0    0    0 : slabdata     22     22      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192          88    100   8192    4    8 : tunables    0    0    0 : slabdata     25     25      0
kmalloc-4096        3567   3704   4096    8    8 : tunables    0    0    0 : slabdata    463    463      0
kmalloc-2048       55140  55936   2048   16    8 : tunables    0    0    0 : slabdata   3496   3496      0
kmalloc-1024        5960   6496   1024   32    8 : tunables    0    0    0 : slabdata    203    203      0
kmalloc-512        12185  12704    512   32    4 : tunables    0    0    0 : slabdata    397    397      0
kmalloc-256       195078 199040    256   32    2 : tunables    0    0    0 : slabdata   6220   6220      0
kmalloc-128        45645  47328    128   32    1 : tunables    0    0    0 : slabdata   1479   1479      0
kmalloc-64        14647251 14776576     64   64    1 : tunables    0    0    0 : slabdata 230884 230884      0
kmalloc-32          5573   7552     32  128    1 : tunables    0    0    0 : slabdata     59     59      0
kmalloc-16          7550  10752     16  256    1 : tunables    0    0    0 : slabdata     42     42      0
kmalloc-8          13805  14848      8  512    1 : tunables    0    0    0 : slabdata     29     29      0
kmalloc-192        47641  50883    192   42    2 : tunables    0    0    0 : slabdata   1214   1214      0
kmalloc-96          3673   6006     96   42    1 : tunables    0    0    0 : slabdata    143    143      0
kmem_cache            32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
kmem_cache_node      495    576     64   64    1 : tunables    0    0    0 : slabdata      9      9      0

# cat /proc/buddyinfo
Node 0, zone      DMA      0      0      1      0      2      1      1      0      1      1      3
Node 0, zone    DMA32   9148   1941    657    673    131     53     18      2      0      0      0
Node 0, zone   Normal   8080     13      0      2      0      2      1      0      1      0      0
Node 1, zone   Normal  19071   3239    675    200    413     37      4      1      2      0      0
Node 2, zone   Normal  37716   3924    154      9      3      1      2      0      1      0      0
Node 3, zone   Normal  20015   4590   1768    996    334     20      1      1      1      0      0

# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree:          201460 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree:          283224 kB
/sys/devices/system/node/node2/meminfo:Node 2 MemFree:          287060 kB
/sys/devices/system/node/node3/meminfo:Node 3 MemFree:          316928 kB

# cat /proc/vmstat
nr_free_pages 224933
nr_inactive_anon 1173838
nr_active_anon 5209232
nr_inactive_file 6998686
nr_active_file 2180311
nr_unevictable 1685
nr_mlock 1685
nr_anon_pages 5940145
nr_mapped 1836
nr_file_pages 9635092
nr_dirty 603
nr_writeback 0
nr_slab_reclaimable 253121
nr_slab_unreclaimable 299440
nr_page_table_pages 32311
nr_kernel_stack 854
nr_unstable 0
nr_bounce 0
nr_vmscan_write 50485772
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 12
nr_dirtied 5630347228
nr_written 5625041387
numa_hit 28372623283
numa_miss 4761673976
numa_foreign 4761673976
numa_interleave 30490
numa_local 28372334279
numa_other 4761962980
nr_anon_transparent_hugepages 76
nr_dirty_threshold 8192
nr_dirty_background_threshold 4096
pgpgin 9523143630
pgpgout 23124688920
pswpin 57978726
pswpout 50121412
pgalloc_dma 0
pgalloc_dma32 1132547190
pgalloc_normal 32421613044
pgalloc_movable 0
pgfree 39379011152
pgactivate 751722445
pgdeactivate 591205976
pgfault 41103638391
pgmajfault 11853858
pgrefill_dma 0
pgrefill_dma32 24124080
pgrefill_normal 540719764
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 297677595
pgsteal_normal 4784595717
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 241277864
pgscan_kswapd_normal 4004618399
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 65729843
pgscan_direct_normal 1012932822
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 66
slabs_scanned 668153728
kswapd_steal 4063341017
kswapd_inodesteal 2063
kswapd_low_wmark_hit_quickly 9834
kswapd_high_wmark_hit_quickly 488468
kswapd_skip_congestion_wait 580150
pageoutrun 22006623
allocstall 926752
pgrotated 28467920
compact_blocks_moved 522323130
compact_pages_moved 5774251432
compact_pagemigrate_failed 5267247
compact_stall 121045
compact_fail 68349
compact_success 52696
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 19976952
unevictable_pgs_scanned 0
unevictable_pgs_rescued 33137561
unevictable_pgs_mlocked 35042070
unevictable_pgs_munlocked 33138335
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 1024
thp_fault_alloc 263176
thp_fault_fallback 717335
thp_collapse_alloc 21307
thp_collapse_alloc_failed 91103
thp_split 90328

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Over-eager swapping
@ 2012-04-23  9:27 ` Richard Davies
  0 siblings, 0 replies; 75+ messages in thread
From: Richard Davies @ 2012-04-23  9:27 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Minchan Kim, Wu Fengguang, Balbir Singh, Christoph Lameter,
	Lee Schermerhorn, Chris Webb

We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some of the machines. This is resulting in load spikes during the
swapping and customer reports of very poor response latency from the virtual
machines which have been swapped out, despite the hosts apparently having
large amounts of free memory, and running fine if swap is turned off.


All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
(previous helpful contributors cc:ed here - thanks).

We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
http://users.org.uk/config-3.1.4


The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.

We estimate memory used from /proc/meminfo as:

  = MemTotal - MemFree - Buffers + SwapTotal - SwapFree

The first rrd shows memory used increasing as a VM starts, but not getting
near the 64GB of physical RAM.

The second rrd shows the heavy swapping this VM start caused.

The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree

The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
storm.


It is obviously hard to capture all of the relevant data actually during an
incident. However, as of this morning, the relevant stats are as below.

Any help much appreciated! Our strong belief is that there is unnecessary
swapping going on here, and causing these load spikes. We would like to run
with swap for real out-of-memory situations, but at present it is causing
these kind of load spikes on machines which run completely happily with swap
disabled.

Thanks,

Richard.


# cat /proc/meminfo
MemTotal:       65915384 kB
MemFree:          271104 kB
Buffers:        36274368 kB
Cached:            31048 kB
SwapCached:      1830860 kB
Active:         30594144 kB
Inactive:       32295972 kB
Active(anon):   21883428 kB
Inactive(anon):  4695308 kB
Active(file):    8710716 kB
Inactive(file): 27600664 kB
Unevictable:        6740 kB
Mlocked:            6740 kB
SwapTotal:      33054708 kB
SwapFree:       30067948 kB
Dirty:              1044 kB
Writeback:             0 kB
AnonPages:      24962708 kB
Mapped:             7320 kB
Shmem:                48 kB
Slab:            2210964 kB
SReclaimable:    1013272 kB
SUnreclaim:      1197692 kB
KernelStack:        6816 kB
PageTables:       129248 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    66012400 kB
Committed_AS:   67375852 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      259380 kB
VmallocChunk:   34308695568 kB
HardwareCorrupted:     0 kB
AnonHugePages:    155648 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:         576 kB
DirectMap2M:     2095104 kB
DirectMap1G:    65011712 kB

# cat /proc/sys/vm/zone_reclaim_mode
0

# cat /proc/sys/vm/min_unmapped_ratio
1

# cat /proc/slabinfo
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
ext4_groupinfo_1k     32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
RAWv6                 34     34    960   34    8 : tunables    0    0    0 : slabdata      1      1      0
UDPLITEv6              0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                544    544    960   34    8 : tunables    0    0    0 : slabdata     16     16      0
tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
nf_conntrack_expect    592    592    216   37    2 : tunables    0    0    0 : slabdata     16     16      0
nf_conntrack_ffffffff8199a280    933   1856    280   29    2 : tunables    0    0    0 : slabdata     64     64      0
dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
dm_snap_pending_exception      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
dm_crypt_io         1811   2574    152   26    1 : tunables    0    0    0 : slabdata     99     99      0
kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
cfq_queue              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
udf_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_request           0      0    608   26    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode             0      0    704   46    8 : tunables    0    0    0 : slabdata      0      0      0
ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
ntfs_inode_cache       0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
isofs_inode_cache      0      0    600   27    4 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    664   24    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     28     28    568   28    4 : tunables    0    0    0 : slabdata      1      1      0
squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
jbd2_journal_handle   2720   2720     24  170    1 : tunables    0    0    0 : slabdata     16     16      0
jbd2_journal_head    818   1620    112   36    1 : tunables    0    0    0 : slabdata     45     45      0
jbd2_revoke_record   2048   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0
ext4_inode_cache    2754   5328    864   37    8 : tunables    0    0    0 : slabdata    144    144      0
ext4_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_data      1168   2628     56   73    1 : tunables    0    0    0 : slabdata     36     36      0
ext4_allocation_context    540    540    136   30    1 : tunables    0    0    0 : slabdata     18     18      0
ext4_io_end            0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
ext4_io_page         256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
kioctx                 0      0    384   42    4 : tunables    0    0    0 : slabdata      0      0      0
inotify_inode_mark     30     30    136   30    1 : tunables    0    0    0 : slabdata      1      1      0
kvm_async_pf         448    448    144   28    1 : tunables    0    0    0 : slabdata     16     16      0
kvm_vcpu              64     94  13856    2    8 : tunables    0    0    0 : slabdata     47     47      0
UDP-Lite               0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
ip_fib_trie          219    219     56   73    1 : tunables    0    0    0 : slabdata      3      3      0
arp_cache            417    500    320   25    2 : tunables    0    0    0 : slabdata     20     20      0
RAW                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
UDP                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
tw_sock_TCP          512   1088    256   32    2 : tunables    0    0    0 : slabdata     34     34      0
TCP                  345    357   1536   21    8 : tunables    0    0    0 : slabdata     17     17      0
blkdev_queue         414    440   1616   20    8 : tunables    0    0    0 : slabdata     22     22      0
blkdev_requests      945   2209    344   47    4 : tunables    0    0    0 : slabdata     47     47      0
sock_inode_cache     456    475    640   25    4 : tunables    0    0    0 : slabdata     19     19      0
shmem_inode_cache   2063   2375    632   25    4 : tunables    0    0    0 : slabdata     95     95      0
Acpi-ParseExt       3848   3864     72   56    1 : tunables    0    0    0 : slabdata     69     69      0
Acpi-Namespace    633667 1059270     40  102    1 : tunables    0    0    0 : slabdata  10385  10385      0
task_delay_info     1238   1584    112   36    1 : tunables    0    0    0 : slabdata     44     44      0
taskstats            384    384    328   24    2 : tunables    0    0    0 : slabdata     16     16      0
proc_inode_cache    2460   3250    616   26    4 : tunables    0    0    0 : slabdata    125    125      0
sigqueue             400    400    160   25    1 : tunables    0    0    0 : slabdata     16     16      0
bdev_cache           701    714    768   42    8 : tunables    0    0    0 : slabdata     17     17      0
sysfs_dir_cache    31662  34425     80   51    1 : tunables    0    0    0 : slabdata    675    675      0
inode_cache         2546   3886    552   29    4 : tunables    0    0    0 : slabdata    134    134      0
dentry              9452  14868    192   42    2 : tunables    0    0    0 : slabdata    354    354      0
buffer_head       8175114 8360937    104   39    1 : tunables    0    0    0 : slabdata 214383 214383      0
vm_area_struct     35344  35834    176   46    2 : tunables    0    0    0 : slabdata    782    782      0
files_cache          736    874    704   46    8 : tunables    0    0    0 : slabdata     19     19      0
signal_cache        1011   1296    896   36    8 : tunables    0    0    0 : slabdata     36     36      0
sighand_cache        682    945   2112   15    8 : tunables    0    0    0 : slabdata     63     63      0
task_struct         1057   1386   1520   21    8 : tunables    0    0    0 : slabdata     66     66      0
anon_vma            2417   2856     72   56    1 : tunables    0    0    0 : slabdata     51     51      0
shared_policy_node   4877   6800     48   85    1 : tunables    0    0    0 : slabdata     80     80      0
numa_policy        45589  48450     24  170    1 : tunables    0    0    0 : slabdata    285    285      0
radix_tree_node   227192 248388    568   28    4 : tunables    0    0    0 : slabdata   9174   9174      0
idr_layer_cache      603    660    544   30    4 : tunables    0    0    0 : slabdata     22     22      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8192          88    100   8192    4    8 : tunables    0    0    0 : slabdata     25     25      0
kmalloc-4096        3567   3704   4096    8    8 : tunables    0    0    0 : slabdata    463    463      0
kmalloc-2048       55140  55936   2048   16    8 : tunables    0    0    0 : slabdata   3496   3496      0
kmalloc-1024        5960   6496   1024   32    8 : tunables    0    0    0 : slabdata    203    203      0
kmalloc-512        12185  12704    512   32    4 : tunables    0    0    0 : slabdata    397    397      0
kmalloc-256       195078 199040    256   32    2 : tunables    0    0    0 : slabdata   6220   6220      0
kmalloc-128        45645  47328    128   32    1 : tunables    0    0    0 : slabdata   1479   1479      0
kmalloc-64        14647251 14776576     64   64    1 : tunables    0    0    0 : slabdata 230884 230884      0
kmalloc-32          5573   7552     32  128    1 : tunables    0    0    0 : slabdata     59     59      0
kmalloc-16          7550  10752     16  256    1 : tunables    0    0    0 : slabdata     42     42      0
kmalloc-8          13805  14848      8  512    1 : tunables    0    0    0 : slabdata     29     29      0
kmalloc-192        47641  50883    192   42    2 : tunables    0    0    0 : slabdata   1214   1214      0
kmalloc-96          3673   6006     96   42    1 : tunables    0    0    0 : slabdata    143    143      0
kmem_cache            32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
kmem_cache_node      495    576     64   64    1 : tunables    0    0    0 : slabdata      9      9      0

# cat /proc/buddyinfo
Node 0, zone      DMA      0      0      1      0      2      1      1      0      1      1      3
Node 0, zone    DMA32   9148   1941    657    673    131     53     18      2      0      0      0
Node 0, zone   Normal   8080     13      0      2      0      2      1      0      1      0      0
Node 1, zone   Normal  19071   3239    675    200    413     37      4      1      2      0      0
Node 2, zone   Normal  37716   3924    154      9      3      1      2      0      1      0      0
Node 3, zone   Normal  20015   4590   1768    996    334     20      1      1      1      0      0

# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree:          201460 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree:          283224 kB
/sys/devices/system/node/node2/meminfo:Node 2 MemFree:          287060 kB
/sys/devices/system/node/node3/meminfo:Node 3 MemFree:          316928 kB

# cat /proc/vmstat
nr_free_pages 224933
nr_inactive_anon 1173838
nr_active_anon 5209232
nr_inactive_file 6998686
nr_active_file 2180311
nr_unevictable 1685
nr_mlock 1685
nr_anon_pages 5940145
nr_mapped 1836
nr_file_pages 9635092
nr_dirty 603
nr_writeback 0
nr_slab_reclaimable 253121
nr_slab_unreclaimable 299440
nr_page_table_pages 32311
nr_kernel_stack 854
nr_unstable 0
nr_bounce 0
nr_vmscan_write 50485772
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 12
nr_dirtied 5630347228
nr_written 5625041387
numa_hit 28372623283
numa_miss 4761673976
numa_foreign 4761673976
numa_interleave 30490
numa_local 28372334279
numa_other 4761962980
nr_anon_transparent_hugepages 76
nr_dirty_threshold 8192
nr_dirty_background_threshold 4096
pgpgin 9523143630
pgpgout 23124688920
pswpin 57978726
pswpout 50121412
pgalloc_dma 0
pgalloc_dma32 1132547190
pgalloc_normal 32421613044
pgalloc_movable 0
pgfree 39379011152
pgactivate 751722445
pgdeactivate 591205976
pgfault 41103638391
pgmajfault 11853858
pgrefill_dma 0
pgrefill_dma32 24124080
pgrefill_normal 540719764
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 297677595
pgsteal_normal 4784595717
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 241277864
pgscan_kswapd_normal 4004618399
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 65729843
pgscan_direct_normal 1012932822
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 66
slabs_scanned 668153728
kswapd_steal 4063341017
kswapd_inodesteal 2063
kswapd_low_wmark_hit_quickly 9834
kswapd_high_wmark_hit_quickly 488468
kswapd_skip_congestion_wait 580150
pageoutrun 22006623
allocstall 926752
pgrotated 28467920
compact_blocks_moved 522323130
compact_pages_moved 5774251432
compact_pagemigrate_failed 5267247
compact_stall 121045
compact_fail 68349
compact_success 52696
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 19976952
unevictable_pgs_scanned 0
unevictable_pgs_rescued 33137561
unevictable_pgs_mlocked 35042070
unevictable_pgs_munlocked 33138335
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 1024
thp_fault_alloc 263176
thp_fault_fallback 717335
thp_collapse_alloc 21307
thp_collapse_alloc_failed 91103
thp_split 90328

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2012-04-23  9:27 ` Richard Davies
@ 2012-04-23 12:07   ` Zdenek Kaspar
  -1 siblings, 0 replies; 75+ messages in thread
From: Zdenek Kaspar @ 2012-04-23 12:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

Dne 23.4.2012 11:27, Richard Davies napsal(a):
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
> 
> 
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
> 
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
> 
> 
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
> 
> We estimate memory used from /proc/meminfo as:
> 
>   = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
> 
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
> 
> The second rrd shows the heavy swapping this VM start caused.
> 
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
> 
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
> 
> 
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
> 
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
> 
> Thanks,
> 
> Richard.
> 
> 
> # cat /proc/meminfo
> MemTotal:       65915384 kB
> MemFree:          271104 kB
> Buffers:        36274368 kB
> Cached:            31048 kB
> SwapCached:      1830860 kB
> Active:         30594144 kB
> Inactive:       32295972 kB
> Active(anon):   21883428 kB
> Inactive(anon):  4695308 kB
> Active(file):    8710716 kB
> Inactive(file): 27600664 kB
> Unevictable:        6740 kB
> Mlocked:            6740 kB
> SwapTotal:      33054708 kB
> SwapFree:       30067948 kB
> Dirty:              1044 kB
> Writeback:             0 kB
> AnonPages:      24962708 kB
> Mapped:             7320 kB
> Shmem:                48 kB
> Slab:            2210964 kB
> SReclaimable:    1013272 kB
> SUnreclaim:      1197692 kB
> KernelStack:        6816 kB
> PageTables:       129248 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    66012400 kB
> Committed_AS:   67375852 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      259380 kB
> VmallocChunk:   34308695568 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:    155648 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:         576 kB
> DirectMap2M:     2095104 kB
> DirectMap1G:    65011712 kB
> 
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
> 
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
> 
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k     32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 34     34    960   34    8 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                544    544    960   34    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
> nf_conntrack_expect    592    592    216   37    2 : tunables    0    0    0 : slabdata     16     16      0
> nf_conntrack_ffffffff8199a280    933   1856    280   29    2 : tunables    0    0    0 : slabdata     64     64      0
> dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_snap_pending_exception      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> dm_crypt_io         1811   2574    152   26    1 : tunables    0    0    0 : slabdata     99     99      0
> kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> cfq_queue              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
> bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
> udf_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_request           0      0    608   26    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_inode             0      0    704   46    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_inode_cache       0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
> isofs_inode_cache      0      0    600   27    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_inode_cache        0      0    664   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     28     28    568   28    4 : tunables    0    0    0 : slabdata      1      1      0
> squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> jbd2_journal_handle   2720   2720     24  170    1 : tunables    0    0    0 : slabdata     16     16      0
> jbd2_journal_head    818   1620    112   36    1 : tunables    0    0    0 : slabdata     45     45      0
> jbd2_revoke_record   2048   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0
> ext4_inode_cache    2754   5328    864   37    8 : tunables    0    0    0 : slabdata    144    144      0
> ext4_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_free_data      1168   2628     56   73    1 : tunables    0    0    0 : slabdata     36     36      0
> ext4_allocation_context    540    540    136   30    1 : tunables    0    0    0 : slabdata     18     18      0
> ext4_io_end            0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
> ext4_io_page         256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
> configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    384   42    4 : tunables    0    0    0 : slabdata      0      0      0
> inotify_inode_mark     30     30    136   30    1 : tunables    0    0    0 : slabdata      1      1      0
> kvm_async_pf         448    448    144   28    1 : tunables    0    0    0 : slabdata     16     16      0
> kvm_vcpu              64     94  13856    2    8 : tunables    0    0    0 : slabdata     47     47      0
> UDP-Lite               0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
> xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_fib_trie          219    219     56   73    1 : tunables    0    0    0 : slabdata      3      3      0
> arp_cache            417    500    320   25    2 : tunables    0    0    0 : slabdata     20     20      0
> RAW                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> UDP                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCP          512   1088    256   32    2 : tunables    0    0    0 : slabdata     34     34      0
> TCP                  345    357   1536   21    8 : tunables    0    0    0 : slabdata     17     17      0
> blkdev_queue         414    440   1616   20    8 : tunables    0    0    0 : slabdata     22     22      0
> blkdev_requests      945   2209    344   47    4 : tunables    0    0    0 : slabdata     47     47      0
> sock_inode_cache     456    475    640   25    4 : tunables    0    0    0 : slabdata     19     19      0
> shmem_inode_cache   2063   2375    632   25    4 : tunables    0    0    0 : slabdata     95     95      0
> Acpi-ParseExt       3848   3864     72   56    1 : tunables    0    0    0 : slabdata     69     69      0
> Acpi-Namespace    633667 1059270     40  102    1 : tunables    0    0    0 : slabdata  10385  10385      0
> task_delay_info     1238   1584    112   36    1 : tunables    0    0    0 : slabdata     44     44      0
> taskstats            384    384    328   24    2 : tunables    0    0    0 : slabdata     16     16      0
> proc_inode_cache    2460   3250    616   26    4 : tunables    0    0    0 : slabdata    125    125      0
> sigqueue             400    400    160   25    1 : tunables    0    0    0 : slabdata     16     16      0
> bdev_cache           701    714    768   42    8 : tunables    0    0    0 : slabdata     17     17      0
> sysfs_dir_cache    31662  34425     80   51    1 : tunables    0    0    0 : slabdata    675    675      0
> inode_cache         2546   3886    552   29    4 : tunables    0    0    0 : slabdata    134    134      0
> dentry              9452  14868    192   42    2 : tunables    0    0    0 : slabdata    354    354      0
> buffer_head       8175114 8360937    104   39    1 : tunables    0    0    0 : slabdata 214383 214383      0
> vm_area_struct     35344  35834    176   46    2 : tunables    0    0    0 : slabdata    782    782      0
> files_cache          736    874    704   46    8 : tunables    0    0    0 : slabdata     19     19      0
> signal_cache        1011   1296    896   36    8 : tunables    0    0    0 : slabdata     36     36      0
> sighand_cache        682    945   2112   15    8 : tunables    0    0    0 : slabdata     63     63      0
> task_struct         1057   1386   1520   21    8 : tunables    0    0    0 : slabdata     66     66      0
> anon_vma            2417   2856     72   56    1 : tunables    0    0    0 : slabdata     51     51      0
> shared_policy_node   4877   6800     48   85    1 : tunables    0    0    0 : slabdata     80     80      0
> numa_policy        45589  48450     24  170    1 : tunables    0    0    0 : slabdata    285    285      0
> radix_tree_node   227192 248388    568   28    4 : tunables    0    0    0 : slabdata   9174   9174      0
> idr_layer_cache      603    660    544   30    4 : tunables    0    0    0 : slabdata     22     22      0
> dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc-8192          88    100   8192    4    8 : tunables    0    0    0 : slabdata     25     25      0
> kmalloc-4096        3567   3704   4096    8    8 : tunables    0    0    0 : slabdata    463    463      0
> kmalloc-2048       55140  55936   2048   16    8 : tunables    0    0    0 : slabdata   3496   3496      0
> kmalloc-1024        5960   6496   1024   32    8 : tunables    0    0    0 : slabdata    203    203      0
> kmalloc-512        12185  12704    512   32    4 : tunables    0    0    0 : slabdata    397    397      0
> kmalloc-256       195078 199040    256   32    2 : tunables    0    0    0 : slabdata   6220   6220      0
> kmalloc-128        45645  47328    128   32    1 : tunables    0    0    0 : slabdata   1479   1479      0
> kmalloc-64        14647251 14776576     64   64    1 : tunables    0    0    0 : slabdata 230884 230884      0
> kmalloc-32          5573   7552     32  128    1 : tunables    0    0    0 : slabdata     59     59      0
> kmalloc-16          7550  10752     16  256    1 : tunables    0    0    0 : slabdata     42     42      0
> kmalloc-8          13805  14848      8  512    1 : tunables    0    0    0 : slabdata     29     29      0
> kmalloc-192        47641  50883    192   42    2 : tunables    0    0    0 : slabdata   1214   1214      0
> kmalloc-96          3673   6006     96   42    1 : tunables    0    0    0 : slabdata    143    143      0
> kmem_cache            32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
> kmem_cache_node      495    576     64   64    1 : tunables    0    0    0 : slabdata      9      9      0
> 
> # cat /proc/buddyinfo
> Node 0, zone      DMA      0      0      1      0      2      1      1      0      1      1      3
> Node 0, zone    DMA32   9148   1941    657    673    131     53     18      2      0      0      0
> Node 0, zone   Normal   8080     13      0      2      0      2      1      0      1      0      0
> Node 1, zone   Normal  19071   3239    675    200    413     37      4      1      2      0      0
> Node 2, zone   Normal  37716   3924    154      9      3      1      2      0      1      0      0
> Node 3, zone   Normal  20015   4590   1768    996    334     20      1      1      1      0      0
> 
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree:          283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree:          287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree:          316928 kB
> 
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

Since I have this issue too..

Does anyone on list have idea if it's possible to disable memory reclaim
for specified processes, but not by patching binaries?

It's really frustrating seeing some latency sensitive processes swapped
out, in example:

./getswap.pl
  PID         COMMAND        SWSIZE
    1            init        116 kB
  594           udevd        288 kB
 1211        dhclient        344 kB
 1255        rsyslogd        208 kB
 1274         rpcbind        140 kB
 1292       rpc.statd        444 kB
 1310           mdadm         84 kB
 1421          upsmon        436 kB
 1422          upsmon        408 kB
 1432            sshd        556 kB
 1454        ksmtuned         96 kB
 1463           crond        552 kB
 1494          smartd        164 kB
 1502        mingetty         76 kB
 2200            smbd        620 kB
 2212            smbd        748 kB
 2213            nmbd        532 kB
 2265      rpc.mountd        428 kB
 2282            tgtd         92 kB
 2283            tgtd         96 kB
 2328        qemu-vm3      15512 kB
 2366        qemu-vm2      13204 kB
 2410        qemu-vm4      17140 kB
 2448        qemu-vm5      38532 kB
 2495        qemu-vm6      19148 kB
 2534        qemu-vm7      44552 kB
 2579        qemu-vm9      18788 kB
 2620       qemu-vm10      19256 kB
 2699        qemu-vm8      40204 kB
 6376        qemu-vm1      29232 kB
 7646            ntpd        280 kB
32322            smbd        468 kB

NOTE to OP: I just don't use swap if possible, but with qemu-kvm you can
use hugetlbfs as another workaround, but you will sacrifice some
functionality, like KSM and maybe memory ballooning etc..

HTH, Z.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2012-04-23 12:07   ` Zdenek Kaspar
  0 siblings, 0 replies; 75+ messages in thread
From: Zdenek Kaspar @ 2012-04-23 12:07 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Dne 23.4.2012 11:27, Richard Davies napsal(a):
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
> 
> 
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
> 
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
> 
> 
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
> 
> We estimate memory used from /proc/meminfo as:
> 
>   = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
> 
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
> 
> The second rrd shows the heavy swapping this VM start caused.
> 
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
> 
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
> 
> 
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
> 
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
> 
> Thanks,
> 
> Richard.
> 
> 
> # cat /proc/meminfo
> MemTotal:       65915384 kB
> MemFree:          271104 kB
> Buffers:        36274368 kB
> Cached:            31048 kB
> SwapCached:      1830860 kB
> Active:         30594144 kB
> Inactive:       32295972 kB
> Active(anon):   21883428 kB
> Inactive(anon):  4695308 kB
> Active(file):    8710716 kB
> Inactive(file): 27600664 kB
> Unevictable:        6740 kB
> Mlocked:            6740 kB
> SwapTotal:      33054708 kB
> SwapFree:       30067948 kB
> Dirty:              1044 kB
> Writeback:             0 kB
> AnonPages:      24962708 kB
> Mapped:             7320 kB
> Shmem:                48 kB
> Slab:            2210964 kB
> SReclaimable:    1013272 kB
> SUnreclaim:      1197692 kB
> KernelStack:        6816 kB
> PageTables:       129248 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    66012400 kB
> Committed_AS:   67375852 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      259380 kB
> VmallocChunk:   34308695568 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:    155648 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:         576 kB
> DirectMap2M:     2095104 kB
> DirectMap1G:    65011712 kB
> 
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
> 
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
> 
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k     32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 34     34    960   34    8 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                544    544    960   34    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
> nf_conntrack_expect    592    592    216   37    2 : tunables    0    0    0 : slabdata     16     16      0
> nf_conntrack_ffffffff8199a280    933   1856    280   29    2 : tunables    0    0    0 : slabdata     64     64      0
> dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_snap_pending_exception      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> dm_crypt_io         1811   2574    152   26    1 : tunables    0    0    0 : slabdata     99     99      0
> kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> cfq_queue              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
> bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
> udf_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_request           0      0    608   26    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_inode             0      0    704   46    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_inode_cache       0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
> isofs_inode_cache      0      0    600   27    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_inode_cache        0      0    664   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     28     28    568   28    4 : tunables    0    0    0 : slabdata      1      1      0
> squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> jbd2_journal_handle   2720   2720     24  170    1 : tunables    0    0    0 : slabdata     16     16      0
> jbd2_journal_head    818   1620    112   36    1 : tunables    0    0    0 : slabdata     45     45      0
> jbd2_revoke_record   2048   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0
> ext4_inode_cache    2754   5328    864   37    8 : tunables    0    0    0 : slabdata    144    144      0
> ext4_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_free_data      1168   2628     56   73    1 : tunables    0    0    0 : slabdata     36     36      0
> ext4_allocation_context    540    540    136   30    1 : tunables    0    0    0 : slabdata     18     18      0
> ext4_io_end            0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
> ext4_io_page         256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
> configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    384   42    4 : tunables    0    0    0 : slabdata      0      0      0
> inotify_inode_mark     30     30    136   30    1 : tunables    0    0    0 : slabdata      1      1      0
> kvm_async_pf         448    448    144   28    1 : tunables    0    0    0 : slabdata     16     16      0
> kvm_vcpu              64     94  13856    2    8 : tunables    0    0    0 : slabdata     47     47      0
> UDP-Lite               0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
> xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_fib_trie          219    219     56   73    1 : tunables    0    0    0 : slabdata      3      3      0
> arp_cache            417    500    320   25    2 : tunables    0    0    0 : slabdata     20     20      0
> RAW                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> UDP                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCP          512   1088    256   32    2 : tunables    0    0    0 : slabdata     34     34      0
> TCP                  345    357   1536   21    8 : tunables    0    0    0 : slabdata     17     17      0
> blkdev_queue         414    440   1616   20    8 : tunables    0    0    0 : slabdata     22     22      0
> blkdev_requests      945   2209    344   47    4 : tunables    0    0    0 : slabdata     47     47      0
> sock_inode_cache     456    475    640   25    4 : tunables    0    0    0 : slabdata     19     19      0
> shmem_inode_cache   2063   2375    632   25    4 : tunables    0    0    0 : slabdata     95     95      0
> Acpi-ParseExt       3848   3864     72   56    1 : tunables    0    0    0 : slabdata     69     69      0
> Acpi-Namespace    633667 1059270     40  102    1 : tunables    0    0    0 : slabdata  10385  10385      0
> task_delay_info     1238   1584    112   36    1 : tunables    0    0    0 : slabdata     44     44      0
> taskstats            384    384    328   24    2 : tunables    0    0    0 : slabdata     16     16      0
> proc_inode_cache    2460   3250    616   26    4 : tunables    0    0    0 : slabdata    125    125      0
> sigqueue             400    400    160   25    1 : tunables    0    0    0 : slabdata     16     16      0
> bdev_cache           701    714    768   42    8 : tunables    0    0    0 : slabdata     17     17      0
> sysfs_dir_cache    31662  34425     80   51    1 : tunables    0    0    0 : slabdata    675    675      0
> inode_cache         2546   3886    552   29    4 : tunables    0    0    0 : slabdata    134    134      0
> dentry              9452  14868    192   42    2 : tunables    0    0    0 : slabdata    354    354      0
> buffer_head       8175114 8360937    104   39    1 : tunables    0    0    0 : slabdata 214383 214383      0
> vm_area_struct     35344  35834    176   46    2 : tunables    0    0    0 : slabdata    782    782      0
> files_cache          736    874    704   46    8 : tunables    0    0    0 : slabdata     19     19      0
> signal_cache        1011   1296    896   36    8 : tunables    0    0    0 : slabdata     36     36      0
> sighand_cache        682    945   2112   15    8 : tunables    0    0    0 : slabdata     63     63      0
> task_struct         1057   1386   1520   21    8 : tunables    0    0    0 : slabdata     66     66      0
> anon_vma            2417   2856     72   56    1 : tunables    0    0    0 : slabdata     51     51      0
> shared_policy_node   4877   6800     48   85    1 : tunables    0    0    0 : slabdata     80     80      0
> numa_policy        45589  48450     24  170    1 : tunables    0    0    0 : slabdata    285    285      0
> radix_tree_node   227192 248388    568   28    4 : tunables    0    0    0 : slabdata   9174   9174      0
> idr_layer_cache      603    660    544   30    4 : tunables    0    0    0 : slabdata     22     22      0
> dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc-8192          88    100   8192    4    8 : tunables    0    0    0 : slabdata     25     25      0
> kmalloc-4096        3567   3704   4096    8    8 : tunables    0    0    0 : slabdata    463    463      0
> kmalloc-2048       55140  55936   2048   16    8 : tunables    0    0    0 : slabdata   3496   3496      0
> kmalloc-1024        5960   6496   1024   32    8 : tunables    0    0    0 : slabdata    203    203      0
> kmalloc-512        12185  12704    512   32    4 : tunables    0    0    0 : slabdata    397    397      0
> kmalloc-256       195078 199040    256   32    2 : tunables    0    0    0 : slabdata   6220   6220      0
> kmalloc-128        45645  47328    128   32    1 : tunables    0    0    0 : slabdata   1479   1479      0
> kmalloc-64        14647251 14776576     64   64    1 : tunables    0    0    0 : slabdata 230884 230884      0
> kmalloc-32          5573   7552     32  128    1 : tunables    0    0    0 : slabdata     59     59      0
> kmalloc-16          7550  10752     16  256    1 : tunables    0    0    0 : slabdata     42     42      0
> kmalloc-8          13805  14848      8  512    1 : tunables    0    0    0 : slabdata     29     29      0
> kmalloc-192        47641  50883    192   42    2 : tunables    0    0    0 : slabdata   1214   1214      0
> kmalloc-96          3673   6006     96   42    1 : tunables    0    0    0 : slabdata    143    143      0
> kmem_cache            32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
> kmem_cache_node      495    576     64   64    1 : tunables    0    0    0 : slabdata      9      9      0
> 
> # cat /proc/buddyinfo
> Node 0, zone      DMA      0      0      1      0      2      1      1      0      1      1      3
> Node 0, zone    DMA32   9148   1941    657    673    131     53     18      2      0      0      0
> Node 0, zone   Normal   8080     13      0      2      0      2      1      0      1      0      0
> Node 1, zone   Normal  19071   3239    675    200    413     37      4      1      2      0      0
> Node 2, zone   Normal  37716   3924    154      9      3      1      2      0      1      0      0
> Node 3, zone   Normal  20015   4590   1768    996    334     20      1      1      1      0      0
> 
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree:          283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree:          287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree:          316928 kB
> 
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

Since I have this issue too..

Does anyone on list have idea if it's possible to disable memory reclaim
for specified processes, but not by patching binaries?

It's really frustrating seeing some latency sensitive processes swapped
out, in example:

./getswap.pl
  PID         COMMAND        SWSIZE
    1            init        116 kB
  594           udevd        288 kB
 1211        dhclient        344 kB
 1255        rsyslogd        208 kB
 1274         rpcbind        140 kB
 1292       rpc.statd        444 kB
 1310           mdadm         84 kB
 1421          upsmon        436 kB
 1422          upsmon        408 kB
 1432            sshd        556 kB
 1454        ksmtuned         96 kB
 1463           crond        552 kB
 1494          smartd        164 kB
 1502        mingetty         76 kB
 2200            smbd        620 kB
 2212            smbd        748 kB
 2213            nmbd        532 kB
 2265      rpc.mountd        428 kB
 2282            tgtd         92 kB
 2283            tgtd         96 kB
 2328        qemu-vm3      15512 kB
 2366        qemu-vm2      13204 kB
 2410        qemu-vm4      17140 kB
 2448        qemu-vm5      38532 kB
 2495        qemu-vm6      19148 kB
 2534        qemu-vm7      44552 kB
 2579        qemu-vm9      18788 kB
 2620       qemu-vm10      19256 kB
 2699        qemu-vm8      40204 kB
 6376        qemu-vm1      29232 kB
 7646            ntpd        280 kB
32322            smbd        468 kB

NOTE to OP: I just don't use swap if possible, but with qemu-kvm you can
use hugetlbfs as another workaround, but you will sacrifice some
functionality, like KSM and maybe memory ballooning etc..

HTH, Z.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2012-04-23  9:27 ` Richard Davies
@ 2012-04-23 17:19   ` Dave Hansen
  -1 siblings, 0 replies; 75+ messages in thread
From: Dave Hansen @ 2012-04-23 17:19 UTC (permalink / raw)
  To: Richard Davies
  Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
	Christoph Lameter, Lee Schermerhorn, Chris Webb, Badari

On 04/23/2012 02:27 AM, Richard Davies wrote:
> # cat /proc/meminfo
> MemTotal:       65915384 kB
> MemFree:          271104 kB
> Buffers:        36274368 kB

Your "Buffers" are the only thing that really stands out here.  We used
to see this kind of thing on ext3 a lot, but it's gotten much better
lately.  From slabinfo, you can see all the buffer_heads:

buffer_head       8175114 8360937    104   39    1 : tunables    0    0
   0 : slabdata 214383 214383      0

I _think_ this was a filesystems issue where the FS for some reason kept
the buffers locked down.  The swapping just comes later as so much of
RAM is eaten up by buffers.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2012-04-23 17:19   ` Dave Hansen
  0 siblings, 0 replies; 75+ messages in thread
From: Dave Hansen @ 2012-04-23 17:19 UTC (permalink / raw)
  To: Richard Davies
  Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
	Christoph Lameter, Lee Schermerhorn, Chris Webb, Badari

On 04/23/2012 02:27 AM, Richard Davies wrote:
> # cat /proc/meminfo
> MemTotal:       65915384 kB
> MemFree:          271104 kB
> Buffers:        36274368 kB

Your "Buffers" are the only thing that really stands out here.  We used
to see this kind of thing on ext3 a lot, but it's gotten much better
lately.  From slabinfo, you can see all the buffer_heads:

buffer_head       8175114 8360937    104   39    1 : tunables    0    0
   0 : slabdata 214383 214383      0

I _think_ this was a filesystems issue where the FS for some reason kept
the buffers locked down.  The swapping just comes later as so much of
RAM is eaten up by buffers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2012-04-23  9:27 ` Richard Davies
@ 2012-04-24  0:35   ` Minchan Kim
  -1 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2012-04-24  0:35 UTC (permalink / raw)
  To: Richard Davies
  Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
	Christoph Lameter, Lee Schermerhorn, Chris Webb

On 04/23/2012 06:27 PM, Richard Davies wrote:

> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
> 
> 
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
> 
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4


Although you set swappiness to 0, kernel can swap out anon pages in
current implementation. I think it's a severe problem.

Couldn't this patch help you?
http://permalink.gmane.org/gmane.linux.kernel.mm/74824
It can prevent anon pages's swap out until few page cache remain.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2012-04-24  0:35   ` Minchan Kim
  0 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2012-04-24  0:35 UTC (permalink / raw)
  To: Richard Davies
  Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
	Christoph Lameter, Lee Schermerhorn, Chris Webb

On 04/23/2012 06:27 PM, Richard Davies wrote:

> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
> 
> 
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
> 
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4


Although you set swappiness to 0, kernel can swap out anon pages in
current implementation. I think it's a severe problem.

Couldn't this patch help you?
http://permalink.gmane.org/gmane.linux.kernel.mm/74824
It can prevent anon pages's swap out until few page cache remain.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2012-04-23  9:27 ` Richard Davies
@ 2012-04-24 11:16   ` Peter Lieven
  -1 siblings, 0 replies; 75+ messages in thread
From: Peter Lieven @ 2012-04-24 11:16 UTC (permalink / raw)
  To: Richard Davies
  Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
	Christoph Lameter, Lee Schermerhorn, Chris Webb

On 23.04.2012 11:27, Richard Davies wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
>
>
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
>
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
>
>
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
>    = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
>
>
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
>
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
I wonder why a lot of buffers are allocated at all. Can you describe 
whats your
storage backend and provide your qemu-kvm command line. Anyhow, 
qemu-devel@nongnu.org
might be the better list to discuss this.

Peter
> Thanks,
>
> Richard.
>
>
> # cat /proc/meminfo
> MemTotal:       65915384 kB
> MemFree:          271104 kB
> Buffers:        36274368 kB
> Cached:            31048 kB
> SwapCached:      1830860 kB
> Active:         30594144 kB
> Inactive:       32295972 kB
> Active(anon):   21883428 kB
> Inactive(anon):  4695308 kB
> Active(file):    8710716 kB
> Inactive(file): 27600664 kB
> Unevictable:        6740 kB
> Mlocked:            6740 kB
> SwapTotal:      33054708 kB
> SwapFree:       30067948 kB
> Dirty:              1044 kB
> Writeback:             0 kB
> AnonPages:      24962708 kB
> Mapped:             7320 kB
> Shmem:                48 kB
> Slab:            2210964 kB
> SReclaimable:    1013272 kB
> SUnreclaim:      1197692 kB
> KernelStack:        6816 kB
> PageTables:       129248 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    66012400 kB
> Committed_AS:   67375852 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      259380 kB
> VmallocChunk:   34308695568 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:    155648 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:         576 kB
> DirectMap2M:     2095104 kB
> DirectMap1G:    65011712 kB
>
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
>
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name<active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  : tunables<limit>  <batchcount>  <sharedfactor>  : slabdata<active_slabs>  <num_slabs>  <sharedavail>
> ext4_groupinfo_1k     32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 34     34    960   34    8 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                544    544    960   34    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
> nf_conntrack_expect    592    592    216   37    2 : tunables    0    0    0 : slabdata     16     16      0
> nf_conntrack_ffffffff8199a280    933   1856    280   29    2 : tunables    0    0    0 : slabdata     64     64      0
> dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_snap_pending_exception      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> dm_crypt_io         1811   2574    152   26    1 : tunables    0    0    0 : slabdata     99     99      0
> kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> cfq_queue              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
> bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
> udf_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_request           0      0    608   26    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_inode             0      0    704   46    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_inode_cache       0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
> isofs_inode_cache      0      0    600   27    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_inode_cache        0      0    664   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     28     28    568   28    4 : tunables    0    0    0 : slabdata      1      1      0
> squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> jbd2_journal_handle   2720   2720     24  170    1 : tunables    0    0    0 : slabdata     16     16      0
> jbd2_journal_head    818   1620    112   36    1 : tunables    0    0    0 : slabdata     45     45      0
> jbd2_revoke_record   2048   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0
> ext4_inode_cache    2754   5328    864   37    8 : tunables    0    0    0 : slabdata    144    144      0
> ext4_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_free_data      1168   2628     56   73    1 : tunables    0    0    0 : slabdata     36     36      0
> ext4_allocation_context    540    540    136   30    1 : tunables    0    0    0 : slabdata     18     18      0
> ext4_io_end            0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
> ext4_io_page         256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
> configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    384   42    4 : tunables    0    0    0 : slabdata      0      0      0
> inotify_inode_mark     30     30    136   30    1 : tunables    0    0    0 : slabdata      1      1      0
> kvm_async_pf         448    448    144   28    1 : tunables    0    0    0 : slabdata     16     16      0
> kvm_vcpu              64     94  13856    2    8 : tunables    0    0    0 : slabdata     47     47      0
> UDP-Lite               0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
> xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_fib_trie          219    219     56   73    1 : tunables    0    0    0 : slabdata      3      3      0
> arp_cache            417    500    320   25    2 : tunables    0    0    0 : slabdata     20     20      0
> RAW                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> UDP                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCP          512   1088    256   32    2 : tunables    0    0    0 : slabdata     34     34      0
> TCP                  345    357   1536   21    8 : tunables    0    0    0 : slabdata     17     17      0
> blkdev_queue         414    440   1616   20    8 : tunables    0    0    0 : slabdata     22     22      0
> blkdev_requests      945   2209    344   47    4 : tunables    0    0    0 : slabdata     47     47      0
> sock_inode_cache     456    475    640   25    4 : tunables    0    0    0 : slabdata     19     19      0
> shmem_inode_cache   2063   2375    632   25    4 : tunables    0    0    0 : slabdata     95     95      0
> Acpi-ParseExt       3848   3864     72   56    1 : tunables    0    0    0 : slabdata     69     69      0
> Acpi-Namespace    633667 1059270     40  102    1 : tunables    0    0    0 : slabdata  10385  10385      0
> task_delay_info     1238   1584    112   36    1 : tunables    0    0    0 : slabdata     44     44      0
> taskstats            384    384    328   24    2 : tunables    0    0    0 : slabdata     16     16      0
> proc_inode_cache    2460   3250    616   26    4 : tunables    0    0    0 : slabdata    125    125      0
> sigqueue             400    400    160   25    1 : tunables    0    0    0 : slabdata     16     16      0
> bdev_cache           701    714    768   42    8 : tunables    0    0    0 : slabdata     17     17      0
> sysfs_dir_cache    31662  34425     80   51    1 : tunables    0    0    0 : slabdata    675    675      0
> inode_cache         2546   3886    552   29    4 : tunables    0    0    0 : slabdata    134    134      0
> dentry              9452  14868    192   42    2 : tunables    0    0    0 : slabdata    354    354      0
> buffer_head       8175114 8360937    104   39    1 : tunables    0    0    0 : slabdata 214383 214383      0
> vm_area_struct     35344  35834    176   46    2 : tunables    0    0    0 : slabdata    782    782      0
> files_cache          736    874    704   46    8 : tunables    0    0    0 : slabdata     19     19      0
> signal_cache        1011   1296    896   36    8 : tunables    0    0    0 : slabdata     36     36      0
> sighand_cache        682    945   2112   15    8 : tunables    0    0    0 : slabdata     63     63      0
> task_struct         1057   1386   1520   21    8 : tunables    0    0    0 : slabdata     66     66      0
> anon_vma            2417   2856     72   56    1 : tunables    0    0    0 : slabdata     51     51      0
> shared_policy_node   4877   6800     48   85    1 : tunables    0    0    0 : slabdata     80     80      0
> numa_policy        45589  48450     24  170    1 : tunables    0    0    0 : slabdata    285    285      0
> radix_tree_node   227192 248388    568   28    4 : tunables    0    0    0 : slabdata   9174   9174      0
> idr_layer_cache      603    660    544   30    4 : tunables    0    0    0 : slabdata     22     22      0
> dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc-8192          88    100   8192    4    8 : tunables    0    0    0 : slabdata     25     25      0
> kmalloc-4096        3567   3704   4096    8    8 : tunables    0    0    0 : slabdata    463    463      0
> kmalloc-2048       55140  55936   2048   16    8 : tunables    0    0    0 : slabdata   3496   3496      0
> kmalloc-1024        5960   6496   1024   32    8 : tunables    0    0    0 : slabdata    203    203      0
> kmalloc-512        12185  12704    512   32    4 : tunables    0    0    0 : slabdata    397    397      0
> kmalloc-256       195078 199040    256   32    2 : tunables    0    0    0 : slabdata   6220   6220      0
> kmalloc-128        45645  47328    128   32    1 : tunables    0    0    0 : slabdata   1479   1479      0
> kmalloc-64        14647251 14776576     64   64    1 : tunables    0    0    0 : slabdata 230884 230884      0
> kmalloc-32          5573   7552     32  128    1 : tunables    0    0    0 : slabdata     59     59      0
> kmalloc-16          7550  10752     16  256    1 : tunables    0    0    0 : slabdata     42     42      0
> kmalloc-8          13805  14848      8  512    1 : tunables    0    0    0 : slabdata     29     29      0
> kmalloc-192        47641  50883    192   42    2 : tunables    0    0    0 : slabdata   1214   1214      0
> kmalloc-96          3673   6006     96   42    1 : tunables    0    0    0 : slabdata    143    143      0
> kmem_cache            32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
> kmem_cache_node      495    576     64   64    1 : tunables    0    0    0 : slabdata      9      9      0
>
> # cat /proc/buddyinfo
> Node 0, zone      DMA      0      0      1      0      2      1      1      0      1      1      3
> Node 0, zone    DMA32   9148   1941    657    673    131     53     18      2      0      0      0
> Node 0, zone   Normal   8080     13      0      2      0      2      1      0      1      0      0
> Node 1, zone   Normal  19071   3239    675    200    413     37      4      1      2      0      0
> Node 2, zone   Normal  37716   3924    154      9      3      1      2      0      1      0      0
> Node 3, zone   Normal  20015   4590   1768    996    334     20      1      1      1      0      0
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree:          283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree:          287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree:          316928 kB
>
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2012-04-24 11:16   ` Peter Lieven
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Lieven @ 2012-04-24 11:16 UTC (permalink / raw)
  To: Richard Davies
  Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
	Christoph Lameter, Lee Schermerhorn, Chris Webb

On 23.04.2012 11:27, Richard Davies wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
>
>
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
>
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
>
>
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
>    = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
>
>
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
>
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
I wonder why a lot of buffers are allocated at all. Can you describe 
whats your
storage backend and provide your qemu-kvm command line. Anyhow, 
qemu-devel@nongnu.org
might be the better list to discuss this.

Peter
> Thanks,
>
> Richard.
>
>
> # cat /proc/meminfo
> MemTotal:       65915384 kB
> MemFree:          271104 kB
> Buffers:        36274368 kB
> Cached:            31048 kB
> SwapCached:      1830860 kB
> Active:         30594144 kB
> Inactive:       32295972 kB
> Active(anon):   21883428 kB
> Inactive(anon):  4695308 kB
> Active(file):    8710716 kB
> Inactive(file): 27600664 kB
> Unevictable:        6740 kB
> Mlocked:            6740 kB
> SwapTotal:      33054708 kB
> SwapFree:       30067948 kB
> Dirty:              1044 kB
> Writeback:             0 kB
> AnonPages:      24962708 kB
> Mapped:             7320 kB
> Shmem:                48 kB
> Slab:            2210964 kB
> SReclaimable:    1013272 kB
> SUnreclaim:      1197692 kB
> KernelStack:        6816 kB
> PageTables:       129248 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    66012400 kB
> Committed_AS:   67375852 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      259380 kB
> VmallocChunk:   34308695568 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:    155648 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:         576 kB
> DirectMap2M:     2095104 kB
> DirectMap1G:    65011712 kB
>
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
>
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name<active_objs>  <num_objs>  <objsize>  <objperslab>  <pagesperslab>  : tunables<limit>  <batchcount>  <sharedfactor>  : slabdata<active_slabs>  <num_slabs>  <sharedavail>
> ext4_groupinfo_1k     32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 34     34    960   34    8 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                544    544    960   34    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
> nf_conntrack_expect    592    592    216   37    2 : tunables    0    0    0 : slabdata     16     16      0
> nf_conntrack_ffffffff8199a280    933   1856    280   29    2 : tunables    0    0    0 : slabdata     64     64      0
> dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_snap_pending_exception      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> dm_crypt_io         1811   2574    152   26    1 : tunables    0    0    0 : slabdata     99     99      0
> kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> cfq_queue              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
> bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
> udf_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_request           0      0    608   26    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_inode             0      0    704   46    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_inode_cache       0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
> isofs_inode_cache      0      0    600   27    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_inode_cache        0      0    664   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     28     28    568   28    4 : tunables    0    0    0 : slabdata      1      1      0
> squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> jbd2_journal_handle   2720   2720     24  170    1 : tunables    0    0    0 : slabdata     16     16      0
> jbd2_journal_head    818   1620    112   36    1 : tunables    0    0    0 : slabdata     45     45      0
> jbd2_revoke_record   2048   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0
> ext4_inode_cache    2754   5328    864   37    8 : tunables    0    0    0 : slabdata    144    144      0
> ext4_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_free_data      1168   2628     56   73    1 : tunables    0    0    0 : slabdata     36     36      0
> ext4_allocation_context    540    540    136   30    1 : tunables    0    0    0 : slabdata     18     18      0
> ext4_io_end            0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
> ext4_io_page         256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
> configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    384   42    4 : tunables    0    0    0 : slabdata      0      0      0
> inotify_inode_mark     30     30    136   30    1 : tunables    0    0    0 : slabdata      1      1      0
> kvm_async_pf         448    448    144   28    1 : tunables    0    0    0 : slabdata     16     16      0
> kvm_vcpu              64     94  13856    2    8 : tunables    0    0    0 : slabdata     47     47      0
> UDP-Lite               0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
> xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_fib_trie          219    219     56   73    1 : tunables    0    0    0 : slabdata      3      3      0
> arp_cache            417    500    320   25    2 : tunables    0    0    0 : slabdata     20     20      0
> RAW                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> UDP                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCP          512   1088    256   32    2 : tunables    0    0    0 : slabdata     34     34      0
> TCP                  345    357   1536   21    8 : tunables    0    0    0 : slabdata     17     17      0
> blkdev_queue         414    440   1616   20    8 : tunables    0    0    0 : slabdata     22     22      0
> blkdev_requests      945   2209    344   47    4 : tunables    0    0    0 : slabdata     47     47      0
> sock_inode_cache     456    475    640   25    4 : tunables    0    0    0 : slabdata     19     19      0
> shmem_inode_cache   2063   2375    632   25    4 : tunables    0    0    0 : slabdata     95     95      0
> Acpi-ParseExt       3848   3864     72   56    1 : tunables    0    0    0 : slabdata     69     69      0
> Acpi-Namespace    633667 1059270     40  102    1 : tunables    0    0    0 : slabdata  10385  10385      0
> task_delay_info     1238   1584    112   36    1 : tunables    0    0    0 : slabdata     44     44      0
> taskstats            384    384    328   24    2 : tunables    0    0    0 : slabdata     16     16      0
> proc_inode_cache    2460   3250    616   26    4 : tunables    0    0    0 : slabdata    125    125      0
> sigqueue             400    400    160   25    1 : tunables    0    0    0 : slabdata     16     16      0
> bdev_cache           701    714    768   42    8 : tunables    0    0    0 : slabdata     17     17      0
> sysfs_dir_cache    31662  34425     80   51    1 : tunables    0    0    0 : slabdata    675    675      0
> inode_cache         2546   3886    552   29    4 : tunables    0    0    0 : slabdata    134    134      0
> dentry              9452  14868    192   42    2 : tunables    0    0    0 : slabdata    354    354      0
> buffer_head       8175114 8360937    104   39    1 : tunables    0    0    0 : slabdata 214383 214383      0
> vm_area_struct     35344  35834    176   46    2 : tunables    0    0    0 : slabdata    782    782      0
> files_cache          736    874    704   46    8 : tunables    0    0    0 : slabdata     19     19      0
> signal_cache        1011   1296    896   36    8 : tunables    0    0    0 : slabdata     36     36      0
> sighand_cache        682    945   2112   15    8 : tunables    0    0    0 : slabdata     63     63      0
> task_struct         1057   1386   1520   21    8 : tunables    0    0    0 : slabdata     66     66      0
> anon_vma            2417   2856     72   56    1 : tunables    0    0    0 : slabdata     51     51      0
> shared_policy_node   4877   6800     48   85    1 : tunables    0    0    0 : slabdata     80     80      0
> numa_policy        45589  48450     24  170    1 : tunables    0    0    0 : slabdata    285    285      0
> radix_tree_node   227192 248388    568   28    4 : tunables    0    0    0 : slabdata   9174   9174      0
> idr_layer_cache      603    660    544   30    4 : tunables    0    0    0 : slabdata     22     22      0
> dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc-8192          88    100   8192    4    8 : tunables    0    0    0 : slabdata     25     25      0
> kmalloc-4096        3567   3704   4096    8    8 : tunables    0    0    0 : slabdata    463    463      0
> kmalloc-2048       55140  55936   2048   16    8 : tunables    0    0    0 : slabdata   3496   3496      0
> kmalloc-1024        5960   6496   1024   32    8 : tunables    0    0    0 : slabdata    203    203      0
> kmalloc-512        12185  12704    512   32    4 : tunables    0    0    0 : slabdata    397    397      0
> kmalloc-256       195078 199040    256   32    2 : tunables    0    0    0 : slabdata   6220   6220      0
> kmalloc-128        45645  47328    128   32    1 : tunables    0    0    0 : slabdata   1479   1479      0
> kmalloc-64        14647251 14776576     64   64    1 : tunables    0    0    0 : slabdata 230884 230884      0
> kmalloc-32          5573   7552     32  128    1 : tunables    0    0    0 : slabdata     59     59      0
> kmalloc-16          7550  10752     16  256    1 : tunables    0    0    0 : slabdata     42     42      0
> kmalloc-8          13805  14848      8  512    1 : tunables    0    0    0 : slabdata     29     29      0
> kmalloc-192        47641  50883    192   42    2 : tunables    0    0    0 : slabdata   1214   1214      0
> kmalloc-96          3673   6006     96   42    1 : tunables    0    0    0 : slabdata    143    143      0
> kmem_cache            32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
> kmem_cache_node      495    576     64   64    1 : tunables    0    0    0 : slabdata      9      9      0
>
> # cat /proc/buddyinfo
> Node 0, zone      DMA      0      0      1      0      2      1      1      0      1      1      3
> Node 0, zone    DMA32   9148   1941    657    673    131     53     18      2      0      0      0
> Node 0, zone   Normal   8080     13      0      2      0      2      1      0      1      0      0
> Node 1, zone   Normal  19071   3239    675    200    413     37      4      1      2      0      0
> Node 2, zone   Normal  37716   3924    154      9      3      1      2      0      1      0      0
> Node 3, zone   Normal  20015   4590   1768    996    334     20      1      1      1      0      0
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree:          283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree:          287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree:          316928 kB
>
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2012-04-23  9:27 ` Richard Davies
@ 2012-04-25 14:41   ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-04-25 14:41 UTC (permalink / raw)
  To: Richard Davies
  Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
	Christoph Lameter, Lee Schermerhorn, Chris Webb

On 04/23/2012 05:27 AM, Richard Davies wrote:

> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
>    = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.

These are exactly the kind of swap storms that led me
make the VM tweaks that got merged into 3.4-rc :)

See these commits:

fe2c2a106663130a5ab45cb0e3414b52df2fff0c
7be62de99adcab4449d416977b4274985c5fe023
aff622495c9a0b56148192e53bdec539f5e147f2
1480de0340a8d5f094b74d7c4b902456c9a06903
496b919b3bdd957d4b1727df79bfa3751bced1c1

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2012-04-25 14:41   ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-04-25 14:41 UTC (permalink / raw)
  To: Richard Davies
  Cc: linux-kernel, linux-mm, Minchan Kim, Wu Fengguang, Balbir Singh,
	Christoph Lameter, Lee Schermerhorn, Chris Webb

On 04/23/2012 05:27 AM, Richard Davies wrote:

> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
>
> We estimate memory used from /proc/meminfo as:
>
>    = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
>
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
>
> The second rrd shows the heavy swapping this VM start caused.
>
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
>
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.

These are exactly the kind of swap storms that led me
make the VM tweaks that got merged into 3.4-rc :)

See these commits:

fe2c2a106663130a5ab45cb0e3414b52df2fff0c
7be62de99adcab4449d416977b4274985c5fe023
aff622495c9a0b56148192e53bdec539f5e147f2
1480de0340a8d5f094b74d7c4b902456c9a06903
496b919b3bdd957d4b1727df79bfa3751bced1c1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-19 10:20                                     ` Chris Webb
@ 2010-08-19 19:03                                       ` Christoph Lameter
  -1 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-19 19:03 UTC (permalink / raw)
  To: Chris Webb
  Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
	linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen

On Thu, 19 Aug 2010, Chris Webb wrote:

> I tried this on a handful of the problem hosts before re-adding their swap.
> One of them now runs without dipping into swap. The other three I tried had
> the same behaviour of sitting at zero swap usage for a while, before
> suddenly spiralling up with %wait going through the roof. I had to swapoff
> on them to bring them back into a sane state. So it looks like it helps a
> bit, but doesn't cure the problem.
>
> I could definitely believe an explanation that we're swapping in preference
> to allocating remote zone pages somehow, given the imbalance in free memory
> between the nodes which we saw. However, I read the documentation for
> vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
> pages from remote zones should be allocated automatically in preference to
> swap given that zone_reclaim_mode & 4 == 0?

If zone reclaim is off then pages from other nodes will be allocated if a
node is filled up with page cache.

zone reclaim typically only evicts clean page cache pages in order to keep
the additional overhead down. Enabling swapping allows a more aggressive
form of recovering memory in preference of going off line.

The VM should work fine even without zone reclaim.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-19 19:03                                       ` Christoph Lameter
  0 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-19 19:03 UTC (permalink / raw)
  To: Chris Webb
  Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
	linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen

On Thu, 19 Aug 2010, Chris Webb wrote:

> I tried this on a handful of the problem hosts before re-adding their swap.
> One of them now runs without dipping into swap. The other three I tried had
> the same behaviour of sitting at zero swap usage for a while, before
> suddenly spiralling up with %wait going through the roof. I had to swapoff
> on them to bring them back into a sane state. So it looks like it helps a
> bit, but doesn't cure the problem.
>
> I could definitely believe an explanation that we're swapping in preference
> to allocating remote zone pages somehow, given the imbalance in free memory
> between the nodes which we saw. However, I read the documentation for
> vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
> pages from remote zones should be allocated automatically in preference to
> swap given that zone_reclaim_mode & 4 == 0?

If zone reclaim is off then pages from other nodes will be allocated if a
node is filled up with page cache.

zone reclaim typically only evicts clean page cache pages in order to keep
the additional overhead down. Enabling swapping allows a more aggressive
form of recovering memory in preference of going off line.

The VM should work fine even without zone reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-19  9:25     ` Chris Webb
@ 2010-08-19 15:13       ` Balbir Singh
  -1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19 15:13 UTC (permalink / raw)
  To: Chris Webb
  Cc: linux-mm, linux-kernel, Wu Fengguang, Minchan Kim, KOSAKI Motohiro

* Chris Webb <chris@arachsys.com> [2010-08-19 10:25:36]:

> Balbir Singh <balbir@linux.vnet.ibm.com> writes:
> 
> > Can you give an idea of what the meminfo inside the guest looks like.
> 
> Sorry for the slow reply here. Unfortunately not, as these guests are run on
> behalf of customers. They install them with operating systems of their
> choice, and run them on our service.
>

Thanks for clarifying.
 
> > Have you looked at
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
> 
> Yes, I've been watching this discussions with interest. Our application is
> one where we have little to no control over what goes on inside the guests,
> but these sorts of things definitely make sense where the two are under the
> same administrative control.
>

Not necessarily, in some cases you can use a guest that uses lesser
page cache, but that might not matter in your case at the moment.
 
> > Do we have reason to believe the problem can be solved entirely in the
> > host?
> 
> It's not clear to me why this should be difficult, given that the total size
> of vm allocated to guests (and system processes) is always strictly less
> than the total amount of RAM available in the host. I do understand that it
> won't allow for as impressive overcommit (except by ksm) or be as efficient,
> because file-backed guest pages won't get evicted by pressure in the host as
> they are indistinguishable from anonymous pages.
>
> After all, a solution that isn't ideal, but does work, is to turn off swap
> completely! This is what we've been doing to date. The only problem with
> this is that we can't dip into swap in an emergency if there's no swap there
> at all.

If you are not overcommitting it should work, in my experiments I've
seen a lot of memory used by the host as page cache on behalf of the
guest. I've done my experiments using cgroups to identify accurate
usage.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-19 15:13       ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19 15:13 UTC (permalink / raw)
  To: Chris Webb
  Cc: linux-mm, linux-kernel, Wu Fengguang, Minchan Kim, KOSAKI Motohiro

* Chris Webb <chris@arachsys.com> [2010-08-19 10:25:36]:

> Balbir Singh <balbir@linux.vnet.ibm.com> writes:
> 
> > Can you give an idea of what the meminfo inside the guest looks like.
> 
> Sorry for the slow reply here. Unfortunately not, as these guests are run on
> behalf of customers. They install them with operating systems of their
> choice, and run them on our service.
>

Thanks for clarifying.
 
> > Have you looked at
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
> 
> Yes, I've been watching this discussions with interest. Our application is
> one where we have little to no control over what goes on inside the guests,
> but these sorts of things definitely make sense where the two are under the
> same administrative control.
>

Not necessarily, in some cases you can use a guest that uses lesser
page cache, but that might not matter in your case at the moment.
 
> > Do we have reason to believe the problem can be solved entirely in the
> > host?
> 
> It's not clear to me why this should be difficult, given that the total size
> of vm allocated to guests (and system processes) is always strictly less
> than the total amount of RAM available in the host. I do understand that it
> won't allow for as impressive overcommit (except by ksm) or be as efficient,
> because file-backed guest pages won't get evicted by pressure in the host as
> they are indistinguishable from anonymous pages.
>
> After all, a solution that isn't ideal, but does work, is to turn off swap
> completely! This is what we've been doing to date. The only problem with
> this is that we can't dip into swap in an emergency if there's no swap there
> at all.

If you are not overcommitting it should work, in my experiments I've
seen a lot of memory used by the host as page cache on behalf of the
guest. I've done my experiments using cgroups to identify accurate
usage.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 16:13                                   ` Christoph Lameter
@ 2010-08-19 10:20                                     ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-19 10:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
	linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen

Christoph Lameter <cl@linux-foundation.org> writes:

> On Wed, 18 Aug 2010, Chris Webb wrote:
> 
> > > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > > anon.  In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
> 
> Set it to 1.

I tried this on a handful of the problem hosts before re-adding their swap.
One of them now runs without dipping into swap. The other three I tried had
the same behaviour of sitting at zero swap usage for a while, before
suddenly spiralling up with %wait going through the roof. I had to swapoff
on them to bring them back into a sane state. So it looks like it helps a
bit, but doesn't cure the problem.

I could definitely believe an explanation that we're swapping in preference
to allocating remote zone pages somehow, given the imbalance in free memory
between the nodes which we saw. However, I read the documentation for
vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
pages from remote zones should be allocated automatically in preference to
swap given that zone_reclaim_mode & 4 == 0?

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-19 10:20                                     ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-19 10:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
	linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen

Christoph Lameter <cl@linux-foundation.org> writes:

> On Wed, 18 Aug 2010, Chris Webb wrote:
> 
> > > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > > anon.  In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
> 
> Set it to 1.

I tried this on a handful of the problem hosts before re-adding their swap.
One of them now runs without dipping into swap. The other three I tried had
the same behaviour of sitting at zero swap usage for a while, before
suddenly spiralling up with %wait going through the roof. I had to swapoff
on them to bring them back into a sane state. So it looks like it helps a
bit, but doesn't cure the problem.

I could definitely believe an explanation that we're swapping in preference
to allocating remote zone pages somehow, given the imbalance in free memory
between the nodes which we saw. However, I read the documentation for
vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
pages from remote zones should be allocated automatically in preference to
swap given that zone_reclaim_mode & 4 == 0?

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 16:45   ` Balbir Singh
@ 2010-08-19  9:25     ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-19  9:25 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Wu Fengguang, Minchan Kim, KOSAKI Motohiro

Balbir Singh <balbir@linux.vnet.ibm.com> writes:

> Can you give an idea of what the meminfo inside the guest looks like.

Sorry for the slow reply here. Unfortunately not, as these guests are run on
behalf of customers. They install them with operating systems of their
choice, and run them on our service.

> Have you looked at
> http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772

Yes, I've been watching this discussions with interest. Our application is
one where we have little to no control over what goes on inside the guests,
but these sorts of things definitely make sense where the two are under the
same administrative control.

> Do we have reason to believe the problem can be solved entirely in the
> host?

It's not clear to me why this should be difficult, given that the total size
of vm allocated to guests (and system processes) is always strictly less
than the total amount of RAM available in the host. I do understand that it
won't allow for as impressive overcommit (except by ksm) or be as efficient,
because file-backed guest pages won't get evicted by pressure in the host as
they are indistinguishable from anonymous pages.

After all, a solution that isn't ideal, but does work, is to turn off swap
completely! This is what we've been doing to date. The only problem with
this is that we can't dip into swap in an emergency if there's no swap there
at all.

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-19  9:25     ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-19  9:25 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-mm, linux-kernel, Wu Fengguang, Minchan Kim, KOSAKI Motohiro

Balbir Singh <balbir@linux.vnet.ibm.com> writes:

> Can you give an idea of what the meminfo inside the guest looks like.

Sorry for the slow reply here. Unfortunately not, as these guests are run on
behalf of customers. They install them with operating systems of their
choice, and run them on our service.

> Have you looked at
> http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772

Yes, I've been watching this discussions with interest. Our application is
one where we have little to no control over what goes on inside the guests,
but these sorts of things definitely make sense where the two are under the
same administrative control.

> Do we have reason to believe the problem can be solved entirely in the
> host?

It's not clear to me why this should be difficult, given that the total size
of vm allocated to guests (and system processes) is always strictly less
than the total amount of RAM available in the host. I do understand that it
won't allow for as impressive overcommit (except by ksm) or be as efficient,
because file-backed guest pages won't get evicted by pressure in the host as
they are indistinguishable from anonymous pages.

After all, a solution that isn't ideal, but does work, is to turn off swap
completely! This is what we've been doing to date. The only problem with
this is that we can't dip into swap in an emergency if there's no swap there
at all.

Best wishes,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 16:13                                   ` Christoph Lameter
@ 2010-08-19  5:16                                     ` Balbir Singh
  -1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19  5:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Webb, Lee Schermerhorn, Wu Fengguang, Minchan Kim,
	linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg,
	Andi Kleen

* Christoph Lameter <cl@linux-foundation.org> [2010-08-18 11:13:03]:

> On Wed, 18 Aug 2010, Chris Webb wrote:
> 
> > > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > > anon.  In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
> 
> Set it to 1.
>

Isn't that bad in terms of how we treat the cost of remote node
allocations? Is local zone_reclaim() always a good thing or is it
something for chris to try and see if that helps his situation?

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-19  5:16                                     ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19  5:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Webb, Lee Schermerhorn, Wu Fengguang, Minchan Kim,
	linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg,
	Andi Kleen

* Christoph Lameter <cl@linux-foundation.org> [2010-08-18 11:13:03]:

> On Wed, 18 Aug 2010, Chris Webb wrote:
> 
> > > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > > anon.  In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
> 
> Set it to 1.
>

Isn't that bad in terms of how we treat the cost of remote node
allocations? Is local zone_reclaim() always a good thing or is it
something for chris to try and see if that helps his situation?

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-03  4:28         ` Wu Fengguang
@ 2010-08-19  5:13           ` Balbir Singh
  -1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19  5:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro

* Wu Fengguang <fengguang.wu@intel.com> [2010-08-03 12:28:35]:

> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> > On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> > > Minchan Kim <minchan.kim@gmail.com> writes:
> > >
> > >> Another possibility is _zone_reclaim_ in NUMA.
> > >> Your working set has many anonymous page.
> > >>
> > >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> > >> It can make reclaim mode to lumpy so it can page out anon pages.
> > >>
> > >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> > >
> > > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > > these are
> > >
> > >  # cat /proc/sys/vm/zone_reclaim_mode
> > >  0
> > >  # cat /proc/sys/vm/min_unmapped_ratio
> > >  1
> > 
> > if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
> 
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is
> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?
> 
> Chris, what's in your /proc/slabinfo?
>

I don't know if Chris saw the link I pointed to earlier, but one of
the reclaim challenges with virtual machines is that cached memory
in the guest (in fact all memory) shows up as anonymous on the host.
If the guests are doing a lot of caching and the guest reclaim sees
no reason to evict the cache, the host will see pressure.

That is one of the reasons I wanted to see meminfo inside the guest if
possible. Setting swappiness to 0 inside the guest is one way of
avoiding double caching that might take place, but I've not found it
to be very effective. 

Do we have reason to believe the problem can be solved entirely in the
host?

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-19  5:13           ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-19  5:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro

* Wu Fengguang <fengguang.wu@intel.com> [2010-08-03 12:28:35]:

> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> > On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> > > Minchan Kim <minchan.kim@gmail.com> writes:
> > >
> > >> Another possibility is _zone_reclaim_ in NUMA.
> > >> Your working set has many anonymous page.
> > >>
> > >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> > >> It can make reclaim mode to lumpy so it can page out anon pages.
> > >>
> > >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> > >
> > > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > > these are
> > >
> > >  # cat /proc/sys/vm/zone_reclaim_mode
> > >  0
> > >  # cat /proc/sys/vm/min_unmapped_ratio
> > >  1
> > 
> > if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
> 
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is
> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?
> 
> Chris, what's in your /proc/slabinfo?
>

I don't know if Chris saw the link I pointed to earlier, but one of
the reclaim challenges with virtual machines is that cached memory
in the guest (in fact all memory) shows up as anonymous on the host.
If the guests are doing a lot of caching and the guest reclaim sees
no reason to evict the cache, the host will see pressure.

That is one of the reasons I wanted to see meminfo inside the guest if
possible. Setting swappiness to 0 inside the guest is one way of
avoiding double caching that might take place, but I've not found it
to be very effective. 

Do we have reason to believe the problem can be solved entirely in the
host?

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-02 12:47 ` Chris Webb
@ 2010-08-18 16:45   ` Balbir Singh
  -1 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-18 16:45 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-mm, linux-kernel

* Chris Webb <chris@arachsys.com> [2010-08-02 13:47:35]:

> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
> 
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
> 
>   http://cdw.me.uk/tmp/config-2.6.32.7
> 
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
> 
>   -CONFIG_MCORE2=y
>   +CONFIG_MK8=y
>   -CONFIG_X86_P6_NOP=y
> 
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
> 
>   # cat /proc/meminfo 
>   MemTotal:       33083420 kB
>   MemFree:          693164 kB
>   Buffers:         8834380 kB
>   Cached:            11212 kB
>   SwapCached:      1443524 kB
>   Active:         21656844 kB
>   Inactive:        8119352 kB
>   Active(anon):   17203092 kB
>   Inactive(anon):  3729032 kB
>   Active(file):    4453752 kB
>   Inactive(file):  4390320 kB
>   Unevictable:        5472 kB
>   Mlocked:            5472 kB
>   SwapTotal:      25165816 kB
>   SwapFree:       21854572 kB
>   Dirty:              4300 kB
>   Writeback:             4 kB
>   AnonPages:      20780368 kB
>   Mapped:             6056 kB
>   Shmem:                56 kB
>   Slab:             961512 kB
>   SReclaimable:     438276 kB
>   SUnreclaim:       523236 kB
>   KernelStack:       10152 kB
>   PageTables:        67176 kB
>   NFS_Unstable:          0 kB
>   Bounce:                0 kB
>   WritebackTmp:          0 kB
>   CommitLimit:    41707524 kB
>   Committed_AS:   39870868 kB
>   VmallocTotal:   34359738367 kB
>   VmallocUsed:      150880 kB
>   VmallocChunk:   34342404996 kB
>   HardwareCorrupted:     0 kB
>   HugePages_Total:       0
>   HugePages_Free:        0
>   HugePages_Rsvd:        0
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
>   DirectMap4k:        5824 kB
>   DirectMap2M:     3205120 kB
>   DirectMap1G:    30408704 kB
> 
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>
> After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out. Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use, before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB.
> 
> We could run with these machines without swap (in the worst cases we're
> already doing so), but I'd prefer to have a reserve of swap available in
> case of genuine emergency. If it's a choice between swapping out a guest or
> oom-killing it, I'd prefer to swap... but I really don't want to swap out
> running virtual machines in order to have eight gigabytes of page cache
> instead of five!
> 
> Is this a problem with the page reclaim priorities, or am I just tuning
> these hosts incorrectly? Is there more detailed info than /proc/meminfo
> available which might shed more light on what's going wrong here?
> 

Can you give an idea of what the meminfo inside the guest looks like.
Have you looked at
http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 16:45   ` Balbir Singh
  0 siblings, 0 replies; 75+ messages in thread
From: Balbir Singh @ 2010-08-18 16:45 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-mm, linux-kernel

* Chris Webb <chris@arachsys.com> [2010-08-02 13:47:35]:

> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
> 
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
> 
>   http://cdw.me.uk/tmp/config-2.6.32.7
> 
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
> 
>   -CONFIG_MCORE2=y
>   +CONFIG_MK8=y
>   -CONFIG_X86_P6_NOP=y
> 
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
> 
>   # cat /proc/meminfo 
>   MemTotal:       33083420 kB
>   MemFree:          693164 kB
>   Buffers:         8834380 kB
>   Cached:            11212 kB
>   SwapCached:      1443524 kB
>   Active:         21656844 kB
>   Inactive:        8119352 kB
>   Active(anon):   17203092 kB
>   Inactive(anon):  3729032 kB
>   Active(file):    4453752 kB
>   Inactive(file):  4390320 kB
>   Unevictable:        5472 kB
>   Mlocked:            5472 kB
>   SwapTotal:      25165816 kB
>   SwapFree:       21854572 kB
>   Dirty:              4300 kB
>   Writeback:             4 kB
>   AnonPages:      20780368 kB
>   Mapped:             6056 kB
>   Shmem:                56 kB
>   Slab:             961512 kB
>   SReclaimable:     438276 kB
>   SUnreclaim:       523236 kB
>   KernelStack:       10152 kB
>   PageTables:        67176 kB
>   NFS_Unstable:          0 kB
>   Bounce:                0 kB
>   WritebackTmp:          0 kB
>   CommitLimit:    41707524 kB
>   Committed_AS:   39870868 kB
>   VmallocTotal:   34359738367 kB
>   VmallocUsed:      150880 kB
>   VmallocChunk:   34342404996 kB
>   HardwareCorrupted:     0 kB
>   HugePages_Total:       0
>   HugePages_Free:        0
>   HugePages_Rsvd:        0
>   HugePages_Surp:        0
>   Hugepagesize:       2048 kB
>   DirectMap4k:        5824 kB
>   DirectMap2M:     3205120 kB
>   DirectMap1G:    30408704 kB
> 
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>
> After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out. Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use, before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB.
> 
> We could run with these machines without swap (in the worst cases we're
> already doing so), but I'd prefer to have a reserve of swap available in
> case of genuine emergency. If it's a choice between swapping out a guest or
> oom-killing it, I'd prefer to swap... but I really don't want to swap out
> running virtual machines in order to have eight gigabytes of page cache
> instead of five!
> 
> Is this a problem with the page reclaim priorities, or am I just tuning
> these hosts incorrectly? Is there more detailed info than /proc/meminfo
> available which might shed more light on what's going wrong here?
> 

Can you give an idea of what the meminfo inside the guest looks like.
Have you looked at
http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772


-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 16:13                                   ` Christoph Lameter
@ 2010-08-18 16:32                                     ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 16:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
	linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen

Christoph Lameter <cl@linux-foundation.org> writes:

> On Wed, 18 Aug 2010, Chris Webb wrote:
> 
> > > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > > anon.  In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
> 
> Set it to 1.

I'll try this tonight: setting this to one and readding swap on one of the
problem machines.

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 16:32                                     ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 16:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
	linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen

Christoph Lameter <cl@linux-foundation.org> writes:

> On Wed, 18 Aug 2010, Chris Webb wrote:
> 
> > > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > > anon.  In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
> 
> Set it to 1.

I'll try this tonight: setting this to one and readding swap on one of the
problem machines.

Best wishes,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 16:13                                   ` Wu Fengguang
@ 2010-08-18 16:31                                     ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 16:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Lee Schermerhorn, Minchan Kim, linux-mm, linux-kernel,
	KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter

Wu Fengguang <fengguang.wu@intel.com> writes:

> Chris, can you post /proc/vmstat on the problem machines?

Here's /proc/vmstat from one of the bad machines with swap taken out:

  # cat /proc/vmstat
  nr_free_pages 115572
  nr_inactive_anon 562140
  nr_active_anon 5015609
  nr_inactive_file 997097
  nr_active_file 996989
  nr_unevictable 1368
  nr_mlock 1368
  nr_anon_pages 5862299
  nr_mapped 1414
  nr_file_pages 1994569
  nr_dirty 619
  nr_writeback 0
  nr_slab_reclaimable 88883
  nr_slab_unreclaimable 129859
  nr_page_table_pages 15744
  nr_kernel_stack 1132
  nr_unstable 0
  nr_bounce 0
  nr_vmscan_write 68708505
  nr_writeback_temp 0
  nr_isolated_anon 0
  nr_isolated_file 0
  nr_shmem 14
  numa_hit 15295188815
  numa_miss 9391232519
  numa_foreign 9391232519
  numa_interleave 16982
  numa_local 15294742520
  numa_other 9391678814
  pgpgin 20644565778
  pgpgout 28740368207
  pswpin 63818244
  pswpout 61199234
  pgalloc_dma 0
  pgalloc_dma32 4967135753
  pgalloc_normal 19812671901
  pgalloc_movable 0
  pgfree 24779926775
  pgactivate 1290396237
  pgdeactivate 1289759899
  pgfault 19993995783
  pgmajfault 21059190
  pgrefill_dma 0
  pgrefill_dma32 133366009
  pgrefill_normal 921184739
  pgrefill_movable 0
  pgsteal_dma 0
  pgsteal_dma32 1275354745
  pgsteal_normal 5641309780
  pgsteal_movable 0
  pgscan_kswapd_dma 0
  pgscan_kswapd_dma32 1333139288
  pgscan_kswapd_normal 5870516663
  pgscan_kswapd_movable 0
  pgscan_direct_dma 0
  pgscan_direct_dma32 1064518
  pgscan_direct_normal 13317302
  pgscan_direct_movable 0
  zone_reclaim_failed 0
  pginodesteal 0
  slabs_scanned 1682790400
  kswapd_steal 6902288285
  kswapd_inodesteal 4909342
  pageoutrun 65408579
  allocstall 33223
  pgrotated 68402979
  htlb_buddy_alloc_success 0
  htlb_buddy_alloc_fail 0
  unevictable_pgs_culled 3538872
  unevictable_pgs_scanned 0
  unevictable_pgs_rescued 4989403
  unevictable_pgs_mlocked 5192009
  unevictable_pgs_munlocked 4989074
  unevictable_pgs_cleared 2295
  unevictable_pgs_stranded 0
  unevictable_pgs_mlockfreed 0

The not-so-bad machine that's 3G in swap that I mentioned previously has

  # cat /proc/vmstat 
  nr_free_pages 898394
  nr_inactive_anon 834445
  nr_active_anon 4118034
  nr_inactive_file 904411
  nr_active_file 910902
  nr_unevictable 2440
  nr_mlock 2440
  nr_anon_pages 4836349
  nr_mapped 1553
  nr_file_pages 2243152
  nr_dirty 1097
  nr_writeback 0
  nr_slab_reclaimable 88788
  nr_slab_unreclaimable 127310
  nr_page_table_pages 14762
  nr_kernel_stack 532
  nr_unstable 0
  nr_bounce 0
  nr_vmscan_write 37404214
  nr_writeback_temp 0
  nr_isolated_anon 0
  nr_isolated_file 0
  nr_shmem 12
  numa_hit 14220178949
  numa_miss 3903552922
  numa_foreign 3903552922
  numa_interleave 16282
  numa_local 14219905325
  numa_other 3903826546
  pgpgin 6500403846
  pgpgout 13255814979
  pswpin 36384510
  pswpout 36380545
  pgalloc_dma 4
  pgalloc_dma32 2019546454
  pgalloc_normal 16466621455
  pgalloc_movable 0
  pgfree 18487068066
  pgactivate 530670561
  pgdeactivate 506674301
  pgfault 19986735100
  pgmajfault 10611234
  pgrefill_dma 0
  pgrefill_dma32 41306492
  pgrefill_normal 318767138
  pgrefill_movable 0
  pgsteal_dma 0
  pgsteal_dma32 214447663
  pgsteal_normal 1645250232
  pgsteal_movable 0
  pgscan_kswapd_dma 0
  pgscan_kswapd_dma32 218030201
  pgscan_kswapd_normal 1812499810
  pgscan_kswapd_movable 0
  pgscan_direct_dma 0
  pgscan_direct_dma32 157144
  pgscan_direct_normal 1095919
  pgscan_direct_movable 0
  zone_reclaim_failed 0
  pginodesteal 0
  slabs_scanned 50051072
  kswapd_steal 1858447127
  kswapd_inodesteal 202297
  pageoutrun 15070446
  allocstall 3104
  pgrotated 37181651
  htlb_buddy_alloc_success 0
  htlb_buddy_alloc_fail 0
  unevictable_pgs_culled 2113384
  unevictable_pgs_scanned 0
  unevictable_pgs_rescued 3055005
  unevictable_pgs_mlocked 3184675
  unevictable_pgs_munlocked 3045129
  unevictable_pgs_cleared 10034
  unevictable_pgs_stranded 0
  unevictable_pgs_mlockfreed 0

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 16:31                                     ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 16:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Lee Schermerhorn, Minchan Kim, linux-mm, linux-kernel,
	KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter

Wu Fengguang <fengguang.wu@intel.com> writes:

> Chris, can you post /proc/vmstat on the problem machines?

Here's /proc/vmstat from one of the bad machines with swap taken out:

  # cat /proc/vmstat
  nr_free_pages 115572
  nr_inactive_anon 562140
  nr_active_anon 5015609
  nr_inactive_file 997097
  nr_active_file 996989
  nr_unevictable 1368
  nr_mlock 1368
  nr_anon_pages 5862299
  nr_mapped 1414
  nr_file_pages 1994569
  nr_dirty 619
  nr_writeback 0
  nr_slab_reclaimable 88883
  nr_slab_unreclaimable 129859
  nr_page_table_pages 15744
  nr_kernel_stack 1132
  nr_unstable 0
  nr_bounce 0
  nr_vmscan_write 68708505
  nr_writeback_temp 0
  nr_isolated_anon 0
  nr_isolated_file 0
  nr_shmem 14
  numa_hit 15295188815
  numa_miss 9391232519
  numa_foreign 9391232519
  numa_interleave 16982
  numa_local 15294742520
  numa_other 9391678814
  pgpgin 20644565778
  pgpgout 28740368207
  pswpin 63818244
  pswpout 61199234
  pgalloc_dma 0
  pgalloc_dma32 4967135753
  pgalloc_normal 19812671901
  pgalloc_movable 0
  pgfree 24779926775
  pgactivate 1290396237
  pgdeactivate 1289759899
  pgfault 19993995783
  pgmajfault 21059190
  pgrefill_dma 0
  pgrefill_dma32 133366009
  pgrefill_normal 921184739
  pgrefill_movable 0
  pgsteal_dma 0
  pgsteal_dma32 1275354745
  pgsteal_normal 5641309780
  pgsteal_movable 0
  pgscan_kswapd_dma 0
  pgscan_kswapd_dma32 1333139288
  pgscan_kswapd_normal 5870516663
  pgscan_kswapd_movable 0
  pgscan_direct_dma 0
  pgscan_direct_dma32 1064518
  pgscan_direct_normal 13317302
  pgscan_direct_movable 0
  zone_reclaim_failed 0
  pginodesteal 0
  slabs_scanned 1682790400
  kswapd_steal 6902288285
  kswapd_inodesteal 4909342
  pageoutrun 65408579
  allocstall 33223
  pgrotated 68402979
  htlb_buddy_alloc_success 0
  htlb_buddy_alloc_fail 0
  unevictable_pgs_culled 3538872
  unevictable_pgs_scanned 0
  unevictable_pgs_rescued 4989403
  unevictable_pgs_mlocked 5192009
  unevictable_pgs_munlocked 4989074
  unevictable_pgs_cleared 2295
  unevictable_pgs_stranded 0
  unevictable_pgs_mlockfreed 0

The not-so-bad machine that's 3G in swap that I mentioned previously has

  # cat /proc/vmstat 
  nr_free_pages 898394
  nr_inactive_anon 834445
  nr_active_anon 4118034
  nr_inactive_file 904411
  nr_active_file 910902
  nr_unevictable 2440
  nr_mlock 2440
  nr_anon_pages 4836349
  nr_mapped 1553
  nr_file_pages 2243152
  nr_dirty 1097
  nr_writeback 0
  nr_slab_reclaimable 88788
  nr_slab_unreclaimable 127310
  nr_page_table_pages 14762
  nr_kernel_stack 532
  nr_unstable 0
  nr_bounce 0
  nr_vmscan_write 37404214
  nr_writeback_temp 0
  nr_isolated_anon 0
  nr_isolated_file 0
  nr_shmem 12
  numa_hit 14220178949
  numa_miss 3903552922
  numa_foreign 3903552922
  numa_interleave 16282
  numa_local 14219905325
  numa_other 3903826546
  pgpgin 6500403846
  pgpgout 13255814979
  pswpin 36384510
  pswpout 36380545
  pgalloc_dma 4
  pgalloc_dma32 2019546454
  pgalloc_normal 16466621455
  pgalloc_movable 0
  pgfree 18487068066
  pgactivate 530670561
  pgdeactivate 506674301
  pgfault 19986735100
  pgmajfault 10611234
  pgrefill_dma 0
  pgrefill_dma32 41306492
  pgrefill_normal 318767138
  pgrefill_movable 0
  pgsteal_dma 0
  pgsteal_dma32 214447663
  pgsteal_normal 1645250232
  pgsteal_movable 0
  pgscan_kswapd_dma 0
  pgscan_kswapd_dma32 218030201
  pgscan_kswapd_normal 1812499810
  pgscan_kswapd_movable 0
  pgscan_direct_dma 0
  pgscan_direct_dma32 157144
  pgscan_direct_normal 1095919
  pgscan_direct_movable 0
  zone_reclaim_failed 0
  pginodesteal 0
  slabs_scanned 50051072
  kswapd_steal 1858447127
  kswapd_inodesteal 202297
  pageoutrun 15070446
  allocstall 3104
  pgrotated 37181651
  htlb_buddy_alloc_success 0
  htlb_buddy_alloc_fail 0
  unevictable_pgs_culled 2113384
  unevictable_pgs_scanned 0
  unevictable_pgs_rescued 3055005
  unevictable_pgs_mlocked 3184675
  unevictable_pgs_munlocked 3045129
  unevictable_pgs_cleared 10034
  unevictable_pgs_stranded 0
  unevictable_pgs_mlockfreed 0

Best wishes,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 15:57                               ` Christoph Lameter
@ 2010-08-18 16:20                                 ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 16:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Lee Schermerhorn

On Wed, Aug 18, 2010 at 11:57:09PM +0800, Christoph Lameter wrote:
> On Wed, 18 Aug 2010, Wu Fengguang wrote:
> 
> > Andi, Christoph and Lee:
> >
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
> 
> Is zone reclaim active? It may not activate on smaller systems leading
> to an unbalance memory usage between node.

Another possibility is, there are many low watermark page allocations,
leading to kswapd page out activities.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 16:20                                 ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 16:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Lee Schermerhorn

On Wed, Aug 18, 2010 at 11:57:09PM +0800, Christoph Lameter wrote:
> On Wed, 18 Aug 2010, Wu Fengguang wrote:
> 
> > Andi, Christoph and Lee:
> >
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
> 
> Is zone reclaim active? It may not activate on smaller systems leading
> to an unbalance memory usage between node.

Another possibility is, there are many low watermark page allocations,
leading to kswapd page out activities.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 15:58                                 ` Chris Webb
@ 2010-08-18 16:13                                   ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 16:13 UTC (permalink / raw)
  To: Chris Webb
  Cc: Lee Schermerhorn, Minchan Kim, linux-mm, linux-kernel,
	KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter

On Wed, Aug 18, 2010 at 11:58:25PM +0800, Chris Webb wrote:
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
> 
> > On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > > Andi, Christoph and Lee:
> > > 
> > > This looks like an "unbalanced NUMA memory usage leading to premature
> > > swapping" problem.
> > 
> > What is the value of the vm.zone_reclaim_mode sysctl?  If it is !0, the
> > system will go into zone reclaim before allocating off-node pages.
> > However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > anon.  In theory...
> 
> Hi. This is zero on all our machines:
> 
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0

Chris, can you post /proc/vmstat on the problem machines?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 16:13                                   ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 16:13 UTC (permalink / raw)
  To: Chris Webb
  Cc: Lee Schermerhorn, Minchan Kim, linux-mm, linux-kernel,
	KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter

On Wed, Aug 18, 2010 at 11:58:25PM +0800, Chris Webb wrote:
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:
> 
> > On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > > Andi, Christoph and Lee:
> > > 
> > > This looks like an "unbalanced NUMA memory usage leading to premature
> > > swapping" problem.
> > 
> > What is the value of the vm.zone_reclaim_mode sysctl?  If it is !0, the
> > system will go into zone reclaim before allocating off-node pages.
> > However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > anon.  In theory...
> 
> Hi. This is zero on all our machines:
> 
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0

Chris, can you post /proc/vmstat on the problem machines?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 15:58                                 ` Chris Webb
@ 2010-08-18 16:13                                   ` Christoph Lameter
  -1 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-18 16:13 UTC (permalink / raw)
  To: Chris Webb
  Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
	linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen

On Wed, 18 Aug 2010, Chris Webb wrote:

> > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > anon.  In theory...
>
> Hi. This is zero on all our machines:
>
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0

Set it to 1.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 16:13                                   ` Christoph Lameter
  0 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-18 16:13 UTC (permalink / raw)
  To: Chris Webb
  Cc: Lee Schermerhorn, Wu Fengguang, Minchan Kim, linux-mm,
	linux-kernel, KOSAKI Motohiro, Pekka Enberg, Andi Kleen

On Wed, 18 Aug 2010, Chris Webb wrote:

> > != 0.  And even then, zone reclaim should only reclaim file pages, not
> > anon.  In theory...
>
> Hi. This is zero on all our machines:
>
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0

Set it to 1.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 15:57                               ` Lee Schermerhorn
@ 2010-08-18 15:58                                 ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 15:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Wu Fengguang, Minchan Kim, linux-mm, linux-kernel,
	KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter

Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:

> On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > Andi, Christoph and Lee:
> > 
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
> 
> What is the value of the vm.zone_reclaim_mode sysctl?  If it is !0, the
> system will go into zone reclaim before allocating off-node pages.
> However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> != 0.  And even then, zone reclaim should only reclaim file pages, not
> anon.  In theory...

Hi. This is zero on all our machines:

# sysctl vm.zone_reclaim_mode
vm.zone_reclaim_mode = 0

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 15:58                                 ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 15:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Wu Fengguang, Minchan Kim, linux-mm, linux-kernel,
	KOSAKI Motohiro, Pekka Enberg, Andi Kleen, Christoph Lameter

Lee Schermerhorn <Lee.Schermerhorn@hp.com> writes:

> On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > Andi, Christoph and Lee:
> > 
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
> 
> What is the value of the vm.zone_reclaim_mode sysctl?  If it is !0, the
> system will go into zone reclaim before allocating off-node pages.
> However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> != 0.  And even then, zone reclaim should only reclaim file pages, not
> anon.  In theory...

Hi. This is zero on all our machines:

# sysctl vm.zone_reclaim_mode
vm.zone_reclaim_mode = 0

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 15:21                             ` Wu Fengguang
@ 2010-08-18 15:57                               ` Lee Schermerhorn
  -1 siblings, 0 replies; 75+ messages in thread
From: Lee Schermerhorn @ 2010-08-18 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Christoph Lameter

On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> Andi, Christoph and Lee:
> 
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.

What is the value of the vm.zone_reclaim_mode sysctl?  If it is !0, the
system will go into zone reclaim before allocating off-node pages.
However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
!= 0.  And even then, zone reclaim should only reclaim file pages, not
anon.  In theory...

Note:  zone_reclaim_mode will be enabled by default [= 1] if the SLIT
contains any distances > 2.0 [20].  Check SLIT values via 'numactl
--hardware'.

Lee

> 
> Thanks,
> Fengguang
> 
> On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> > Wu Fengguang <fengguang.wu@intel.com> writes:
> > 
> > > Did you enable any NUMA policy? That could start swapping even if
> > > there are lots of free pages in some nodes.
> > 
> > Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> > NUMA behaviour, but NUMA support is definitely compiled into the kernel:
> > 
> >   # zgrep NUMA /proc/config.gz 
> >   CONFIG_NUMA_IRQ_DESC=y
> >   CONFIG_NUMA=y
> >   CONFIG_K8_NUMA=y
> >   CONFIG_X86_64_ACPI_NUMA=y
> >   # CONFIG_NUMA_EMU is not set
> >   CONFIG_ACPI_NUMA=y
> >   # grep -i numa /var/log/dmesg.boot 
> >   NUMe: Allocated memnodemap from b000 - 1b540
> >   NUMA: Using 20 for the hash shift.
> > 
> > > Are your free pages equally distributed over the nodes? Or limited to
> > > some of the nodes? Try this command:
> > > 
> > >         grep MemFree /sys/devices/system/node/node*/meminfo
> > 
> > My worst-case machines current have swap completely turned off to make them
> > usable for clients, but I have one machine which is about 3GB into swap with
> > 8GB of buffers and 3GB free. This shows
> > 
> >   # grep MemFree /sys/devices/system/node/node*/meminfo
> >   /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          954500 kB
> >   /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         2374528 kB
> > 
> > I could definitely imagine that one of the nodes could have dipped down to
> > zero in the past. I'll try enabling swap on one of our machines with the bad
> > problem late tonight and repeat the experiment. The node meminfo on this box
> > currently looks like
> > 
> >   # grep MemFree /sys/devices/system/node/node*/meminfo
> >   /sys/devices/system/node/node0/meminfo:Node 0 MemFree:           82732 kB
> >   /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         1723896 kB
> > 
> > Best wishes,
> > 
> > Chris.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 15:57                               ` Lee Schermerhorn
  0 siblings, 0 replies; 75+ messages in thread
From: Lee Schermerhorn @ 2010-08-18 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Christoph Lameter

On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> Andi, Christoph and Lee:
> 
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.

What is the value of the vm.zone_reclaim_mode sysctl?  If it is !0, the
system will go into zone reclaim before allocating off-node pages.
However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
!= 0.  And even then, zone reclaim should only reclaim file pages, not
anon.  In theory...

Note:  zone_reclaim_mode will be enabled by default [= 1] if the SLIT
contains any distances > 2.0 [20].  Check SLIT values via 'numactl
--hardware'.

Lee

> 
> Thanks,
> Fengguang
> 
> On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> > Wu Fengguang <fengguang.wu@intel.com> writes:
> > 
> > > Did you enable any NUMA policy? That could start swapping even if
> > > there are lots of free pages in some nodes.
> > 
> > Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> > NUMA behaviour, but NUMA support is definitely compiled into the kernel:
> > 
> >   # zgrep NUMA /proc/config.gz 
> >   CONFIG_NUMA_IRQ_DESC=y
> >   CONFIG_NUMA=y
> >   CONFIG_K8_NUMA=y
> >   CONFIG_X86_64_ACPI_NUMA=y
> >   # CONFIG_NUMA_EMU is not set
> >   CONFIG_ACPI_NUMA=y
> >   # grep -i numa /var/log/dmesg.boot 
> >   NUMe: Allocated memnodemap from b000 - 1b540
> >   NUMA: Using 20 for the hash shift.
> > 
> > > Are your free pages equally distributed over the nodes? Or limited to
> > > some of the nodes? Try this command:
> > > 
> > >         grep MemFree /sys/devices/system/node/node*/meminfo
> > 
> > My worst-case machines current have swap completely turned off to make them
> > usable for clients, but I have one machine which is about 3GB into swap with
> > 8GB of buffers and 3GB free. This shows
> > 
> >   # grep MemFree /sys/devices/system/node/node*/meminfo
> >   /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          954500 kB
> >   /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         2374528 kB
> > 
> > I could definitely imagine that one of the nodes could have dipped down to
> > zero in the past. I'll try enabling swap on one of our machines with the bad
> > problem late tonight and repeat the experiment. The node meminfo on this box
> > currently looks like
> > 
> >   # grep MemFree /sys/devices/system/node/node*/meminfo
> >   /sys/devices/system/node/node0/meminfo:Node 0 MemFree:           82732 kB
> >   /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         1723896 kB
> > 
> > Best wishes,
> > 
> > Chris.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 15:21                             ` Wu Fengguang
@ 2010-08-18 15:57                               ` Christoph Lameter
  -1 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-18 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Lee Schermerhorn

On Wed, 18 Aug 2010, Wu Fengguang wrote:

> Andi, Christoph and Lee:
>
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.

Is zone reclaim active? It may not activate on smaller systems leading
to an unbalance memory usage between node.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 15:57                               ` Christoph Lameter
  0 siblings, 0 replies; 75+ messages in thread
From: Christoph Lameter @ 2010-08-18 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Webb, Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Lee Schermerhorn

On Wed, 18 Aug 2010, Wu Fengguang wrote:

> Andi, Christoph and Lee:
>
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.

Is zone reclaim active? It may not activate on smaller systems leading
to an unbalance memory usage between node.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 14:46                           ` Chris Webb
@ 2010-08-18 15:21                             ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 15:21 UTC (permalink / raw)
  To: Chris Webb
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Lee Schermerhorn, Christoph Lameter

Andi, Christoph and Lee:

This looks like an "unbalanced NUMA memory usage leading to premature
swapping" problem.

Thanks,
Fengguang

On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
> 
> > Did you enable any NUMA policy? That could start swapping even if
> > there are lots of free pages in some nodes.
> 
> Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> NUMA behaviour, but NUMA support is definitely compiled into the kernel:
> 
>   # zgrep NUMA /proc/config.gz 
>   CONFIG_NUMA_IRQ_DESC=y
>   CONFIG_NUMA=y
>   CONFIG_K8_NUMA=y
>   CONFIG_X86_64_ACPI_NUMA=y
>   # CONFIG_NUMA_EMU is not set
>   CONFIG_ACPI_NUMA=y
>   # grep -i numa /var/log/dmesg.boot 
>   NUMe: Allocated memnodemap from b000 - 1b540
>   NUMA: Using 20 for the hash shift.
> 
> > Are your free pages equally distributed over the nodes? Or limited to
> > some of the nodes? Try this command:
> > 
> >         grep MemFree /sys/devices/system/node/node*/meminfo
> 
> My worst-case machines current have swap completely turned off to make them
> usable for clients, but I have one machine which is about 3GB into swap with
> 8GB of buffers and 3GB free. This shows
> 
>   # grep MemFree /sys/devices/system/node/node*/meminfo
>   /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          954500 kB
>   /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         2374528 kB
> 
> I could definitely imagine that one of the nodes could have dipped down to
> zero in the past. I'll try enabling swap on one of our machines with the bad
> problem late tonight and repeat the experiment. The node meminfo on this box
> currently looks like
> 
>   # grep MemFree /sys/devices/system/node/node*/meminfo
>   /sys/devices/system/node/node0/meminfo:Node 0 MemFree:           82732 kB
>   /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         1723896 kB
> 
> Best wishes,
> 
> Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 15:21                             ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 15:21 UTC (permalink / raw)
  To: Chris Webb
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro,
	Pekka Enberg, Andi Kleen, Lee Schermerhorn, Christoph Lameter

Andi, Christoph and Lee:

This looks like an "unbalanced NUMA memory usage leading to premature
swapping" problem.

Thanks,
Fengguang

On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
> 
> > Did you enable any NUMA policy? That could start swapping even if
> > there are lots of free pages in some nodes.
> 
> Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> NUMA behaviour, but NUMA support is definitely compiled into the kernel:
> 
>   # zgrep NUMA /proc/config.gz 
>   CONFIG_NUMA_IRQ_DESC=y
>   CONFIG_NUMA=y
>   CONFIG_K8_NUMA=y
>   CONFIG_X86_64_ACPI_NUMA=y
>   # CONFIG_NUMA_EMU is not set
>   CONFIG_ACPI_NUMA=y
>   # grep -i numa /var/log/dmesg.boot 
>   NUMe: Allocated memnodemap from b000 - 1b540
>   NUMA: Using 20 for the hash shift.
> 
> > Are your free pages equally distributed over the nodes? Or limited to
> > some of the nodes? Try this command:
> > 
> >         grep MemFree /sys/devices/system/node/node*/meminfo
> 
> My worst-case machines current have swap completely turned off to make them
> usable for clients, but I have one machine which is about 3GB into swap with
> 8GB of buffers and 3GB free. This shows
> 
>   # grep MemFree /sys/devices/system/node/node*/meminfo
>   /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          954500 kB
>   /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         2374528 kB
> 
> I could definitely imagine that one of the nodes could have dipped down to
> zero in the past. I'll try enabling swap on one of our machines with the bad
> problem late tonight and repeat the experiment. The node meminfo on this box
> currently looks like
> 
>   # grep MemFree /sys/devices/system/node/node*/meminfo
>   /sys/devices/system/node/node0/meminfo:Node 0 MemFree:           82732 kB
>   /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         1723896 kB
> 
> Best wishes,
> 
> Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-18 14:38                         ` Wu Fengguang
@ 2010-08-18 14:46                           ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 14:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Wu Fengguang <fengguang.wu@intel.com> writes:

> Did you enable any NUMA policy? That could start swapping even if
> there are lots of free pages in some nodes.

Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
NUMA behaviour, but NUMA support is definitely compiled into the kernel:

  # zgrep NUMA /proc/config.gz 
  CONFIG_NUMA_IRQ_DESC=y
  CONFIG_NUMA=y
  CONFIG_K8_NUMA=y
  CONFIG_X86_64_ACPI_NUMA=y
  # CONFIG_NUMA_EMU is not set
  CONFIG_ACPI_NUMA=y
  # grep -i numa /var/log/dmesg.boot 
  NUMe: Allocated memnodemap from b000 - 1b540
  NUMA: Using 20 for the hash shift.

> Are your free pages equally distributed over the nodes? Or limited to
> some of the nodes? Try this command:
> 
>         grep MemFree /sys/devices/system/node/node*/meminfo

My worst-case machines current have swap completely turned off to make them
usable for clients, but I have one machine which is about 3GB into swap with
8GB of buffers and 3GB free. This shows

  # grep MemFree /sys/devices/system/node/node*/meminfo
  /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          954500 kB
  /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         2374528 kB

I could definitely imagine that one of the nodes could have dipped down to
zero in the past. I'll try enabling swap on one of our machines with the bad
problem late tonight and repeat the experiment. The node meminfo on this box
currently looks like

  # grep MemFree /sys/devices/system/node/node*/meminfo
  /sys/devices/system/node/node0/meminfo:Node 0 MemFree:           82732 kB
  /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         1723896 kB

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 14:46                           ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-18 14:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Wu Fengguang <fengguang.wu@intel.com> writes:

> Did you enable any NUMA policy? That could start swapping even if
> there are lots of free pages in some nodes.

Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
NUMA behaviour, but NUMA support is definitely compiled into the kernel:

  # zgrep NUMA /proc/config.gz 
  CONFIG_NUMA_IRQ_DESC=y
  CONFIG_NUMA=y
  CONFIG_K8_NUMA=y
  CONFIG_X86_64_ACPI_NUMA=y
  # CONFIG_NUMA_EMU is not set
  CONFIG_ACPI_NUMA=y
  # grep -i numa /var/log/dmesg.boot 
  NUMe: Allocated memnodemap from b000 - 1b540
  NUMA: Using 20 for the hash shift.

> Are your free pages equally distributed over the nodes? Or limited to
> some of the nodes? Try this command:
> 
>         grep MemFree /sys/devices/system/node/node*/meminfo

My worst-case machines current have swap completely turned off to make them
usable for clients, but I have one machine which is about 3GB into swap with
8GB of buffers and 3GB free. This shows

  # grep MemFree /sys/devices/system/node/node*/meminfo
  /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          954500 kB
  /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         2374528 kB

I could definitely imagine that one of the nodes could have dipped down to
zero in the past. I'll try enabling swap on one of our machines with the bad
problem late tonight and repeat the experiment. The node meminfo on this box
currently looks like

  # grep MemFree /sys/devices/system/node/node*/meminfo
  /sys/devices/system/node/node0/meminfo:Node 0 MemFree:           82732 kB
  /sys/devices/system/node/node1/meminfo:Node 1 MemFree:         1723896 kB

Best wishes,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-04 12:04                       ` Chris Webb
@ 2010-08-18 14:38                         ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 14:38 UTC (permalink / raw)
  To: Chris Webb
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Chris,

Did you enable any NUMA policy? That could start swapping even if
there are lots of free pages in some nodes.

Are your free pages equally distributed over the nodes? Or limited to
some of the nodes? Try this command:

        grep MemFree /sys/devices/system/node/node*/meminfo

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-18 14:38                         ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-18 14:38 UTC (permalink / raw)
  To: Chris Webb
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Chris,

Did you enable any NUMA policy? That could start swapping even if
there are lots of free pages in some nodes.

Are your free pages equally distributed over the nodes? Or limited to
some of the nodes? Try this command:

        grep MemFree /sys/devices/system/node/node*/meminfo

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-04 11:49                     ` Wu Fengguang
@ 2010-08-04 12:04                       ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-04 12:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Wu Fengguang <fengguang.wu@intel.com> writes:

> Maybe turn off KSM? It helps to isolate problems. It's a relative new
> and complex feature after all.

Good idea! I'll give that a go on one of the machines without swap at the
moment, re-add the swap with ksm turned off, and see what happens.

> > However, your suggestion is right that the CPU loads on these machines are
> > typically quite high. The large number of kvm virtual machines they run mean
> > thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> > these are higher when there's swap than after it has been removed. I assume
> > this is mostly because of increased IO wait, as this number increases
> > significantly in top.
> 
> iowait = CPU (idle) waiting for disk IO
> 
> So iowait means not CPU load, but somehow disk load :)

Sorry, yes, I wrote very unclearly here. What I should have written is that
the load numbers are fairly high even without swap, when the IO wait figure
is pretty small. This is presumably normal CPU load from the guests.

The load average rises significantly when swap is added, but I think that
rise is due to an increase in processes waiting for IO (io wait %age
increases considerably) rather than extra CPU work. Presumably this is the
IO from swapping.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-04 12:04                       ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-04 12:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Wu Fengguang <fengguang.wu@intel.com> writes:

> Maybe turn off KSM? It helps to isolate problems. It's a relative new
> and complex feature after all.

Good idea! I'll give that a go on one of the machines without swap at the
moment, re-add the swap with ksm turned off, and see what happens.

> > However, your suggestion is right that the CPU loads on these machines are
> > typically quite high. The large number of kvm virtual machines they run mean
> > thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> > these are higher when there's swap than after it has been removed. I assume
> > this is mostly because of increased IO wait, as this number increases
> > significantly in top.
> 
> iowait = CPU (idle) waiting for disk IO
> 
> So iowait means not CPU load, but somehow disk load :)

Sorry, yes, I wrote very unclearly here. What I should have written is that
the load numbers are fairly high even without swap, when the IO wait figure
is pretty small. This is presumably normal CPU load from the guests.

The load average rises significantly when swap is added, but I think that
rise is due to an increase in processes waiting for IO (io wait %age
increases considerably) rather than extra CPU work. Presumably this is the
IO from swapping.

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-04  9:58                   ` Chris Webb
@ 2010-08-04 11:49                     ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04 11:49 UTC (permalink / raw)
  To: Chris Webb
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

On Wed, Aug 04, 2010 at 05:58:12PM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
> 
> > This is interesting. Why is it waiting for 1m here? Are there high CPU
> > loads? Would you do a
> > 
> >         echo t > /proc/sysrq-trigger
> > 
> > and show us the dmesg?
> 
> Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
> way I can get this info for you? Replacing the kernels on the machines is a
> painful job as I have to give the clients running on them quite a bit of
> notice of the reboot, and I haven't been able to reproduce the problem on a
> test machine.

Maybe turn off KSM? It helps to isolate problems. It's a relative new
and complex feature after all.

> I also think the swap use is much better following a reboot, and only starts
> to spiral out of control after the machines have been running for a week or
> so.

Something deteriorates over long time.. It may take time to catch this bug..

> However, your suggestion is right that the CPU loads on these machines are
> typically quite high. The large number of kvm virtual machines they run mean
> thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> these are higher when there's swap than after it has been removed. I assume
> this is mostly because of increased IO wait, as this number increases
> significantly in top.

iowait = CPU (idle) waiting for disk IO

So iowait means not CPU load, but somehow disk load :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-04 11:49                     ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04 11:49 UTC (permalink / raw)
  To: Chris Webb
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

On Wed, Aug 04, 2010 at 05:58:12PM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
> 
> > This is interesting. Why is it waiting for 1m here? Are there high CPU
> > loads? Would you do a
> > 
> >         echo t > /proc/sysrq-trigger
> > 
> > and show us the dmesg?
> 
> Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
> way I can get this info for you? Replacing the kernels on the machines is a
> painful job as I have to give the clients running on them quite a bit of
> notice of the reboot, and I haven't been able to reproduce the problem on a
> test machine.

Maybe turn off KSM? It helps to isolate problems. It's a relative new
and complex feature after all.

> I also think the swap use is much better following a reboot, and only starts
> to spiral out of control after the machines have been running for a week or
> so.

Something deteriorates over long time.. It may take time to catch this bug..

> However, your suggestion is right that the CPU loads on these machines are
> typically quite high. The large number of kvm virtual machines they run mean
> thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> these are higher when there's swap than after it has been removed. I assume
> this is mostly because of increased IO wait, as this number increases
> significantly in top.

iowait = CPU (idle) waiting for disk IO

So iowait means not CPU load, but somehow disk load :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-04  3:24                 ` Wu Fengguang
@ 2010-08-04  9:58                   ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-04  9:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Wu Fengguang <fengguang.wu@intel.com> writes:

> This is interesting. Why is it waiting for 1m here? Are there high CPU
> loads? Would you do a
> 
>         echo t > /proc/sysrq-trigger
> 
> and show us the dmesg?

Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
way I can get this info for you? Replacing the kernels on the machines is a
painful job as I have to give the clients running on them quite a bit of
notice of the reboot, and I haven't been able to reproduce the problem on a
test machine.

I also think the swap use is much better following a reboot, and only starts
to spiral out of control after the machines have been running for a week or
so.

However, your suggestion is right that the CPU loads on these machines are
typically quite high. The large number of kvm virtual machines they run mean
thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
these are higher when there's swap than after it has been removed. I assume
this is mostly because of increased IO wait, as this number increases
significantly in top.

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-04  9:58                   ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-04  9:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Wu Fengguang <fengguang.wu@intel.com> writes:

> This is interesting. Why is it waiting for 1m here? Are there high CPU
> loads? Would you do a
> 
>         echo t > /proc/sysrq-trigger
> 
> and show us the dmesg?

Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
way I can get this info for you? Replacing the kernels on the machines is a
painful job as I have to give the clients running on them quite a bit of
notice of the reboot, and I haven't been able to reproduce the problem on a
test machine.

I also think the swap use is much better following a reboot, and only starts
to spiral out of control after the machines have been running for a week or
so.

However, your suggestion is right that the CPU loads on these machines are
typically quite high. The large number of kvm virtual machines they run mean
thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
these are higher when there's swap than after it has been removed. I assume
this is mostly because of increased IO wait, as this number increases
significantly in top.

Cheers,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-04  3:10             ` Minchan Kim
@ 2010-08-04  3:24                 ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04  3:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

On Wed, Aug 04, 2010 at 11:10:46AM +0800, Minchan Kim wrote:
> On Wed, Aug 4, 2010 at 11:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > Chris,
> >
> > Your slabinfo does contain many order 1-3 slab caches, this is a major source
> > of high order allocations and hence lumpy reclaim. fork() is another.
> >
> > In another thread, Pekka Enberg offers a tip:
> >
> >        You can pass "slub_debug=o" as a kernel parameter to disable higher
> >        order allocations if you want to test things.
> >
> > Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
> >
> > Thanks,
> > Fengguang
> 
> He said following as.
> "After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out.
>
> Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use,

This is interesting. Why is it waiting for 1m here? Are there high CPU
loads? Would you do a

        echo t > /proc/sysrq-trigger

and show us the dmesg?

Thanks,
Fengguang

> before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB."
> 
> 1. His system works well without swap.
> 2. His system increase swap by 2G rapidly and more steadily to 3GB.
> 
> So I thought it isn't likely to relate normal lumpy.
> 
> Of course, without swap, lumpy can scan more file pages to make
> contiguous page frames. so it could work well, still. But I can't
> understand 2.
> 
> Hmm, I have no idea. :(
> 
> Off-Topic:
> 
> Hi, Pekka.
> 
> Document says.
> "Debugging options may require the minimum possible slab order to increase as
> a result of storing the metadata (for example, caches with PAGE_SIZE object
> sizes).  This has a higher liklihood of resulting in slab allocation errors
> in low memory situations or if there's high fragmentation of memory.  To
> switch off debugging for such caches by default, use
> 
>        slub_debug=O"
> 
> But when I tested it in my machine(2.6.34),  with slub_debug=O, it
> increase objsize and pagesperslab. Even it increase the number of
> slab(But I am not sure this part since it might not the same time from
> booting)
> What am I missing now?
> 
> But SLAB seems to be consumed small pages than SLUB. Hmm.
> SLAB is more proper than SLUBin small memory system(ex, embedded)?
> 
> 
> --
> Kind regards,
> Minchan Kim

> 
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu               0      0   9200    3    8 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc_dma-512       16     16    512   16    2 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 17     17    960   17    4 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   17    4 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                 51     51    960   17    4 : tunables    0    0    0 : slabdata      3      3      0
> TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
> nf_conntrack_c10a8540      0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
> dm_raid1_read_record      0      0   1056   31    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2464   13    8 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
> fuse_request          18     18    432   18    2 : tunables    0    0    0 : slabdata      1      1      0
> fuse_inode            21     21    768   21    4 : tunables    0    0    0 : slabdata      1      1      0
> nfsd4_stateowners      0      0    344   23    2 : tunables    0    0    0 : slabdata      0      0      0
> nfs_read_data         72     72    448   18    2 : tunables    0    0    0 : slabdata      4      4      0
> nfs_inode_cache        0      0   1040   31    8 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_inode_cache      0      0   1280   25    8 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     24     24    656   24    4 : tunables    0    0    0 : slabdata      1      1      0
> ext4_inode_cache       0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
> ext2_inode_cache       0      0    944   17    4 : tunables    0    0    0 : slabdata      0      0      0
> ext3_inode_cache    5032   5032    928   17    4 : tunables    0    0    0 : slabdata    296    296      0
> rpc_inode_cache       18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
> UNIX                 532    532    832   19    4 : tunables    0    0    0 : slabdata     28     28      0
> UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
> UDP                   76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
> TCP                   60     60   1600   20    8 : tunables    0    0    0 : slabdata      3      3      0
> sgpool-128            48     48   2560   12    8 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-64            100    100   1280   25    8 : tunables    0    0    0 : slabdata      4      4      0
> blkdev_queue          76     76   1688   19    8 : tunables    0    0    0 : slabdata      4      4      0
> biovec-256            10     10   3072   10    8 : tunables    0    0    0 : slabdata      1      1      0
> biovec-128            21     21   1536   21    8 : tunables    0    0    0 : slabdata      1      1      0
> biovec-64             84     84    768   21    4 : tunables    0    0    0 : slabdata      4      4      0
> bip-256               10     10   3200   10    8 : tunables    0    0    0 : slabdata      1      1      0
> bip-128                0      0   1664   19    8 : tunables    0    0    0 : slabdata      0      0      0
> bip-64                 0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
> bip-16               100    100    320   25    2 : tunables    0    0    0 : slabdata      4      4      0
> sock_inode_cache     609    609    768   21    4 : tunables    0    0    0 : slabdata     29     29      0
> skbuff_fclone_cache     84     84    384   21    2 : tunables    0    0    0 : slabdata      4      4      0
> shmem_inode_cache   1835   1840    784   20    4 : tunables    0    0    0 : slabdata     92     92      0
> taskstats             96     96    328   24    2 : tunables    0    0    0 : slabdata      4      4      0
> proc_inode_cache    1584   1584    680   24    4 : tunables    0    0    0 : slabdata     66     66      0
> bdev_cache            72     72    896   18    4 : tunables    0    0    0 : slabdata      4      4      0
> inode_cache         7126   7128    656   24    4 : tunables    0    0    0 : slabdata    297    297      0
> signal_cache         332    350    640   25    4 : tunables    0    0    0 : slabdata     14     14      0
> sighand_cache        246    253   1408   23    8 : tunables    0    0    0 : slabdata     11     11      0
> task_xstate          193    196    576   28    4 : tunables    0    0    0 : slabdata      7      7      0
> task_struct          274    285   5472    5    8 : tunables    0    0    0 : slabdata     57     57      0
> radix_tree_node     3208   3213    296   27    2 : tunables    0    0    0 : slabdata    119    119      0
> kmalloc-8192          20     20   8192    4    8 : tunables    0    0    0 : slabdata      5      5      0
> kmalloc-4096          78     80   4096    8    8 : tunables    0    0    0 : slabdata     10     10      0
> kmalloc-2048         400    400   2048   16    8 : tunables    0    0    0 : slabdata     25     25      0
> kmalloc-1024         326    336   1024   16    4 : tunables    0    0    0 : slabdata     21     21      0
> kmalloc-512          758    784    512   16    2 : tunables    0    0    0 : slabdata     49     49      0

> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu               0      0   9248    3    8 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc_dma-512       29     29    560   29    4 : tunables    0    0    0 : slabdata      1      1      0
> clip_arp_cache         0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> ip6_dst_cache         25     25    320   25    2 : tunables    0    0    0 : slabdata      1      1      0
> ndisc_cache           25     25    320   25    2 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 16     16   1024   16    4 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   17    4 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                 68     68    960   17    4 : tunables    0    0    0 : slabdata      4      4      0
> tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> TCPv6                 36     36   1792   18    8 : tunables    0    0    0 : slabdata      2      2      0
> nf_conntrack_c10a8540      0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> dm_raid1_read_record      0      0   1096   29    8 : tunables    0    0    0 : slabdata      0      0      0
> kcopyd_job             0      0    376   21    2 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2504   13    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_rq_target_io        0      0    272   30    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     17     17    960   17    4 : tunables    0    0    0 : slabdata      1      1      0
> fuse_request          17     17    480   17    2 : tunables    0    0    0 : slabdata      1      1      0
> fuse_inode            19     19    832   19    4 : tunables    0    0    0 : slabdata      1      1      0
> nfsd4_stateowners      0      0    392   20    2 : tunables    0    0    0 : slabdata      0      0      0
> nfs_write_data        48     48    512   16    2 : tunables    0    0    0 : slabdata      3      3      0
> nfs_read_data         32     32    512   16    2 : tunables    0    0    0 : slabdata      2      2      0
> nfs_inode_cache        0      0   1080   30    8 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_key_record_cache      0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_sb_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_inode_cache      0      0   1280   25    8 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_auth_tok_list_item      0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     23     23    696   23    4 : tunables    0    0    0 : slabdata      1      1      0
> ext4_inode_cache       0      0   1168   28    8 : tunables    0    0    0 : slabdata      0      0      0
> ext2_inode_cache       0      0    984   16    4 : tunables    0    0    0 : slabdata      0      0      0
> ext3_inode_cache    5391   5392    968   16    4 : tunables    0    0    0 : slabdata    337    337      0
> dquot                  0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    384   21    2 : tunables    0    0    0 : slabdata      0      0      0
> rpc_buffers           30     30   2112   15    8 : tunables    0    0    0 : slabdata      2      2      0
> rpc_inode_cache       18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
> UNIX                 556    558    896   18    4 : tunables    0    0    0 : slabdata     31     31      0
> UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_dst_cache         125    125    320   25    2 : tunables    0    0    0 : slabdata      5      5      0
> arp_cache            100    100    320   25    2 : tunables    0    0    0 : slabdata      4      4      0
> RAW                   19     19    832   19    4 : tunables    0    0    0 : slabdata      1      1      0
> UDP                   76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
> TCP                   76     76   1664   19    8 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-128            48     48   2624   12    8 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-64             96     96   1344   24    8 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-32             92     92    704   23    4 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-16             84     84    384   21    2 : tunables    0    0    0 : slabdata      4      4      0
> blkdev_queue          72     72   1736   18    8 : tunables    0    0    0 : slabdata      4      4      0
> biovec-256            10     10   3136   10    8 : tunables    0    0    0 : slabdata      1      1      0
> biovec-128            20     20   1600   20    8 : tunables    0    0    0 : slabdata      1      1      0
> biovec-64             76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
> bip-256               10     10   3200   10    8 : tunables    0    0    0 : slabdata      1      1      0
> bip-128                0      0   1664   19    8 : tunables    0    0    0 : slabdata      0      0      0
> bip-64                 0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
> bip-16                 0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> sock_inode_cache     629    630    768   21    4 : tunables    0    0    0 : slabdata     30     30      0
> skbuff_fclone_cache     72     72    448   18    2 : tunables    0    0    0 : slabdata      4      4      0
> shmem_inode_cache   1862   1862    824   19    4 : tunables    0    0    0 : slabdata     98     98      0
> taskstats             84     84    376   21    2 : tunables    0    0    0 : slabdata      4      4      0
> proc_inode_cache    1623   1650    720   22    4 : tunables    0    0    0 : slabdata     75     75      0
> bdev_cache            68     68    960   17    4 : tunables    0    0    0 : slabdata      4      4      0
> inode_cache         7125   7130    696   23    4 : tunables    0    0    0 : slabdata    310    310      0
> mm_struct            135    138    704   23    4 : tunables    0    0    0 : slabdata      6      6      0
> files_cache          142    150    320   25    2 : tunables    0    0    0 : slabdata      6      6      0
> signal_cache         229    230    704   23    4 : tunables    0    0    0 : slabdata     10     10      0
> sighand_cache        228    230   1408   23    8 : tunables    0    0    0 : slabdata     10     10      0
> task_xstate          195    200    640   25    4 : tunables    0    0    0 : slabdata      8      8      0
> task_struct          271    285   5520    5    8 : tunables    0    0    0 : slabdata     57     57      0
> radix_tree_node     3484   3504    336   24    2 : tunables    0    0    0 : slabdata    146    146      0
> kmalloc-8192          20     20   8192    4    8 : tunables    0    0    0 : slabdata      5      5      0
> kmalloc-4096          79     80   4096    8    8 : tunables    0    0    0 : slabdata     10     10      0
> kmalloc-2048         388    390   2096   15    8 : tunables    0    0    0 : slabdata     26     26      0
> kmalloc-1024         382    390   1072   30    8 : tunables    0    0    0 : slabdata     13     13      0
> kmalloc-512          796    812    560   29    4 : tunables    0    0    0 : slabdata     28     28      0
> kmalloc-256          153    156    304   26    2 : tunables    0    0    0 : slabdata      6      6      0


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-04  3:24                 ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04  3:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

On Wed, Aug 04, 2010 at 11:10:46AM +0800, Minchan Kim wrote:
> On Wed, Aug 4, 2010 at 11:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > Chris,
> >
> > Your slabinfo does contain many order 1-3 slab caches, this is a major source
> > of high order allocations and hence lumpy reclaim. fork() is another.
> >
> > In another thread, Pekka Enberg offers a tip:
> >
> > A  A  A  A You can pass "slub_debug=o" as a kernel parameter to disable higher
> > A  A  A  A order allocations if you want to test things.
> >
> > Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
> >
> > Thanks,
> > Fengguang
> 
> He said following as.
> "After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out.
>
> Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use,

This is interesting. Why is it waiting for 1m here? Are there high CPU
loads? Would you do a

        echo t > /proc/sysrq-trigger

and show us the dmesg?

Thanks,
Fengguang

> before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB."
> 
> 1. His system works well without swap.
> 2. His system increase swap by 2G rapidly and more steadily to 3GB.
> 
> So I thought it isn't likely to relate normal lumpy.
> 
> Of course, without swap, lumpy can scan more file pages to make
> contiguous page frames. so it could work well, still. But I can't
> understand 2.
> 
> Hmm, I have no idea. :(
> 
> Off-Topic:
> 
> Hi, Pekka.
> 
> Document says.
> "Debugging options may require the minimum possible slab order to increase as
> a result of storing the metadata (for example, caches with PAGE_SIZE object
> sizes). A This has a higher liklihood of resulting in slab allocation errors
> in low memory situations or if there's high fragmentation of memory. A To
> switch off debugging for such caches by default, use
> 
> A  A  A  A slub_debug=O"
> 
> But when I tested it in my machine(2.6.34), A with slub_debug=O, it
> increase objsize and pagesperslab. Even it increase the number of
> slab(But I am not sure this part since it might not the same time from
> booting)
> What am I missing now?
> 
> But SLAB seems to be consumed small pages than SLUB. Hmm.
> SLAB is more proper than SLUBin small memory system(ex, embedded)?
> 
> 
> --
> Kind regards,
> Minchan Kim

> 
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu               0      0   9200    3    8 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc_dma-512       16     16    512   16    2 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 17     17    960   17    4 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   17    4 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                 51     51    960   17    4 : tunables    0    0    0 : slabdata      3      3      0
> TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
> nf_conntrack_c10a8540      0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
> dm_raid1_read_record      0      0   1056   31    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2464   13    8 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
> fuse_request          18     18    432   18    2 : tunables    0    0    0 : slabdata      1      1      0
> fuse_inode            21     21    768   21    4 : tunables    0    0    0 : slabdata      1      1      0
> nfsd4_stateowners      0      0    344   23    2 : tunables    0    0    0 : slabdata      0      0      0
> nfs_read_data         72     72    448   18    2 : tunables    0    0    0 : slabdata      4      4      0
> nfs_inode_cache        0      0   1040   31    8 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_inode_cache      0      0   1280   25    8 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     24     24    656   24    4 : tunables    0    0    0 : slabdata      1      1      0
> ext4_inode_cache       0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
> ext2_inode_cache       0      0    944   17    4 : tunables    0    0    0 : slabdata      0      0      0
> ext3_inode_cache    5032   5032    928   17    4 : tunables    0    0    0 : slabdata    296    296      0
> rpc_inode_cache       18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
> UNIX                 532    532    832   19    4 : tunables    0    0    0 : slabdata     28     28      0
> UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
> UDP                   76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
> TCP                   60     60   1600   20    8 : tunables    0    0    0 : slabdata      3      3      0
> sgpool-128            48     48   2560   12    8 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-64            100    100   1280   25    8 : tunables    0    0    0 : slabdata      4      4      0
> blkdev_queue          76     76   1688   19    8 : tunables    0    0    0 : slabdata      4      4      0
> biovec-256            10     10   3072   10    8 : tunables    0    0    0 : slabdata      1      1      0
> biovec-128            21     21   1536   21    8 : tunables    0    0    0 : slabdata      1      1      0
> biovec-64             84     84    768   21    4 : tunables    0    0    0 : slabdata      4      4      0
> bip-256               10     10   3200   10    8 : tunables    0    0    0 : slabdata      1      1      0
> bip-128                0      0   1664   19    8 : tunables    0    0    0 : slabdata      0      0      0
> bip-64                 0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
> bip-16               100    100    320   25    2 : tunables    0    0    0 : slabdata      4      4      0
> sock_inode_cache     609    609    768   21    4 : tunables    0    0    0 : slabdata     29     29      0
> skbuff_fclone_cache     84     84    384   21    2 : tunables    0    0    0 : slabdata      4      4      0
> shmem_inode_cache   1835   1840    784   20    4 : tunables    0    0    0 : slabdata     92     92      0
> taskstats             96     96    328   24    2 : tunables    0    0    0 : slabdata      4      4      0
> proc_inode_cache    1584   1584    680   24    4 : tunables    0    0    0 : slabdata     66     66      0
> bdev_cache            72     72    896   18    4 : tunables    0    0    0 : slabdata      4      4      0
> inode_cache         7126   7128    656   24    4 : tunables    0    0    0 : slabdata    297    297      0
> signal_cache         332    350    640   25    4 : tunables    0    0    0 : slabdata     14     14      0
> sighand_cache        246    253   1408   23    8 : tunables    0    0    0 : slabdata     11     11      0
> task_xstate          193    196    576   28    4 : tunables    0    0    0 : slabdata      7      7      0
> task_struct          274    285   5472    5    8 : tunables    0    0    0 : slabdata     57     57      0
> radix_tree_node     3208   3213    296   27    2 : tunables    0    0    0 : slabdata    119    119      0
> kmalloc-8192          20     20   8192    4    8 : tunables    0    0    0 : slabdata      5      5      0
> kmalloc-4096          78     80   4096    8    8 : tunables    0    0    0 : slabdata     10     10      0
> kmalloc-2048         400    400   2048   16    8 : tunables    0    0    0 : slabdata     25     25      0
> kmalloc-1024         326    336   1024   16    4 : tunables    0    0    0 : slabdata     21     21      0
> kmalloc-512          758    784    512   16    2 : tunables    0    0    0 : slabdata     49     49      0

> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu               0      0   9248    3    8 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc_dma-512       29     29    560   29    4 : tunables    0    0    0 : slabdata      1      1      0
> clip_arp_cache         0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> ip6_dst_cache         25     25    320   25    2 : tunables    0    0    0 : slabdata      1      1      0
> ndisc_cache           25     25    320   25    2 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 16     16   1024   16    4 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   17    4 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                 68     68    960   17    4 : tunables    0    0    0 : slabdata      4      4      0
> tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> TCPv6                 36     36   1792   18    8 : tunables    0    0    0 : slabdata      2      2      0
> nf_conntrack_c10a8540      0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> dm_raid1_read_record      0      0   1096   29    8 : tunables    0    0    0 : slabdata      0      0      0
> kcopyd_job             0      0    376   21    2 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2504   13    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_rq_target_io        0      0    272   30    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     17     17    960   17    4 : tunables    0    0    0 : slabdata      1      1      0
> fuse_request          17     17    480   17    2 : tunables    0    0    0 : slabdata      1      1      0
> fuse_inode            19     19    832   19    4 : tunables    0    0    0 : slabdata      1      1      0
> nfsd4_stateowners      0      0    392   20    2 : tunables    0    0    0 : slabdata      0      0      0
> nfs_write_data        48     48    512   16    2 : tunables    0    0    0 : slabdata      3      3      0
> nfs_read_data         32     32    512   16    2 : tunables    0    0    0 : slabdata      2      2      0
> nfs_inode_cache        0      0   1080   30    8 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_key_record_cache      0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_sb_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_inode_cache      0      0   1280   25    8 : tunables    0    0    0 : slabdata      0      0      0
> ecryptfs_auth_tok_list_item      0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     23     23    696   23    4 : tunables    0    0    0 : slabdata      1      1      0
> ext4_inode_cache       0      0   1168   28    8 : tunables    0    0    0 : slabdata      0      0      0
> ext2_inode_cache       0      0    984   16    4 : tunables    0    0    0 : slabdata      0      0      0
> ext3_inode_cache    5391   5392    968   16    4 : tunables    0    0    0 : slabdata    337    337      0
> dquot                  0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    384   21    2 : tunables    0    0    0 : slabdata      0      0      0
> rpc_buffers           30     30   2112   15    8 : tunables    0    0    0 : slabdata      2      2      0
> rpc_inode_cache       18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
> UNIX                 556    558    896   18    4 : tunables    0    0    0 : slabdata     31     31      0
> UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_dst_cache         125    125    320   25    2 : tunables    0    0    0 : slabdata      5      5      0
> arp_cache            100    100    320   25    2 : tunables    0    0    0 : slabdata      4      4      0
> RAW                   19     19    832   19    4 : tunables    0    0    0 : slabdata      1      1      0
> UDP                   76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
> TCP                   76     76   1664   19    8 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-128            48     48   2624   12    8 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-64             96     96   1344   24    8 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-32             92     92    704   23    4 : tunables    0    0    0 : slabdata      4      4      0
> sgpool-16             84     84    384   21    2 : tunables    0    0    0 : slabdata      4      4      0
> blkdev_queue          72     72   1736   18    8 : tunables    0    0    0 : slabdata      4      4      0
> biovec-256            10     10   3136   10    8 : tunables    0    0    0 : slabdata      1      1      0
> biovec-128            20     20   1600   20    8 : tunables    0    0    0 : slabdata      1      1      0
> biovec-64             76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
> bip-256               10     10   3200   10    8 : tunables    0    0    0 : slabdata      1      1      0
> bip-128                0      0   1664   19    8 : tunables    0    0    0 : slabdata      0      0      0
> bip-64                 0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
> bip-16                 0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> sock_inode_cache     629    630    768   21    4 : tunables    0    0    0 : slabdata     30     30      0
> skbuff_fclone_cache     72     72    448   18    2 : tunables    0    0    0 : slabdata      4      4      0
> shmem_inode_cache   1862   1862    824   19    4 : tunables    0    0    0 : slabdata     98     98      0
> taskstats             84     84    376   21    2 : tunables    0    0    0 : slabdata      4      4      0
> proc_inode_cache    1623   1650    720   22    4 : tunables    0    0    0 : slabdata     75     75      0
> bdev_cache            68     68    960   17    4 : tunables    0    0    0 : slabdata      4      4      0
> inode_cache         7125   7130    696   23    4 : tunables    0    0    0 : slabdata    310    310      0
> mm_struct            135    138    704   23    4 : tunables    0    0    0 : slabdata      6      6      0
> files_cache          142    150    320   25    2 : tunables    0    0    0 : slabdata      6      6      0
> signal_cache         229    230    704   23    4 : tunables    0    0    0 : slabdata     10     10      0
> sighand_cache        228    230   1408   23    8 : tunables    0    0    0 : slabdata     10     10      0
> task_xstate          195    200    640   25    4 : tunables    0    0    0 : slabdata      8      8      0
> task_struct          271    285   5520    5    8 : tunables    0    0    0 : slabdata     57     57      0
> radix_tree_node     3484   3504    336   24    2 : tunables    0    0    0 : slabdata    146    146      0
> kmalloc-8192          20     20   8192    4    8 : tunables    0    0    0 : slabdata      5      5      0
> kmalloc-4096          79     80   4096    8    8 : tunables    0    0    0 : slabdata     10     10      0
> kmalloc-2048         388    390   2096   15    8 : tunables    0    0    0 : slabdata     26     26      0
> kmalloc-1024         382    390   1072   30    8 : tunables    0    0    0 : slabdata     13     13      0
> kmalloc-512          796    812    560   29    4 : tunables    0    0    0 : slabdata     28     28      0
> kmalloc-256          153    156    304   26    2 : tunables    0    0    0 : slabdata      6      6      0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-04  2:21             ` Wu Fengguang
  (?)
@ 2010-08-04  3:10             ` Minchan Kim
  2010-08-04  3:24                 ` Wu Fengguang
  -1 siblings, 1 reply; 75+ messages in thread
From: Minchan Kim @ 2010-08-04  3:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

[-- Attachment #1: Type: text/plain, Size: 2349 bytes --]

On Wed, Aug 4, 2010 at 11:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> Chris,
>
> Your slabinfo does contain many order 1-3 slab caches, this is a major source
> of high order allocations and hence lumpy reclaim. fork() is another.
>
> In another thread, Pekka Enberg offers a tip:
>
>        You can pass "slub_debug=o" as a kernel parameter to disable higher
>        order allocations if you want to test things.
>
> Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
>
> Thanks,
> Fengguang

He said following as.
"After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB."

1. His system works well without swap.
2. His system increase swap by 2G rapidly and more steadily to 3GB.

So I thought it isn't likely to relate normal lumpy.

Of course, without swap, lumpy can scan more file pages to make
contiguous page frames. so it could work well, still. But I can't
understand 2.

Hmm, I have no idea. :(

Off-Topic:

Hi, Pekka.

Document says.
"Debugging options may require the minimum possible slab order to increase as
a result of storing the metadata (for example, caches with PAGE_SIZE object
sizes).  This has a higher liklihood of resulting in slab allocation errors
in low memory situations or if there's high fragmentation of memory.  To
switch off debugging for such caches by default, use

       slub_debug=O"

But when I tested it in my machine(2.6.34),  with slub_debug=O, it
increase objsize and pagesperslab. Even it increase the number of
slab(But I am not sure this part since it might not the same time from
booting)
What am I missing now?

But SLAB seems to be consumed small pages than SLUB. Hmm.
SLAB is more proper than SLUBin small memory system(ex, embedded)?


--
Kind regards,
Minchan Kim

[-- Attachment #2: slub_debug.log --]
[-- Type: text/x-log, Size: 5786 bytes --]


slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_vcpu               0      0   9200    3    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc_dma-512       16     16    512   16    2 : tunables    0    0    0 : slabdata      1      1      0
RAWv6                 17     17    960   17    4 : tunables    0    0    0 : slabdata      1      1      0
UDPLITEv6              0      0    960   17    4 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                 51     51    960   17    4 : tunables    0    0    0 : slabdata      3      3      0
TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
nf_conntrack_c10a8540      0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
dm_raid1_read_record      0      0   1056   31    8 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2464   13    8 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
fuse_request          18     18    432   18    2 : tunables    0    0    0 : slabdata      1      1      0
fuse_inode            21     21    768   21    4 : tunables    0    0    0 : slabdata      1      1      0
nfsd4_stateowners      0      0    344   23    2 : tunables    0    0    0 : slabdata      0      0      0
nfs_read_data         72     72    448   18    2 : tunables    0    0    0 : slabdata      4      4      0
nfs_inode_cache        0      0   1040   31    8 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_inode_cache      0      0   1280   25    8 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     24     24    656   24    4 : tunables    0    0    0 : slabdata      1      1      0
ext4_inode_cache       0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    944   17    4 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache    5032   5032    928   17    4 : tunables    0    0    0 : slabdata    296    296      0
rpc_inode_cache       18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
UNIX                 532    532    832   19    4 : tunables    0    0    0 : slabdata     28     28      0
UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
UDP                   76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
TCP                   60     60   1600   20    8 : tunables    0    0    0 : slabdata      3      3      0
sgpool-128            48     48   2560   12    8 : tunables    0    0    0 : slabdata      4      4      0
sgpool-64            100    100   1280   25    8 : tunables    0    0    0 : slabdata      4      4      0
blkdev_queue          76     76   1688   19    8 : tunables    0    0    0 : slabdata      4      4      0
biovec-256            10     10   3072   10    8 : tunables    0    0    0 : slabdata      1      1      0
biovec-128            21     21   1536   21    8 : tunables    0    0    0 : slabdata      1      1      0
biovec-64             84     84    768   21    4 : tunables    0    0    0 : slabdata      4      4      0
bip-256               10     10   3200   10    8 : tunables    0    0    0 : slabdata      1      1      0
bip-128                0      0   1664   19    8 : tunables    0    0    0 : slabdata      0      0      0
bip-64                 0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
bip-16               100    100    320   25    2 : tunables    0    0    0 : slabdata      4      4      0
sock_inode_cache     609    609    768   21    4 : tunables    0    0    0 : slabdata     29     29      0
skbuff_fclone_cache     84     84    384   21    2 : tunables    0    0    0 : slabdata      4      4      0
shmem_inode_cache   1835   1840    784   20    4 : tunables    0    0    0 : slabdata     92     92      0
taskstats             96     96    328   24    2 : tunables    0    0    0 : slabdata      4      4      0
proc_inode_cache    1584   1584    680   24    4 : tunables    0    0    0 : slabdata     66     66      0
bdev_cache            72     72    896   18    4 : tunables    0    0    0 : slabdata      4      4      0
inode_cache         7126   7128    656   24    4 : tunables    0    0    0 : slabdata    297    297      0
signal_cache         332    350    640   25    4 : tunables    0    0    0 : slabdata     14     14      0
sighand_cache        246    253   1408   23    8 : tunables    0    0    0 : slabdata     11     11      0
task_xstate          193    196    576   28    4 : tunables    0    0    0 : slabdata      7      7      0
task_struct          274    285   5472    5    8 : tunables    0    0    0 : slabdata     57     57      0
radix_tree_node     3208   3213    296   27    2 : tunables    0    0    0 : slabdata    119    119      0
kmalloc-8192          20     20   8192    4    8 : tunables    0    0    0 : slabdata      5      5      0
kmalloc-4096          78     80   4096    8    8 : tunables    0    0    0 : slabdata     10     10      0
kmalloc-2048         400    400   2048   16    8 : tunables    0    0    0 : slabdata     25     25      0
kmalloc-1024         326    336   1024   16    4 : tunables    0    0    0 : slabdata     21     21      0
kmalloc-512          758    784    512   16    2 : tunables    0    0    0 : slabdata     49     49      0

[-- Attachment #3: slub_debug_disable.log --]
[-- Type: text/x-log, Size: 8050 bytes --]

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_vcpu               0      0   9248    3    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc_dma-512       29     29    560   29    4 : tunables    0    0    0 : slabdata      1      1      0
clip_arp_cache         0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
ip6_dst_cache         25     25    320   25    2 : tunables    0    0    0 : slabdata      1      1      0
ndisc_cache           25     25    320   25    2 : tunables    0    0    0 : slabdata      1      1      0
RAWv6                 16     16   1024   16    4 : tunables    0    0    0 : slabdata      1      1      0
UDPLITEv6              0      0    960   17    4 : tunables    0    0    0 : slabdata      0      0      0
UDPv6                 68     68    960   17    4 : tunables    0    0    0 : slabdata      4      4      0
tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
TCPv6                 36     36   1792   18    8 : tunables    0    0    0 : slabdata      2      2      0
nf_conntrack_c10a8540      0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
dm_raid1_read_record      0      0   1096   29    8 : tunables    0    0    0 : slabdata      0      0      0
kcopyd_job             0      0    376   21    2 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2504   13    8 : tunables    0    0    0 : slabdata      0      0      0
dm_rq_target_io        0      0    272   30    2 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     17     17    960   17    4 : tunables    0    0    0 : slabdata      1      1      0
fuse_request          17     17    480   17    2 : tunables    0    0    0 : slabdata      1      1      0
fuse_inode            19     19    832   19    4 : tunables    0    0    0 : slabdata      1      1      0
nfsd4_stateowners      0      0    392   20    2 : tunables    0    0    0 : slabdata      0      0      0
nfs_write_data        48     48    512   16    2 : tunables    0    0    0 : slabdata      3      3      0
nfs_read_data         32     32    512   16    2 : tunables    0    0    0 : slabdata      2      2      0
nfs_inode_cache        0      0   1080   30    8 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_key_record_cache      0      0    576   28    4 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_sb_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_inode_cache      0      0   1280   25    8 : tunables    0    0    0 : slabdata      0      0      0
ecryptfs_auth_tok_list_item      0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     23     23    696   23    4 : tunables    0    0    0 : slabdata      1      1      0
ext4_inode_cache       0      0   1168   28    8 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    984   16    4 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache    5391   5392    968   16    4 : tunables    0    0    0 : slabdata    337    337      0
dquot                  0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
kioctx                 0      0    384   21    2 : tunables    0    0    0 : slabdata      0      0      0
rpc_buffers           30     30   2112   15    8 : tunables    0    0    0 : slabdata      2      2      0
rpc_inode_cache       18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
UNIX                 556    558    896   18    4 : tunables    0    0    0 : slabdata     31     31      0
UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
ip_dst_cache         125    125    320   25    2 : tunables    0    0    0 : slabdata      5      5      0
arp_cache            100    100    320   25    2 : tunables    0    0    0 : slabdata      4      4      0
RAW                   19     19    832   19    4 : tunables    0    0    0 : slabdata      1      1      0
UDP                   76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
TCP                   76     76   1664   19    8 : tunables    0    0    0 : slabdata      4      4      0
sgpool-128            48     48   2624   12    8 : tunables    0    0    0 : slabdata      4      4      0
sgpool-64             96     96   1344   24    8 : tunables    0    0    0 : slabdata      4      4      0
sgpool-32             92     92    704   23    4 : tunables    0    0    0 : slabdata      4      4      0
sgpool-16             84     84    384   21    2 : tunables    0    0    0 : slabdata      4      4      0
blkdev_queue          72     72   1736   18    8 : tunables    0    0    0 : slabdata      4      4      0
biovec-256            10     10   3136   10    8 : tunables    0    0    0 : slabdata      1      1      0
biovec-128            20     20   1600   20    8 : tunables    0    0    0 : slabdata      1      1      0
biovec-64             76     76    832   19    4 : tunables    0    0    0 : slabdata      4      4      0
bip-256               10     10   3200   10    8 : tunables    0    0    0 : slabdata      1      1      0
bip-128                0      0   1664   19    8 : tunables    0    0    0 : slabdata      0      0      0
bip-64                 0      0    896   18    4 : tunables    0    0    0 : slabdata      0      0      0
bip-16                 0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
sock_inode_cache     629    630    768   21    4 : tunables    0    0    0 : slabdata     30     30      0
skbuff_fclone_cache     72     72    448   18    2 : tunables    0    0    0 : slabdata      4      4      0
shmem_inode_cache   1862   1862    824   19    4 : tunables    0    0    0 : slabdata     98     98      0
taskstats             84     84    376   21    2 : tunables    0    0    0 : slabdata      4      4      0
proc_inode_cache    1623   1650    720   22    4 : tunables    0    0    0 : slabdata     75     75      0
bdev_cache            68     68    960   17    4 : tunables    0    0    0 : slabdata      4      4      0
inode_cache         7125   7130    696   23    4 : tunables    0    0    0 : slabdata    310    310      0
mm_struct            135    138    704   23    4 : tunables    0    0    0 : slabdata      6      6      0
files_cache          142    150    320   25    2 : tunables    0    0    0 : slabdata      6      6      0
signal_cache         229    230    704   23    4 : tunables    0    0    0 : slabdata     10     10      0
sighand_cache        228    230   1408   23    8 : tunables    0    0    0 : slabdata     10     10      0
task_xstate          195    200    640   25    4 : tunables    0    0    0 : slabdata      8      8      0
task_struct          271    285   5520    5    8 : tunables    0    0    0 : slabdata     57     57      0
radix_tree_node     3484   3504    336   24    2 : tunables    0    0    0 : slabdata    146    146      0
kmalloc-8192          20     20   8192    4    8 : tunables    0    0    0 : slabdata      5      5      0
kmalloc-4096          79     80   4096    8    8 : tunables    0    0    0 : slabdata     10     10      0
kmalloc-2048         388    390   2096   15    8 : tunables    0    0    0 : slabdata     26     26      0
kmalloc-1024         382    390   1072   30    8 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-512          796    812    560   29    4 : tunables    0    0    0 : slabdata     28     28      0
kmalloc-256          153    156    304   26    2 : tunables    0    0    0 : slabdata      6      6      0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-03 21:49           ` Chris Webb
@ 2010-08-04  2:21             ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04  2:21 UTC (permalink / raw)
  To: Chris Webb
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Chris,

Your slabinfo does contain many order 1-3 slab caches, this is a major source
of high order allocations and hence lumpy reclaim. fork() is another.

In another thread, Pekka Enberg offers a tip:

        You can pass "slub_debug=o" as a kernel parameter to disable higher
        order allocations if you want to test things.

Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.

Thanks,
Fengguang


On Wed, Aug 04, 2010 at 05:49:46AM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
> 
> > Chris, what's in your /proc/slabinfo?
> 
> Hi. Sorry for the slow reply. The exact machine from which I previously
> extracted that /proc/memstat has unfortunately had swap turned off by a
> colleague while I was away, presumably because its behaviour because too
> bad. However, here is info from another member of the cluster, this time
> with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:
> 
> # cat /proc/meminfo 
> MemTotal:       33084008 kB
> MemFree:         2291464 kB
> Buffers:         4908468 kB
> Cached:            16056 kB
> SwapCached:      1427480 kB
> Active:         22885508 kB
> Inactive:        5719520 kB
> Active(anon):   20466488 kB
> Inactive(anon):  3215888 kB
> Active(file):    2419020 kB
> Inactive(file):  2503632 kB
> Unevictable:       10688 kB
> Mlocked:           10688 kB
> SwapTotal:      25165816 kB
> SwapFree:       22798248 kB
> Dirty:              2616 kB
> Writeback:             0 kB
> AnonPages:      23410296 kB
> Mapped:             6324 kB
> Shmem:                56 kB
> Slab:             692296 kB
> SReclaimable:     189032 kB
> SUnreclaim:       503264 kB
> KernelStack:        4568 kB
> PageTables:        65588 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    41707820 kB
> Committed_AS:   34859884 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      147616 kB
> VmallocChunk:   34342399496 kB
> HardwareCorrupted:     0 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:        5888 kB
> DirectMap2M:     2156544 kB
> DirectMap1G:    31457280 kB
> 
> # cat /proc/slabinfo 
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kmalloc_dma-512       32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
> nf_conntrack_expect    312    312    208   39    2 : tunables    0    0    0 : slabdata      8      8      0
> nf_conntrack         240    240    272   30    2 : tunables    0    0    0 : slabdata      8      8      0
> dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_crypt_io          240    260    152   26    1 : tunables    0    0    0 : slabdata     10     10      0
> kcopyd_job             0      0    368   22    2 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_rq_target_io        0      0    376   21    2 : tunables    0    0    0 : slabdata      0      0      0
> cfq_queue              0      0    168   24    1 : tunables    0    0    0 : slabdata      0      0      0
> bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
> udf_inode_cache        0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_request           0      0    632   25    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_inode             0      0    704   23    4 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_inode_cache       0      0    264   31    2 : tunables    0    0    0 : slabdata      0      0      0
> isofs_inode_cache      0      0    616   26    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_inode_cache        0      0    648   25    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     28     28    584   28    4 : tunables    0    0    0 : slabdata      1      1      0
> squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> journal_handle      1360   1360     24  170    1 : tunables    0    0    0 : slabdata      8      8      0
> journal_head         288    288    112   36    1 : tunables    0    0    0 : slabdata      8      8      0
> revoke_table         512    512     16  256    1 : tunables    0    0    0 : slabdata      2      2      0
> revoke_record       1024   1024     32  128    1 : tunables    0    0    0 : slabdata      8      8      0
> ext4_inode_cache       0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
> ext4_free_block_extents      0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_alloc_context      0      0    144   28    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_prealloc_space      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_system_zone       0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
> ext3_inode_cache    2371   2457    768   21    4 : tunables    0    0    0 : slabdata    117    117      0
> ext3_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> inotify_inode_mark_entry     36     36    112   36    1 : tunables    0    0    0 : slabdata      1      1      0
> posix_timers_cache    224    224    144   28    1 : tunables    0    0    0 : slabdata      8      8      0
> kvm_vcpu              38     45  10256    3    8 : tunables    0    0    0 : slabdata     15     15      0
> kvm_rmap_desc      19408  21828     40  102    1 : tunables    0    0    0 : slabdata    214    214      0
> kvm_pte_chain      14514  28543     56   73    1 : tunables    0    0    0 : slabdata    391    391      0
> UDP-Lite               0      0    768   21    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_dst_cache         221    231    384   21    2 : tunables    0    0    0 : slabdata     11     11      0
> UDP                  168    168    768   21    4 : tunables    0    0    0 : slabdata      8      8      0
> tw_sock_TCP          256    256    256   32    2 : tunables    0    0    0 : slabdata      8      8      0
> TCP                  191    220   1472   22    8 : tunables    0    0    0 : slabdata     10     10      0
> blkdev_queue         178    210   2128   15    8 : tunables    0    0    0 : slabdata     14     14      0
> blkdev_requests      608    816    336   24    2 : tunables    0    0    0 : slabdata     34     34      0
> fsnotify_event         0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> sock_inode_cache     250    300    640   25    4 : tunables    0    0    0 : slabdata     12     12      0
> file_lock_cache      176    176    184   22    1 : tunables    0    0    0 : slabdata      8      8      0
> shmem_inode_cache   1617   1827    776   21    4 : tunables    0    0    0 : slabdata     87     87      0
> Acpi-ParseExt       1692   1736     72   56    1 : tunables    0    0    0 : slabdata     31     31      0
> proc_inode_cache    1182   1326    616   26    4 : tunables    0    0    0 : slabdata     51     51      0
> sigqueue             200    200    160   25    1 : tunables    0    0    0 : slabdata      8      8      0
> radix_tree_node    65891  69542    560   29    4 : tunables    0    0    0 : slabdata   2398   2398      0
> bdev_cache           312    312    832   39    8 : tunables    0    0    0 : slabdata      8      8      0
> sysfs_dir_cache    21585  22287     80   51    1 : tunables    0    0    0 : slabdata    437    437      0
> inode_cache         2903   2996    568   28    4 : tunables    0    0    0 : slabdata    107    107      0
> dentry              8532   8631    192   21    1 : tunables    0    0    0 : slabdata    411    411      0
> buffer_head       1227688 1296648    112   36    1 : tunables    0    0    0 : slabdata  36018  36018      0
> vm_area_struct     18494  19389    176   23    1 : tunables    0    0    0 : slabdata    843    843      0
> files_cache          236    322    704   23    4 : tunables    0    0    0 : slabdata     14     14      0
> signal_cache         606    702    832   39    8 : tunables    0    0    0 : slabdata     18     18      0
> sighand_cache        415    480   2112   15    8 : tunables    0    0    0 : slabdata     32     32      0
> task_struct          671    840   1616   20    8 : tunables    0    0    0 : slabdata     42     42      0
> anon_vma            1511   1920     32  128    1 : tunables    0    0    0 : slabdata     15     15      0
> shared_policy_node    255    255     48   85    1 : tunables    0    0    0 : slabdata      3      3      0
> numa_policy        19205  20910     24  170    1 : tunables    0    0    0 : slabdata    123    123      0
> idr_layer_cache      373    390    544   30    4 : tunables    0    0    0 : slabdata     13     13      0
> kmalloc-8192          36     36   8192    4    8 : tunables    0    0    0 : slabdata      9      9      0
> kmalloc-4096        2284   2592   4096    8    8 : tunables    0    0    0 : slabdata    324    324      0
> kmalloc-2048         750    896   2048   16    8 : tunables    0    0    0 : slabdata     56     56      0
> kmalloc-1024        4025   4320   1024   32    8 : tunables    0    0    0 : slabdata    135    135      0
> kmalloc-512         1358   1760    512   32    4 : tunables    0    0    0 : slabdata     55     55      0
> kmalloc-256         1402   1952    256   32    2 : tunables    0    0    0 : slabdata     61     61      0
> kmalloc-128         8625   9280    128   32    1 : tunables    0    0    0 : slabdata    290    290      0
> kmalloc-64        7030122 7455232     64   64    1 : tunables    0    0    0 : slabdata 116488 116488      0
> kmalloc-32         18603  19712     32  128    1 : tunables    0    0    0 : slabdata    154    154      0
> kmalloc-16          8895   9728     16  256    1 : tunables    0    0    0 : slabdata     38     38      0
> kmalloc-8           9047  10752      8  512    1 : tunables    0    0    0 : slabdata     21     21      0
> kmalloc-192         5130   9135    192   21    1 : tunables    0    0    0 : slabdata    435    435      0
> kmalloc-96          1905   2940     96   42    1 : tunables    0    0    0 : slabdata     70     70      0
> kmem_cache_node      196    256     64   64    1 : tunables    0    0    0 : slabdata      4      4      0
> 
> # cat /proc/buddyinfo 
> Node 0, zone      DMA      2      0      2      2      2      2      2      1      2      2      2 
> Node 0, zone    DMA32  61877  10368    111     10      2      3      1      0      0      0      0 
> Node 0, zone   Normal   2036      0     14     12      6      3      3      0      1      0      0 
> Node 1, zone   Normal 483348     15      2      3      7      1      3      1      0      0      0 
>  
> Best wishes,
> 
> Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-04  2:21             ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-04  2:21 UTC (permalink / raw)
  To: Chris Webb
  Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, Pekka Enberg

Chris,

Your slabinfo does contain many order 1-3 slab caches, this is a major source
of high order allocations and hence lumpy reclaim. fork() is another.

In another thread, Pekka Enberg offers a tip:

        You can pass "slub_debug=o" as a kernel parameter to disable higher
        order allocations if you want to test things.

Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.

Thanks,
Fengguang


On Wed, Aug 04, 2010 at 05:49:46AM +0800, Chris Webb wrote:
> Wu Fengguang <fengguang.wu@intel.com> writes:
> 
> > Chris, what's in your /proc/slabinfo?
> 
> Hi. Sorry for the slow reply. The exact machine from which I previously
> extracted that /proc/memstat has unfortunately had swap turned off by a
> colleague while I was away, presumably because its behaviour because too
> bad. However, here is info from another member of the cluster, this time
> with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:
> 
> # cat /proc/meminfo 
> MemTotal:       33084008 kB
> MemFree:         2291464 kB
> Buffers:         4908468 kB
> Cached:            16056 kB
> SwapCached:      1427480 kB
> Active:         22885508 kB
> Inactive:        5719520 kB
> Active(anon):   20466488 kB
> Inactive(anon):  3215888 kB
> Active(file):    2419020 kB
> Inactive(file):  2503632 kB
> Unevictable:       10688 kB
> Mlocked:           10688 kB
> SwapTotal:      25165816 kB
> SwapFree:       22798248 kB
> Dirty:              2616 kB
> Writeback:             0 kB
> AnonPages:      23410296 kB
> Mapped:             6324 kB
> Shmem:                56 kB
> Slab:             692296 kB
> SReclaimable:     189032 kB
> SUnreclaim:       503264 kB
> KernelStack:        4568 kB
> PageTables:        65588 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    41707820 kB
> Committed_AS:   34859884 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      147616 kB
> VmallocChunk:   34342399496 kB
> HardwareCorrupted:     0 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:        5888 kB
> DirectMap2M:     2156544 kB
> DirectMap1G:    31457280 kB
> 
> # cat /proc/slabinfo 
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kmalloc_dma-512       32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
> nf_conntrack_expect    312    312    208   39    2 : tunables    0    0    0 : slabdata      8      8      0
> nf_conntrack         240    240    272   30    2 : tunables    0    0    0 : slabdata      8      8      0
> dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_crypt_io          240    260    152   26    1 : tunables    0    0    0 : slabdata     10     10      0
> kcopyd_job             0      0    368   22    2 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_rq_target_io        0      0    376   21    2 : tunables    0    0    0 : slabdata      0      0      0
> cfq_queue              0      0    168   24    1 : tunables    0    0    0 : slabdata      0      0      0
> bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
> udf_inode_cache        0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_request           0      0    632   25    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_inode             0      0    704   23    4 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_inode_cache       0      0    264   31    2 : tunables    0    0    0 : slabdata      0      0      0
> isofs_inode_cache      0      0    616   26    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_inode_cache        0      0    648   25    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     28     28    584   28    4 : tunables    0    0    0 : slabdata      1      1      0
> squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> journal_handle      1360   1360     24  170    1 : tunables    0    0    0 : slabdata      8      8      0
> journal_head         288    288    112   36    1 : tunables    0    0    0 : slabdata      8      8      0
> revoke_table         512    512     16  256    1 : tunables    0    0    0 : slabdata      2      2      0
> revoke_record       1024   1024     32  128    1 : tunables    0    0    0 : slabdata      8      8      0
> ext4_inode_cache       0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
> ext4_free_block_extents      0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_alloc_context      0      0    144   28    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_prealloc_space      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_system_zone       0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
> ext3_inode_cache    2371   2457    768   21    4 : tunables    0    0    0 : slabdata    117    117      0
> ext3_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> inotify_inode_mark_entry     36     36    112   36    1 : tunables    0    0    0 : slabdata      1      1      0
> posix_timers_cache    224    224    144   28    1 : tunables    0    0    0 : slabdata      8      8      0
> kvm_vcpu              38     45  10256    3    8 : tunables    0    0    0 : slabdata     15     15      0
> kvm_rmap_desc      19408  21828     40  102    1 : tunables    0    0    0 : slabdata    214    214      0
> kvm_pte_chain      14514  28543     56   73    1 : tunables    0    0    0 : slabdata    391    391      0
> UDP-Lite               0      0    768   21    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_dst_cache         221    231    384   21    2 : tunables    0    0    0 : slabdata     11     11      0
> UDP                  168    168    768   21    4 : tunables    0    0    0 : slabdata      8      8      0
> tw_sock_TCP          256    256    256   32    2 : tunables    0    0    0 : slabdata      8      8      0
> TCP                  191    220   1472   22    8 : tunables    0    0    0 : slabdata     10     10      0
> blkdev_queue         178    210   2128   15    8 : tunables    0    0    0 : slabdata     14     14      0
> blkdev_requests      608    816    336   24    2 : tunables    0    0    0 : slabdata     34     34      0
> fsnotify_event         0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> sock_inode_cache     250    300    640   25    4 : tunables    0    0    0 : slabdata     12     12      0
> file_lock_cache      176    176    184   22    1 : tunables    0    0    0 : slabdata      8      8      0
> shmem_inode_cache   1617   1827    776   21    4 : tunables    0    0    0 : slabdata     87     87      0
> Acpi-ParseExt       1692   1736     72   56    1 : tunables    0    0    0 : slabdata     31     31      0
> proc_inode_cache    1182   1326    616   26    4 : tunables    0    0    0 : slabdata     51     51      0
> sigqueue             200    200    160   25    1 : tunables    0    0    0 : slabdata      8      8      0
> radix_tree_node    65891  69542    560   29    4 : tunables    0    0    0 : slabdata   2398   2398      0
> bdev_cache           312    312    832   39    8 : tunables    0    0    0 : slabdata      8      8      0
> sysfs_dir_cache    21585  22287     80   51    1 : tunables    0    0    0 : slabdata    437    437      0
> inode_cache         2903   2996    568   28    4 : tunables    0    0    0 : slabdata    107    107      0
> dentry              8532   8631    192   21    1 : tunables    0    0    0 : slabdata    411    411      0
> buffer_head       1227688 1296648    112   36    1 : tunables    0    0    0 : slabdata  36018  36018      0
> vm_area_struct     18494  19389    176   23    1 : tunables    0    0    0 : slabdata    843    843      0
> files_cache          236    322    704   23    4 : tunables    0    0    0 : slabdata     14     14      0
> signal_cache         606    702    832   39    8 : tunables    0    0    0 : slabdata     18     18      0
> sighand_cache        415    480   2112   15    8 : tunables    0    0    0 : slabdata     32     32      0
> task_struct          671    840   1616   20    8 : tunables    0    0    0 : slabdata     42     42      0
> anon_vma            1511   1920     32  128    1 : tunables    0    0    0 : slabdata     15     15      0
> shared_policy_node    255    255     48   85    1 : tunables    0    0    0 : slabdata      3      3      0
> numa_policy        19205  20910     24  170    1 : tunables    0    0    0 : slabdata    123    123      0
> idr_layer_cache      373    390    544   30    4 : tunables    0    0    0 : slabdata     13     13      0
> kmalloc-8192          36     36   8192    4    8 : tunables    0    0    0 : slabdata      9      9      0
> kmalloc-4096        2284   2592   4096    8    8 : tunables    0    0    0 : slabdata    324    324      0
> kmalloc-2048         750    896   2048   16    8 : tunables    0    0    0 : slabdata     56     56      0
> kmalloc-1024        4025   4320   1024   32    8 : tunables    0    0    0 : slabdata    135    135      0
> kmalloc-512         1358   1760    512   32    4 : tunables    0    0    0 : slabdata     55     55      0
> kmalloc-256         1402   1952    256   32    2 : tunables    0    0    0 : slabdata     61     61      0
> kmalloc-128         8625   9280    128   32    1 : tunables    0    0    0 : slabdata    290    290      0
> kmalloc-64        7030122 7455232     64   64    1 : tunables    0    0    0 : slabdata 116488 116488      0
> kmalloc-32         18603  19712     32  128    1 : tunables    0    0    0 : slabdata    154    154      0
> kmalloc-16          8895   9728     16  256    1 : tunables    0    0    0 : slabdata     38     38      0
> kmalloc-8           9047  10752      8  512    1 : tunables    0    0    0 : slabdata     21     21      0
> kmalloc-192         5130   9135    192   21    1 : tunables    0    0    0 : slabdata    435    435      0
> kmalloc-96          1905   2940     96   42    1 : tunables    0    0    0 : slabdata     70     70      0
> kmem_cache_node      196    256     64   64    1 : tunables    0    0    0 : slabdata      4      4      0
> 
> # cat /proc/buddyinfo 
> Node 0, zone      DMA      2      0      2      2      2      2      2      1      2      2      2 
> Node 0, zone    DMA32  61877  10368    111     10      2      3      1      0      0      0      0 
> Node 0, zone   Normal   2036      0     14     12      6      3      3      0      1      0      0 
> Node 1, zone   Normal 483348     15      2      3      7      1      3      1      0      0      0 
>  
> Best wishes,
> 
> Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-03  4:28         ` Wu Fengguang
@ 2010-08-03 21:49           ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-03 21:49 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro

Wu Fengguang <fengguang.wu@intel.com> writes:

> Chris, what's in your /proc/slabinfo?

Hi. Sorry for the slow reply. The exact machine from which I previously
extracted that /proc/memstat has unfortunately had swap turned off by a
colleague while I was away, presumably because its behaviour because too
bad. However, here is info from another member of the cluster, this time
with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:

# cat /proc/meminfo 
MemTotal:       33084008 kB
MemFree:         2291464 kB
Buffers:         4908468 kB
Cached:            16056 kB
SwapCached:      1427480 kB
Active:         22885508 kB
Inactive:        5719520 kB
Active(anon):   20466488 kB
Inactive(anon):  3215888 kB
Active(file):    2419020 kB
Inactive(file):  2503632 kB
Unevictable:       10688 kB
Mlocked:           10688 kB
SwapTotal:      25165816 kB
SwapFree:       22798248 kB
Dirty:              2616 kB
Writeback:             0 kB
AnonPages:      23410296 kB
Mapped:             6324 kB
Shmem:                56 kB
Slab:             692296 kB
SReclaimable:     189032 kB
SUnreclaim:       503264 kB
KernelStack:        4568 kB
PageTables:        65588 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    41707820 kB
Committed_AS:   34859884 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      147616 kB
VmallocChunk:   34342399496 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        5888 kB
DirectMap2M:     2156544 kB
DirectMap1G:    31457280 kB

# cat /proc/slabinfo 
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc_dma-512       32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
nf_conntrack_expect    312    312    208   39    2 : tunables    0    0    0 : slabdata      8      8      0
nf_conntrack         240    240    272   30    2 : tunables    0    0    0 : slabdata      8      8      0
dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
dm_crypt_io          240    260    152   26    1 : tunables    0    0    0 : slabdata     10     10      0
kcopyd_job             0      0    368   22    2 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
dm_rq_target_io        0      0    376   21    2 : tunables    0    0    0 : slabdata      0      0      0
cfq_queue              0      0    168   24    1 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
udf_inode_cache        0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_request           0      0    632   25    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode             0      0    704   23    4 : tunables    0    0    0 : slabdata      0      0      0
ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
ntfs_inode_cache       0      0    264   31    2 : tunables    0    0    0 : slabdata      0      0      0
isofs_inode_cache      0      0    616   26    4 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    648   25    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     28     28    584   28    4 : tunables    0    0    0 : slabdata      1      1      0
squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
journal_handle      1360   1360     24  170    1 : tunables    0    0    0 : slabdata      8      8      0
journal_head         288    288    112   36    1 : tunables    0    0    0 : slabdata      8      8      0
revoke_table         512    512     16  256    1 : tunables    0    0    0 : slabdata      2      2      0
revoke_record       1024   1024     32  128    1 : tunables    0    0    0 : slabdata      8      8      0
ext4_inode_cache       0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_block_extents      0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_alloc_context      0      0    144   28    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_prealloc_space      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_system_zone       0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache    2371   2457    768   21    4 : tunables    0    0    0 : slabdata    117    117      0
ext3_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
kioctx                 0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
inotify_inode_mark_entry     36     36    112   36    1 : tunables    0    0    0 : slabdata      1      1      0
posix_timers_cache    224    224    144   28    1 : tunables    0    0    0 : slabdata      8      8      0
kvm_vcpu              38     45  10256    3    8 : tunables    0    0    0 : slabdata     15     15      0
kvm_rmap_desc      19408  21828     40  102    1 : tunables    0    0    0 : slabdata    214    214      0
kvm_pte_chain      14514  28543     56   73    1 : tunables    0    0    0 : slabdata    391    391      0
UDP-Lite               0      0    768   21    4 : tunables    0    0    0 : slabdata      0      0      0
ip_dst_cache         221    231    384   21    2 : tunables    0    0    0 : slabdata     11     11      0
UDP                  168    168    768   21    4 : tunables    0    0    0 : slabdata      8      8      0
tw_sock_TCP          256    256    256   32    2 : tunables    0    0    0 : slabdata      8      8      0
TCP                  191    220   1472   22    8 : tunables    0    0    0 : slabdata     10     10      0
blkdev_queue         178    210   2128   15    8 : tunables    0    0    0 : slabdata     14     14      0
blkdev_requests      608    816    336   24    2 : tunables    0    0    0 : slabdata     34     34      0
fsnotify_event         0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
sock_inode_cache     250    300    640   25    4 : tunables    0    0    0 : slabdata     12     12      0
file_lock_cache      176    176    184   22    1 : tunables    0    0    0 : slabdata      8      8      0
shmem_inode_cache   1617   1827    776   21    4 : tunables    0    0    0 : slabdata     87     87      0
Acpi-ParseExt       1692   1736     72   56    1 : tunables    0    0    0 : slabdata     31     31      0
proc_inode_cache    1182   1326    616   26    4 : tunables    0    0    0 : slabdata     51     51      0
sigqueue             200    200    160   25    1 : tunables    0    0    0 : slabdata      8      8      0
radix_tree_node    65891  69542    560   29    4 : tunables    0    0    0 : slabdata   2398   2398      0
bdev_cache           312    312    832   39    8 : tunables    0    0    0 : slabdata      8      8      0
sysfs_dir_cache    21585  22287     80   51    1 : tunables    0    0    0 : slabdata    437    437      0
inode_cache         2903   2996    568   28    4 : tunables    0    0    0 : slabdata    107    107      0
dentry              8532   8631    192   21    1 : tunables    0    0    0 : slabdata    411    411      0
buffer_head       1227688 1296648    112   36    1 : tunables    0    0    0 : slabdata  36018  36018      0
vm_area_struct     18494  19389    176   23    1 : tunables    0    0    0 : slabdata    843    843      0
files_cache          236    322    704   23    4 : tunables    0    0    0 : slabdata     14     14      0
signal_cache         606    702    832   39    8 : tunables    0    0    0 : slabdata     18     18      0
sighand_cache        415    480   2112   15    8 : tunables    0    0    0 : slabdata     32     32      0
task_struct          671    840   1616   20    8 : tunables    0    0    0 : slabdata     42     42      0
anon_vma            1511   1920     32  128    1 : tunables    0    0    0 : slabdata     15     15      0
shared_policy_node    255    255     48   85    1 : tunables    0    0    0 : slabdata      3      3      0
numa_policy        19205  20910     24  170    1 : tunables    0    0    0 : slabdata    123    123      0
idr_layer_cache      373    390    544   30    4 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-8192          36     36   8192    4    8 : tunables    0    0    0 : slabdata      9      9      0
kmalloc-4096        2284   2592   4096    8    8 : tunables    0    0    0 : slabdata    324    324      0
kmalloc-2048         750    896   2048   16    8 : tunables    0    0    0 : slabdata     56     56      0
kmalloc-1024        4025   4320   1024   32    8 : tunables    0    0    0 : slabdata    135    135      0
kmalloc-512         1358   1760    512   32    4 : tunables    0    0    0 : slabdata     55     55      0
kmalloc-256         1402   1952    256   32    2 : tunables    0    0    0 : slabdata     61     61      0
kmalloc-128         8625   9280    128   32    1 : tunables    0    0    0 : slabdata    290    290      0
kmalloc-64        7030122 7455232     64   64    1 : tunables    0    0    0 : slabdata 116488 116488      0
kmalloc-32         18603  19712     32  128    1 : tunables    0    0    0 : slabdata    154    154      0
kmalloc-16          8895   9728     16  256    1 : tunables    0    0    0 : slabdata     38     38      0
kmalloc-8           9047  10752      8  512    1 : tunables    0    0    0 : slabdata     21     21      0
kmalloc-192         5130   9135    192   21    1 : tunables    0    0    0 : slabdata    435    435      0
kmalloc-96          1905   2940     96   42    1 : tunables    0    0    0 : slabdata     70     70      0
kmem_cache_node      196    256     64   64    1 : tunables    0    0    0 : slabdata      4      4      0

# cat /proc/buddyinfo 
Node 0, zone      DMA      2      0      2      2      2      2      2      1      2      2      2 
Node 0, zone    DMA32  61877  10368    111     10      2      3      1      0      0      0      0 
Node 0, zone   Normal   2036      0     14     12      6      3      3      0      1      0      0 
Node 1, zone   Normal 483348     15      2      3      7      1      3      1      0      0      0 
 
Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-03 21:49           ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-03 21:49 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro

Wu Fengguang <fengguang.wu@intel.com> writes:

> Chris, what's in your /proc/slabinfo?

Hi. Sorry for the slow reply. The exact machine from which I previously
extracted that /proc/memstat has unfortunately had swap turned off by a
colleague while I was away, presumably because its behaviour because too
bad. However, here is info from another member of the cluster, this time
with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:

# cat /proc/meminfo 
MemTotal:       33084008 kB
MemFree:         2291464 kB
Buffers:         4908468 kB
Cached:            16056 kB
SwapCached:      1427480 kB
Active:         22885508 kB
Inactive:        5719520 kB
Active(anon):   20466488 kB
Inactive(anon):  3215888 kB
Active(file):    2419020 kB
Inactive(file):  2503632 kB
Unevictable:       10688 kB
Mlocked:           10688 kB
SwapTotal:      25165816 kB
SwapFree:       22798248 kB
Dirty:              2616 kB
Writeback:             0 kB
AnonPages:      23410296 kB
Mapped:             6324 kB
Shmem:                56 kB
Slab:             692296 kB
SReclaimable:     189032 kB
SUnreclaim:       503264 kB
KernelStack:        4568 kB
PageTables:        65588 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    41707820 kB
Committed_AS:   34859884 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      147616 kB
VmallocChunk:   34342399496 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        5888 kB
DirectMap2M:     2156544 kB
DirectMap1G:    31457280 kB

# cat /proc/slabinfo 
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc_dma-512       32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
nf_conntrack_expect    312    312    208   39    2 : tunables    0    0    0 : slabdata      8      8      0
nf_conntrack         240    240    272   30    2 : tunables    0    0    0 : slabdata      8      8      0
dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
dm_crypt_io          240    260    152   26    1 : tunables    0    0    0 : slabdata     10     10      0
kcopyd_job             0      0    368   22    2 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
dm_rq_target_io        0      0    376   21    2 : tunables    0    0    0 : slabdata      0      0      0
cfq_queue              0      0    168   24    1 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
udf_inode_cache        0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_request           0      0    632   25    4 : tunables    0    0    0 : slabdata      0      0      0
fuse_inode             0      0    704   23    4 : tunables    0    0    0 : slabdata      0      0      0
ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
ntfs_inode_cache       0      0    264   31    2 : tunables    0    0    0 : slabdata      0      0      0
isofs_inode_cache      0      0    616   26    4 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    648   25    4 : tunables    0    0    0 : slabdata      0      0      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     28     28    584   28    4 : tunables    0    0    0 : slabdata      1      1      0
squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
journal_handle      1360   1360     24  170    1 : tunables    0    0    0 : slabdata      8      8      0
journal_head         288    288    112   36    1 : tunables    0    0    0 : slabdata      8      8      0
revoke_table         512    512     16  256    1 : tunables    0    0    0 : slabdata      2      2      0
revoke_record       1024   1024     32  128    1 : tunables    0    0    0 : slabdata      8      8      0
ext4_inode_cache       0      0    896   36    8 : tunables    0    0    0 : slabdata      0      0      0
ext4_free_block_extents      0      0     56   73    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_alloc_context      0      0    144   28    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_prealloc_space      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
ext4_system_zone       0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
ext3_inode_cache    2371   2457    768   21    4 : tunables    0    0    0 : slabdata    117    117      0
ext3_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
kioctx                 0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
inotify_inode_mark_entry     36     36    112   36    1 : tunables    0    0    0 : slabdata      1      1      0
posix_timers_cache    224    224    144   28    1 : tunables    0    0    0 : slabdata      8      8      0
kvm_vcpu              38     45  10256    3    8 : tunables    0    0    0 : slabdata     15     15      0
kvm_rmap_desc      19408  21828     40  102    1 : tunables    0    0    0 : slabdata    214    214      0
kvm_pte_chain      14514  28543     56   73    1 : tunables    0    0    0 : slabdata    391    391      0
UDP-Lite               0      0    768   21    4 : tunables    0    0    0 : slabdata      0      0      0
ip_dst_cache         221    231    384   21    2 : tunables    0    0    0 : slabdata     11     11      0
UDP                  168    168    768   21    4 : tunables    0    0    0 : slabdata      8      8      0
tw_sock_TCP          256    256    256   32    2 : tunables    0    0    0 : slabdata      8      8      0
TCP                  191    220   1472   22    8 : tunables    0    0    0 : slabdata     10     10      0
blkdev_queue         178    210   2128   15    8 : tunables    0    0    0 : slabdata     14     14      0
blkdev_requests      608    816    336   24    2 : tunables    0    0    0 : slabdata     34     34      0
fsnotify_event         0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
sock_inode_cache     250    300    640   25    4 : tunables    0    0    0 : slabdata     12     12      0
file_lock_cache      176    176    184   22    1 : tunables    0    0    0 : slabdata      8      8      0
shmem_inode_cache   1617   1827    776   21    4 : tunables    0    0    0 : slabdata     87     87      0
Acpi-ParseExt       1692   1736     72   56    1 : tunables    0    0    0 : slabdata     31     31      0
proc_inode_cache    1182   1326    616   26    4 : tunables    0    0    0 : slabdata     51     51      0
sigqueue             200    200    160   25    1 : tunables    0    0    0 : slabdata      8      8      0
radix_tree_node    65891  69542    560   29    4 : tunables    0    0    0 : slabdata   2398   2398      0
bdev_cache           312    312    832   39    8 : tunables    0    0    0 : slabdata      8      8      0
sysfs_dir_cache    21585  22287     80   51    1 : tunables    0    0    0 : slabdata    437    437      0
inode_cache         2903   2996    568   28    4 : tunables    0    0    0 : slabdata    107    107      0
dentry              8532   8631    192   21    1 : tunables    0    0    0 : slabdata    411    411      0
buffer_head       1227688 1296648    112   36    1 : tunables    0    0    0 : slabdata  36018  36018      0
vm_area_struct     18494  19389    176   23    1 : tunables    0    0    0 : slabdata    843    843      0
files_cache          236    322    704   23    4 : tunables    0    0    0 : slabdata     14     14      0
signal_cache         606    702    832   39    8 : tunables    0    0    0 : slabdata     18     18      0
sighand_cache        415    480   2112   15    8 : tunables    0    0    0 : slabdata     32     32      0
task_struct          671    840   1616   20    8 : tunables    0    0    0 : slabdata     42     42      0
anon_vma            1511   1920     32  128    1 : tunables    0    0    0 : slabdata     15     15      0
shared_policy_node    255    255     48   85    1 : tunables    0    0    0 : slabdata      3      3      0
numa_policy        19205  20910     24  170    1 : tunables    0    0    0 : slabdata    123    123      0
idr_layer_cache      373    390    544   30    4 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-8192          36     36   8192    4    8 : tunables    0    0    0 : slabdata      9      9      0
kmalloc-4096        2284   2592   4096    8    8 : tunables    0    0    0 : slabdata    324    324      0
kmalloc-2048         750    896   2048   16    8 : tunables    0    0    0 : slabdata     56     56      0
kmalloc-1024        4025   4320   1024   32    8 : tunables    0    0    0 : slabdata    135    135      0
kmalloc-512         1358   1760    512   32    4 : tunables    0    0    0 : slabdata     55     55      0
kmalloc-256         1402   1952    256   32    2 : tunables    0    0    0 : slabdata     61     61      0
kmalloc-128         8625   9280    128   32    1 : tunables    0    0    0 : slabdata    290    290      0
kmalloc-64        7030122 7455232     64   64    1 : tunables    0    0    0 : slabdata 116488 116488      0
kmalloc-32         18603  19712     32  128    1 : tunables    0    0    0 : slabdata    154    154      0
kmalloc-16          8895   9728     16  256    1 : tunables    0    0    0 : slabdata     38     38      0
kmalloc-8           9047  10752      8  512    1 : tunables    0    0    0 : slabdata     21     21      0
kmalloc-192         5130   9135    192   21    1 : tunables    0    0    0 : slabdata    435    435      0
kmalloc-96          1905   2940     96   42    1 : tunables    0    0    0 : slabdata     70     70      0
kmem_cache_node      196    256     64   64    1 : tunables    0    0    0 : slabdata      4      4      0

# cat /proc/buddyinfo 
Node 0, zone      DMA      2      0      2      2      2      2      2      1      2      2      2 
Node 0, zone    DMA32  61877  10368    111     10      2      3      1      0      0      0      0 
Node 0, zone   Normal   2036      0     14     12      6      3      3      0      1      0      0 
Node 1, zone   Normal 483348     15      2      3      7      1      3      1      0      0      0 
 
Best wishes,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-03  4:47           ` Minchan Kim
@ 2010-08-03  6:39             ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-03  6:39 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro

On Tue, Aug 03, 2010 at 12:47:36PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> >> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> >> > Minchan Kim <minchan.kim@gmail.com> writes:
> >> >
> >> >> Another possibility is _zone_reclaim_ in NUMA.
> >> >> Your working set has many anonymous page.
> >> >>
> >> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >> >>
> >> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >> >
> >> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> >> > these are
> >> >
> >> >  # cat /proc/sys/vm/zone_reclaim_mode
> >> >  0
> >> >  # cat /proc/sys/vm/min_unmapped_ratio
> >> >  1
> >>
> >> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
> >
> > If there are lots of order-1 or higher allocations, anonymous pages
> > will be randomly evicted, regardless of their LRU ages. This is
> 
> I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
> But it's possible. :)
> 
> > probably another factor why the users claim. Are there easy ways to
> > confirm this other than patching the kernel?
> 
> cat /proc/buddyinfo can help?

Some high order slab caches may show up there :)

> Off-topic:
> It would be better to add new vmstat of lumpy entrance.

I think it's a good debug entry. Although convenient, lumpy reclaim
is accompanied with some bad side effects. When something goes wrong,
it helps to check the number of lumpy reclaims.

Thanks,
Fengguang

> Pseudo code.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0f9f624..d10ff4e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1641,7 +1641,7 @@ out:
>         }
>  }
> 
> -static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
> +static void set_lumpy_reclaim_mode(int priority, struct scan_control
> *sc, struct zone *zone)
>  {
>         /*
>          * If we need a large contiguous chunk of memory, or have
> @@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
> struct scan_control *sc)
>                 sc->lumpy_reclaim_mode = 1;
>         else
>                 sc->lumpy_reclaim_mode = 0;
> +
> +       if (sc->lumpy_reclaim_mode)
> +               inc_zone_state(zone, NR_LUMPY);
>  }
> 
>  /*
> @@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,
> 
>         get_scan_count(zone, sc, nr, priority);
> 
> -       set_lumpy_reclaim_mode(priority, sc);
> +       set_lumpy_reclaim_mode(priority, sc, zone);
> 
>         while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>                                         nr[LRU_INACTIVE_FILE]) {
> 
> -- 
> Kind regards,
> Minchan Kim

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-03  6:39             ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-03  6:39 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro

On Tue, Aug 03, 2010 at 12:47:36PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> >> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> >> > Minchan Kim <minchan.kim@gmail.com> writes:
> >> >
> >> >> Another possibility is _zone_reclaim_ in NUMA.
> >> >> Your working set has many anonymous page.
> >> >>
> >> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >> >>
> >> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >> >
> >> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> >> > these are
> >> >
> >> > A # cat /proc/sys/vm/zone_reclaim_mode
> >> > A 0
> >> > A # cat /proc/sys/vm/min_unmapped_ratio
> >> > A 1
> >>
> >> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
> >
> > If there are lots of order-1 or higher allocations, anonymous pages
> > will be randomly evicted, regardless of their LRU ages. This is
> 
> I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
> But it's possible. :)
> 
> > probably another factor why the users claim. Are there easy ways to
> > confirm this other than patching the kernel?
> 
> cat /proc/buddyinfo can help?

Some high order slab caches may show up there :)

> Off-topic:
> It would be better to add new vmstat of lumpy entrance.

I think it's a good debug entry. Although convenient, lumpy reclaim
is accompanied with some bad side effects. When something goes wrong,
it helps to check the number of lumpy reclaims.

Thanks,
Fengguang

> Pseudo code.
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0f9f624..d10ff4e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1641,7 +1641,7 @@ out:
>         }
>  }
> 
> -static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
> +static void set_lumpy_reclaim_mode(int priority, struct scan_control
> *sc, struct zone *zone)
>  {
>         /*
>          * If we need a large contiguous chunk of memory, or have
> @@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
> struct scan_control *sc)
>                 sc->lumpy_reclaim_mode = 1;
>         else
>                 sc->lumpy_reclaim_mode = 0;
> +
> +       if (sc->lumpy_reclaim_mode)
> +               inc_zone_state(zone, NR_LUMPY);
>  }
> 
>  /*
> @@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,
> 
>         get_scan_count(zone, sc, nr, priority);
> 
> -       set_lumpy_reclaim_mode(priority, sc);
> +       set_lumpy_reclaim_mode(priority, sc, zone);
> 
>         while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>                                         nr[LRU_INACTIVE_FILE]) {
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-03  4:28         ` Wu Fengguang
@ 2010-08-03  4:47           ` Minchan Kim
  -1 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-03  4:47 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro

On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
>> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
>> > Minchan Kim <minchan.kim@gmail.com> writes:
>> >
>> >> Another possibility is _zone_reclaim_ in NUMA.
>> >> Your working set has many anonymous page.
>> >>
>> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> >> It can make reclaim mode to lumpy so it can page out anon pages.
>> >>
>> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>> >
>> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
>> > these are
>> >
>> >  # cat /proc/sys/vm/zone_reclaim_mode
>> >  0
>> >  # cat /proc/sys/vm/min_unmapped_ratio
>> >  1
>>
>> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
>
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is

I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
But it's possible. :)

> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?

cat /proc/buddyinfo can help?

Off-topic:
It would be better to add new vmstat of lumpy entrance.

Pseudo code.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f9f624..d10ff4e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1641,7 +1641,7 @@ out:
        }
 }

-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
+static void set_lumpy_reclaim_mode(int priority, struct scan_control
*sc, struct zone *zone)
 {
        /*
         * If we need a large contiguous chunk of memory, or have
@@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
struct scan_control *sc)
                sc->lumpy_reclaim_mode = 1;
        else
                sc->lumpy_reclaim_mode = 0;
+
+       if (sc->lumpy_reclaim_mode)
+               inc_zone_state(zone, NR_LUMPY);
 }

 /*
@@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,

        get_scan_count(zone, sc, nr, priority);

-       set_lumpy_reclaim_mode(priority, sc);
+       set_lumpy_reclaim_mode(priority, sc, zone);

        while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
                                        nr[LRU_INACTIVE_FILE]) {

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-03  4:47           ` Minchan Kim
  0 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-03  4:47 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro

On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
>> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
>> > Minchan Kim <minchan.kim@gmail.com> writes:
>> >
>> >> Another possibility is _zone_reclaim_ in NUMA.
>> >> Your working set has many anonymous page.
>> >>
>> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> >> It can make reclaim mode to lumpy so it can page out anon pages.
>> >>
>> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>> >
>> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
>> > these are
>> >
>> >  # cat /proc/sys/vm/zone_reclaim_mode
>> >  0
>> >  # cat /proc/sys/vm/min_unmapped_ratio
>> >  1
>>
>> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
>
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is

I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
But it's possible. :)

> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?

cat /proc/buddyinfo can help?

Off-topic:
It would be better to add new vmstat of lumpy entrance.

Pseudo code.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f9f624..d10ff4e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1641,7 +1641,7 @@ out:
        }
 }

-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
+static void set_lumpy_reclaim_mode(int priority, struct scan_control
*sc, struct zone *zone)
 {
        /*
         * If we need a large contiguous chunk of memory, or have
@@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
struct scan_control *sc)
                sc->lumpy_reclaim_mode = 1;
        else
                sc->lumpy_reclaim_mode = 0;
+
+       if (sc->lumpy_reclaim_mode)
+               inc_zone_state(zone, NR_LUMPY);
 }

 /*
@@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,

        get_scan_count(zone, sc, nr, priority);

-       set_lumpy_reclaim_mode(priority, sc);
+       set_lumpy_reclaim_mode(priority, sc, zone);

        while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
                                        nr[LRU_INACTIVE_FILE]) {

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-03  4:09       ` Minchan Kim
@ 2010-08-03  4:28         ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-03  4:28 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro

On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> > Minchan Kim <minchan.kim@gmail.com> writes:
> >
> >> Another possibility is _zone_reclaim_ in NUMA.
> >> Your working set has many anonymous page.
> >>
> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >>
> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >
> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > these are
> >
> >  # cat /proc/sys/vm/zone_reclaim_mode
> >  0
> >  # cat /proc/sys/vm/min_unmapped_ratio
> >  1
> 
> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.

If there are lots of order-1 or higher allocations, anonymous pages
will be randomly evicted, regardless of their LRU ages. This is
probably another factor why the users claim. Are there easy ways to
confirm this other than patching the kernel?

Chris, what's in your /proc/slabinfo?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-03  4:28         ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-08-03  4:28 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Chris Webb, linux-mm, linux-kernel, KOSAKI Motohiro

On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> > Minchan Kim <minchan.kim@gmail.com> writes:
> >
> >> Another possibility is _zone_reclaim_ in NUMA.
> >> Your working set has many anonymous page.
> >>
> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >>
> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >
> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > these are
> >
> > A # cat /proc/sys/vm/zone_reclaim_mode
> > A 0
> > A # cat /proc/sys/vm/min_unmapped_ratio
> > A 1
> 
> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.

If there are lots of order-1 or higher allocations, anonymous pages
will be randomly evicted, regardless of their LRU ages. This is
probably another factor why the users claim. Are there easy ways to
confirm this other than patching the kernel?

Chris, what's in your /proc/slabinfo?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-03  3:31     ` Chris Webb
@ 2010-08-03  4:09       ` Minchan Kim
  -1 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-03  4:09 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang

On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> Minchan Kim <minchan.kim@gmail.com> writes:
>
>> Another possibility is _zone_reclaim_ in NUMA.
>> Your working set has many anonymous page.
>>
>> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> It can make reclaim mode to lumpy so it can page out anon pages.
>>
>> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>
> Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> these are
>
>  # cat /proc/sys/vm/zone_reclaim_mode
>  0
>  # cat /proc/sys/vm/min_unmapped_ratio
>  1

if zone_reclaim_mode is zero, it doesn't swap out anon_pages.

1) How does VM reclaim anonymous pages even though vm_swappiness ==
zero and has big page cache?
2) I doubt file pages of your system is fulled by Buffers while Cached
is almost 10M.
Why is it remained although anon pages is swapped out and cached page
are reclaimed?

Hmm. I have no idea. :(

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-03  4:09       ` Minchan Kim
  0 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-03  4:09 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang

On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris@arachsys.com> wrote:
> Minchan Kim <minchan.kim@gmail.com> writes:
>
>> Another possibility is _zone_reclaim_ in NUMA.
>> Your working set has many anonymous page.
>>
>> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> It can make reclaim mode to lumpy so it can page out anon pages.
>>
>> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>
> Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> these are
>
>  # cat /proc/sys/vm/zone_reclaim_mode
>  0
>  # cat /proc/sys/vm/min_unmapped_ratio
>  1

if zone_reclaim_mode is zero, it doesn't swap out anon_pages.

1) How does VM reclaim anonymous pages even though vm_swappiness ==
zero and has big page cache?
2) I doubt file pages of your system is fulled by Buffers while Cached
is almost 10M.
Why is it remained although anon pages is swapped out and cached page
are reclaimed?

Hmm. I have no idea. :(

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-02 23:55   ` Minchan Kim
@ 2010-08-03  3:31     ` Chris Webb
  -1 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-03  3:31 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang

Minchan Kim <minchan.kim@gmail.com> writes:

> Another possibility is _zone_reclaim_ in NUMA.
> Your working set has many anonymous page.
> 
> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> It can make reclaim mode to lumpy so it can page out anon pages.
> 
> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?

Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
these are

  # cat /proc/sys/vm/zone_reclaim_mode 
  0
  # cat /proc/sys/vm/min_unmapped_ratio 
  1

I haven't changed either of these from the kernel default.

Many thanks,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-03  3:31     ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-03  3:31 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang

Minchan Kim <minchan.kim@gmail.com> writes:

> Another possibility is _zone_reclaim_ in NUMA.
> Your working set has many anonymous page.
> 
> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> It can make reclaim mode to lumpy so it can page out anon pages.
> 
> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?

Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
these are

  # cat /proc/sys/vm/zone_reclaim_mode 
  0
  # cat /proc/sys/vm/min_unmapped_ratio 
  1

I haven't changed either of these from the kernel default.

Many thanks,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
  2010-08-02 12:47 ` Chris Webb
@ 2010-08-02 23:55   ` Minchan Kim
  -1 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-02 23:55 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang

On Mon, Aug 2, 2010 at 9:47 PM, Chris Webb <chris@arachsys.com> wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
>  http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
>  -CONFIG_MCORE2=y
>  +CONFIG_MK8=y
>  -CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
>  # cat /proc/meminfo
>  MemTotal:       33083420 kB
>  MemFree:          693164 kB
>  Buffers:         8834380 kB
>  Cached:            11212 kB
>  SwapCached:      1443524 kB
>  Active:         21656844 kB
>  Inactive:        8119352 kB
>  Active(anon):   17203092 kB
>  Inactive(anon):  3729032 kB
>  Active(file):    4453752 kB
>  Inactive(file):  4390320 kB
>  Unevictable:        5472 kB
>  Mlocked:            5472 kB
>  SwapTotal:      25165816 kB
>  SwapFree:       21854572 kB
>  Dirty:              4300 kB
>  Writeback:             4 kB
>  AnonPages:      20780368 kB
>  Mapped:             6056 kB
>  Shmem:                56 kB
>  Slab:             961512 kB
>  SReclaimable:     438276 kB
>  SUnreclaim:       523236 kB
>  KernelStack:       10152 kB
>  PageTables:        67176 kB
>  NFS_Unstable:          0 kB
>  Bounce:                0 kB
>  WritebackTmp:          0 kB
>  CommitLimit:    41707524 kB
>  Committed_AS:   39870868 kB
>  VmallocTotal:   34359738367 kB
>  VmallocUsed:      150880 kB
>  VmallocChunk:   34342404996 kB
>  HardwareCorrupted:     0 kB
>  HugePages_Total:       0
>  HugePages_Free:        0
>  HugePages_Rsvd:        0
>  HugePages_Surp:        0
>  Hugepagesize:       2048 kB
>  DirectMap4k:        5824 kB
>  DirectMap2M:     3205120 kB
>  DirectMap1G:    30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>

Hmm, Strange.
We reclaim only anon pages when the system has few page cache.
(ie, file + free <= high_water_mark)
But in your meminfo, your system has lots of page cache page.
So It isn't likely.

Another possibility is _zone_reclaim_ in NUMA.
Your working set has many anonymous page.

The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
It can make reclaim mode to lumpy so it can page out anon pages.

Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Over-eager swapping
@ 2010-08-02 23:55   ` Minchan Kim
  0 siblings, 0 replies; 75+ messages in thread
From: Minchan Kim @ 2010-08-02 23:55 UTC (permalink / raw)
  To: Chris Webb; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Wu Fengguang

On Mon, Aug 2, 2010 at 9:47 PM, Chris Webb <chris@arachsys.com> wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
>  http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
>  -CONFIG_MCORE2=y
>  +CONFIG_MK8=y
>  -CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
>  # cat /proc/meminfo
>  MemTotal:       33083420 kB
>  MemFree:          693164 kB
>  Buffers:         8834380 kB
>  Cached:            11212 kB
>  SwapCached:      1443524 kB
>  Active:         21656844 kB
>  Inactive:        8119352 kB
>  Active(anon):   17203092 kB
>  Inactive(anon):  3729032 kB
>  Active(file):    4453752 kB
>  Inactive(file):  4390320 kB
>  Unevictable:        5472 kB
>  Mlocked:            5472 kB
>  SwapTotal:      25165816 kB
>  SwapFree:       21854572 kB
>  Dirty:              4300 kB
>  Writeback:             4 kB
>  AnonPages:      20780368 kB
>  Mapped:             6056 kB
>  Shmem:                56 kB
>  Slab:             961512 kB
>  SReclaimable:     438276 kB
>  SUnreclaim:       523236 kB
>  KernelStack:       10152 kB
>  PageTables:        67176 kB
>  NFS_Unstable:          0 kB
>  Bounce:                0 kB
>  WritebackTmp:          0 kB
>  CommitLimit:    41707524 kB
>  Committed_AS:   39870868 kB
>  VmallocTotal:   34359738367 kB
>  VmallocUsed:      150880 kB
>  VmallocChunk:   34342404996 kB
>  HardwareCorrupted:     0 kB
>  HugePages_Total:       0
>  HugePages_Free:        0
>  HugePages_Rsvd:        0
>  HugePages_Surp:        0
>  Hugepagesize:       2048 kB
>  DirectMap4k:        5824 kB
>  DirectMap2M:     3205120 kB
>  DirectMap1G:    30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>

Hmm, Strange.
We reclaim only anon pages when the system has few page cache.
(ie, file + free <= high_water_mark)
But in your meminfo, your system has lots of page cache page.
So It isn't likely.

Another possibility is _zone_reclaim_ in NUMA.
Your working set has many anonymous page.

The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
It can make reclaim mode to lumpy so it can page out anon pages.

Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Over-eager swapping
@ 2010-08-02 12:47 ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-02 12:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel

We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some (but not all) of the machines. This is resulting in
customer reports of very poor response latency from the virtual machines
which have been swapped out, despite the hosts apparently having large
amounts of free memory, and running fine if swap is turned off.

All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
machines which apparently doesn't exhibit the problem, and a cluster of
2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
the affected machines is at

  http://cdw.me.uk/tmp/config-2.6.32.7

This differs very little from the config on the unaffected Xeon machines,
essentially just

  -CONFIG_MCORE2=y
  +CONFIG_MK8=y
  -CONFIG_X86_P6_NOP=y

On a typical affected machine, the virtual machines and other processes
would apparently leave around 5.5GB of RAM available for buffers, but the
system seems to want to swap out 3GB of anonymous pages to give itself more
like 9GB of buffers:

  # cat /proc/meminfo 
  MemTotal:       33083420 kB
  MemFree:          693164 kB
  Buffers:         8834380 kB
  Cached:            11212 kB
  SwapCached:      1443524 kB
  Active:         21656844 kB
  Inactive:        8119352 kB
  Active(anon):   17203092 kB
  Inactive(anon):  3729032 kB
  Active(file):    4453752 kB
  Inactive(file):  4390320 kB
  Unevictable:        5472 kB
  Mlocked:            5472 kB
  SwapTotal:      25165816 kB
  SwapFree:       21854572 kB
  Dirty:              4300 kB
  Writeback:             4 kB
  AnonPages:      20780368 kB
  Mapped:             6056 kB
  Shmem:                56 kB
  Slab:             961512 kB
  SReclaimable:     438276 kB
  SUnreclaim:       523236 kB
  KernelStack:       10152 kB
  PageTables:        67176 kB
  NFS_Unstable:          0 kB
  Bounce:                0 kB
  WritebackTmp:          0 kB
  CommitLimit:    41707524 kB
  Committed_AS:   39870868 kB
  VmallocTotal:   34359738367 kB
  VmallocUsed:      150880 kB
  VmallocChunk:   34342404996 kB
  HardwareCorrupted:     0 kB
  HugePages_Total:       0
  HugePages_Free:        0
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  DirectMap4k:        5824 kB
  DirectMap2M:     3205120 kB
  DirectMap1G:    30408704 kB

We see this despite the machine having vm.swappiness set to 0 in an attempt
to skew the reclaim as far as possible in favour of releasing page cache
instead of swapping anonymous pages.

After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB.

We could run with these machines without swap (in the worst cases we're
already doing so), but I'd prefer to have a reserve of swap available in
case of genuine emergency. If it's a choice between swapping out a guest or
oom-killing it, I'd prefer to swap... but I really don't want to swap out
running virtual machines in order to have eight gigabytes of page cache
instead of five!

Is this a problem with the page reclaim priorities, or am I just tuning
these hosts incorrectly? Is there more detailed info than /proc/meminfo
available which might shed more light on what's going wrong here?

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Over-eager swapping
@ 2010-08-02 12:47 ` Chris Webb
  0 siblings, 0 replies; 75+ messages in thread
From: Chris Webb @ 2010-08-02 12:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel

We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some (but not all) of the machines. This is resulting in
customer reports of very poor response latency from the virtual machines
which have been swapped out, despite the hosts apparently having large
amounts of free memory, and running fine if swap is turned off.

All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
machines which apparently doesn't exhibit the problem, and a cluster of
2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
the affected machines is at

  http://cdw.me.uk/tmp/config-2.6.32.7

This differs very little from the config on the unaffected Xeon machines,
essentially just

  -CONFIG_MCORE2=y
  +CONFIG_MK8=y
  -CONFIG_X86_P6_NOP=y

On a typical affected machine, the virtual machines and other processes
would apparently leave around 5.5GB of RAM available for buffers, but the
system seems to want to swap out 3GB of anonymous pages to give itself more
like 9GB of buffers:

  # cat /proc/meminfo 
  MemTotal:       33083420 kB
  MemFree:          693164 kB
  Buffers:         8834380 kB
  Cached:            11212 kB
  SwapCached:      1443524 kB
  Active:         21656844 kB
  Inactive:        8119352 kB
  Active(anon):   17203092 kB
  Inactive(anon):  3729032 kB
  Active(file):    4453752 kB
  Inactive(file):  4390320 kB
  Unevictable:        5472 kB
  Mlocked:            5472 kB
  SwapTotal:      25165816 kB
  SwapFree:       21854572 kB
  Dirty:              4300 kB
  Writeback:             4 kB
  AnonPages:      20780368 kB
  Mapped:             6056 kB
  Shmem:                56 kB
  Slab:             961512 kB
  SReclaimable:     438276 kB
  SUnreclaim:       523236 kB
  KernelStack:       10152 kB
  PageTables:        67176 kB
  NFS_Unstable:          0 kB
  Bounce:                0 kB
  WritebackTmp:          0 kB
  CommitLimit:    41707524 kB
  Committed_AS:   39870868 kB
  VmallocTotal:   34359738367 kB
  VmallocUsed:      150880 kB
  VmallocChunk:   34342404996 kB
  HardwareCorrupted:     0 kB
  HugePages_Total:       0
  HugePages_Free:        0
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:       2048 kB
  DirectMap4k:        5824 kB
  DirectMap2M:     3205120 kB
  DirectMap1G:    30408704 kB

We see this despite the machine having vm.swappiness set to 0 in an attempt
to skew the reclaim as far as possible in favour of releasing page cache
instead of swapping anonymous pages.

After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB.

We could run with these machines without swap (in the worst cases we're
already doing so), but I'd prefer to have a reserve of swap available in
case of genuine emergency. If it's a choice between swapping out a guest or
oom-killing it, I'd prefer to swap... but I really don't want to swap out
running virtual machines in order to have eight gigabytes of page cache
instead of five!

Is this a problem with the page reclaim priorities, or am I just tuning
these hosts incorrectly? Is there more detailed info than /proc/meminfo
available which might shed more light on what's going wrong here?

Best wishes,

Chris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2012-04-25 14:42 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-23  9:27 Over-eager swapping Richard Davies
2012-04-23  9:27 ` Richard Davies
2012-04-23 12:07 ` Zdenek Kaspar
2012-04-23 12:07   ` Zdenek Kaspar
2012-04-23 17:19 ` Dave Hansen
2012-04-23 17:19   ` Dave Hansen
2012-04-24  0:35 ` Minchan Kim
2012-04-24  0:35   ` Minchan Kim
2012-04-24 11:16 ` Peter Lieven
2012-04-24 11:16   ` Peter Lieven
2012-04-25 14:41 ` Rik van Riel
2012-04-25 14:41   ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2010-08-02 12:47 Chris Webb
2010-08-02 12:47 ` Chris Webb
2010-08-02 23:55 ` Minchan Kim
2010-08-02 23:55   ` Minchan Kim
2010-08-03  3:31   ` Chris Webb
2010-08-03  3:31     ` Chris Webb
2010-08-03  4:09     ` Minchan Kim
2010-08-03  4:09       ` Minchan Kim
2010-08-03  4:28       ` Wu Fengguang
2010-08-03  4:28         ` Wu Fengguang
2010-08-03  4:47         ` Minchan Kim
2010-08-03  4:47           ` Minchan Kim
2010-08-03  6:39           ` Wu Fengguang
2010-08-03  6:39             ` Wu Fengguang
2010-08-03 21:49         ` Chris Webb
2010-08-03 21:49           ` Chris Webb
2010-08-04  2:21           ` Wu Fengguang
2010-08-04  2:21             ` Wu Fengguang
2010-08-04  3:10             ` Minchan Kim
2010-08-04  3:24               ` Wu Fengguang
2010-08-04  3:24                 ` Wu Fengguang
2010-08-04  9:58                 ` Chris Webb
2010-08-04  9:58                   ` Chris Webb
2010-08-04 11:49                   ` Wu Fengguang
2010-08-04 11:49                     ` Wu Fengguang
2010-08-04 12:04                     ` Chris Webb
2010-08-04 12:04                       ` Chris Webb
2010-08-18 14:38                       ` Wu Fengguang
2010-08-18 14:38                         ` Wu Fengguang
2010-08-18 14:46                         ` Chris Webb
2010-08-18 14:46                           ` Chris Webb
2010-08-18 15:21                           ` Wu Fengguang
2010-08-18 15:21                             ` Wu Fengguang
2010-08-18 15:57                             ` Christoph Lameter
2010-08-18 15:57                               ` Christoph Lameter
2010-08-18 16:20                               ` Wu Fengguang
2010-08-18 16:20                                 ` Wu Fengguang
2010-08-18 15:57                             ` Lee Schermerhorn
2010-08-18 15:57                               ` Lee Schermerhorn
2010-08-18 15:58                               ` Chris Webb
2010-08-18 15:58                                 ` Chris Webb
2010-08-18 16:13                                 ` Christoph Lameter
2010-08-18 16:13                                   ` Christoph Lameter
2010-08-18 16:32                                   ` Chris Webb
2010-08-18 16:32                                     ` Chris Webb
2010-08-19  5:16                                   ` Balbir Singh
2010-08-19  5:16                                     ` Balbir Singh
2010-08-19 10:20                                   ` Chris Webb
2010-08-19 10:20                                     ` Chris Webb
2010-08-19 19:03                                     ` Christoph Lameter
2010-08-19 19:03                                       ` Christoph Lameter
2010-08-18 16:13                                 ` Wu Fengguang
2010-08-18 16:13                                   ` Wu Fengguang
2010-08-18 16:31                                   ` Chris Webb
2010-08-18 16:31                                     ` Chris Webb
2010-08-19  5:13         ` Balbir Singh
2010-08-19  5:13           ` Balbir Singh
2010-08-18 16:45 ` Balbir Singh
2010-08-18 16:45   ` Balbir Singh
2010-08-19  9:25   ` Chris Webb
2010-08-19  9:25     ` Chris Webb
2010-08-19 15:13     ` Balbir Singh
2010-08-19 15:13       ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.