linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kswapd in 3.2.0-rc5 is a CPU hog
@ 2011-12-21  3:10 Nikolay S.
  2011-12-21  9:52 ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Nikolay S. @ 2011-12-21  3:10 UTC (permalink / raw)
  To: linux-kernel

Hello,

I'm using 3.2-rc5 on a machine, which atm does almost nothing except
file system operations and network i/o (i.e. file server). And there is
a problem with kswapd.

I'm playing with dd:
dd if=/some/big/file of=/dev/null bs=8M

I.e. I'm filling page cache.

So when the machine is just rebooted, kswapd during this operation is
almost idle, just 5-8 percent according to top.

After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
for kswapd:

  PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
  420 root      R   70  0.0  22:09.60    0 kswapd0
17717 nowhere   D   27  0.2   0:01.81  10m dd

In fact, kswapd cpu usage on this operation steadily increases over
time.

Also read performance degrades over time. After reboot:
dd if=/some/big/file of=/dev/null bs=8M
1019+1 records in
1019+1 records out
8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s

After ~5 days uptime:
dd if=/some/big/file of=/dev/null bs=8M
1019+1 records in
1019+1 records out
8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s

Whereas raw disk sequential read performance stays the same:
dd if=/some/big/file of=/dev/null bs=8M iflag=direct
1019+1 records in
1019+1 records out
8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s

Also after dropping caches, situation somehow improves, but not to the
state of freshly restarted system:
  PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
  420 root      S   39  0.0  23:31.17    0 kswapd0
19829 nowhere   D   24  0.2   0:02.72 7764 dd

perf shows:

    31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
    26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
    16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
     6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
     5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
     4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
     2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
     1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
     0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
     0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
     0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
     0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
     0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt

P.S.: The message above was written couple of days ago. Now I'm at 10
days uptime, and this is the result as of today
  PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
  420 root      R   93  0.0 110:48.48    0 kswapd0
30085 nowhere   D   42  0.2   0:04.36  10m dd

PPS: Please CC me.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-21  3:10 Kswapd in 3.2.0-rc5 is a CPU hog Nikolay S.
@ 2011-12-21  9:52 ` Michal Hocko
  2011-12-21 10:15   ` nowhere
  2011-12-21 22:55   ` Dave Chinner
  0 siblings, 2 replies; 32+ messages in thread
From: Michal Hocko @ 2011-12-21  9:52 UTC (permalink / raw)
  To: Nikolay S.; +Cc: linux-kernel, linux-mm

[Let's CC linux-mm]

On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> Hello,
> 
> I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> file system operations and network i/o (i.e. file server). And there is
> a problem with kswapd.

What kind of filesystem do you use?

> 
> I'm playing with dd:
> dd if=/some/big/file of=/dev/null bs=8M
> 
> I.e. I'm filling page cache.
> 
> So when the machine is just rebooted, kswapd during this operation is
> almost idle, just 5-8 percent according to top.
> 
> After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> for kswapd:
> 
>   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
>   420 root      R   70  0.0  22:09.60    0 kswapd0
> 17717 nowhere   D   27  0.2   0:01.81  10m dd
> 
> In fact, kswapd cpu usage on this operation steadily increases over
> time.
> 
> Also read performance degrades over time. After reboot:
> dd if=/some/big/file of=/dev/null bs=8M
> 1019+1 records in
> 1019+1 records out
> 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> 
> After ~5 days uptime:
> dd if=/some/big/file of=/dev/null bs=8M
> 1019+1 records in
> 1019+1 records out
> 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> 
> Whereas raw disk sequential read performance stays the same:
> dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> 1019+1 records in
> 1019+1 records out
> 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> 
> Also after dropping caches, situation somehow improves, but not to the
> state of freshly restarted system:
>   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
>   420 root      S   39  0.0  23:31.17    0 kswapd0
> 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> 
> perf shows:
> 
>     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
>     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
>     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
>      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
>      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
>      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
>      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
>      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
>      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
>      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
>      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
>      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
>      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt

Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
spin lock contention.
Could you also take few snapshots of /proc/420/stack to see what kswapd
is doing.

> 
> P.S.: The message above was written couple of days ago. Now I'm at 10
> days uptime, and this is the result as of today
>   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
>   420 root      R   93  0.0 110:48.48    0 kswapd0
> 30085 nowhere   D   42  0.2   0:04.36  10m dd
> 
> PPS: Please CC me.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-21  9:52 ` Michal Hocko
@ 2011-12-21 10:15   ` nowhere
  2011-12-21 10:24     ` Michal Hocko
  2011-12-21 22:55   ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: nowhere @ 2011-12-21 10:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-kernel, linux-mm

В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> [Let's CC linux-mm]
> 
> On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > Hello,
> > 
> > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > file system operations and network i/o (i.e. file server). And there is
> > a problem with kswapd.
> 
> What kind of filesystem do you use?

Well, that is XFS.
I have a large volume with ~200000 files, and a periodic job, which
checks all file's timestamps once per 30 minutes and makes actions if
timestamp has changed.

> 
> > 
> > I'm playing with dd:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 
> > I.e. I'm filling page cache.
> > 
> > So when the machine is just rebooted, kswapd during this operation is
> > almost idle, just 5-8 percent according to top.
> > 
> > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > for kswapd:
> > 
> >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > 
> > In fact, kswapd cpu usage on this operation steadily increases over
> > time.
> > 
> > Also read performance degrades over time. After reboot:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > 
> > After ~5 days uptime:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > 
> > Whereas raw disk sequential read performance stays the same:
> > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > 
> > Also after dropping caches, situation somehow improves, but not to the
> > state of freshly restarted system:
> >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > 
> > perf shows:
> > 
> >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> 
> Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> spin lock contention.

This is slabinfo, sorted by num objects:

xfs_inode         192941 193205    960   17    4 : tunables    0    0    0 : slabdata  11365  11365      0
dentry            118818 118818    192   21    1 : tunables    0    0    0 : slabdata   5658   5658      0
kmalloc-256       107920 107920    256   16    1 : tunables    0    0    0 : slabdata   6745   6745      0
kmalloc-64         59912 102656     64   64    1 : tunables    0    0    0 : slabdata   1604   1604      0
radix_tree_node    30618  33474    568   14    2 : tunables    0    0    0 : slabdata   2391   2391      0
kmalloc-96         27092  41202     96   42    1 : tunables    0    0    0 : slabdata    981    981      0
buffer_head        24892  63843    104   39    1 : tunables    0    0    0 : slabdata   1637   1637      0
kmalloc-192        23332  34503    192   21    1 : tunables    0    0    0 : slabdata   1643   1643      0
sysfs_dir_cache    17444  17444    144   28    1 : tunables    0    0    0 : slabdata    623    623      0
arp_cache          12863  14796    320   12    1 : tunables    0    0    0 : slabdata   1233   1233      0
kmalloc-512        11051  11440    512   16    2 : tunables    0    0    0 : slabdata    715    715      0
kmalloc-128        10611  13152    128   32    1 : tunables    0    0    0 : slabdata    411    411      0
ext4_inode_cache    9660  18018    880   18    4 : tunables    0    0    0 : slabdata   1001   1001      0
kmalloc-8           8704   8704      8  512    1 : tunables    0    0    0 : slabdata     17     17      0
ext4_io_page        6912   6912     16  256    1 : tunables    0    0    0 : slabdata     27     27      0
anon_vma_chain      6701  10880     48   85    1 : tunables    0    0    0 : slabdata    128    128      0
Acpi-Namespace      6611   8058     40  102    1 : tunables    0    0    0 : slabdata     79     79      0
fsnotify_event_holder   6290   6970     24  170    1 : tunables    0    0    0 : slabdata     41     41      0
kmalloc-1024        5813   5888   1024   16    4 : tunables    0    0    0 : slabdata    368    368      0
vm_area_struct      5664   5664    168   24    1 : tunables    0    0    0 : slabdata    236    236      0
reiser_inode_cache   3992   5198    704   23    4 : tunables    0    0    0 : slabdata    226    226      0
Acpi-ParseExt       3808   3808     72   56    1 : tunables    0    0    0 : slabdata     68     68      0
kmalloc-2048        3587   3888   2048   16    8 : tunables    0    0    0 : slabdata    243    243      0
proc_inode_cache    3498   3510    624   13    2 : tunables    0    0    0 : slabdata    270    270      0
anon_vma            3380   3640     72   56    1 : tunables    0    0    0 : slabdata     65     65      0
kmalloc-16          3072   3072     16  256    1 : tunables    0    0    0 : slabdata     12     12      0
inode_cache         3024   3024    560   14    2 : tunables    0    0    0 : slabdata    216    216      0
ext4_allocation_context   3000   3000    136   30    1 : tunables    0    0    0 : slabdata    100    100      0
nf_conntrack_ffffffff81776d40   2910   3549    312   13    1 : tunables    0    0    0 : slabdata    273    273      0
kmalloc-4096        2792   3136   4096    8    8 : tunables    0    0    0 : slabdata    392    392      0
ext4_free_data      2701   2701     56   73    1 : tunables    0    0    0 : slabdata     37     37      0
pid_namespace       2130   2130   2112   15    8 : tunables    0    0    0 : slabdata    142    142      0
mqueue_inode_cache   2124   2124    896   18    4 : tunables    0    0    0 : slabdata    118    118      0
jbd2_revoke_record   1664   1664     32  128    1 : tunables    0    0    0 : slabdata     13     13      0
kmalloc-32          1434   3072     32  128    1 : tunables    0    0    0 : slabdata     24     24      0
shmem_inode_cache   1400   1521    624   13    2 : tunables    0    0    0 : slabdata    117    117      0
xfs_ili             1139   1260    216   18    1 : tunables    0    0    0 : slabdata     70     70      0
nfsd4_stateids      1131   1496    120   34    1 : tunables    0    0    0 : slabdata     44     44      0
idr_layer_cache      900    900    544   15    2 : tunables    0    0    0 : slabdata     60     60      0
jbd2_journal_head    742   1008    112   36    1 : tunables    0    0    0 : slabdata     28     28      0
fsnotify_event       648    648    112   36    1 : tunables    0    0    0 : slabdata     18     18      0
sock_inode_cache     577    624    640   12    2 : tunables    0    0    0 : slabdata     52     52      0
tw_sock_TCP          504    504    192   21    1 : tunables    0    0    0 : slabdata     24     24      0
TCP                  356    414   1728   18    8 : tunables    0    0    0 : slabdata     23     23      0
RAW                  342    342    832   19    4 : tunables    0    0    0 : slabdata     18     18      0
jbd2_journal_handle    340    340     24  170    1 : tunables    0    0    0 : slabdata      2      2      0
blkdev_requests      322    322    344   23    2 : tunables    0    0    0 : slabdata     14     14      0
task_struct          293    357   1504   21    8 : tunables    0    0    0 : slabdata     17     17      0
UDP                  285    285    832   19    4 : tunables    0    0    0 : slabdata     15     15      0
files_cache          276    276    704   23    4 : tunables    0    0    0 : slabdata     12     12      0
nfsd4_openowners     220    320    392   20    2 : tunables    0    0    0 : slabdata     16     16      0
mm_struct            216    216    896   18    4 : tunables    0    0    0 : slabdata     12     12      0
sighand_cache        199    225   2112   15    8 : tunables    0    0    0 : slabdata     15     15      0
nfsd4_delegations    198    198    368   22    2 : tunables    0    0    0 : slabdata      9      9      0
kmem_cache_node      192    192     64   64    1 : tunables    0    0    0 : slabdata      3      3      0
xfs_buf_item         162    162    224   18    1 : tunables    0    0    0 : slabdata      9      9      0
ip_fib_trie          146    146     56   73    1 : tunables    0    0    0 : slabdata      2      2      0
ext4_io_end          140    154   1128   14    4 : tunables    0    0    0 : slabdata     11     11      0
dnotify_mark         120    120    136   30    1 : tunables    0    0    0 : slabdata      4      4      0
TCPv6                104    119   1856   17    8 : tunables    0    0    0 : slabdata      7      7      0
cfq_queue            102    102    232   17    1 : tunables    0    0    0 : slabdata      6      6      0
Acpi-State           102    102     80   51    1 : tunables    0    0    0 : slabdata      2      2      0
sigqueue             100    100    160   25    1 : tunables    0    0    0 : slabdata      4      4      0
xfs_efd_item          80     80    400   20    2 : tunables    0    0    0 : slabdata      4      4      0
tw_sock_TCPv6         64    144    256   16    1 : tunables    0    0    0 : slabdata      9      9      0
bdev_cache            57     57    832   19    4 : tunables    0    0    0 : slabdata      3      3      0
blkdev_queue          54     54   1744   18    8 : tunables    0    0    0 : slabdata      3      3      0
net_namespace         52     52   2432   13    8 : tunables    0    0    0 : slabdata      4      4      0
kmalloc-8192          52     52   8192    4    8 : tunables    0    0    0 : slabdata     13     13      0
kmem_cache            42     42    192   21    1 : tunables    0    0    0 : slabdata      2      2      0
xfs_log_ticket        40     40    200   20    1 : tunables    0    0    0 : slabdata      2      2      0
xfs_btree_cur         38     38    208   19    1 : tunables    0    0    0 : slabdata      2      2      0
rpc_inode_cache       38     38    832   19    4 : tunables    0    0    0 : slabdata      2      2      0
nf_conntrack_expect     34     34    240   17    1 : tunables    0    0    0 : slabdata      2      2      0
xfs_da_state          32     32    488   16    2 : tunables    0    0    0 : slabdata      2      2      0
UDPv6                 32     32   1024   16    4 : tunables    0    0    0 : slabdata      2      2      0
xfs_trans             28     28    280   14    1 : tunables    0    0    0 : slabdata      2      2      0
taskstats             24     24    328   12    1 : tunables    0    0    0 : slabdata      2      2      0
dio                   24     24    640   12    2 : tunables    0    0    0 : slabdata      2      2      0
posix_timers_cache     23     23    176   23    1 : tunables    0    0    0 : slabdata      1      1      0
hugetlbfs_inode_cache     14     14    560   14    2 : tunables    0    0    0 : slabdata      1      1      0
xfrm_dst_cache         0      0    384   21    2 : tunables    0    0    0 : slabdata      0      0      0
user_namespace         0      0   1072   15    4 : tunables    0    0    0 : slabdata      0      0      0
UDPLITEv6              0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
flow_cache             0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
dquot                  0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512        0      0    512   16    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-256        0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   21    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1024       0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
bsg_cmd                0      0    312   13    1 : tunables    0    0    0 : slabdata      0      0      0

> Could you also take few snapshots of /proc/420/stack to see what kswapd
> is doing.

Uhm, there is no such entry in proc. Guess I need to enable some kernel
option and recompile?
This will reset uptime and a problem for another 5-10 days..

> 
> > 
> > P.S.: The message above was written couple of days ago. Now I'm at 10
> > days uptime, and this is the result as of today
> >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> >   420 root      R   93  0.0 110:48.48    0 kswapd0
> > 30085 nowhere   D   42  0.2   0:04.36  10m dd
> > 
> > PPS: Please CC me.
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-21 10:15   ` nowhere
@ 2011-12-21 10:24     ` Michal Hocko
  2011-12-21 10:52       ` nowhere
  2011-12-21 14:06       ` Alex Elder
  0 siblings, 2 replies; 32+ messages in thread
From: Michal Hocko @ 2011-12-21 10:24 UTC (permalink / raw)
  To: nowhere
  Cc: linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner, xfs,
	Ben Myers, Alex Elder

[Let's cc also some fs and xfs people]

On Wed 21-12-11 14:15:21, nowhere wrote:
> В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> > [Let's CC linux-mm]
> > 
> > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > Hello,
> > > 
> > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > file system operations and network i/o (i.e. file server). And there is
> > > a problem with kswapd.
> > 
> > What kind of filesystem do you use?
> 
> Well, that is XFS.
> I have a large volume with ~200000 files, and a periodic job, which
> checks all file's timestamps once per 30 minutes and makes actions if
> timestamp has changed.

Is it the first time you are seeing this? I am not familiar with xfs at
all but the number of files sounds like dcache shrinker might be really
busy...

> > > I'm playing with dd:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 
> > > I.e. I'm filling page cache.
> > > 
> > > So when the machine is just rebooted, kswapd during this operation is
> > > almost idle, just 5-8 percent according to top.
> > > 
> > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > for kswapd:
> > > 
> > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > 
> > > In fact, kswapd cpu usage on this operation steadily increases over
> > > time.
> > > 
> > > Also read performance degrades over time. After reboot:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > 
> > > After ~5 days uptime:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > 
> > > Whereas raw disk sequential read performance stays the same:
> > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > 
> > > Also after dropping caches, situation somehow improves, but not to the
> > > state of freshly restarted system:
> > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > 
> > > perf shows:
> > > 
> > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > 
> > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > spin lock contention.
> 
> This is slabinfo, sorted by num objects:
> 
> xfs_inode         192941 193205    960   17    4 : tunables    0    0    0 : slabdata  11365  11365      0
> dentry            118818 118818    192   21    1 : tunables    0    0    0 : slabdata   5658   5658      0
> kmalloc-256       107920 107920    256   16    1 : tunables    0    0    0 : slabdata   6745   6745      0
> kmalloc-64         59912 102656     64   64    1 : tunables    0    0    0 : slabdata   1604   1604      0
> radix_tree_node    30618  33474    568   14    2 : tunables    0    0    0 : slabdata   2391   2391      0
> kmalloc-96         27092  41202     96   42    1 : tunables    0    0    0 : slabdata    981    981      0
> buffer_head        24892  63843    104   39    1 : tunables    0    0    0 : slabdata   1637   1637      0
> kmalloc-192        23332  34503    192   21    1 : tunables    0    0    0 : slabdata   1643   1643      0
> sysfs_dir_cache    17444  17444    144   28    1 : tunables    0    0    0 : slabdata    623    623      0
> arp_cache          12863  14796    320   12    1 : tunables    0    0    0 : slabdata   1233   1233      0
> kmalloc-512        11051  11440    512   16    2 : tunables    0    0    0 : slabdata    715    715      0
> kmalloc-128        10611  13152    128   32    1 : tunables    0    0    0 : slabdata    411    411      0
> ext4_inode_cache    9660  18018    880   18    4 : tunables    0    0    0 : slabdata   1001   1001      0
> kmalloc-8           8704   8704      8  512    1 : tunables    0    0    0 : slabdata     17     17      0
> ext4_io_page        6912   6912     16  256    1 : tunables    0    0    0 : slabdata     27     27      0
> anon_vma_chain      6701  10880     48   85    1 : tunables    0    0    0 : slabdata    128    128      0
> Acpi-Namespace      6611   8058     40  102    1 : tunables    0    0    0 : slabdata     79     79      0
> fsnotify_event_holder   6290   6970     24  170    1 : tunables    0    0    0 : slabdata     41     41      0
> kmalloc-1024        5813   5888   1024   16    4 : tunables    0    0    0 : slabdata    368    368      0
> vm_area_struct      5664   5664    168   24    1 : tunables    0    0    0 : slabdata    236    236      0
> reiser_inode_cache   3992   5198    704   23    4 : tunables    0    0    0 : slabdata    226    226      0
> Acpi-ParseExt       3808   3808     72   56    1 : tunables    0    0    0 : slabdata     68     68      0
> kmalloc-2048        3587   3888   2048   16    8 : tunables    0    0    0 : slabdata    243    243      0
> proc_inode_cache    3498   3510    624   13    2 : tunables    0    0    0 : slabdata    270    270      0
> anon_vma            3380   3640     72   56    1 : tunables    0    0    0 : slabdata     65     65      0
> kmalloc-16          3072   3072     16  256    1 : tunables    0    0    0 : slabdata     12     12      0
> inode_cache         3024   3024    560   14    2 : tunables    0    0    0 : slabdata    216    216      0
> ext4_allocation_context   3000   3000    136   30    1 : tunables    0    0    0 : slabdata    100    100      0
> nf_conntrack_ffffffff81776d40   2910   3549    312   13    1 : tunables    0    0    0 : slabdata    273    273      0
> kmalloc-4096        2792   3136   4096    8    8 : tunables    0    0    0 : slabdata    392    392      0
> ext4_free_data      2701   2701     56   73    1 : tunables    0    0    0 : slabdata     37     37      0
> pid_namespace       2130   2130   2112   15    8 : tunables    0    0    0 : slabdata    142    142      0
> mqueue_inode_cache   2124   2124    896   18    4 : tunables    0    0    0 : slabdata    118    118      0
> jbd2_revoke_record   1664   1664     32  128    1 : tunables    0    0    0 : slabdata     13     13      0
> kmalloc-32          1434   3072     32  128    1 : tunables    0    0    0 : slabdata     24     24      0
> shmem_inode_cache   1400   1521    624   13    2 : tunables    0    0    0 : slabdata    117    117      0
> xfs_ili             1139   1260    216   18    1 : tunables    0    0    0 : slabdata     70     70      0
> nfsd4_stateids      1131   1496    120   34    1 : tunables    0    0    0 : slabdata     44     44      0
> idr_layer_cache      900    900    544   15    2 : tunables    0    0    0 : slabdata     60     60      0
> jbd2_journal_head    742   1008    112   36    1 : tunables    0    0    0 : slabdata     28     28      0
> fsnotify_event       648    648    112   36    1 : tunables    0    0    0 : slabdata     18     18      0
> sock_inode_cache     577    624    640   12    2 : tunables    0    0    0 : slabdata     52     52      0
> tw_sock_TCP          504    504    192   21    1 : tunables    0    0    0 : slabdata     24     24      0
> TCP                  356    414   1728   18    8 : tunables    0    0    0 : slabdata     23     23      0
> RAW                  342    342    832   19    4 : tunables    0    0    0 : slabdata     18     18      0
> jbd2_journal_handle    340    340     24  170    1 : tunables    0    0    0 : slabdata      2      2      0
> blkdev_requests      322    322    344   23    2 : tunables    0    0    0 : slabdata     14     14      0
> task_struct          293    357   1504   21    8 : tunables    0    0    0 : slabdata     17     17      0
> UDP                  285    285    832   19    4 : tunables    0    0    0 : slabdata     15     15      0
> files_cache          276    276    704   23    4 : tunables    0    0    0 : slabdata     12     12      0
> nfsd4_openowners     220    320    392   20    2 : tunables    0    0    0 : slabdata     16     16      0
> mm_struct            216    216    896   18    4 : tunables    0    0    0 : slabdata     12     12      0
> sighand_cache        199    225   2112   15    8 : tunables    0    0    0 : slabdata     15     15      0
> nfsd4_delegations    198    198    368   22    2 : tunables    0    0    0 : slabdata      9      9      0
> kmem_cache_node      192    192     64   64    1 : tunables    0    0    0 : slabdata      3      3      0
> xfs_buf_item         162    162    224   18    1 : tunables    0    0    0 : slabdata      9      9      0
> ip_fib_trie          146    146     56   73    1 : tunables    0    0    0 : slabdata      2      2      0
> ext4_io_end          140    154   1128   14    4 : tunables    0    0    0 : slabdata     11     11      0
> dnotify_mark         120    120    136   30    1 : tunables    0    0    0 : slabdata      4      4      0
> TCPv6                104    119   1856   17    8 : tunables    0    0    0 : slabdata      7      7      0
> cfq_queue            102    102    232   17    1 : tunables    0    0    0 : slabdata      6      6      0
> Acpi-State           102    102     80   51    1 : tunables    0    0    0 : slabdata      2      2      0
> sigqueue             100    100    160   25    1 : tunables    0    0    0 : slabdata      4      4      0
> xfs_efd_item          80     80    400   20    2 : tunables    0    0    0 : slabdata      4      4      0
> tw_sock_TCPv6         64    144    256   16    1 : tunables    0    0    0 : slabdata      9      9      0
> bdev_cache            57     57    832   19    4 : tunables    0    0    0 : slabdata      3      3      0
> blkdev_queue          54     54   1744   18    8 : tunables    0    0    0 : slabdata      3      3      0
> net_namespace         52     52   2432   13    8 : tunables    0    0    0 : slabdata      4      4      0
> kmalloc-8192          52     52   8192    4    8 : tunables    0    0    0 : slabdata     13     13      0
> kmem_cache            42     42    192   21    1 : tunables    0    0    0 : slabdata      2      2      0
> xfs_log_ticket        40     40    200   20    1 : tunables    0    0    0 : slabdata      2      2      0
> xfs_btree_cur         38     38    208   19    1 : tunables    0    0    0 : slabdata      2      2      0
> rpc_inode_cache       38     38    832   19    4 : tunables    0    0    0 : slabdata      2      2      0
> nf_conntrack_expect     34     34    240   17    1 : tunables    0    0    0 : slabdata      2      2      0
> xfs_da_state          32     32    488   16    2 : tunables    0    0    0 : slabdata      2      2      0
> UDPv6                 32     32   1024   16    4 : tunables    0    0    0 : slabdata      2      2      0
> xfs_trans             28     28    280   14    1 : tunables    0    0    0 : slabdata      2      2      0
> taskstats             24     24    328   12    1 : tunables    0    0    0 : slabdata      2      2      0
> dio                   24     24    640   12    2 : tunables    0    0    0 : slabdata      2      2      0
> posix_timers_cache     23     23    176   23    1 : tunables    0    0    0 : slabdata      1      1      0
> hugetlbfs_inode_cache     14     14    560   14    2 : tunables    0    0    0 : slabdata      1      1      0
> xfrm_dst_cache         0      0    384   21    2 : tunables    0    0    0 : slabdata      0      0      0
> user_namespace         0      0   1072   15    4 : tunables    0    0    0 : slabdata      0      0      0
> UDPLITEv6              0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
> UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
> kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> flow_cache             0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> ext2_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
> dquot                  0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-512        0      0    512   16    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-256        0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-192        0      0    192   21    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-1024       0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
> bsg_cmd                0      0    312   13    1 : tunables    0    0    0 : slabdata      0      0      0
> 
> > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > is doing.
> 
> Uhm, there is no such entry in proc. Guess I need to enable some kernel
> option and recompile?

Yes, you need CONFIG_STACKTRACE. But you can get a similar information
by sysrq+t

> This will reset uptime and a problem for another 5-10 days..

Then don't do that ;)

> > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > days uptime, and this is the result as of today
> > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > >   420 root      R   93  0.0 110:48.48    0 kswapd0
> > > 30085 nowhere   D   42  0.2   0:04.36  10m dd
> > > 
> > > PPS: Please CC me.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-21 10:24     ` Michal Hocko
@ 2011-12-21 10:52       ` nowhere
  2011-12-21 14:06       ` Alex Elder
  1 sibling, 0 replies; 32+ messages in thread
From: nowhere @ 2011-12-21 10:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner, xfs,
	Ben Myers, Alex Elder

В Ср., 21/12/2011 в 11:24 +0100, Michal Hocko пишет:
> [Let's cc also some fs and xfs people]
> 
> On Wed 21-12-11 14:15:21, nowhere wrote:
> > В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> > > [Let's CC linux-mm]
> > > 
> > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > Hello,
> > > > 
> > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > file system operations and network i/o (i.e. file server). And there is
> > > > a problem with kswapd.
> > > 
> > > What kind of filesystem do you use?
> > 
> > Well, that is XFS.
> > I have a large volume with ~200000 files, and a periodic job, which
> > checks all file's timestamps once per 30 minutes and makes actions if
> > timestamp has changed.
> 
> Is it the first time you are seeing this? I am not familiar with xfs at
> all but the number of files sounds like dcache shrinker might be really
> busy...

This behavoir - yes. In 3.0 kswapd sometimes hangs in D-state, shifting
load average above 1 - that's how I noticed it.

> > > > I'm playing with dd:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 
> > > > I.e. I'm filling page cache.
> > > > 
> > > > So when the machine is just rebooted, kswapd during this operation is
> > > > almost idle, just 5-8 percent according to top.
> > > > 
> > > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > > for kswapd:
> > > > 
> > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > > 
> > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > time.
> > > > 
> > > > Also read performance degrades over time. After reboot:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > 
> > > > After ~5 days uptime:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > 
> > > > Whereas raw disk sequential read performance stays the same:
> > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > 
> > > > Also after dropping caches, situation somehow improves, but not to the
> > > > state of freshly restarted system:
> > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > > 
> > > > perf shows:
> > > > 
> > > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > > 
> > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > spin lock contention.
> > 
> > This is slabinfo, sorted by num objects:
> > 
> > xfs_inode         192941 193205    960   17    4 : tunables    0    0    0 : slabdata  11365  11365      0
> > dentry            118818 118818    192   21    1 : tunables    0    0    0 : slabdata   5658   5658      0
> > kmalloc-256       107920 107920    256   16    1 : tunables    0    0    0 : slabdata   6745   6745      0
> > kmalloc-64         59912 102656     64   64    1 : tunables    0    0    0 : slabdata   1604   1604      0
> > radix_tree_node    30618  33474    568   14    2 : tunables    0    0    0 : slabdata   2391   2391      0
> > kmalloc-96         27092  41202     96   42    1 : tunables    0    0    0 : slabdata    981    981      0
> > buffer_head        24892  63843    104   39    1 : tunables    0    0    0 : slabdata   1637   1637      0
> > kmalloc-192        23332  34503    192   21    1 : tunables    0    0    0 : slabdata   1643   1643      0
> > sysfs_dir_cache    17444  17444    144   28    1 : tunables    0    0    0 : slabdata    623    623      0
> > arp_cache          12863  14796    320   12    1 : tunables    0    0    0 : slabdata   1233   1233      0
> > kmalloc-512        11051  11440    512   16    2 : tunables    0    0    0 : slabdata    715    715      0
> > kmalloc-128        10611  13152    128   32    1 : tunables    0    0    0 : slabdata    411    411      0
> > ext4_inode_cache    9660  18018    880   18    4 : tunables    0    0    0 : slabdata   1001   1001      0
> > kmalloc-8           8704   8704      8  512    1 : tunables    0    0    0 : slabdata     17     17      0
> > ext4_io_page        6912   6912     16  256    1 : tunables    0    0    0 : slabdata     27     27      0
> > anon_vma_chain      6701  10880     48   85    1 : tunables    0    0    0 : slabdata    128    128      0
> > Acpi-Namespace      6611   8058     40  102    1 : tunables    0    0    0 : slabdata     79     79      0
> > fsnotify_event_holder   6290   6970     24  170    1 : tunables    0    0    0 : slabdata     41     41      0
> > kmalloc-1024        5813   5888   1024   16    4 : tunables    0    0    0 : slabdata    368    368      0
> > vm_area_struct      5664   5664    168   24    1 : tunables    0    0    0 : slabdata    236    236      0
> > reiser_inode_cache   3992   5198    704   23    4 : tunables    0    0    0 : slabdata    226    226      0
> > Acpi-ParseExt       3808   3808     72   56    1 : tunables    0    0    0 : slabdata     68     68      0
> > kmalloc-2048        3587   3888   2048   16    8 : tunables    0    0    0 : slabdata    243    243      0
> > proc_inode_cache    3498   3510    624   13    2 : tunables    0    0    0 : slabdata    270    270      0
> > anon_vma            3380   3640     72   56    1 : tunables    0    0    0 : slabdata     65     65      0
> > kmalloc-16          3072   3072     16  256    1 : tunables    0    0    0 : slabdata     12     12      0
> > inode_cache         3024   3024    560   14    2 : tunables    0    0    0 : slabdata    216    216      0
> > ext4_allocation_context   3000   3000    136   30    1 : tunables    0    0    0 : slabdata    100    100      0
> > nf_conntrack_ffffffff81776d40   2910   3549    312   13    1 : tunables    0    0    0 : slabdata    273    273      0
> > kmalloc-4096        2792   3136   4096    8    8 : tunables    0    0    0 : slabdata    392    392      0
> > ext4_free_data      2701   2701     56   73    1 : tunables    0    0    0 : slabdata     37     37      0
> > pid_namespace       2130   2130   2112   15    8 : tunables    0    0    0 : slabdata    142    142      0
> > mqueue_inode_cache   2124   2124    896   18    4 : tunables    0    0    0 : slabdata    118    118      0
> > jbd2_revoke_record   1664   1664     32  128    1 : tunables    0    0    0 : slabdata     13     13      0
> > kmalloc-32          1434   3072     32  128    1 : tunables    0    0    0 : slabdata     24     24      0
> > shmem_inode_cache   1400   1521    624   13    2 : tunables    0    0    0 : slabdata    117    117      0
> > xfs_ili             1139   1260    216   18    1 : tunables    0    0    0 : slabdata     70     70      0
> > nfsd4_stateids      1131   1496    120   34    1 : tunables    0    0    0 : slabdata     44     44      0
> > idr_layer_cache      900    900    544   15    2 : tunables    0    0    0 : slabdata     60     60      0
> > jbd2_journal_head    742   1008    112   36    1 : tunables    0    0    0 : slabdata     28     28      0
> > fsnotify_event       648    648    112   36    1 : tunables    0    0    0 : slabdata     18     18      0
> > sock_inode_cache     577    624    640   12    2 : tunables    0    0    0 : slabdata     52     52      0
> > tw_sock_TCP          504    504    192   21    1 : tunables    0    0    0 : slabdata     24     24      0
> > TCP                  356    414   1728   18    8 : tunables    0    0    0 : slabdata     23     23      0
> > RAW                  342    342    832   19    4 : tunables    0    0    0 : slabdata     18     18      0
> > jbd2_journal_handle    340    340     24  170    1 : tunables    0    0    0 : slabdata      2      2      0
> > blkdev_requests      322    322    344   23    2 : tunables    0    0    0 : slabdata     14     14      0
> > task_struct          293    357   1504   21    8 : tunables    0    0    0 : slabdata     17     17      0
> > UDP                  285    285    832   19    4 : tunables    0    0    0 : slabdata     15     15      0
> > files_cache          276    276    704   23    4 : tunables    0    0    0 : slabdata     12     12      0
> > nfsd4_openowners     220    320    392   20    2 : tunables    0    0    0 : slabdata     16     16      0
> > mm_struct            216    216    896   18    4 : tunables    0    0    0 : slabdata     12     12      0
> > sighand_cache        199    225   2112   15    8 : tunables    0    0    0 : slabdata     15     15      0
> > nfsd4_delegations    198    198    368   22    2 : tunables    0    0    0 : slabdata      9      9      0
> > kmem_cache_node      192    192     64   64    1 : tunables    0    0    0 : slabdata      3      3      0
> > xfs_buf_item         162    162    224   18    1 : tunables    0    0    0 : slabdata      9      9      0
> > ip_fib_trie          146    146     56   73    1 : tunables    0    0    0 : slabdata      2      2      0
> > ext4_io_end          140    154   1128   14    4 : tunables    0    0    0 : slabdata     11     11      0
> > dnotify_mark         120    120    136   30    1 : tunables    0    0    0 : slabdata      4      4      0
> > TCPv6                104    119   1856   17    8 : tunables    0    0    0 : slabdata      7      7      0
> > cfq_queue            102    102    232   17    1 : tunables    0    0    0 : slabdata      6      6      0
> > Acpi-State           102    102     80   51    1 : tunables    0    0    0 : slabdata      2      2      0
> > sigqueue             100    100    160   25    1 : tunables    0    0    0 : slabdata      4      4      0
> > xfs_efd_item          80     80    400   20    2 : tunables    0    0    0 : slabdata      4      4      0
> > tw_sock_TCPv6         64    144    256   16    1 : tunables    0    0    0 : slabdata      9      9      0
> > bdev_cache            57     57    832   19    4 : tunables    0    0    0 : slabdata      3      3      0
> > blkdev_queue          54     54   1744   18    8 : tunables    0    0    0 : slabdata      3      3      0
> > net_namespace         52     52   2432   13    8 : tunables    0    0    0 : slabdata      4      4      0
> > kmalloc-8192          52     52   8192    4    8 : tunables    0    0    0 : slabdata     13     13      0
> > kmem_cache            42     42    192   21    1 : tunables    0    0    0 : slabdata      2      2      0
> > xfs_log_ticket        40     40    200   20    1 : tunables    0    0    0 : slabdata      2      2      0
> > xfs_btree_cur         38     38    208   19    1 : tunables    0    0    0 : slabdata      2      2      0
> > rpc_inode_cache       38     38    832   19    4 : tunables    0    0    0 : slabdata      2      2      0
> > nf_conntrack_expect     34     34    240   17    1 : tunables    0    0    0 : slabdata      2      2      0
> > xfs_da_state          32     32    488   16    2 : tunables    0    0    0 : slabdata      2      2      0
> > UDPv6                 32     32   1024   16    4 : tunables    0    0    0 : slabdata      2      2      0
> > xfs_trans             28     28    280   14    1 : tunables    0    0    0 : slabdata      2      2      0
> > taskstats             24     24    328   12    1 : tunables    0    0    0 : slabdata      2      2      0
> > dio                   24     24    640   12    2 : tunables    0    0    0 : slabdata      2      2      0
> > posix_timers_cache     23     23    176   23    1 : tunables    0    0    0 : slabdata      1      1      0
> > hugetlbfs_inode_cache     14     14    560   14    2 : tunables    0    0    0 : slabdata      1      1      0
> > xfrm_dst_cache         0      0    384   21    2 : tunables    0    0    0 : slabdata      0      0      0
> > user_namespace         0      0   1072   15    4 : tunables    0    0    0 : slabdata      0      0      0
> > UDPLITEv6              0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
> > UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
> > kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> > flow_cache             0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> > ext2_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> > ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
> > dquot                  0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
> > dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-512        0      0    512   16    2 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-256        0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-192        0      0    192   21    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-1024       0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
> > bsg_cmd                0      0    312   13    1 : tunables    0    0    0 : slabdata      0      0      0
> > 
> > > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > > is doing.
> > 
> > Uhm, there is no such entry in proc. Guess I need to enable some kernel
> > option and recompile?
> 
> Yes, you need CONFIG_STACKTRACE. But you can get a similar information
> by sysrq+t

Ok, here it is (4 outputs at random time during dd):

<6>kswapd0         R  running task        0   420      2 0x00000000
<4> 0000000000000001 0000000000000001 ffff88011b3bdc80 0000000000000000
<4> 0000000000000000 0000000000000001 ffff88011b3bde70 ffff8800d0ab1468
<4> ffffffffffffff96 ffffffff81059391 0000000000000010 0000000000000246
<4>Call Trace:
<4> [<ffffffff81059391>] ? down_read_trylock+0x11/0x20
<4> [<ffffffff810da7ad>] ? grab_super_passive+0x1d/0xb0
<4> [<ffffffff810d9ced>] ? drop_super+0xd/0x20
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15c4>] ? shrink_slab+0x74/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb

<6>kswapd0         R  running task        0   420      2 0x00000000
<4> ffff88011aeff630 000000000000003f ffffffff811c9601 ffff880118bb7cc0
<4> 0000000000000040 0000000000001443 ffff880117cc4000 0000000000000001
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffff811c9601>] ? radix_tree_gang_lookup_tag+0x81/0xf0
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7ad>] ? grab_super_passive+0x1d/0xb0
<4> [<ffffffff810d9c48>] ? put_super+0x18/0x20
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb

<6>kswapd0         R  running task        0   420      2 0x00000000
<4> ffff88011aeff630 000000000000003f ffffffff811c9601 ffff88011b0646c0
<4> 0000000000000040 0000000000000d3e ffff880117cc4000 0000000000000005
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffff811c9601>] ? radix_tree_gang_lookup_tag+0x81/0xf0
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7d5>] ? grab_super_passive+0x45/0xb0
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb

<6>kswapd0         R  running task        0   420      2 0x00000000
<4> 000000000003337a 000000000001747d 0000000000000000 0000000000000004
<4> ffffffff816492c0 0000000000000000 ffff880117cc4000 0000000000000000
<4> 0000000000000000 ffff88011b3bde70 0000000000000000 ffffffffa00f85bf
<4>Call Trace:
<4> [<ffffffffa00f85bf>] ? xfs_perag_get_tag+0x1f/0x40 [xfs]
<4> [<ffffffff810da7d5>] ? grab_super_passive+0x45/0xb0
<4> [<ffffffff810da87c>] ? prune_super+0x3c/0x1c0
<4> [<ffffffff810a15e0>] ? shrink_slab+0x90/0x1f0
<4> [<ffffffff810a3d6a>] ? kswapd+0x5fa/0x890
<4> [<ffffffff81054c30>] ? wake_up_bit+0x40/0x40
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff810a3770>] ? shrink_zone+0x500/0x500
<4> [<ffffffff81054796>] ? kthread+0x96/0xa0
<4> [<ffffffff813e2cf4>] ? kernel_thread_helper+0x4/0x10
<4> [<ffffffff81054700>] ? kthread_worker_fn+0x120/0x120
<4> [<ffffffff813e2cf0>] ? gs_change+0xb/0xb


> > This will reset uptime and a problem for another 5-10 days..
> 
> Then don't do that ;)
> 
> > > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > > days uptime, and this is the result as of today
> > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > >   420 root      R   93  0.0 110:48.48    0 kswapd0
> > > > 30085 nowhere   D   42  0.2   0:04.36  10m dd
> > > > 
> > > > PPS: Please CC me.
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-21 10:24     ` Michal Hocko
  2011-12-21 10:52       ` nowhere
@ 2011-12-21 14:06       ` Alex Elder
  2011-12-21 14:19         ` nowhere
  1 sibling, 1 reply; 32+ messages in thread
From: Alex Elder @ 2011-12-21 14:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: nowhere, linux-kernel, linux-mm, Christoph Hellwig, Dave Chinner,
	xfs, Ben Myers, Alex Elder

On Wed, 2011-12-21 at 11:24 +0100, Michal Hocko wrote:
> [Let's cc also some fs and xfs people]
> 
> On Wed 21-12-11 14:15:21, nowhere wrote:
> > В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> > > [Let's CC linux-mm]
> > > 
> > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > Hello,
> > > > 
> > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > file system operations and network i/o (i.e. file server). And there is
> > > > a problem with kswapd.
> > > 
> > > What kind of filesystem do you use?
> > 
> > Well, that is XFS.
> > I have a large volume with ~200000 files, and a periodic job, which
> > checks all file's timestamps once per 30 minutes and makes actions if
> > timestamp has changed.
> 
> Is it the first time you are seeing this? I am not familiar with xfs at
> all but the number of files sounds like dcache shrinker might be really
> busy...
> 
> > > > I'm playing with dd:
> > > > dd if=/some/big/file of=/dev/null bs=8M


Just to be clear, is it always the same file you are
operating on in all of the cases you show here?  If
so, there anything else happening to that file in
the mean time?  Is it being written to at all?

Is the filesystem in question (the one with the
big file) also being NFS served?  Are there any
other filesystems being served, and if so, are
they also XFS?

I presume you are doing nothing special with respect
to the file serving, i.e. in all these cases and
timings, etc. the system has been and continues to
act as an NFS server.

Is the filesystem anywhere near full?

I don't know if these answers change much but it
might help to complete the picture.

Thanks.

					-Alex
> > > > I.e. I'm filling page cache.
> > > > 
> > > > So when the machine is just rebooted, kswapd during this operation is
> > > > almost idle, just 5-8 percent according to top.
> > > > 
> > > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > > for kswapd:
> > > > 
> > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > > 
> > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > time.
> > > > 
> > > > Also read performance degrades over time. After reboot:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > 
> > > > After ~5 days uptime:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > 
> > > > Whereas raw disk sequential read performance stays the same:
> > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > 
> > > > Also after dropping caches, situation somehow improves, but not to the
> > > > state of freshly restarted system:
> > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > > 
> > > > perf shows:
> > > > 
> > > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > > 
> > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > spin lock contention.
> > 
> > This is slabinfo, sorted by num objects:
> > 
> > xfs_inode         192941 193205    960   17    4 : tunables    0    0    0 : slabdata  11365  11365      0
> > dentry            118818 118818    192   21    1 : tunables    0    0    0 : slabdata   5658   5658      0
> > kmalloc-256       107920 107920    256   16    1 : tunables    0    0    0 : slabdata   6745   6745      0
> > kmalloc-64         59912 102656     64   64    1 : tunables    0    0    0 : slabdata   1604   1604      0
> > radix_tree_node    30618  33474    568   14    2 : tunables    0    0    0 : slabdata   2391   2391      0
> > kmalloc-96         27092  41202     96   42    1 : tunables    0    0    0 : slabdata    981    981      0
> > buffer_head        24892  63843    104   39    1 : tunables    0    0    0 : slabdata   1637   1637      0
> > kmalloc-192        23332  34503    192   21    1 : tunables    0    0    0 : slabdata   1643   1643      0
> > sysfs_dir_cache    17444  17444    144   28    1 : tunables    0    0    0 : slabdata    623    623      0
> > arp_cache          12863  14796    320   12    1 : tunables    0    0    0 : slabdata   1233   1233      0
> > kmalloc-512        11051  11440    512   16    2 : tunables    0    0    0 : slabdata    715    715      0
> > kmalloc-128        10611  13152    128   32    1 : tunables    0    0    0 : slabdata    411    411      0
> > ext4_inode_cache    9660  18018    880   18    4 : tunables    0    0    0 : slabdata   1001   1001      0
> > kmalloc-8           8704   8704      8  512    1 : tunables    0    0    0 : slabdata     17     17      0
> > ext4_io_page        6912   6912     16  256    1 : tunables    0    0    0 : slabdata     27     27      0
> > anon_vma_chain      6701  10880     48   85    1 : tunables    0    0    0 : slabdata    128    128      0
> > Acpi-Namespace      6611   8058     40  102    1 : tunables    0    0    0 : slabdata     79     79      0
> > fsnotify_event_holder   6290   6970     24  170    1 : tunables    0    0    0 : slabdata     41     41      0
> > kmalloc-1024        5813   5888   1024   16    4 : tunables    0    0    0 : slabdata    368    368      0
> > vm_area_struct      5664   5664    168   24    1 : tunables    0    0    0 : slabdata    236    236      0
> > reiser_inode_cache   3992   5198    704   23    4 : tunables    0    0    0 : slabdata    226    226      0
> > Acpi-ParseExt       3808   3808     72   56    1 : tunables    0    0    0 : slabdata     68     68      0
> > kmalloc-2048        3587   3888   2048   16    8 : tunables    0    0    0 : slabdata    243    243      0
> > proc_inode_cache    3498   3510    624   13    2 : tunables    0    0    0 : slabdata    270    270      0
> > anon_vma            3380   3640     72   56    1 : tunables    0    0    0 : slabdata     65     65      0
> > kmalloc-16          3072   3072     16  256    1 : tunables    0    0    0 : slabdata     12     12      0
> > inode_cache         3024   3024    560   14    2 : tunables    0    0    0 : slabdata    216    216      0
> > ext4_allocation_context   3000   3000    136   30    1 : tunables    0    0    0 : slabdata    100    100      0
> > nf_conntrack_ffffffff81776d40   2910   3549    312   13    1 : tunables    0    0    0 : slabdata    273    273      0
> > kmalloc-4096        2792   3136   4096    8    8 : tunables    0    0    0 : slabdata    392    392      0
> > ext4_free_data      2701   2701     56   73    1 : tunables    0    0    0 : slabdata     37     37      0
> > pid_namespace       2130   2130   2112   15    8 : tunables    0    0    0 : slabdata    142    142      0
> > mqueue_inode_cache   2124   2124    896   18    4 : tunables    0    0    0 : slabdata    118    118      0
> > jbd2_revoke_record   1664   1664     32  128    1 : tunables    0    0    0 : slabdata     13     13      0
> > kmalloc-32          1434   3072     32  128    1 : tunables    0    0    0 : slabdata     24     24      0
> > shmem_inode_cache   1400   1521    624   13    2 : tunables    0    0    0 : slabdata    117    117      0
> > xfs_ili             1139   1260    216   18    1 : tunables    0    0    0 : slabdata     70     70      0
> > nfsd4_stateids      1131   1496    120   34    1 : tunables    0    0    0 : slabdata     44     44      0
> > idr_layer_cache      900    900    544   15    2 : tunables    0    0    0 : slabdata     60     60      0
> > jbd2_journal_head    742   1008    112   36    1 : tunables    0    0    0 : slabdata     28     28      0
> > fsnotify_event       648    648    112   36    1 : tunables    0    0    0 : slabdata     18     18      0
> > sock_inode_cache     577    624    640   12    2 : tunables    0    0    0 : slabdata     52     52      0
> > tw_sock_TCP          504    504    192   21    1 : tunables    0    0    0 : slabdata     24     24      0
> > TCP                  356    414   1728   18    8 : tunables    0    0    0 : slabdata     23     23      0
> > RAW                  342    342    832   19    4 : tunables    0    0    0 : slabdata     18     18      0
> > jbd2_journal_handle    340    340     24  170    1 : tunables    0    0    0 : slabdata      2      2      0
> > blkdev_requests      322    322    344   23    2 : tunables    0    0    0 : slabdata     14     14      0
> > task_struct          293    357   1504   21    8 : tunables    0    0    0 : slabdata     17     17      0
> > UDP                  285    285    832   19    4 : tunables    0    0    0 : slabdata     15     15      0
> > files_cache          276    276    704   23    4 : tunables    0    0    0 : slabdata     12     12      0
> > nfsd4_openowners     220    320    392   20    2 : tunables    0    0    0 : slabdata     16     16      0
> > mm_struct            216    216    896   18    4 : tunables    0    0    0 : slabdata     12     12      0
> > sighand_cache        199    225   2112   15    8 : tunables    0    0    0 : slabdata     15     15      0
> > nfsd4_delegations    198    198    368   22    2 : tunables    0    0    0 : slabdata      9      9      0
> > kmem_cache_node      192    192     64   64    1 : tunables    0    0    0 : slabdata      3      3      0
> > xfs_buf_item         162    162    224   18    1 : tunables    0    0    0 : slabdata      9      9      0
> > ip_fib_trie          146    146     56   73    1 : tunables    0    0    0 : slabdata      2      2      0
> > ext4_io_end          140    154   1128   14    4 : tunables    0    0    0 : slabdata     11     11      0
> > dnotify_mark         120    120    136   30    1 : tunables    0    0    0 : slabdata      4      4      0
> > TCPv6                104    119   1856   17    8 : tunables    0    0    0 : slabdata      7      7      0
> > cfq_queue            102    102    232   17    1 : tunables    0    0    0 : slabdata      6      6      0
> > Acpi-State           102    102     80   51    1 : tunables    0    0    0 : slabdata      2      2      0
> > sigqueue             100    100    160   25    1 : tunables    0    0    0 : slabdata      4      4      0
> > xfs_efd_item          80     80    400   20    2 : tunables    0    0    0 : slabdata      4      4      0
> > tw_sock_TCPv6         64    144    256   16    1 : tunables    0    0    0 : slabdata      9      9      0
> > bdev_cache            57     57    832   19    4 : tunables    0    0    0 : slabdata      3      3      0
> > blkdev_queue          54     54   1744   18    8 : tunables    0    0    0 : slabdata      3      3      0
> > net_namespace         52     52   2432   13    8 : tunables    0    0    0 : slabdata      4      4      0
> > kmalloc-8192          52     52   8192    4    8 : tunables    0    0    0 : slabdata     13     13      0
> > kmem_cache            42     42    192   21    1 : tunables    0    0    0 : slabdata      2      2      0
> > xfs_log_ticket        40     40    200   20    1 : tunables    0    0    0 : slabdata      2      2      0
> > xfs_btree_cur         38     38    208   19    1 : tunables    0    0    0 : slabdata      2      2      0
> > rpc_inode_cache       38     38    832   19    4 : tunables    0    0    0 : slabdata      2      2      0
> > nf_conntrack_expect     34     34    240   17    1 : tunables    0    0    0 : slabdata      2      2      0
> > xfs_da_state          32     32    488   16    2 : tunables    0    0    0 : slabdata      2      2      0
> > UDPv6                 32     32   1024   16    4 : tunables    0    0    0 : slabdata      2      2      0
> > xfs_trans             28     28    280   14    1 : tunables    0    0    0 : slabdata      2      2      0
> > taskstats             24     24    328   12    1 : tunables    0    0    0 : slabdata      2      2      0
> > dio                   24     24    640   12    2 : tunables    0    0    0 : slabdata      2      2      0
> > posix_timers_cache     23     23    176   23    1 : tunables    0    0    0 : slabdata      1      1      0
> > hugetlbfs_inode_cache     14     14    560   14    2 : tunables    0    0    0 : slabdata      1      1      0
> > xfrm_dst_cache         0      0    384   21    2 : tunables    0    0    0 : slabdata      0      0      0
> > user_namespace         0      0   1072   15    4 : tunables    0    0    0 : slabdata      0      0      0
> > UDPLITEv6              0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
> > UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
> > kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> > flow_cache             0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> > ext2_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> > ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
> > dquot                  0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
> > dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-512        0      0    512   16    2 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-256        0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-192        0      0    192   21    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> > dma-kmalloc-1024       0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
> > bsg_cmd                0      0    312   13    1 : tunables    0    0    0 : slabdata      0      0      0
> > 
> > > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > > is doing.
> > 
> > Uhm, there is no such entry in proc. Guess I need to enable some kernel
> > option and recompile?
> 
> Yes, you need CONFIG_STACKTRACE. But you can get a similar information
> by sysrq+t
> 
> > This will reset uptime and a problem for another 5-10 days..
> 
> Then don't do that ;)
> 
> > > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > > days uptime, and this is the result as of today
> > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > >   420 root      R   93  0.0 110:48.48    0 kswapd0
> > > > 30085 nowhere   D   42  0.2   0:04.36  10m dd
> > > > 
> > > > PPS: Please CC me.
> 




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-21 14:06       ` Alex Elder
@ 2011-12-21 14:19         ` nowhere
  0 siblings, 0 replies; 32+ messages in thread
From: nowhere @ 2011-12-21 14:19 UTC (permalink / raw)
  To: elder
  Cc: Michal Hocko, linux-kernel, linux-mm, Christoph Hellwig,
	Dave Chinner, xfs, Ben Myers, Alex Elder

В Ср., 21/12/2011 в 08:06 -0600, Alex Elder пишет:
> On Wed, 2011-12-21 at 11:24 +0100, Michal Hocko wrote:
> > [Let's cc also some fs and xfs people]
> > 
> > On Wed 21-12-11 14:15:21, nowhere wrote:
> > > В Ср., 21/12/2011 в 10:52 +0100, Michal Hocko пишет:
> > > > [Let's CC linux-mm]
> > > > 
> > > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > > Hello,
> > > > > 
> > > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > > file system operations and network i/o (i.e. file server). And there is
> > > > > a problem with kswapd.
> > > > 
> > > > What kind of filesystem do you use?
> > > 
> > > Well, that is XFS.
> > > I have a large volume with ~200000 files, and a periodic job, which
> > > checks all file's timestamps once per 30 minutes and makes actions if
> > > timestamp has changed.
> > 
> > Is it the first time you are seeing this? I am not familiar with xfs at
> > all but the number of files sounds like dcache shrinker might be really
> > busy...
> > 
> > > > > I'm playing with dd:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> 
> 
> Just to be clear, is it always the same file you are
> operating on in all of the cases you show here?  If
> so, there anything else happening to that file in
> the mean time?  Is it being written to at all?

No, this file is not modified. And yes, this is always the same file.

> Is the filesystem in question (the one with the
> big file) also being NFS served?  Are there any
> other filesystems being served, and if so, are
> they also XFS?

Yes, the FS is exported. But nothing happens over NFS at the time of
dd'ing. There is also ext4 fs which is exported, but it is also idle.

> I presume you are doing nothing special with respect
> to the file serving, i.e. in all these cases and
> timings, etc. the system has been and continues to
> act as an NFS server.

Yes, that is correct. NFS + a little ftp + a little of samba. In
read-only fashion.

> Is the filesystem anywhere near full?

FS in question is nearly full. There are ~10G free out of 6,4Tb.

> 
> I don't know if these answers change much but it
> might help to complete the picture.
> 
> Thanks.
> 
> 					-Alex
> > > > > I.e. I'm filling page cache.
> > > > > 
> > > > > So when the machine is just rebooted, kswapd during this operation is
> > > > > almost idle, just 5-8 percent according to top.
> > > > > 
> > > > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > > > for kswapd:
> > > > > 
> > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > > > 
> > > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > > time.
> > > > > 
> > > > > Also read performance degrades over time. After reboot:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > > 
> > > > > After ~5 days uptime:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > > 
> > > > > Whereas raw disk sequential read performance stays the same:
> > > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > > 
> > > > > Also after dropping caches, situation somehow improves, but not to the
> > > > > state of freshly restarted system:
> > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > > > 
> > > > > perf shows:
> > > > > 
> > > > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > > > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > > > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > > > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > > > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > > > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > > > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > > > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > > > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > > > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > > > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > > > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > > > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > > > 
> > > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > > spin lock contention.
> > > 
> > > This is slabinfo, sorted by num objects:
> > > 
> > > xfs_inode         192941 193205    960   17    4 : tunables    0    0    0 : slabdata  11365  11365      0
> > > dentry            118818 118818    192   21    1 : tunables    0    0    0 : slabdata   5658   5658      0
> > > kmalloc-256       107920 107920    256   16    1 : tunables    0    0    0 : slabdata   6745   6745      0
> > > kmalloc-64         59912 102656     64   64    1 : tunables    0    0    0 : slabdata   1604   1604      0
> > > radix_tree_node    30618  33474    568   14    2 : tunables    0    0    0 : slabdata   2391   2391      0
> > > kmalloc-96         27092  41202     96   42    1 : tunables    0    0    0 : slabdata    981    981      0
> > > buffer_head        24892  63843    104   39    1 : tunables    0    0    0 : slabdata   1637   1637      0
> > > kmalloc-192        23332  34503    192   21    1 : tunables    0    0    0 : slabdata   1643   1643      0
> > > sysfs_dir_cache    17444  17444    144   28    1 : tunables    0    0    0 : slabdata    623    623      0
> > > arp_cache          12863  14796    320   12    1 : tunables    0    0    0 : slabdata   1233   1233      0
> > > kmalloc-512        11051  11440    512   16    2 : tunables    0    0    0 : slabdata    715    715      0
> > > kmalloc-128        10611  13152    128   32    1 : tunables    0    0    0 : slabdata    411    411      0
> > > ext4_inode_cache    9660  18018    880   18    4 : tunables    0    0    0 : slabdata   1001   1001      0
> > > kmalloc-8           8704   8704      8  512    1 : tunables    0    0    0 : slabdata     17     17      0
> > > ext4_io_page        6912   6912     16  256    1 : tunables    0    0    0 : slabdata     27     27      0
> > > anon_vma_chain      6701  10880     48   85    1 : tunables    0    0    0 : slabdata    128    128      0
> > > Acpi-Namespace      6611   8058     40  102    1 : tunables    0    0    0 : slabdata     79     79      0
> > > fsnotify_event_holder   6290   6970     24  170    1 : tunables    0    0    0 : slabdata     41     41      0
> > > kmalloc-1024        5813   5888   1024   16    4 : tunables    0    0    0 : slabdata    368    368      0
> > > vm_area_struct      5664   5664    168   24    1 : tunables    0    0    0 : slabdata    236    236      0
> > > reiser_inode_cache   3992   5198    704   23    4 : tunables    0    0    0 : slabdata    226    226      0
> > > Acpi-ParseExt       3808   3808     72   56    1 : tunables    0    0    0 : slabdata     68     68      0
> > > kmalloc-2048        3587   3888   2048   16    8 : tunables    0    0    0 : slabdata    243    243      0
> > > proc_inode_cache    3498   3510    624   13    2 : tunables    0    0    0 : slabdata    270    270      0
> > > anon_vma            3380   3640     72   56    1 : tunables    0    0    0 : slabdata     65     65      0
> > > kmalloc-16          3072   3072     16  256    1 : tunables    0    0    0 : slabdata     12     12      0
> > > inode_cache         3024   3024    560   14    2 : tunables    0    0    0 : slabdata    216    216      0
> > > ext4_allocation_context   3000   3000    136   30    1 : tunables    0    0    0 : slabdata    100    100      0
> > > nf_conntrack_ffffffff81776d40   2910   3549    312   13    1 : tunables    0    0    0 : slabdata    273    273      0
> > > kmalloc-4096        2792   3136   4096    8    8 : tunables    0    0    0 : slabdata    392    392      0
> > > ext4_free_data      2701   2701     56   73    1 : tunables    0    0    0 : slabdata     37     37      0
> > > pid_namespace       2130   2130   2112   15    8 : tunables    0    0    0 : slabdata    142    142      0
> > > mqueue_inode_cache   2124   2124    896   18    4 : tunables    0    0    0 : slabdata    118    118      0
> > > jbd2_revoke_record   1664   1664     32  128    1 : tunables    0    0    0 : slabdata     13     13      0
> > > kmalloc-32          1434   3072     32  128    1 : tunables    0    0    0 : slabdata     24     24      0
> > > shmem_inode_cache   1400   1521    624   13    2 : tunables    0    0    0 : slabdata    117    117      0
> > > xfs_ili             1139   1260    216   18    1 : tunables    0    0    0 : slabdata     70     70      0
> > > nfsd4_stateids      1131   1496    120   34    1 : tunables    0    0    0 : slabdata     44     44      0
> > > idr_layer_cache      900    900    544   15    2 : tunables    0    0    0 : slabdata     60     60      0
> > > jbd2_journal_head    742   1008    112   36    1 : tunables    0    0    0 : slabdata     28     28      0
> > > fsnotify_event       648    648    112   36    1 : tunables    0    0    0 : slabdata     18     18      0
> > > sock_inode_cache     577    624    640   12    2 : tunables    0    0    0 : slabdata     52     52      0
> > > tw_sock_TCP          504    504    192   21    1 : tunables    0    0    0 : slabdata     24     24      0
> > > TCP                  356    414   1728   18    8 : tunables    0    0    0 : slabdata     23     23      0
> > > RAW                  342    342    832   19    4 : tunables    0    0    0 : slabdata     18     18      0
> > > jbd2_journal_handle    340    340     24  170    1 : tunables    0    0    0 : slabdata      2      2      0
> > > blkdev_requests      322    322    344   23    2 : tunables    0    0    0 : slabdata     14     14      0
> > > task_struct          293    357   1504   21    8 : tunables    0    0    0 : slabdata     17     17      0
> > > UDP                  285    285    832   19    4 : tunables    0    0    0 : slabdata     15     15      0
> > > files_cache          276    276    704   23    4 : tunables    0    0    0 : slabdata     12     12      0
> > > nfsd4_openowners     220    320    392   20    2 : tunables    0    0    0 : slabdata     16     16      0
> > > mm_struct            216    216    896   18    4 : tunables    0    0    0 : slabdata     12     12      0
> > > sighand_cache        199    225   2112   15    8 : tunables    0    0    0 : slabdata     15     15      0
> > > nfsd4_delegations    198    198    368   22    2 : tunables    0    0    0 : slabdata      9      9      0
> > > kmem_cache_node      192    192     64   64    1 : tunables    0    0    0 : slabdata      3      3      0
> > > xfs_buf_item         162    162    224   18    1 : tunables    0    0    0 : slabdata      9      9      0
> > > ip_fib_trie          146    146     56   73    1 : tunables    0    0    0 : slabdata      2      2      0
> > > ext4_io_end          140    154   1128   14    4 : tunables    0    0    0 : slabdata     11     11      0
> > > dnotify_mark         120    120    136   30    1 : tunables    0    0    0 : slabdata      4      4      0
> > > TCPv6                104    119   1856   17    8 : tunables    0    0    0 : slabdata      7      7      0
> > > cfq_queue            102    102    232   17    1 : tunables    0    0    0 : slabdata      6      6      0
> > > Acpi-State           102    102     80   51    1 : tunables    0    0    0 : slabdata      2      2      0
> > > sigqueue             100    100    160   25    1 : tunables    0    0    0 : slabdata      4      4      0
> > > xfs_efd_item          80     80    400   20    2 : tunables    0    0    0 : slabdata      4      4      0
> > > tw_sock_TCPv6         64    144    256   16    1 : tunables    0    0    0 : slabdata      9      9      0
> > > bdev_cache            57     57    832   19    4 : tunables    0    0    0 : slabdata      3      3      0
> > > blkdev_queue          54     54   1744   18    8 : tunables    0    0    0 : slabdata      3      3      0
> > > net_namespace         52     52   2432   13    8 : tunables    0    0    0 : slabdata      4      4      0
> > > kmalloc-8192          52     52   8192    4    8 : tunables    0    0    0 : slabdata     13     13      0
> > > kmem_cache            42     42    192   21    1 : tunables    0    0    0 : slabdata      2      2      0
> > > xfs_log_ticket        40     40    200   20    1 : tunables    0    0    0 : slabdata      2      2      0
> > > xfs_btree_cur         38     38    208   19    1 : tunables    0    0    0 : slabdata      2      2      0
> > > rpc_inode_cache       38     38    832   19    4 : tunables    0    0    0 : slabdata      2      2      0
> > > nf_conntrack_expect     34     34    240   17    1 : tunables    0    0    0 : slabdata      2      2      0
> > > xfs_da_state          32     32    488   16    2 : tunables    0    0    0 : slabdata      2      2      0
> > > UDPv6                 32     32   1024   16    4 : tunables    0    0    0 : slabdata      2      2      0
> > > xfs_trans             28     28    280   14    1 : tunables    0    0    0 : slabdata      2      2      0
> > > taskstats             24     24    328   12    1 : tunables    0    0    0 : slabdata      2      2      0
> > > dio                   24     24    640   12    2 : tunables    0    0    0 : slabdata      2      2      0
> > > posix_timers_cache     23     23    176   23    1 : tunables    0    0    0 : slabdata      1      1      0
> > > hugetlbfs_inode_cache     14     14    560   14    2 : tunables    0    0    0 : slabdata      1      1      0
> > > xfrm_dst_cache         0      0    384   21    2 : tunables    0    0    0 : slabdata      0      0      0
> > > user_namespace         0      0   1072   15    4 : tunables    0    0    0 : slabdata      0      0      0
> > > UDPLITEv6              0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
> > > UDP-Lite               0      0    832   19    4 : tunables    0    0    0 : slabdata      0      0      0
> > > kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> > > flow_cache             0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> > > ext2_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> > > ext2_inode_cache       0      0    752   21    4 : tunables    0    0    0 : slabdata      0      0      0
> > > dquot                  0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-512        0      0    512   16    2 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-256        0      0    256   16    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-192        0      0    192   21    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> > > dma-kmalloc-1024       0      0   1024   16    4 : tunables    0    0    0 : slabdata      0      0      0
> > > bsg_cmd                0      0    312   13    1 : tunables    0    0    0 : slabdata      0      0      0
> > > 
> > > > Could you also take few snapshots of /proc/420/stack to see what kswapd
> > > > is doing.
> > > 
> > > Uhm, there is no such entry in proc. Guess I need to enable some kernel
> > > option and recompile?
> > 
> > Yes, you need CONFIG_STACKTRACE. But you can get a similar information
> > by sysrq+t
> > 
> > > This will reset uptime and a problem for another 5-10 days..
> > 
> > Then don't do that ;)
> > 
> > > > > P.S.: The message above was written couple of days ago. Now I'm at 10
> > > > > days uptime, and this is the result as of today
> > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > >   420 root      R   93  0.0 110:48.48    0 kswapd0
> > > > > 30085 nowhere   D   42  0.2   0:04.36  10m dd
> > > > > 
> > > > > PPS: Please CC me.
> > 
> 
> 
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-21  9:52 ` Michal Hocko
  2011-12-21 10:15   ` nowhere
@ 2011-12-21 22:55   ` Dave Chinner
  2011-12-23  9:01     ` nowhere
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2011-12-21 22:55 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nikolay S., linux-kernel, linux-mm

On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> [Let's CC linux-mm]
> 
> On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > Hello,
> > 
> > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > file system operations and network i/o (i.e. file server). And there is
> > a problem with kswapd.
> 
> What kind of filesystem do you use?
> 
> > 
> > I'm playing with dd:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 
> > I.e. I'm filling page cache.
> > 
> > So when the machine is just rebooted, kswapd during this operation is
> > almost idle, just 5-8 percent according to top.
> > 
> > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > for kswapd:
> > 
> >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > 
> > In fact, kswapd cpu usage on this operation steadily increases over
> > time.
> > 
> > Also read performance degrades over time. After reboot:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > 
> > After ~5 days uptime:
> > dd if=/some/big/file of=/dev/null bs=8M
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > 
> > Whereas raw disk sequential read performance stays the same:
> > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > 1019+1 records in
> > 1019+1 records out
> > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > 
> > Also after dropping caches, situation somehow improves, but not to the
> > state of freshly restarted system:
> >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > 
> > perf shows:
> > 
> >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> 
> Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> spin lock contention.

That's just scanning superblocks, not apparently doing anything
useful like shrinking dentries or inodes attached to each sb. i.e.
the shrinkers are being called an awful lot and basically have
nothing to do. I'd be suspecting a problem higher up in the stack to
do with how shrink_slab is operating or being called.

I'd suggest gathering event traces for mm_shrink_slab_start/
mm_shrink_slab_end to try to see how the shrinkers are being
driven...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-21 22:55   ` Dave Chinner
@ 2011-12-23  9:01     ` nowhere
  2011-12-23 10:20       ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: nowhere @ 2011-12-23  9:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Michal Hocko, linux-kernel, linux-mm

В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > [Let's CC linux-mm]
> > 
> > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > Hello,
> > > 
> > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > file system operations and network i/o (i.e. file server). And there is
> > > a problem with kswapd.
> > 
> > What kind of filesystem do you use?
> > 
> > > 
> > > I'm playing with dd:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 
> > > I.e. I'm filling page cache.
> > > 
> > > So when the machine is just rebooted, kswapd during this operation is
> > > almost idle, just 5-8 percent according to top.
> > > 
> > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > for kswapd:
> > > 
> > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > 
> > > In fact, kswapd cpu usage on this operation steadily increases over
> > > time.
> > > 
> > > Also read performance degrades over time. After reboot:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > 
> > > After ~5 days uptime:
> > > dd if=/some/big/file of=/dev/null bs=8M
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > 
> > > Whereas raw disk sequential read performance stays the same:
> > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > 1019+1 records in
> > > 1019+1 records out
> > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > 
> > > Also after dropping caches, situation somehow improves, but not to the
> > > state of freshly restarted system:
> > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > 
> > > perf shows:
> > > 
> > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > 
> > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > spin lock contention.
> 
> That's just scanning superblocks, not apparently doing anything
> useful like shrinking dentries or inodes attached to each sb. i.e.
> the shrinkers are being called an awful lot and basically have
> nothing to do. I'd be suspecting a problem higher up in the stack to
> do with how shrink_slab is operating or being called.
> 
> I'd suggest gathering event traces for mm_shrink_slab_start/
> mm_shrink_slab_end to try to see how the shrinkers are being
> driven...
> 
> Cheers,
> 
> Dave.

I have recompiled kernel with tracers, and today the problem is visible
again. So here is the trace for mm_shrink_slab_start (it is HUGE):

         kswapd0   421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
         kswapd0   421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
         kswapd0   421 [000] 103976.627884: mm_shrink_slab_start: prune_super+0x0 0xffff88011a892300: objects to shrink 110 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 2000 del
         kswapd0   421 [000] 103976.627887: mm_shrink_slab_start: prune_super+0x0 0xffff88011a893700: objects to shrink 31 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 4900 delt
         kswapd0   421 [000] 103976.627888: mm_shrink_slab_start: prune_super+0x0 0xffff88011a386700: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 300 delta 
         kswapd0   421 [000] 103976.627889: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381700: objects to shrink 6 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 4700 delta
         kswapd0   421 [000] 103976.627890: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381300: objects to shrink 3 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 600 delta 
         kswapd0   421 [000] 103976.627893: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a171058: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1 d
         kswapd0   421 [000] 103976.627895: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a29e658: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1 d
         kswapd0   421 [000] 103976.627897: mm_shrink_slab_start: prune_super+0x0 0xffff88011ab17300: objects to shrink 311 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 4900 del
         kswapd0   421 [000] 103976.627897: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: objects to shrink 27 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 49
         kswapd0   421 [000] 103976.628108: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 1500 delt
         kswapd0   421 [000] 103976.628110: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 5300 del
         kswapd0   421 [000] 103976.628111: mm_shrink_slab_start: prune_super+0x0 0xffff88011a892300: objects to shrink 110 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 2000 del
         kswapd0   421 [000] 103976.628112: mm_shrink_slab_start: prune_super+0x0 0xffff88011a893700: objects to shrink 31 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 4900 delt
         kswapd0   421 [000] 103976.628113: mm_shrink_slab_start: prune_super+0x0 0xffff88011a386700: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 300 delta 
         kswapd0   421 [000] 103976.628113: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381700: objects to shrink 6 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 4700 delta
         kswapd0   421 [000] 103976.628114: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381300: objects to shrink 3 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 600 delta 
         kswapd0   421 [000] 103976.628115: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a171058: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 1 d
         kswapd0   421 [000] 103976.628116: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a29e658: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 1 d
         kswapd0   421 [000] 103976.628116: mm_shrink_slab_start: prune_super+0x0 0xffff88011ab17300: objects to shrink 311 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 4900 del
         kswapd0   421 [000] 103976.628117: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: objects to shrink 27 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 95 lru_pgs 942483 cache items 49
         kswapd0   421 [000] 103976.628161: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 1500 delta
         kswapd0   421 [000] 103976.628163: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 268 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 5300 delt
         kswapd0   421 [000] 103976.628163: mm_shrink_slab_start: prune_super+0x0 0xffff88011a892300: objects to shrink 110 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 2000 delt
         kswapd0   421 [000] 103976.628165: mm_shrink_slab_start: prune_super+0x0 0xffff88011a893700: objects to shrink 31 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 4900 delta
         kswapd0   421 [000] 103976.628165: mm_shrink_slab_start: prune_super+0x0 0xffff88011a386700: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 300 delta 0
         kswapd0   421 [000] 103976.628166: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381700: objects to shrink 6 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 4700 delta 
         kswapd0   421 [000] 103976.628167: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381300: objects to shrink 3 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 600 delta 0
         kswapd0   421 [000] 103976.628167: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a171058: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 1 de
         kswapd0   421 [000] 103976.628168: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a29e658: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 1 de
         kswapd0   421 [000] 103976.628169: mm_shrink_slab_start: prune_super+0x0 0xffff88011ab17300: objects to shrink 311 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 4900 delt
         kswapd0   421 [000] 103976.628169: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: objects to shrink 27 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 8 lru_pgs 942483 cache items 493
         kswapd0   421 [000] 103976.628208: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 1500 delt
         kswapd0   421 [000] 103976.628210: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 268 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 5300 del
         kswapd0   421 [000] 103976.628210: mm_shrink_slab_start: prune_super+0x0 0xffff88011a892300: objects to shrink 110 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 2000 del
         kswapd0   421 [000] 103976.628212: mm_shrink_slab_start: prune_super+0x0 0xffff88011a893700: objects to shrink 31 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 4900 delt
         kswapd0   421 [000] 103976.628212: mm_shrink_slab_start: prune_super+0x0 0xffff88011a386700: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 300 delta 
         kswapd0   421 [000] 103976.628213: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381700: objects to shrink 6 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 4700 delta
         kswapd0   421 [000] 103976.628214: mm_shrink_slab_start: prune_super+0x0 0xffff88011a381300: objects to shrink 3 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 600 delta 
         kswapd0   421 [000] 103976.628214: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a171058: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 1 d
         kswapd0   421 [000] 103976.628215: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011a29e658: objects to shrink 0 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 1 d
         kswapd0   421 [000] 103976.628216: mm_shrink_slab_start: prune_super+0x0 0xffff88011ab17300: objects to shrink 311 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 4900 del
         kswapd0   421 [000] 103976.628216: mm_shrink_slab_start: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: objects to shrink 27 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942462 cache items 49

And mm_shrink_slab_end:

         kswapd0   421 [000] 104433.026125: mm_shrink_slab_end: prune_super+0x0 0xffff88011b00d300: unused scan count 85 new scan count 85 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026133: mm_shrink_slab_end: prune_super+0x0 0xffff88011a20ab00: unused scan count 265 new scan count 265 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026134: mm_shrink_slab_end: prune_super+0x0 0xffff88011a892300: unused scan count 217 new scan count 217 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026137: mm_shrink_slab_end: prune_super+0x0 0xffff88011a893700: unused scan count 421 new scan count 421 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026138: mm_shrink_slab_end: prune_super+0x0 0xffff88011a386700: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026139: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381700: unused scan count 361 new scan count 361 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026140: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381300: unused scan count 20 new scan count 20 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026143: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a171058: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026144: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a29e658: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026146: mm_shrink_slab_end: prune_super+0x0 0xffff88011ab17300: unused scan count 711 new scan count 725 total_scan 14 last shrinker return val 0
         kswapd0   421 [000] 104433.026146: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: unused scan count 50 new scan count 51 total_scan 1 last shrinker return val 0
         kswapd0   421 [000] 104433.026280: mm_shrink_slab_end: prune_super+0x0 0xffff88011b00d300: unused scan count 85 new scan count 85 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026282: mm_shrink_slab_end: prune_super+0x0 0xffff88011a20ab00: unused scan count 265 new scan count 265 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026282: mm_shrink_slab_end: prune_super+0x0 0xffff88011a892300: unused scan count 217 new scan count 217 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026284: mm_shrink_slab_end: prune_super+0x0 0xffff88011a893700: unused scan count 421 new scan count 421 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026285: mm_shrink_slab_end: prune_super+0x0 0xffff88011a386700: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026285: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381700: unused scan count 361 new scan count 361 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026286: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381300: unused scan count 20 new scan count 20 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026287: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a171058: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026287: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a29e658: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026288: mm_shrink_slab_end: prune_super+0x0 0xffff88011ab17300: unused scan count 725 new scan count 759 total_scan 34 last shrinker return val 0
         kswapd0   421 [000] 104433.026289: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: unused scan count 51 new scan count 54 total_scan 3 last shrinker return val 0
         kswapd0   421 [000] 104433.026329: mm_shrink_slab_end: prune_super+0x0 0xffff88011b00d300: unused scan count 85 new scan count 85 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026331: mm_shrink_slab_end: prune_super+0x0 0xffff88011a20ab00: unused scan count 265 new scan count 265 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026331: mm_shrink_slab_end: prune_super+0x0 0xffff88011a892300: unused scan count 217 new scan count 217 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026333: mm_shrink_slab_end: prune_super+0x0 0xffff88011a893700: unused scan count 421 new scan count 421 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026333: mm_shrink_slab_end: prune_super+0x0 0xffff88011a386700: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026334: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381700: unused scan count 361 new scan count 361 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026335: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381300: unused scan count 20 new scan count 20 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026335: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a171058: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026336: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a29e658: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026337: mm_shrink_slab_end: prune_super+0x0 0xffff88011ab17300: unused scan count 759 new scan count 760 total_scan 1 last shrinker return val 0
         kswapd0   421 [000] 104433.026337: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: unused scan count 54 new scan count 54 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026376: mm_shrink_slab_end: prune_super+0x0 0xffff88011b00d300: unused scan count 85 new scan count 85 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026378: mm_shrink_slab_end: prune_super+0x0 0xffff88011a20ab00: unused scan count 265 new scan count 265 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026378: mm_shrink_slab_end: prune_super+0x0 0xffff88011a892300: unused scan count 217 new scan count 217 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026380: mm_shrink_slab_end: prune_super+0x0 0xffff88011a893700: unused scan count 421 new scan count 421 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026380: mm_shrink_slab_end: prune_super+0x0 0xffff88011a386700: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026381: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381700: unused scan count 361 new scan count 361 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026382: mm_shrink_slab_end: prune_super+0x0 0xffff88011a381300: unused scan count 20 new scan count 20 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026382: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a171058: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026383: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011a29e658: unused scan count 0 new scan count 0 total_scan 0 last shrinker return val 0
         kswapd0   421 [000] 104433.026384: mm_shrink_slab_end: prune_super+0x0 0xffff88011ab17300: unused scan count 760 new scan count 774 total_scan 14 last shrinker return val 0
         kswapd0   421 [000] 104433.026384: mm_shrink_slab_end: xfs_buftarg_shrink+0x0 0xffff88011b3c4d18: unused scan count 54 new scan count 55 total_scan 1 last shrinker return val 0

  PID USER      S %CPU %MEM    TIME+  COMMAND
 4438 nowhere   D   27  0.2   0:02.54 dd
  421 root      S   17  0.0   1:45.79 kswapd0

Uptime is 1 day,  5:02


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-23  9:01     ` nowhere
@ 2011-12-23 10:20       ` Dave Chinner
  2011-12-23 11:04         ` nowhere
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2011-12-23 10:20 UTC (permalink / raw)
  To: nowhere; +Cc: Michal Hocko, linux-kernel, linux-mm

On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > [Let's CC linux-mm]
> > > 
> > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > Hello,
> > > > 
> > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > file system operations and network i/o (i.e. file server). And there is
> > > > a problem with kswapd.
> > > 
> > > What kind of filesystem do you use?
> > > 
> > > > 
> > > > I'm playing with dd:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 
> > > > I.e. I'm filling page cache.
> > > > 
> > > > So when the machine is just rebooted, kswapd during this operation is
> > > > almost idle, just 5-8 percent according to top.
> > > > 
> > > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > > for kswapd:
> > > > 
> > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > > 
> > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > time.
> > > > 
> > > > Also read performance degrades over time. After reboot:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > 
> > > > After ~5 days uptime:
> > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > 
> > > > Whereas raw disk sequential read performance stays the same:
> > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > 1019+1 records in
> > > > 1019+1 records out
> > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > 
> > > > Also after dropping caches, situation somehow improves, but not to the
> > > > state of freshly restarted system:
> > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > > 
> > > > perf shows:
> > > > 
> > > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > > 
> > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > spin lock contention.
> > 
> > That's just scanning superblocks, not apparently doing anything
> > useful like shrinking dentries or inodes attached to each sb. i.e.
> > the shrinkers are being called an awful lot and basically have
> > nothing to do. I'd be suspecting a problem higher up in the stack to
> > do with how shrink_slab is operating or being called.
> > 
> > I'd suggest gathering event traces for mm_shrink_slab_start/
> > mm_shrink_slab_end to try to see how the shrinkers are being
> > driven...
> > 
> > Cheers,
> > 
> > Dave.
> 
> I have recompiled kernel with tracers, and today the problem is visible
> again. So here is the trace for mm_shrink_slab_start (it is HUGE):
> 
>          kswapd0   421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
>          kswapd0   421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del

And possibly useless in this form. I need to see the
mm_shrink_slab_start/mm_shrink_slab_end events interleaved so I can
see exactly how much work each shrinker call is doing, and the start
events are truncated so not all the info I need is present.

Perhaps you should install trace-cmd.

$ trace-cmd record -e mm_shrink_slab*
(wait 30s, then ^C)
$ trace-cmd report > shrink.trace

And then compress and attach the trace file or put up on the web
somewhere for me ot download if it's too large for email...

As it is, there's ~940k pages in the LRU, and shrink_slab is being
called after 32, 95, 8, 8, 32 and 32 pages on the LRU have been
scanned. That seems like the shrinkers are being called rather too
often.

The end traces indicate the shrinker caches aren't able to free
anything. So it looks like the vmscan code has got itself in a
situation where it is not scanning many pages between shrinker
callouts, and the shrinkers scan but can't make any progress. Looks
like a vmscan balancing problem right now, not anything to do with
the shrinker code. A better trace will confirm that.

FWIW, if you use trace-cmd, it might be worthwhile collecting all the
vmscan trace events too, as that might help the VM folk understand
the problem without needing to ask you for more info.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-23 10:20       ` Dave Chinner
@ 2011-12-23 11:04         ` nowhere
  2011-12-23 20:45           ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: nowhere @ 2011-12-23 11:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Michal Hocko, linux-kernel, linux-mm

В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > > [Let's CC linux-mm]
> > > > 
> > > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > > Hello,
> > > > > 
> > > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > > file system operations and network i/o (i.e. file server). And there is
> > > > > a problem with kswapd.
> > > > 
> > > > What kind of filesystem do you use?
> > > > 
> > > > > 
> > > > > I'm playing with dd:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 
> > > > > I.e. I'm filling page cache.
> > > > > 
> > > > > So when the machine is just rebooted, kswapd during this operation is
> > > > > almost idle, just 5-8 percent according to top.
> > > > > 
> > > > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > > > for kswapd:
> > > > > 
> > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > > > 
> > > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > > time.
> > > > > 
> > > > > Also read performance degrades over time. After reboot:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > > 
> > > > > After ~5 days uptime:
> > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > > 
> > > > > Whereas raw disk sequential read performance stays the same:
> > > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > > 1019+1 records in
> > > > > 1019+1 records out
> > > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > > 
> > > > > Also after dropping caches, situation somehow improves, but not to the
> > > > > state of freshly restarted system:
> > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > > > 
> > > > > perf shows:
> > > > > 
> > > > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > > > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > > > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > > > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > > > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > > > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > > > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > > > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > > > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > > > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > > > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > > > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > > > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > > > 
> > > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > > spin lock contention.
> > > 
> > > That's just scanning superblocks, not apparently doing anything
> > > useful like shrinking dentries or inodes attached to each sb. i.e.
> > > the shrinkers are being called an awful lot and basically have
> > > nothing to do. I'd be suspecting a problem higher up in the stack to
> > > do with how shrink_slab is operating or being called.
> > > 
> > > I'd suggest gathering event traces for mm_shrink_slab_start/
> > > mm_shrink_slab_end to try to see how the shrinkers are being
> > > driven...
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > 
> > I have recompiled kernel with tracers, and today the problem is visible
> > again. So here is the trace for mm_shrink_slab_start (it is HUGE):
> > 
> >          kswapd0   421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
> >          kswapd0   421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
> 
> And possibly useless in this form. I need to see the
> mm_shrink_slab_start/mm_shrink_slab_end events interleaved so I can
> see exactly how much work each shrinker call is doing, and the start
> events are truncated so not all the info I need is present.
> 
> Perhaps you should install trace-cmd.
> 
> $ trace-cmd record -e mm_shrink_slab*
> (wait 30s, then ^C)
> $ trace-cmd report > shrink.trace
> 
> And then compress and attach the trace file or put up on the web
> somewhere for me ot download if it's too large for email...
> 
> As it is, there's ~940k pages in the LRU, and shrink_slab is being
> called after 32, 95, 8, 8, 32 and 32 pages on the LRU have been
> scanned. That seems like the shrinkers are being called rather too
> often.
> 
> The end traces indicate the shrinker caches aren't able to free
> anything. So it looks like the vmscan code has got itself in a
> situation where it is not scanning many pages between shrinker
> callouts, and the shrinkers scan but can't make any progress. Looks
> like a vmscan balancing problem right now, not anything to do with
> the shrinker code. A better trace will confirm that.
> 
> FWIW, if you use trace-cmd, it might be worthwhile collecting all the
> vmscan trace events too, as that might help the VM folk understand
> the problem without needing to ask you for more info.

./trace-cmd record -e vmscan/*

Here is the report of trace-cmd while dd'ing
https://80.237.6.56/report-dd.xz


And one more during normal operation
https://80.237.6.56/report-normal.xz

> 
> Cheers,
> 
> Dave.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-23 11:04         ` nowhere
@ 2011-12-23 20:45           ` Dave Chinner
  2011-12-25  9:09             ` Hillf Danton
  2011-12-27  2:15             ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2011-12-23 20:45 UTC (permalink / raw)
  To: nowhere; +Cc: Michal Hocko, linux-kernel, linux-mm

On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > > > [Let's CC linux-mm]
> > > > > 
> > > > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > > > file system operations and network i/o (i.e. file server). And there is
> > > > > > a problem with kswapd.
> > > > > 
> > > > > What kind of filesystem do you use?
> > > > > 
> > > > > > 
> > > > > > I'm playing with dd:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 
> > > > > > I.e. I'm filling page cache.
> > > > > > 
> > > > > > So when the machine is just rebooted, kswapd during this operation is
> > > > > > almost idle, just 5-8 percent according to top.
> > > > > > 
> > > > > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > > > > for kswapd:
> > > > > > 
> > > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > > > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > > > > 
> > > > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > > > time.
> > > > > > 
> > > > > > Also read performance degrades over time. After reboot:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > > > 
> > > > > > After ~5 days uptime:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > > > 
> > > > > > Whereas raw disk sequential read performance stays the same:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > > > 
> > > > > > Also after dropping caches, situation somehow improves, but not to the
> > > > > > state of freshly restarted system:
> > > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > > > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > > > > 
> > > > > > perf shows:
> > > > > > 
> > > > > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > > > > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > > > > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > > > > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > > > > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > > > > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > > > > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > > > > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > > > > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > > > > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > > > > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > > > > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > > > > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > > > > 
> > > > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > > > spin lock contention.
> > > > 
> > > > That's just scanning superblocks, not apparently doing anything
> > > > useful like shrinking dentries or inodes attached to each sb. i.e.
> > > > the shrinkers are being called an awful lot and basically have
> > > > nothing to do. I'd be suspecting a problem higher up in the stack to
> > > > do with how shrink_slab is operating or being called.
> > > > 
> > > > I'd suggest gathering event traces for mm_shrink_slab_start/
> > > > mm_shrink_slab_end to try to see how the shrinkers are being
> > > > driven...
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > 
> > > I have recompiled kernel with tracers, and today the problem is visible
> > > again. So here is the trace for mm_shrink_slab_start (it is HUGE):
> > > 
> > >          kswapd0   421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
> > >          kswapd0   421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
> > 
> > And possibly useless in this form. I need to see the
> > mm_shrink_slab_start/mm_shrink_slab_end events interleaved so I can
> > see exactly how much work each shrinker call is doing, and the start
> > events are truncated so not all the info I need is present.
> > 
> > Perhaps you should install trace-cmd.
> > 
> > $ trace-cmd record -e mm_shrink_slab*
> > (wait 30s, then ^C)
> > $ trace-cmd report > shrink.trace
> > 
> > And then compress and attach the trace file or put up on the web
> > somewhere for me ot download if it's too large for email...
> > 
> > As it is, there's ~940k pages in the LRU, and shrink_slab is being
> > called after 32, 95, 8, 8, 32 and 32 pages on the LRU have been
> > scanned. That seems like the shrinkers are being called rather too
> > often.
> > 
> > The end traces indicate the shrinker caches aren't able to free
> > anything. So it looks like the vmscan code has got itself in a
> > situation where it is not scanning many pages between shrinker
> > callouts, and the shrinkers scan but can't make any progress. Looks
> > like a vmscan balancing problem right now, not anything to do with
> > the shrinker code. A better trace will confirm that.
> > 
> > FWIW, if you use trace-cmd, it might be worthwhile collecting all the
> > vmscan trace events too, as that might help the VM folk understand
> > the problem without needing to ask you for more info.
> 
> ./trace-cmd record -e vmscan/*
> 
> Here is the report of trace-cmd while dd'ing
> https://80.237.6.56/report-dd.xz

Ok, it's not a shrink_slab() problem - it's just being called ~100uS
by kswapd. The pattern is:

	- reclaim 94 (batches of 32,32,30) pages from iinactive list
	  of zone 1, node 0, prio 12
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	- 40us gap
	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	- 40us gap
	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	40us gap

And it just repeats over and over again. After a while, nid=0,zone=1
drops out of the traces, so reclaim only comes in batches of 10-30
pages from zone 2 between each shrink_slab() call.

The trace starts at 111209.881s, with 944776 pages on the LRUs. It
finishes at 111216.1 with kswapd going to sleep on node 0 with
930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
2,000 pages/s) which is awfully slow....

vmscan gurus - time for you to step in now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-23 20:45           ` Dave Chinner
@ 2011-12-25  9:09             ` Hillf Danton
  2011-12-25 10:21               ` Nikolay S.
  2011-12-27  2:15             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 32+ messages in thread
From: Hillf Danton @ 2011-12-25  9:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: nowhere, Michal Hocko, linux-kernel, linux-mm

On Sat, Dec 24, 2011 at 4:45 AM, Dave Chinner <david@fromorbit.com> wrote:
[...]
>
> Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> by kswapd. The pattern is:
>
>        - reclaim 94 (batches of 32,32,30) pages from iinactive list
>          of zone 1, node 0, prio 12
>        - call shrink_slab
>                - scan all caches
>                - all shrinkers return 0 saying nothing to shrink
>        - 40us gap
>        - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
>        - call shrink_slab
>                - scan all caches
>                - all shrinkers return 0 saying nothing to shrink
>        - 40us gap
>        - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
>        - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
>        - call shrink_slab
>                - scan all caches
>                - all shrinkers return 0 saying nothing to shrink
>        40us gap
>
> And it just repeats over and over again. After a while, nid=0,zone=1
> drops out of the traces, so reclaim only comes in batches of 10-30
> pages from zone 2 between each shrink_slab() call.
>
> The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> finishes at 111216.1 with kswapd going to sleep on node 0 with
> 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> 2,000 pages/s) which is awfully slow....
>
Hi all,

In hope, the added debug info is helpful.

Hillf
---

--- a/mm/memcontrol.c	Fri Dec  9 21:57:40 2011
+++ b/mm/memcontrol.c	Sun Dec 25 17:08:14 2011
@@ -1038,7 +1038,11 @@ void mem_cgroup_lru_del_list(struct page
 		memcg = root_mem_cgroup;
 	mz = page_cgroup_zoneinfo(memcg, page);
 	/* huge page split is done under lru_lock. so, we have no races. */
-	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
+	if (WARN_ON_ONCE(MEM_CGROUP_ZSTAT(mz, lru) <
+				(1 << compound_order(page))))
+		MEM_CGROUP_ZSTAT(mz, lru) = 0;
+	else
+		MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
 }

 void mem_cgroup_lru_del(struct page *page)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-25  9:09             ` Hillf Danton
@ 2011-12-25 10:21               ` Nikolay S.
  2011-12-26 12:35                 ` Hillf Danton
  0 siblings, 1 reply; 32+ messages in thread
From: Nikolay S. @ 2011-12-25 10:21 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm

В Вс., 25/12/2011 в 17:09 +0800, Hillf Danton пишет:
> On Sat, Dec 24, 2011 at 4:45 AM, Dave Chinner <david@fromorbit.com> wrote:
> [...]
> >
> > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > by kswapd. The pattern is:
> >
> >        - reclaim 94 (batches of 32,32,30) pages from iinactive list
> >          of zone 1, node 0, prio 12
> >        - call shrink_slab
> >                - scan all caches
> >                - all shrinkers return 0 saying nothing to shrink
> >        - 40us gap
> >        - reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> >        - call shrink_slab
> >                - scan all caches
> >                - all shrinkers return 0 saying nothing to shrink
> >        - 40us gap
> >        - isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> >        - isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> >        - call shrink_slab
> >                - scan all caches
> >                - all shrinkers return 0 saying nothing to shrink
> >        40us gap
> >
> > And it just repeats over and over again. After a while, nid=0,zone=1
> > drops out of the traces, so reclaim only comes in batches of 10-30
> > pages from zone 2 between each shrink_slab() call.
> >
> > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > 2,000 pages/s) which is awfully slow....
> >
> Hi all,
> 
> In hope, the added debug info is helpful.
> 
> Hillf
> ---
> 
> --- a/mm/memcontrol.c	Fri Dec  9 21:57:40 2011
> +++ b/mm/memcontrol.c	Sun Dec 25 17:08:14 2011
> @@ -1038,7 +1038,11 @@ void mem_cgroup_lru_del_list(struct page
>  		memcg = root_mem_cgroup;
>  	mz = page_cgroup_zoneinfo(memcg, page);
>  	/* huge page split is done under lru_lock. so, we have no races. */
> -	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> +	if (WARN_ON_ONCE(MEM_CGROUP_ZSTAT(mz, lru) <
> +				(1 << compound_order(page))))
> +		MEM_CGROUP_ZSTAT(mz, lru) = 0;
> +	else
> +		MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
>  }
> 
>  void mem_cgroup_lru_del(struct page *page)

Hello,

Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
place changes there?

And also, -rc7 is here. May the problem be addressed as part of some
ongoing work? Is there any reason to try -rc7 (the problem requires
several days of uptime to become obvious)?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-25 10:21               ` Nikolay S.
@ 2011-12-26 12:35                 ` Hillf Danton
  2011-12-27  0:20                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 32+ messages in thread
From: Hillf Danton @ 2011-12-26 12:35 UTC (permalink / raw)
  To: Nikolay S.; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm

On Sun, Dec 25, 2011 at 6:21 PM, Nikolay S. <nowhere@hakkenden.ath.cx> wrote:
>
> Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
> mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
> place changes there?
>
> And also, -rc7 is here. May the problem be addressed as part of some
> ongoing work? Is there any reason to try -rc7 (the problem requires
> several days of uptime to become obvious)?
>

Sorry, Nikolay, it is not based on the -next, nor on the -rc5(I assumed it was).
The following is based on -next, and if you want to test -rc5, please
grep MEM_CGROUP_ZSTAT mm/memcontrol.c and change it.

Best regard

Hillf
---

--- a/mm/memcontrol.c	Mon Dec 26 20:34:38 2011
+++ b/mm/memcontrol.c	Mon Dec 26 20:37:54 2011
@@ -1076,7 +1076,11 @@ void mem_cgroup_lru_del_list(struct page
 	VM_BUG_ON(!memcg);
 	mz = page_cgroup_zoneinfo(memcg, page);
 	/* huge page split is done under lru_lock. so, we have no races. */
-	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
+	if (WARN_ON_ONCE(MEM_CGROUP_ZSTAT(mz, lru) <
+				(1 << compound_order(page))))
+		MEM_CGROUP_ZSTAT(mz, lru) = 0;
+	else
+		MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
 }

 void mem_cgroup_lru_del(struct page *page)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-26 12:35                 ` Hillf Danton
@ 2011-12-27  0:20                   ` KAMEZAWA Hiroyuki
  2011-12-27 13:33                     ` Hillf Danton
  0 siblings, 1 reply; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-27  0:20 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Nikolay S., Dave Chinner, Michal Hocko, linux-kernel, linux-mm

On Mon, 26 Dec 2011 20:35:46 +0800
Hillf Danton <dhillf@gmail.com> wrote:

> On Sun, Dec 25, 2011 at 6:21 PM, Nikolay S. <nowhere@hakkenden.ath.cx> wrote:
> >
> > Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
> > mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
> > place changes there?
> >
> > And also, -rc7 is here. May the problem be addressed as part of some
> > ongoing work? Is there any reason to try -rc7 (the problem requires
> > several days of uptime to become obvious)?
> >
> 
> Sorry, Nikolay, it is not based on the -next, nor on the -rc5(I assumed it was).
> The following is based on -next, and if you want to test -rc5, please
> grep MEM_CGROUP_ZSTAT mm/memcontrol.c and change it.
> 
> Best regard
> 

Hmm ? memcg is used ? Why do you consider this will be a help ?

Thanks,
-Kame

> Hillf
> ---
> 
> --- a/mm/memcontrol.c	Mon Dec 26 20:34:38 2011
> +++ b/mm/memcontrol.c	Mon Dec 26 20:37:54 2011
> @@ -1076,7 +1076,11 @@ void mem_cgroup_lru_del_list(struct page
>  	VM_BUG_ON(!memcg);
>  	mz = page_cgroup_zoneinfo(memcg, page);
>  	/* huge page split is done under lru_lock. so, we have no races. */
> -	MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
> +	if (WARN_ON_ONCE(MEM_CGROUP_ZSTAT(mz, lru) <
> +				(1 << compound_order(page))))
> +		MEM_CGROUP_ZSTAT(mz, lru) = 0;
> +	else
> +		MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
>  }
> 
>  void mem_cgroup_lru_del(struct page *page)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-23 20:45           ` Dave Chinner
  2011-12-25  9:09             ` Hillf Danton
@ 2011-12-27  2:15             ` KAMEZAWA Hiroyuki
  2011-12-27  2:50               ` Nikolay S.
  2011-12-27  3:57               ` Minchan Kim
  1 sibling, 2 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-27  2:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: nowhere, Michal Hocko, linux-kernel, linux-mm

On Sat, 24 Dec 2011 07:45:03 +1100
Dave Chinner <david@fromorbit.com> wrote:

> On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:

> > Here is the report of trace-cmd while dd'ing
> > https://80.237.6.56/report-dd.xz
> 
> Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> by kswapd. The pattern is:
> 
> 	- reclaim 94 (batches of 32,32,30) pages from iinactive list
> 	  of zone 1, node 0, prio 12
> 	- call shrink_slab
> 		- scan all caches
> 		- all shrinkers return 0 saying nothing to shrink
> 	- 40us gap
> 	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> 	- call shrink_slab
> 		- scan all caches
> 		- all shrinkers return 0 saying nothing to shrink
> 	- 40us gap
> 	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> 	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> 	- call shrink_slab
> 		- scan all caches
> 		- all shrinkers return 0 saying nothing to shrink
> 	40us gap
> 
> And it just repeats over and over again. After a while, nid=0,zone=1
> drops out of the traces, so reclaim only comes in batches of 10-30
> pages from zone 2 between each shrink_slab() call.
> 
> The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> finishes at 111216.1 with kswapd going to sleep on node 0 with
> 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> 2,000 pages/s) which is awfully slow....
> 
> vmscan gurus - time for you to step in now...
>
 
Can you show /proc/zoneinfo ? I want to know each zone's size.

Below is my memo.

In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.

Seeing balance_pgdat() and trace log, I guess it does

	wake up

	shrink_zone(zone=0(DMA?))     => nothing to reclaim.
		shrink_slab()
	shrink_zone(zone=1(DMA32?))   => reclaim 32,32,31 pages 
		shrink_slab()
	shrink_zone(zone=2(NORMAL?))  => reclaim 13 pages. 
		srhink_slab()

	sleep or retry.

Why shrink_slab() need to be called frequently like this ?

BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
while 'dd' operation ? (no direct relcaim by dd.)
Is this log record cpu hog after 'dd' ?

Thanks,
-Kame












^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  2:15             ` KAMEZAWA Hiroyuki
@ 2011-12-27  2:50               ` Nikolay S.
  2011-12-27  4:44                 ` KAMEZAWA Hiroyuki
  2011-12-27  3:57               ` Minchan Kim
  1 sibling, 1 reply; 32+ messages in thread
From: Nikolay S. @ 2011-12-27  2:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm

В Вт., 27/12/2011 в 11:15 +0900, KAMEZAWA Hiroyuki пишет:
> On Sat, 24 Dec 2011 07:45:03 +1100
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> 
> > > Here is the report of trace-cmd while dd'ing
> > > https://80.237.6.56/report-dd.xz
> > 
> > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > by kswapd. The pattern is:
> > 
> > 	- reclaim 94 (batches of 32,32,30) pages from iinactive list
> > 	  of zone 1, node 0, prio 12
> > 	- call shrink_slab
> > 		- scan all caches
> > 		- all shrinkers return 0 saying nothing to shrink
> > 	- 40us gap
> > 	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > 	- call shrink_slab
> > 		- scan all caches
> > 		- all shrinkers return 0 saying nothing to shrink
> > 	- 40us gap
> > 	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > 	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > 	- call shrink_slab
> > 		- scan all caches
> > 		- all shrinkers return 0 saying nothing to shrink
> > 	40us gap
> > 
> > And it just repeats over and over again. After a while, nid=0,zone=1
> > drops out of the traces, so reclaim only comes in batches of 10-30
> > pages from zone 2 between each shrink_slab() call.
> > 
> > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > 2,000 pages/s) which is awfully slow....
> > 
> > vmscan gurus - time for you to step in now...
> >
>  
> Can you show /proc/zoneinfo ? I want to know each zone's size.

$ cat /proc/zoneinfo 
Node 0, zone      DMA
  pages free     3980
        min      64
        low      80
        high     96
        scanned  0
        spanned  4080
        present  3916
    nr_free_pages 3980
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    nr_dirtied   0
    nr_written   0
    nr_anon_transparent_hugepages 0
        protection: (0, 3503, 4007, 4007)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 4
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 4
  all_unreclaimable: 1
  start_pfn:         16
  inactive_ratio:    1
Node 0, zone    DMA32
  pages free     19620
        min      14715
        low      18393
        high     22072
        scanned  0
        spanned  1044480
        present  896960
    nr_free_pages 19620
    nr_inactive_anon 43203
    nr_active_anon 206577
    nr_inactive_file 412249
    nr_active_file 126151
    nr_unevictable 7
    nr_mlock     7
    nr_anon_pages 108557
    nr_mapped    6683
    nr_file_pages 540415
    nr_dirty     5
    nr_writeback 0
    nr_slab_reclaimable 58887
    nr_slab_unreclaimable 12145
    nr_page_table_pages 1389
    nr_kernel_stack 100
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 1021
    nr_vmscan_immediate_reclaim 69337
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     1861
    nr_dirtied   1586363
    nr_written   1245872
    nr_anon_transparent_hugepages 272
        protection: (0, 0, 504, 504)
  pagesets
    cpu: 0
              count: 4
              high:  186
              batch: 31
  vm stats threshold: 24
    cpu: 1
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 24
  all_unreclaimable: 0
  start_pfn:         4096
  inactive_ratio:    5
Node 0, zone   Normal
  pages free     2854
        min      2116
        low      2645
        high     3174
        scanned  0
        spanned  131072
        present  129024
    nr_free_pages 2854
    nr_inactive_anon 20682
    nr_active_anon 10262
    nr_inactive_file 47083
    nr_active_file 11292
    nr_unevictable 518
    nr_mlock     518
    nr_anon_pages 22801
    nr_mapped    1798
    nr_file_pages 58853
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 4347
    nr_slab_unreclaimable 5955
    nr_page_table_pages 769
    nr_kernel_stack 128
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 5285
    nr_vmscan_immediate_reclaim 51475
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     28
    nr_dirtied   251597
    nr_written   191561
    nr_anon_transparent_hugepages 16
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 30
              high:  186
              batch: 31
  vm stats threshold: 12
    cpu: 1
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 12
  all_unreclaimable: 0
  start_pfn:         1048576
  inactive_ratio:

> 
> Below is my memo.
> 
> In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
> to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.
> 
> Seeing balance_pgdat() and trace log, I guess it does
> 
> 	wake up
> 
> 	shrink_zone(zone=0(DMA?))     => nothing to reclaim.
> 		shrink_slab()
> 	shrink_zone(zone=1(DMA32?))   => reclaim 32,32,31 pages 
> 		shrink_slab()
> 	shrink_zone(zone=2(NORMAL?))  => reclaim 13 pages. 
> 		srhink_slab()
> 
> 	sleep or retry.
> 
> Why shrink_slab() need to be called frequently like this ?
> 
> BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> while 'dd' operation ? (no direct relcaim by dd.)
> Is this log record cpu hog after 'dd' ?

report-dd.xz is while_ dd.
report-normal.xz - some time after

> Thanks,
> -Kame
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  2:15             ` KAMEZAWA Hiroyuki
  2011-12-27  2:50               ` Nikolay S.
@ 2011-12-27  3:57               ` Minchan Kim
  2011-12-27  4:56                 ` KAMEZAWA Hiroyuki
  2012-01-11  1:17                 ` Rik van Riel
  1 sibling, 2 replies; 32+ messages in thread
From: Minchan Kim @ 2011-12-27  3:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Dave Chinner, nowhere, Michal Hocko, linux-kernel, linux-mm

On Tue, Dec 27, 2011 at 11:15:43AM +0900, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Dec 2011 07:45:03 +1100
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> 
> > > Here is the report of trace-cmd while dd'ing
> > > https://80.237.6.56/report-dd.xz
> > 
> > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > by kswapd. The pattern is:
> > 
> > 	- reclaim 94 (batches of 32,32,30) pages from iinactive list
> > 	  of zone 1, node 0, prio 12
> > 	- call shrink_slab
> > 		- scan all caches
> > 		- all shrinkers return 0 saying nothing to shrink
> > 	- 40us gap
> > 	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > 	- call shrink_slab
> > 		- scan all caches
> > 		- all shrinkers return 0 saying nothing to shrink
> > 	- 40us gap
> > 	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > 	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > 	- call shrink_slab
> > 		- scan all caches
> > 		- all shrinkers return 0 saying nothing to shrink
> > 	40us gap
> > 
> > And it just repeats over and over again. After a while, nid=0,zone=1
> > drops out of the traces, so reclaim only comes in batches of 10-30
> > pages from zone 2 between each shrink_slab() call.
> > 
> > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > 2,000 pages/s) which is awfully slow....
> > 
> > vmscan gurus - time for you to step in now...
> >
>  
> Can you show /proc/zoneinfo ? I want to know each zone's size.
> 
> Below is my memo.
> 
> In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
> to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.
> 
> Seeing balance_pgdat() and trace log, I guess it does
> 
> 	wake up
> 
> 	shrink_zone(zone=0(DMA?))     => nothing to reclaim.
> 		shrink_slab()
> 	shrink_zone(zone=1(DMA32?))   => reclaim 32,32,31 pages 
> 		shrink_slab()
> 	shrink_zone(zone=2(NORMAL?))  => reclaim 13 pages. 
> 		srhink_slab()
> 
> 	sleep or retry.
> 
> Why shrink_slab() need to be called frequently like this ?

I guess it's caused by small NORMAL zone.
The scenario I think is as follows,

1. dd comsumes memory in NORMAL zone
2. dd enter direct reclaim and wakeup kswapd
3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
4. schedule
5. dd consumes memory again in NORMAL zone
6. kswapd fail to reclaim memory by high watermark due to 5.
7. loop again, goto 3.

The point is speed between reclaim VS memory consumption.
So kswapd cannot reach a point which enough pages are in NORMAL zone.

> 
> BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> while 'dd' operation ? (no direct relcaim by dd.)
> Is this log record cpu hog after 'dd' ?

If above scenario is right, dd couldn't enter direct reclaim to reclaim memory.


> 
> Thanks,
> -Kame
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  2:50               ` Nikolay S.
@ 2011-12-27  4:44                 ` KAMEZAWA Hiroyuki
  2011-12-27  6:06                   ` nowhere
  2011-12-28 21:33                   ` Dave Chinner
  0 siblings, 2 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-27  4:44 UTC (permalink / raw)
  To: Nikolay S.; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm

On Tue, 27 Dec 2011 06:50:08 +0400
"Nikolay S." <nowhere@hakkenden.ath.cx> wrote:

> В Вт., 27/12/2011 в 11:15 +0900, KAMEZAWA Hiroyuki пишет:
> > On Sat, 24 Dec 2011 07:45:03 +1100
> > Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > 
> > > > Here is the report of trace-cmd while dd'ing
> > > > https://80.237.6.56/report-dd.xz
> > > 
> > > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > > by kswapd. The pattern is:
> > > 
> > > 	- reclaim 94 (batches of 32,32,30) pages from iinactive list
> > > 	  of zone 1, node 0, prio 12
> > > 	- call shrink_slab
> > > 		- scan all caches
> > > 		- all shrinkers return 0 saying nothing to shrink
> > > 	- 40us gap
> > > 	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > > 	- call shrink_slab
> > > 		- scan all caches
> > > 		- all shrinkers return 0 saying nothing to shrink
> > > 	- 40us gap
> > > 	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > > 	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > > 	- call shrink_slab
> > > 		- scan all caches
> > > 		- all shrinkers return 0 saying nothing to shrink
> > > 	40us gap
> > > 
> > > And it just repeats over and over again. After a while, nid=0,zone=1
> > > drops out of the traces, so reclaim only comes in batches of 10-30
> > > pages from zone 2 between each shrink_slab() call.
> > > 
> > > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > > 2,000 pages/s) which is awfully slow....
> > > 
> > > vmscan gurus - time for you to step in now...
> > >
> >  
> > Can you show /proc/zoneinfo ? I want to know each zone's size.
> 

Thanks, 
Qeustion:
 1. does this system has no swap ?
 2. What version of kernel which you didn't see the kswapd issue ?
 3. Is this real host ? or virtualized ?


> $ cat /proc/zoneinfo 
...
Node 0, zone    DMA32
  pages free     19620
        min      14715
        low      18393
        high     22072
        scanned  0
        spanned  1044480
        present  896960
    nr_free_pages 19620
    nr_inactive_anon 43203
    nr_active_anon 206577
    nr_inactive_file 412249
    nr_active_file 126151

Then, DMA32(zone=1) files are enough large (> 32 << 12)
Hmm. assuming all frees are used for file(of dd)


(412249 + 126151 + 19620) >> 12 = 136

So, 32, 32, 30 scan seems to work as desgined.

> Node 0, zone   Normal
>   pages free     2854
>         min      2116
>         low      2645
>         high     3174
>         scanned  0
>         spanned  131072
>         present  129024
>     nr_free_pages 2854
>     nr_inactive_anon 20682
>     nr_active_anon 10262
>     nr_inactive_file 47083
>     nr_active_file 11292

Hmm, NORMAL is much smaller than DMA32. (only 500MB.)

Then, at priority=12,

  13 << 12 = 53248

13 pages per a scan seems to work as designed.
To me,  it seems kswapd does usual work...reclaim small memory until free
gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
of gfp_t fallbacks.


Memo.

1. why shrink_slab() should be called per zone, which is not zone aware.
   Isn't it enough to call it per priority ?

2. what spinlock contention that perf showed ?
   And if shrink_slab() doesn't consume cpu as trace shows, why perf 
   says shrink_slab() is heavy..

3. because 8/9 of memory is in DMA32, calling shrink_slab() frequently
   at scanning NORMAL seems to be time wasting.
 

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  3:57               ` Minchan Kim
@ 2011-12-27  4:56                 ` KAMEZAWA Hiroyuki
  2012-01-10 22:33                   ` Andrew Morton
  2012-01-11  0:33                   ` Dave Chinner
  2012-01-11  1:17                 ` Rik van Riel
  1 sibling, 2 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-27  4:56 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Dave Chinner, nowhere, Michal Hocko, linux-kernel, linux-mm

On Tue, 27 Dec 2011 12:57:31 +0900
Minchan Kim <minchan@kernel.org> wrote:

> On Tue, Dec 27, 2011 at 11:15:43AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Sat, 24 Dec 2011 07:45:03 +1100
> > Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > 
> > > > Here is the report of trace-cmd while dd'ing
> > > > https://80.237.6.56/report-dd.xz
> > > 
> > > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > > by kswapd. The pattern is:
> > > 
> > > 	- reclaim 94 (batches of 32,32,30) pages from iinactive list
> > > 	  of zone 1, node 0, prio 12
> > > 	- call shrink_slab
> > > 		- scan all caches
> > > 		- all shrinkers return 0 saying nothing to shrink
> > > 	- 40us gap
> > > 	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > > 	- call shrink_slab
> > > 		- scan all caches
> > > 		- all shrinkers return 0 saying nothing to shrink
> > > 	- 40us gap
> > > 	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > > 	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > > 	- call shrink_slab
> > > 		- scan all caches
> > > 		- all shrinkers return 0 saying nothing to shrink
> > > 	40us gap
> > > 
> > > And it just repeats over and over again. After a while, nid=0,zone=1
> > > drops out of the traces, so reclaim only comes in batches of 10-30
> > > pages from zone 2 between each shrink_slab() call.
> > > 
> > > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > > 2,000 pages/s) which is awfully slow....
> > > 
> > > vmscan gurus - time for you to step in now...
> > >
> >  
> > Can you show /proc/zoneinfo ? I want to know each zone's size.
> > 
> > Below is my memo.
> > 
> > In trace log, priority = 11 or 12. Then, I think kswapd can reclaim memory
> > to satisfy "sc.nr_reclaimed >= SWAP_CLUSTER_MAX" condition and loops again.
> > 
> > Seeing balance_pgdat() and trace log, I guess it does
> > 
> > 	wake up
> > 
> > 	shrink_zone(zone=0(DMA?))     => nothing to reclaim.
> > 		shrink_slab()
> > 	shrink_zone(zone=1(DMA32?))   => reclaim 32,32,31 pages 
> > 		shrink_slab()
> > 	shrink_zone(zone=2(NORMAL?))  => reclaim 13 pages. 
> > 		srhink_slab()
> > 
> > 	sleep or retry.
> > 
> > Why shrink_slab() need to be called frequently like this ?
> 
> I guess it's caused by small NORMAL zone.

You're right. I confirmed his zoneinfo.

> The scenario I think is as follows,
> 
> 1. dd comsumes memory in NORMAL zone
> 2. dd enter direct reclaim and wakeup kswapd
> 3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
> 4. schedule
> 5. dd consumes memory again in NORMAL zone
> 6. kswapd fail to reclaim memory by high watermark due to 5.
> 7. loop again, goto 3.
> 
> The point is speed between reclaim VS memory consumption.
> So kswapd cannot reach a point which enough pages are in NORMAL zone.
> 
> > 
> > BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> > while 'dd' operation ? (no direct relcaim by dd.)
> > Is this log record cpu hog after 'dd' ?
> 
> If above scenario is right, dd couldn't enter direct reclaim to reclaim memory.
> 

I think you're right. IIUC, kswapd's behavior is what we usually see.

Hmm, if I understand correctly,

 - dd's speed down is caused by kswapd's cpu consumption.
 - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
 - kswapd can't stop because NORMAL zone is small.
 - memory reclaim speed is enough because dd can't get enough cpu.

I wonder reducing to call shrink_slab() may be a help but I'm not sure
where lock conention comes from...

Regards,
-Kame







^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  4:44                 ` KAMEZAWA Hiroyuki
@ 2011-12-27  6:06                   ` nowhere
  2011-12-28 21:33                   ` Dave Chinner
  1 sibling, 0 replies; 32+ messages in thread
From: nowhere @ 2011-12-27  6:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Chinner, Michal Hocko, linux-kernel, linux-mm

В Вт., 27/12/2011 в 13:44 +0900, KAMEZAWA Hiroyuki пишет:
> On Tue, 27 Dec 2011 06:50:08 +0400
> "Nikolay S." <nowhere@hakkenden.ath.cx> wrote:
> 
> > В Вт., 27/12/2011 в 11:15 +0900, KAMEZAWA Hiroyuki пишет:
> > > On Sat, 24 Dec 2011 07:45:03 +1100
> > > Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > > On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> > > > > В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > > > > > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > > > > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > > > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > 
> > > > > Here is the report of trace-cmd while dd'ing
> > > > > https://80.237.6.56/report-dd.xz
> > > > 
> > > > Ok, it's not a shrink_slab() problem - it's just being called ~100uS
> > > > by kswapd. The pattern is:
> > > > 
> > > > 	- reclaim 94 (batches of 32,32,30) pages from iinactive list
> > > > 	  of zone 1, node 0, prio 12
> > > > 	- call shrink_slab
> > > > 		- scan all caches
> > > > 		- all shrinkers return 0 saying nothing to shrink
> > > > 	- 40us gap
> > > > 	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
> > > > 	- call shrink_slab
> > > > 		- scan all caches
> > > > 		- all shrinkers return 0 saying nothing to shrink
> > > > 	- 40us gap
> > > > 	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
> > > > 	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
> > > > 	- call shrink_slab
> > > > 		- scan all caches
> > > > 		- all shrinkers return 0 saying nothing to shrink
> > > > 	40us gap
> > > > 
> > > > And it just repeats over and over again. After a while, nid=0,zone=1
> > > > drops out of the traces, so reclaim only comes in batches of 10-30
> > > > pages from zone 2 between each shrink_slab() call.
> > > > 
> > > > The trace starts at 111209.881s, with 944776 pages on the LRUs. It
> > > > finishes at 111216.1 with kswapd going to sleep on node 0 with
> > > > 930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
> > > > 2,000 pages/s) which is awfully slow....
> > > > 
> > > > vmscan gurus - time for you to step in now...
> > > >
> > >  
> > > Can you show /proc/zoneinfo ? I want to know each zone's size.
> > 
> 
> Thanks, 
> Qeustion:
>  1. does this system has no swap ?

It has. 4G

>  2. What version of kernel which you didn't see the kswapd issue ?

Hmm... 3.1 and below, I presume

>  3. Is this real host ? or virtualized ?

100% real

> 
> > $ cat /proc/zoneinfo 
> ...
> Node 0, zone    DMA32
>   pages free     19620
>         min      14715
>         low      18393
>         high     22072
>         scanned  0
>         spanned  1044480
>         present  896960
>     nr_free_pages 19620
>     nr_inactive_anon 43203
>     nr_active_anon 206577
>     nr_inactive_file 412249
>     nr_active_file 126151
> 
> Then, DMA32(zone=1) files are enough large (> 32 << 12)
> Hmm. assuming all frees are used for file(of dd)
> 
> 
> (412249 + 126151 + 19620) >> 12 = 136
> 
> So, 32, 32, 30 scan seems to work as desgined.
> 
> > Node 0, zone   Normal
> >   pages free     2854
> >         min      2116
> >         low      2645
> >         high     3174
> >         scanned  0
> >         spanned  131072
> >         present  129024
> >     nr_free_pages 2854
> >     nr_inactive_anon 20682
> >     nr_active_anon 10262
> >     nr_inactive_file 47083
> >     nr_active_file 11292
> 
> Hmm, NORMAL is much smaller than DMA32. (only 500MB.)
> 
> Then, at priority=12,
> 
>   13 << 12 = 53248
> 
> 13 pages per a scan seems to work as designed.
> To me,  it seems kswapd does usual work...reclaim small memory until free
> gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
> of gfp_t fallbacks.
> 
> 
> Memo.
> 
> 1. why shrink_slab() should be called per zone, which is not zone aware.
>    Isn't it enough to call it per priority ?
> 
> 2. what spinlock contention that perf showed ?
>    And if shrink_slab() doesn't consume cpu as trace shows, why perf 
>    says shrink_slab() is heavy..
> 
> 3. because 8/9 of memory is in DMA32, calling shrink_slab() frequently
>    at scanning NORMAL seems to be time wasting.
>  
> 
> Thanks,
> -Kame
> 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  0:20                   ` KAMEZAWA Hiroyuki
@ 2011-12-27 13:33                     ` Hillf Danton
  2011-12-28  0:06                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 32+ messages in thread
From: Hillf Danton @ 2011-12-27 13:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Nikolay S., Dave Chinner, Michal Hocko, linux-kernel, linux-mm

On Tue, Dec 27, 2011 at 8:20 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 26 Dec 2011 20:35:46 +0800
> Hillf Danton <dhillf@gmail.com> wrote:
>
>> On Sun, Dec 25, 2011 at 6:21 PM, Nikolay S. <nowhere@hakkenden.ath.cx> wrote:
>> >
>> > Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
>> > mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
>> > place changes there?
>> >
>> > And also, -rc7 is here. May the problem be addressed as part of some
>> > ongoing work? Is there any reason to try -rc7 (the problem requires
>> > several days of uptime to become obvious)?
>> >
>>
>> Sorry, Nikolay, it is not based on the -next, nor on the -rc5(I assumed it was).
>> The following is based on -next, and if you want to test -rc5, please
>> grep MEM_CGROUP_ZSTAT mm/memcontrol.c and change it.
>>
>> Best regard
>>
>
> Hmm ? memcg is used ? Why do you consider this will be a help ?
>

Hi Kame

Please see the livelock at
        https://lkml.org/lkml/2011/12/23/222

and if it is related to the hog here, please check if the following is
in right direction.

Thanks
Hillf
---

--- a/mm/memcontrol.c	Mon Dec 26 20:34:38 2011
+++ b/mm/memcontrol.c	Tue Dec 27 20:05:12 2011
@@ -3637,6 +3637,7 @@ static int mem_cgroup_force_empty_list(s
 	list = &mz->lruvec.lists[lru];

 	loop = MEM_CGROUP_ZSTAT(mz, lru);
+	WARN_ON((long)loop < 0);
 	/* give some margin against EBUSY etc...*/
 	loop += 256;
 	busy = NULL;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27 13:33                     ` Hillf Danton
@ 2011-12-28  0:06                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-28  0:06 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Nikolay S., Dave Chinner, Michal Hocko, linux-kernel, linux-mm

On Tue, 27 Dec 2011 21:33:04 +0800
Hillf Danton <dhillf@gmail.com> wrote:

> On Tue, Dec 27, 2011 at 8:20 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 26 Dec 2011 20:35:46 +0800
> > Hillf Danton <dhillf@gmail.com> wrote:
> >
> >> On Sun, Dec 25, 2011 at 6:21 PM, Nikolay S. <nowhere@hakkenden.ath.cx> wrote:
> >> >
> >> > Uhm.., is this patch against 3.2-rc4? I can not apply it. There's no
> >> > mem_cgroup_lru_del_list(), but void mem_cgroup_del_lru_list(). Should I
> >> > place changes there?
> >> >
> >> > And also, -rc7 is here. May the problem be addressed as part of some
> >> > ongoing work? Is there any reason to try -rc7 (the problem requires
> >> > several days of uptime to become obvious)?
> >> >
> >>
> >> Sorry, Nikolay, it is not based on the -next, nor on the -rc5(I assumed it was).
> >> The following is based on -next, and if you want to test -rc5, please
> >> grep MEM_CGROUP_ZSTAT mm/memcontrol.c and change it.
> >>
> >> Best regard
> >>
> >
> > Hmm ? memcg is used ? Why do you consider this will be a help ?
> >
> 
> Hi Kame
> 
> Please see the livelock at
>         https://lkml.org/lkml/2011/12/23/222
> 
> and if it is related to the hog here, please check if the following is
> in right direction.
> 

But linux-next and his 3.2-rc5 has totally different LRU design.
3.2-rc doesn't have unified memcg/global LRU and memcg doesn't affect
kswapd behavior if not used.

Thanks,
-Kame

> Thanks
> Hillf
> ---
> 
> --- a/mm/memcontrol.c	Mon Dec 26 20:34:38 2011
> +++ b/mm/memcontrol.c	Tue Dec 27 20:05:12 2011
> @@ -3637,6 +3637,7 @@ static int mem_cgroup_force_empty_list(s
>  	list = &mz->lruvec.lists[lru];
> 
>  	loop = MEM_CGROUP_ZSTAT(mz, lru);
> +	WARN_ON((long)loop < 0);
>  	/* give some margin against EBUSY etc...*/
>  	loop += 256;
>  	busy = NULL;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  4:44                 ` KAMEZAWA Hiroyuki
  2011-12-27  6:06                   ` nowhere
@ 2011-12-28 21:33                   ` Dave Chinner
  2011-12-28 22:57                     ` KOSAKI Motohiro
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2011-12-28 21:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Nikolay S., Michal Hocko, linux-kernel, linux-mm

On Tue, Dec 27, 2011 at 01:44:05PM +0900, KAMEZAWA Hiroyuki wrote:
> To me,  it seems kswapd does usual work...reclaim small memory until free
> gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
> of gfp_t fallbacks.
> 
> 
> Memo.
> 
> 1. why shrink_slab() should be called per zone, which is not zone aware.
>    Isn't it enough to call it per priority ?

It is intended that it should be zone aware, but the current
shrinkers only have global LRUs and hence cannot discriminate
between objects from different zones easily. And if only a single
node/zone is being scanned, then we still have to call shirnk_slab()
to try to free objects in that zone/node, despite it's current
global scope.

I have some prototype patches that make the major slab caches and
shrinkers zone/node aware - that is the eventual goal here - but
first all the major slab cache LRUs need to be converted to be node
aware first. Then we can pass a nodemask into shrink_slab() and down
to the shrinkers so that those that have per-node LRUs can scan only
the appropriate nodes for objects to free. This is someting that I'm
working on in my spare time, but I have very little of that at the
moment, unfortunately.

> 2. what spinlock contention that perf showed ?
>    And if shrink_slab() doesn't consume cpu as trace shows, why perf 
>    says shrink_slab() is heavy..

There isn't any spin lock contention - it's just showing how
expensive locking superblocks is when it's being done every few
microseconds for no good reason.

> 3. because 8/9 of memory is in DMA32, calling shrink_slab() frequently
>    at scanning NORMAL seems to be time wasting.

Especially as the shrink_slab() calls are returning zero pages freed
every single time (i.e. the slab caches are empty). kswapd needs to
back off here, I think, or free more memory at a time. Only freeing
100 pages at a time is pretty inefficient, esp. as we have 4 orders
of magnitude more pages on the LRU and that is consuming >90% of
RAM...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-28 21:33                   ` Dave Chinner
@ 2011-12-28 22:57                     ` KOSAKI Motohiro
  2012-01-02  7:00                       ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: KOSAKI Motohiro @ 2011-12-28 22:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: KAMEZAWA Hiroyuki, Nikolay S., Michal Hocko, linux-kernel, linux-mm

(12/28/11 4:33 PM), Dave Chinner wrote:
> On Tue, Dec 27, 2011 at 01:44:05PM +0900, KAMEZAWA Hiroyuki wrote:
>> To me,  it seems kswapd does usual work...reclaim small memory until free
>> gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
>> of gfp_t fallbacks.
>>
>>
>> Memo.
>>
>> 1. why shrink_slab() should be called per zone, which is not zone aware.
>>     Isn't it enough to call it per priority ?
>
> It is intended that it should be zone aware, but the current
> shrinkers only have global LRUs and hence cannot discriminate
> between objects from different zones easily. And if only a single
> node/zone is being scanned, then we still have to call shirnk_slab()
> to try to free objects in that zone/node, despite it's current
> global scope.
>
> I have some prototype patches that make the major slab caches and
> shrinkers zone/node aware - that is the eventual goal here - but
> first all the major slab cache LRUs need to be converted to be node
> aware first. Then we can pass a nodemask into shrink_slab() and down
> to the shrinkers so that those that have per-node LRUs can scan only
> the appropriate nodes for objects to free. This is someting that I'm
> working on in my spare time, but I have very little of that at the
> moment, unfortunately.

His machine only have one node. per node basis don't help him. We need
1) to make per zone LRU or 2) to implement more clever inter zone 
balancing code. but it seems off topic. I'm not sure why you are starting
talk about per node shrinker.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-28 22:57                     ` KOSAKI Motohiro
@ 2012-01-02  7:00                       ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2012-01-02  7:00 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Nikolay S., Michal Hocko, linux-kernel, linux-mm

On Wed, Dec 28, 2011 at 05:57:38PM -0500, KOSAKI Motohiro wrote:
> (12/28/11 4:33 PM), Dave Chinner wrote:
> >On Tue, Dec 27, 2011 at 01:44:05PM +0900, KAMEZAWA Hiroyuki wrote:
> >>To me,  it seems kswapd does usual work...reclaim small memory until free
> >>gets enough. And it seems 'dd' allocates its memory from ZONE_DMA32 because
> >>of gfp_t fallbacks.
> >>
> >>
> >>Memo.
> >>
> >>1. why shrink_slab() should be called per zone, which is not zone aware.
> >>    Isn't it enough to call it per priority ?
> >
> >It is intended that it should be zone aware, but the current
> >shrinkers only have global LRUs and hence cannot discriminate
> >between objects from different zones easily. And if only a single
> >node/zone is being scanned, then we still have to call shirnk_slab()
> >to try to free objects in that zone/node, despite it's current
> >global scope.
> >
> >I have some prototype patches that make the major slab caches and
> >shrinkers zone/node aware - that is the eventual goal here - but
> >first all the major slab cache LRUs need to be converted to be node
> >aware first. Then we can pass a nodemask into shrink_slab() and down
> >to the shrinkers so that those that have per-node LRUs can scan only
> >the appropriate nodes for objects to free. This is someting that I'm
> >working on in my spare time, but I have very little of that at the
> >moment, unfortunately.
> 
> His machine only have one node. per node basis don't help him.

You've got hung up on a little detail like how to define locality(*)
and so completely missed my point.

That being: Improving reclaim granularity won't help at all until
the subsystems index their caches in a more fine grained manner.
And with work in progress to make the shrinkers *locality* aware,
then maybe per-priority is not sufficient, either.

IOWs, I don't think there's a right answer to your question, but
there's a good chance that it will be irrelevant given work that is
in progress...

Cheers,

Dave.

----

(*) Keep in mind that the biggest problem with the current
LRU/shrinker implementations is not the shrinker - it's that lock
contention in fast paths is a major performance limiting factor. In
the VFS, that limit is hit somewhere between 8-16 concurrent
processes all banging on a filesystem doing metadata intensive work.

That's the real problem we have to solve first, and what I'm trying
to do is solve it in a way that -aligns- to the MM architecture.
Forcing everyone to know about MM zones is not the right way to go -
it means we can't easily change the internals of the MM subsystem
without touching 20-30 external shrinker implementations. IOWs,
there needs to be a level of abstraction between the MM internals
and the shrinker API exposed to other subsystems.

However, we still need to expose some type of locality information
through that interface to allow the MM to shrink caches in a more
efficient manner.  Given that the allocation interfaces expose
per-node interfaces and so programmers are familiar with this method
of specifying locality, it seems natural to make the shrinker API
symmetric with that. And given that most caches are allocated out of
a single zone per node, having per-zone LRUs is just going to waste
a lot of memory....

But we should really discuss this when I post the patches again...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  4:56                 ` KAMEZAWA Hiroyuki
@ 2012-01-10 22:33                   ` Andrew Morton
  2012-01-11  3:25                     ` Nikolay S.
  2012-01-11  0:33                   ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2012-01-10 22:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Dave Chinner, nowhere, Michal Hocko, linux-kernel, linux-mm

On Tue, 27 Dec 2011 13:56:58 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Hmm, if I understand correctly,
> 
>  - dd's speed down is caused by kswapd's cpu consumption.
>  - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
>  - kswapd can't stop because NORMAL zone is small.
>  - memory reclaim speed is enough because dd can't get enough cpu.
> 
> I wonder reducing to call shrink_slab() may be a help but I'm not sure
> where lock conention comes from...

Nikolay, it sounds as if this problem has only recently started
happening?  Was 3.1 OK?

If so, we should work out what we did post-3.1 to cause this.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  4:56                 ` KAMEZAWA Hiroyuki
  2012-01-10 22:33                   ` Andrew Morton
@ 2012-01-11  0:33                   ` Dave Chinner
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2012-01-11  0:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, nowhere, Michal Hocko, linux-kernel, linux-mm

On Tue, Dec 27, 2011 at 01:56:58PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Dec 2011 12:57:31 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> > The scenario I think is as follows,
> > 
> > 1. dd comsumes memory in NORMAL zone
> > 2. dd enter direct reclaim and wakeup kswapd
> > 3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
> > 4. schedule
> > 5. dd consumes memory again in NORMAL zone
> > 6. kswapd fail to reclaim memory by high watermark due to 5.
> > 7. loop again, goto 3.
> > 
> > The point is speed between reclaim VS memory consumption.
> > So kswapd cannot reach a point which enough pages are in NORMAL zone.
> > 
> > > 
> > > BTW. I'm sorry if I miss something ...Why only kswapd reclaims memory
> > > while 'dd' operation ? (no direct relcaim by dd.)
> > > Is this log record cpu hog after 'dd' ?
> > 
> > If above scenario is right, dd couldn't enter direct reclaim to reclaim memory.
> > 
> 
> I think you're right. IIUC, kswapd's behavior is what we usually see.
> 
> Hmm, if I understand correctly,
> 
>  - dd's speed down is caused by kswapd's cpu consumption.
>  - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
>  - kswapd can't stop because NORMAL zone is small.
>  - memory reclaim speed is enough because dd can't get enough cpu.
> 
> I wonder reducing to call shrink_slab() may be a help but I'm not sure
> where lock conention comes from...

There is no lock contention. It's simply the overhead of grabbing a
passive reference to every superblock in the machine 3 times every
100uS.

FWIW, I don't think kswapd should be polling the shrinkers this
often when there are no/very few pages available to be freed from
the slab caches. There are many more shrinkers now than there were
in the past, so the overhead of polling all shrinkers very quickly
is significant.

FWIW, when we move to locality aware shrinkers, the polling overhead
is going to be even higher, so either the VM needs to call
shrink_slab less aggressively, or shrink_slab() needs to have some
method of reducing shrinker polling frequency.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2011-12-27  3:57               ` Minchan Kim
  2011-12-27  4:56                 ` KAMEZAWA Hiroyuki
@ 2012-01-11  1:17                 ` Rik van Riel
  1 sibling, 0 replies; 32+ messages in thread
From: Rik van Riel @ 2012-01-11  1:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Dave Chinner, nowhere, Michal Hocko,
	linux-kernel, linux-mm

On 12/26/2011 10:57 PM, Minchan Kim wrote:

> I guess it's caused by small NORMAL zone.
> The scenario I think is as follows,

I guess it is exaggerated by a small NORMAL zone.  Even on my
system, where the NORMAL zone is about 3x as large as the DMA32
zone, I can see that the pages in the NORMAL zone get recycled
slightly faster than those in the DMA32 zone...

> 1. dd comsumes memory in NORMAL zone
> 2. dd enter direct reclaim and wakeup kswapd
> 3. kswapd reclaims some memory in NORMAL zone until it reclaims high wamrk
> 4. schedule
> 5. dd consumes memory again in NORMAL zone
> 6. kswapd fail to reclaim memory by high watermark due to 5.
> 7. loop again, goto 3.
>
> The point is speed between reclaim VS memory consumption.
> So kswapd cannot reach a point which enough pages are in NORMAL zone.

I wonder if it would make sense for kswapd to count how many
pages it needs to free in each zone (at step 2), and then stop
reclaiming once it has freed that many pages.

That could leave the NORMAL (or HIGHMEM, on 32 bit) zone below
the watermark, but as long as the other zones are still good,
allocations can proceed just fine.

It would have the disadvantage of kswapd stopping reclaim with
possibly a zone below the watermark, but the advantage of
better balancing of allocations between zones.

Does this idea make sense?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2012-01-10 22:33                   ` Andrew Morton
@ 2012-01-11  3:25                     ` Nikolay S.
  2012-01-11  4:42                       ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Nikolay S. @ 2012-01-11  3:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Minchan Kim, Dave Chinner, Michal Hocko,
	linux-kernel, linux-mm

В Вт., 10/01/2012 в 14:33 -0800, Andrew Morton пишет:
> On Tue, 27 Dec 2011 13:56:58 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Hmm, if I understand correctly,
> > 
> >  - dd's speed down is caused by kswapd's cpu consumption.
> >  - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
> >  - kswapd can't stop because NORMAL zone is small.
> >  - memory reclaim speed is enough because dd can't get enough cpu.
> > 
> > I wonder reducing to call shrink_slab() may be a help but I'm not sure
> > where lock conention comes from...
> 
> Nikolay, it sounds as if this problem has only recently started
> happening?  Was 3.1 OK?
> 
> If so, we should work out what we did post-3.1 to cause this.

Yes, 3.1. was ok.
Recently I have upgraded to 3.2, and I can not reproduce the problem.
I'm now at 5 days uptime, the machine usage pattern, the software - all
the same, but the problem is not visible anymore:

  PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
14822 nowhere   R   30  0.2   0:01.52  10m dd
  416 root      S    7  0.0   6:26.72    0 kswapd0

(also, kswapd run time after 5 days is only 6,5 seconds, whereas with
-rc5 it was 22 seconds after 5 days).

I can provide similar traces to see what has changed in kswapd's
activities (if it is of any value)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Kswapd in 3.2.0-rc5 is a CPU hog
  2012-01-11  3:25                     ` Nikolay S.
@ 2012-01-11  4:42                       ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2012-01-11  4:42 UTC (permalink / raw)
  To: Nikolay S.
  Cc: KAMEZAWA Hiroyuki, Minchan Kim, Dave Chinner, Michal Hocko,
	linux-kernel, linux-mm

On Wed, 11 Jan 2012 07:25:20 +0400 "Nikolay S." <nowhere@hakkenden.ath.cx> wrote:

> __ ____., 10/01/2012 __ 14:33 -0800, Andrew Morton __________:
> > On Tue, 27 Dec 2011 13:56:58 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > Hmm, if I understand correctly,
> > > 
> > >  - dd's speed down is caused by kswapd's cpu consumption.
> > >  - kswapd's cpu consumption is enlarged by shrink_slab() (by perf)
> > >  - kswapd can't stop because NORMAL zone is small.
> > >  - memory reclaim speed is enough because dd can't get enough cpu.
> > > 
> > > I wonder reducing to call shrink_slab() may be a help but I'm not sure
> > > where lock conention comes from...
> > 
> > Nikolay, it sounds as if this problem has only recently started
> > happening?  Was 3.1 OK?
> > 
> > If so, we should work out what we did post-3.1 to cause this.
> 
> Yes, 3.1. was ok.
> Recently I have upgraded to 3.2, and I can not reproduce the problem.
> I'm now at 5 days uptime, the machine usage pattern, the software - all
> the same, but the problem is not visible anymore:
> 
>   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> 14822 nowhere   R   30  0.2   0:01.52  10m dd
>   416 root      S    7  0.0   6:26.72    0 kswapd0
> 
> (also, kswapd run time after 5 days is only 6,5 seconds, whereas with
> -rc5 it was 22 seconds after 5 days).
> 
> I can provide similar traces to see what has changed in kswapd's
> activities (if it is of any value)

OK, thanks for the followup.  I suppose we assume this is fixed unless
someone reports it in 3.2.


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-01-11  4:38 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-21  3:10 Kswapd in 3.2.0-rc5 is a CPU hog Nikolay S.
2011-12-21  9:52 ` Michal Hocko
2011-12-21 10:15   ` nowhere
2011-12-21 10:24     ` Michal Hocko
2011-12-21 10:52       ` nowhere
2011-12-21 14:06       ` Alex Elder
2011-12-21 14:19         ` nowhere
2011-12-21 22:55   ` Dave Chinner
2011-12-23  9:01     ` nowhere
2011-12-23 10:20       ` Dave Chinner
2011-12-23 11:04         ` nowhere
2011-12-23 20:45           ` Dave Chinner
2011-12-25  9:09             ` Hillf Danton
2011-12-25 10:21               ` Nikolay S.
2011-12-26 12:35                 ` Hillf Danton
2011-12-27  0:20                   ` KAMEZAWA Hiroyuki
2011-12-27 13:33                     ` Hillf Danton
2011-12-28  0:06                       ` KAMEZAWA Hiroyuki
2011-12-27  2:15             ` KAMEZAWA Hiroyuki
2011-12-27  2:50               ` Nikolay S.
2011-12-27  4:44                 ` KAMEZAWA Hiroyuki
2011-12-27  6:06                   ` nowhere
2011-12-28 21:33                   ` Dave Chinner
2011-12-28 22:57                     ` KOSAKI Motohiro
2012-01-02  7:00                       ` Dave Chinner
2011-12-27  3:57               ` Minchan Kim
2011-12-27  4:56                 ` KAMEZAWA Hiroyuki
2012-01-10 22:33                   ` Andrew Morton
2012-01-11  3:25                     ` Nikolay S.
2012-01-11  4:42                       ` Andrew Morton
2012-01-11  0:33                   ` Dave Chinner
2012-01-11  1:17                 ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).