linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
@ 2005-09-11 10:57 Theodore Ts'o
  2005-09-11 12:00 ` Dipankar Sarma
  0 siblings, 1 reply; 32+ messages in thread
From: Theodore Ts'o @ 2005-09-11 10:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2089 bytes --]

I've been noticing this for a while (probably since at least 2.6.11 or
so, but I haven't been keeping close attention), but I haven't had the
time to get some proof that this was the cause, and to write it up
until now.

I have a T40 laptop (Pentium M processor) with 2 gigs of memory, and
from time to time, after the system has been up for a while, the
dentry cache grows huge, as does the ext3_inode_cache:

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
dentry_cache      434515 514112    136   29    1 : tunables  120   60    0 : slabdata  17728  17728      0
ext3_inode_cache  587635 589992    464    8    1 : tunables   54   27    0 : slabdata  73748  73749      0

Leading to an impending shortage in low memory:

LowFree:          9268 kB

... and if I don't take corrective measures, very shortly thereafter
the system will become unresponsive and will end up thrashing itself
to death, with symptoms that are identical to a case of 2.4 lowmem
exhaustion --- except this is on a 2.6.13 kernel, where all of these
problems were supposed to be solved.

It turns out I can head off the system lockup by requesting the
formation of hugepages, which will immediately cause a dramatic
reduction of memory usage in both high- and low- memory as various
caches and flushed:

	echo 100 > /proc/sys/vm/nr_hugepages
	echo 0 > /proc/sys/vm/nr_hugepages

The question is why isn't the kernel able to figure out how to do
release dentry cache entries automatically when it starts thrashing due
to a lack of low memory?   Clearly it can, since requesting hugepages
does shrink the dentry cache:

dentry_cache       20097  20097    136   29    1 : tunables  120   60    0 : slabdata    693    693      0
ext3_inode_cache   17782  17784    464    8    1 : tunables   54   27    0 : slabdata   2223   2223      0

LowFree:        835916 kB

Has anyone else seen this, or have some ideas about how to fix it?

Thanks, regards,

						- Ted


[-- Attachment #2: slabinfo --]
[-- Type: text/plain, Size: 13055 bytes --]

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nfs_write_data        36     36    448    9    1 : tunables   54   27    0 : slabdata      4      4      0
nfs_read_data         32     36    448    9    1 : tunables   54   27    0 : slabdata      4      4      0
nfs_inode_cache       69     72    592    6    1 : tunables   54   27    0 : slabdata     12     12      0
nfs_page               0      0     64   61    1 : tunables  120   60    0 : slabdata      0      0      0
rpc_buffers            8      8   2048    2    1 : tunables   24   12    0 : slabdata      4      4      0
rpc_tasks              8     20    192   20    1 : tunables  120   60    0 : slabdata      1      1      0
rpc_inode_cache        8      9    448    9    1 : tunables   54   27    0 : slabdata      1      1      0
uhci_urb_priv          0      0     44   88    1 : tunables  120   60    0 : slabdata      0      0      0
fib6_nodes             7    119     32  119    1 : tunables  120   60    0 : slabdata      1      1      0
ip6_dst_cache          7     15    256   15    1 : tunables  120   60    0 : slabdata      1      1      0
ndisc_cache            1     20    192   20    1 : tunables  120   60    0 : slabdata      1      1      0
RAWv6                  3      6    640    6    1 : tunables   54   27    0 : slabdata      1      1      0
UDPv6                  2      7    576    7    1 : tunables   54   27    0 : slabdata      1      1      0
request_sock_TCPv6      0      0    128   31    1 : tunables  120   60    0 : slabdata      0      0      0
TCPv6                 12     14   1088    7    2 : tunables   24   12    0 : slabdata      2      2      0
ip_fib_alias          11    226     16  226    1 : tunables  120   60    0 : slabdata      1      1      0
ip_fib_hash           11    119     32  119    1 : tunables  120   60    0 : slabdata      1      1      0
UNIX                 343    350    384   10    1 : tunables   54   27    0 : slabdata     35     35      0
tcp_tw_bucket          0      0    128   31    1 : tunables  120   60    0 : slabdata      0      0      0
tcp_bind_bucket       29    226     16  226    1 : tunables  120   60    0 : slabdata      1      1      0
inet_peer_cache        0      0     64   61    1 : tunables  120   60    0 : slabdata      0      0      0
secpath_cache          0      0    128   31    1 : tunables  120   60    0 : slabdata      0      0      0
xfrm_dst_cache         0      0    320   12    1 : tunables   54   27    0 : slabdata      0      0      0
ip_dst_cache          29     45    256   15    1 : tunables  120   60    0 : slabdata      3      3      0
arp_cache              4     31    128   31    1 : tunables  120   60    0 : slabdata      1      1      0
RAW                    2      9    448    9    1 : tunables   54   27    0 : slabdata      1      1      0
UDP                   28     28    512    7    1 : tunables   54   27    0 : slabdata      4      4      0
request_sock_TCP       0      0     64   61    1 : tunables  120   60    0 : slabdata      0      0      0
TCP                  144    148    960    4    1 : tunables   54   27    0 : slabdata     37     37      0
flow_cache             0      0    128   31    1 : tunables  120   60    0 : slabdata      0      0      0
cfq_ioc_pool           0      0     48   81    1 : tunables  120   60    0 : slabdata      0      0      0
cfq_pool               0      0     96   41    1 : tunables  120   60    0 : slabdata      0      0      0
crq_pool               0      0     44   88    1 : tunables  120   60    0 : slabdata      0      0      0
deadline_drq           0      0     48   81    1 : tunables  120   60    0 : slabdata      0      0      0
as_arq                65    130     60   65    1 : tunables  120   60    0 : slabdata      2      2      0
mqueue_inode_cache      1      7    512    7    1 : tunables   54   27    0 : slabdata      1      1      0
hugetlbfs_inode_cache      1     12    316   12    1 : tunables   54   27    0 : slabdata      1      1      0
ext2_inode_cache       0      0    444    9    1 : tunables   54   27    0 : slabdata      0      0      0
ext2_xattr             0      0     44   88    1 : tunables  120   60    0 : slabdata      0      0      0
journal_handle         8    185     20  185    1 : tunables  120   60    0 : slabdata      1      1      0
journal_head        2985   3000     52   75    1 : tunables  120   60    0 : slabdata     40     40      0
revoke_table           6    290     12  290    1 : tunables  120   60    0 : slabdata      1      1      0
revoke_record          0      0     16  226    1 : tunables  120   60    0 : slabdata      0      0      0
ext3_inode_cache  587635 589992    464    8    1 : tunables   54   27    0 : slabdata  73748  73749      0
ext3_xattr             0      0     44   88    1 : tunables  120   60    0 : slabdata      0      0      0
dnotify_cache          5    185     20  185    1 : tunables  120   60    0 : slabdata      1      1      0
eventpoll_pwq          0      0     36  107    1 : tunables  120   60    0 : slabdata      0      0      0
eventpoll_epi          0      0    128   31    1 : tunables  120   60    0 : slabdata      0      0      0
inotify_event_cache      0      0     28  135    1 : tunables  120   60    0 : slabdata      0      0      0
inotify_watch_cache      0      0     36  107    1 : tunables  120   60    0 : slabdata      0      0      0
kioctx                 0      0    192   20    1 : tunables  120   60    0 : slabdata      0      0      0
kiocb                  0      0    128   31    1 : tunables  120   60    0 : slabdata      0      0      0
fasync_cache           3    226     16  226    1 : tunables  120   60    0 : slabdata      1      1      0
shmem_inode_cache    963    963    408    9    1 : tunables   54   27    0 : slabdata    107    107      0
posix_timers_cache      0      0     96   41    1 : tunables  120   60    0 : slabdata      0      0      0
uid_cache             10     61     64   61    1 : tunables  120   60    0 : slabdata      1      1      0
blkdev_ioc            95    135     28  135    1 : tunables  120   60    0 : slabdata      1      1      0
blkdev_queue          25     30    380   10    1 : tunables   54   27    0 : slabdata      3      3      0
blkdev_requests       78     78    152   26    1 : tunables  120   60    0 : slabdata      3      3      0
biovec-(256)         256    256   3072    2    2 : tunables   24   12    0 : slabdata    128    128      0
biovec-128           256    260   1536    5    2 : tunables   24   12    0 : slabdata     52     52      0
biovec-64            256    260    768    5    1 : tunables   54   27    0 : slabdata     52     52      0
biovec-16            256    260    192   20    1 : tunables  120   60    0 : slabdata     13     13      0
biovec-4             258    305     64   61    1 : tunables  120   60    0 : slabdata      5      5      0
biovec-1             340    904     16  226    1 : tunables  120   60    0 : slabdata      4      4      0
bio                  374    465    128   31    1 : tunables  120   60    0 : slabdata     14     15      0
file_lock_cache       45     45     88   45    1 : tunables  120   60    0 : slabdata      1      1      0
sock_inode_cache     570    570    384   10    1 : tunables   54   27    0 : slabdata     57     57      0
skbuff_head_cache    880   1160    192   20    1 : tunables  120   60    0 : slabdata     58     58      0
proc_inode_cache     672    672    332   12    1 : tunables   54   27    0 : slabdata     56     56      0
sigqueue              75    108    148   27    1 : tunables  120   60    0 : slabdata      4      4      0
radix_tree_node    27827  29162    276   14    1 : tunables   54   27    0 : slabdata   2083   2083      0
bdev_cache             7      9    448    9    1 : tunables   54   27    0 : slabdata      1      1      0
sysfs_dir_cache     3540   3552     40   96    1 : tunables  120   60    0 : slabdata     37     37      0
mnt_cache             28     31    128   31    1 : tunables  120   60    0 : slabdata      1      1      0
inode_cache         1251   1404    316   12    1 : tunables   54   27    0 : slabdata    117    117      0
dentry_cache      434515 514112    136   29    1 : tunables  120   60    0 : slabdata  17728  17728      0
filp                4500   4660    192   20    1 : tunables  120   60    0 : slabdata    233    233      0
names_cache            7      7   4096    1    1 : tunables   24   12    0 : slabdata      7      7      0
key_jar               20     31    128   31    1 : tunables  120   60    0 : slabdata      1      1      0
idr_layer_cache       91    116    136   29    1 : tunables  120   60    0 : slabdata      4      4      0
buffer_head       153510 162891     48   81    1 : tunables  120   60    0 : slabdata   2011   2011      0
mm_struct            119    119    576    7    1 : tunables   54   27    0 : slabdata     17     17      0
vm_area_struct      8115   8640     88   45    1 : tunables  120   60    0 : slabdata    192    192      0
fs_cache             113    119     32  119    1 : tunables  120   60    0 : slabdata      1      1      0
files_cache          114    117    448    9    1 : tunables   54   27    0 : slabdata     13     13      0
signal_cache         135    140    384   10    1 : tunables   54   27    0 : slabdata     14     14      0
sighand_cache        132    135   1344    3    1 : tunables   24   12    0 : slabdata     45     45      0
task_struct          150    153   1328    3    1 : tunables   24   12    0 : slabdata     51     51      0
anon_vma            3535   3663      8  407    1 : tunables  120   60    0 : slabdata      9      9      0
pgd                  115    115   4096    1    1 : tunables   24   12    0 : slabdata    115    115      0
size-131072(DMA)       0      0 131072    1   32 : tunables    8    4    0 : slabdata      0      0      0
size-131072            0      0 131072    1   32 : tunables    8    4    0 : slabdata      0      0      0
size-65536(DMA)        0      0  65536    1   16 : tunables    8    4    0 : slabdata      0      0      0
size-65536             0      0  65536    1   16 : tunables    8    4    0 : slabdata      0      0      0
size-32768(DMA)        0      0  32768    1    8 : tunables    8    4    0 : slabdata      0      0      0
size-32768            18     18  32768    1    8 : tunables    8    4    0 : slabdata     18     18      0
size-16384(DMA)        0      0  16384    1    4 : tunables    8    4    0 : slabdata      0      0      0
size-16384             1      1  16384    1    4 : tunables    8    4    0 : slabdata      1      1      0
size-8192(DMA)         0      0   8192    1    2 : tunables    8    4    0 : slabdata      0      0      0
size-8192            158    158   8192    1    2 : tunables    8    4    0 : slabdata    158    158      0
size-4096(DMA)         0      0   4096    1    1 : tunables   24   12    0 : slabdata      0      0      0
size-4096            385    387   4096    1    1 : tunables   24   12    0 : slabdata    385    387      0
size-2048(DMA)         0      0   2048    2    1 : tunables   24   12    0 : slabdata      0      0      0
size-2048             75     76   2048    2    1 : tunables   24   12    0 : slabdata     38     38      0
size-1024(DMA)         0      0   1024    4    1 : tunables   54   27    0 : slabdata      0      0      0
size-1024            212    212   1024    4    1 : tunables   54   27    0 : slabdata     53     53      0
size-512(DMA)          0      0    512    8    1 : tunables   54   27    0 : slabdata      0      0      0
size-512             375    456    512    8    1 : tunables   54   27    0 : slabdata     57     57      0
size-256(DMA)          0      0    256   15    1 : tunables  120   60    0 : slabdata      0      0      0
size-256             645    750    256   15    1 : tunables  120   60    0 : slabdata     50     50      0
size-192(DMA)          0      0    192   20    1 : tunables  120   60    0 : slabdata      0      0      0
size-192             100    100    192   20    1 : tunables  120   60    0 : slabdata      5      5      0
size-128(DMA)          0      0    128   31    1 : tunables  120   60    0 : slabdata      0      0      0
size-128            4259   4557    128   31    1 : tunables  120   60    0 : slabdata    147    147      0
size-64(DMA)           0      0     64   61    1 : tunables  120   60    0 : slabdata      0      0      0
size-64           150913 150914     64   61    1 : tunables  120   60    0 : slabdata   2474   2474      0
size-32(DMA)           0      0     32  119    1 : tunables  120   60    0 : slabdata      0      0      0
size-32             3273   3332     32  119    1 : tunables  120   60    0 : slabdata     28     28      0
kmem_cache           124    124    128   31    1 : tunables  120   60    0 : slabdata      4      4      0

[-- Attachment #3: meminfo --]
[-- Type: text/plain, Size: 670 bytes --]

MemTotal:      2074880 kB
MemFree:         15220 kB
Buffers:        339900 kB
Cached:         798368 kB
SwapCached:      18252 kB
Active:        1025436 kB
Inactive:       603900 kB
HighTotal:     1178944 kB
HighFree:         5952 kB
LowTotal:       895936 kB
LowFree:          9268 kB
SwapTotal:     2124352 kB
SwapFree:      2060040 kB
Dirty:            9356 kB
Writeback:           0 kB
Mapped:         691788 kB
Slab:           405400 kB
CommitLimit:   3161792 kB
Committed_AS:  1206060 kB
PageTables:       5276 kB
VmallocTotal:   114680 kB
VmallocUsed:     24256 kB
VmallocChunk:    89588 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     4096 kB

[-- Attachment #4: config.gz --]
[-- Type: application/octet-stream, Size: 11644 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-11 10:57 VM balancing issues on 2.6.13: dentry cache not getting shrunk enough Theodore Ts'o
@ 2005-09-11 12:00 ` Dipankar Sarma
  2005-09-12  3:16   ` Theodore Ts'o
  0 siblings, 1 reply; 32+ messages in thread
From: Dipankar Sarma @ 2005-09-11 12:00 UTC (permalink / raw)
  To: Theodore Ts'o, linux-mm, linux-kernel; +Cc: Bharata B. Rao

Hi Ted,

On Sun, Sep 11, 2005 at 06:57:09AM -0400, Theodore Ts'o wrote:
> 
> I have a T40 laptop (Pentium M processor) with 2 gigs of memory, and
> from time to time, after the system has been up for a while, the
> dentry cache grows huge, as does the ext3_inode_cache:
> 
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> dentry_cache      434515 514112    136   29    1 : tunables  120   60    0 : slabdata  17728  17728      0
> ext3_inode_cache  587635 589992    464    8    1 : tunables   54   27    0 : slabdata  73748  73749      0
> 
> Leading to an impending shortage in low memory:
> 
> LowFree:          9268 kB

Do you have the /proc/sys/fs/dentry-state output when such lowmem
shortage happens ?

> 
> It turns out I can head off the system lockup by requesting the
> formation of hugepages, which will immediately cause a dramatic
> reduction of memory usage in both high- and low- memory as various
> caches and flushed:
> 
> 	echo 100 > /proc/sys/vm/nr_hugepages
> 	echo 0 > /proc/sys/vm/nr_hugepages
> 
> The question is why isn't the kernel able to figure out how to do
> release dentry cache entries automatically when it starts thrashing due
> to a lack of low memory?   Clearly it can, since requesting hugepages
> does shrink the dentry cache:

This is a problem that Bharata has been investigating at the moment.
But he hasn't seen anything that can't be cured by a small memory
pressure - IOW, dentries do get freed under memory pressure. So
your case might be very useful. Bharata is maintaing an instrumentation
patch to collect more information and an alternative dentry aging patch 
(using rbtree). Perhaps you could try with those.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-11 12:00 ` Dipankar Sarma
@ 2005-09-12  3:16   ` Theodore Ts'o
  2005-09-12  6:16     ` Martin J. Bligh
  2005-09-13  8:47     ` Bharata B Rao
  0 siblings, 2 replies; 32+ messages in thread
From: Theodore Ts'o @ 2005-09-12  3:16 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: linux-mm, linux-kernel, Bharata B. Rao

On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> Do you have the /proc/sys/fs/dentry-state output when such lowmem
> shortage happens ?

Not yet, but the situation occurs on my laptop about 2 or 3 times
(when I'm not travelling and so it doesn't get rebooted).  So
reproducing it isn't utterly trivial, but it's does happen often
enough that it should be possible to get the necessary data.

> This is a problem that Bharata has been investigating at the moment.
> But he hasn't seen anything that can't be cured by a small memory
> pressure - IOW, dentries do get freed under memory pressure. So
> your case might be very useful. Bharata is maintaing an instrumentation
> patch to collect more information and an alternative dentry aging patch 
> (using rbtree). Perhaps you could try with those.

Send it to me, and I'd be happy to try either the instrumentation
patch or the dentry aging patch.

Thanks, regards,

							- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-12  3:16   ` Theodore Ts'o
@ 2005-09-12  6:16     ` Martin J. Bligh
  2005-09-12 12:53       ` Bharata B Rao
  2005-09-13  8:47     ` Bharata B Rao
  1 sibling, 1 reply; 32+ messages in thread
From: Martin J. Bligh @ 2005-09-12  6:16 UTC (permalink / raw)
  To: Theodore Ts'o, Dipankar Sarma; +Cc: linux-mm, linux-kernel, Bharata B. Rao



--Theodore Ts'o <tytso@mit.edu> wrote (on Sunday, September 11, 2005 23:16:36 -0400):

> On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
>> Do you have the /proc/sys/fs/dentry-state output when such lowmem
>> shortage happens ?
> 
> Not yet, but the situation occurs on my laptop about 2 or 3 times
> (when I'm not travelling and so it doesn't get rebooted).  So
> reproducing it isn't utterly trivial, but it's does happen often
> enough that it should be possible to get the necessary data.
>
>> This is a problem that Bharata has been investigating at the moment.
>> But he hasn't seen anything that can't be cured by a small memory
>> pressure - IOW, dentries do get freed under memory pressure. So
>> your case might be very useful. Bharata is maintaing an instrumentation
>> patch to collect more information and an alternative dentry aging patch 
>> (using rbtree). Perhaps you could try with those.
> 
> Send it to me, and I'd be happy to try either the instrumentation
> patch or the dentry aging patch.

Other thing that might be helpful is to shove a printk in prune_dcache
so we can see when it's getting called, and how successful it is, if the
more sophisticated stuff doesn't help ;-)

M.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-12  6:16     ` Martin J. Bligh
@ 2005-09-12 12:53       ` Bharata B Rao
  0 siblings, 0 replies; 32+ messages in thread
From: Bharata B Rao @ 2005-09-12 12:53 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Sun, Sep 11, 2005 at 11:16:30PM -0700, Martin J. Bligh wrote:
> 
> 
> --Theodore Ts'o <tytso@mit.edu> wrote (on Sunday, September 11, 2005 23:16:36 -0400):
> 
> > On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> >> Do you have the /proc/sys/fs/dentry-state output when such lowmem
> >> shortage happens ?
> > 
> > Not yet, but the situation occurs on my laptop about 2 or 3 times
> > (when I'm not travelling and so it doesn't get rebooted).  So
> > reproducing it isn't utterly trivial, but it's does happen often
> > enough that it should be possible to get the necessary data.
> >
> >> This is a problem that Bharata has been investigating at the moment.
> >> But he hasn't seen anything that can't be cured by a small memory
> >> pressure - IOW, dentries do get freed under memory pressure. So
> >> your case might be very useful. Bharata is maintaing an instrumentation
> >> patch to collect more information and an alternative dentry aging patch 
> >> (using rbtree). Perhaps you could try with those.
> > 
> > Send it to me, and I'd be happy to try either the instrumentation
> > patch or the dentry aging patch.
> 
> Other thing that might be helpful is to shove a printk in prune_dcache
> so we can see when it's getting called, and how successful it is, if the
> more sophisticated stuff doesn't help ;-)
> 

I have incorporated this in the dcache stats patch I have. I will 
post it tommorrow after adding some more instrumentation data
(number of inuse and free dentries in lru list) and after a bit of
cleanup and testing.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-12  3:16   ` Theodore Ts'o
  2005-09-12  6:16     ` Martin J. Bligh
@ 2005-09-13  8:47     ` Bharata B Rao
  2005-09-13 21:59       ` David Chinner
                         ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Bharata B Rao @ 2005-09-13  8:47 UTC (permalink / raw)
  To: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1942 bytes --]

On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> > Do you have the /proc/sys/fs/dentry-state output when such lowmem
> > shortage happens ?
> 
> Not yet, but the situation occurs on my laptop about 2 or 3 times
> (when I'm not travelling and so it doesn't get rebooted).  So
> reproducing it isn't utterly trivial, but it's does happen often
> enough that it should be possible to get the necessary data.
> 
> > This is a problem that Bharata has been investigating at the moment.
> > But he hasn't seen anything that can't be cured by a small memory
> > pressure - IOW, dentries do get freed under memory pressure. So
> > your case might be very useful. Bharata is maintaing an instrumentation
> > patch to collect more information and an alternative dentry aging patch 
> > (using rbtree). Perhaps you could try with those.
> 
> Send it to me, and I'd be happy to try either the instrumentation
> patch or the dentry aging patch.
> 

Ted,

I am sending two patches here.

First is dentry_stats patch which collects some dcache statistics
and puts it into /proc/meminfo. This patch provides information 
about how dentries are distributed in dcache slab pages, how many
free and in use dentries are present in dentry_unused lru list and
how prune_dcache() performs with respect to freeing the requested
number of dentries.

Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
to improve this dcache fragmentation problem.

These patches apply on 2.6.13-rc7 and 2.6.13 cleanly.

Could you please apply the dcache_stats patch and check if the problem
can be reproduced. When that happens, could you please capture the 
/proc/meminfo, /proc/sys/fs/dentry-state and /proc/slabinfo.

It would be nice if you could also try the rbtree patch to check if
it improves the situation. rbtree patch applies on top of the stats
patch.

Regards,
Bharata.

[-- Attachment #2: dcache_stats.patch --]
[-- Type: text/plain, Size: 9875 bytes --]



This patch obtains some statistics about dcache and exports it as
part of /proc/meminfo.

The following data is collected:

1. A count of pages with 1,2,3,... dentries.

2. Number of dentries requested for freeing and the actual number
of dentries freed during the last invocation of prune_dcache.

3. Information about dcache lru list: number of inuse, free,
referenced and total dentries.

Original Author: Dave Hansen <haveblue@us.ibm.com>

Signed-off-by: Bharata B Rao <bharata@in.ibm.com>
---

 arch/i386/mm/init.c    |    8 +++++++
 fs/dcache.c            |   56 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/proc/proc_misc.c    |   27 +++++++++++++++++++++++
 include/linux/dcache.h |   11 +++++++++
 include/linux/mm.h     |    3 ++
 mm/bootmem.c           |    4 +++
 6 files changed, 109 insertions(+)

diff -puN include/linux/mm.h~dcache_stats include/linux/mm.h
--- linux-2.6.13-rc7/include/linux/mm.h~dcache_stats	2005-09-12 10:57:52.000000000 +0530
+++ linux-2.6.13-rc7-bharata/include/linux/mm.h	2005-09-13 11:21:52.601920944 +0530
@@ -225,6 +225,9 @@ struct page {
 					 * to show when page is mapped
 					 * & limit reverse map searches.
 					 */
+	int nr_dentry;			/* Number of dentries in this page */
+	spinlock_t nr_dentry_lock;
+
 	unsigned long private;		/* Mapping-private opaque data:
 					 * usually used for buffer_heads
 					 * if PagePrivate set; used for
diff -puN arch/i386/mm/init.c~dcache_stats arch/i386/mm/init.c
--- linux-2.6.13-rc7/arch/i386/mm/init.c~dcache_stats	2005-09-12 10:57:52.000000000 +0530
+++ linux-2.6.13-rc7-bharata/arch/i386/mm/init.c	2005-09-13 11:22:29.357333272 +0530
@@ -272,6 +272,7 @@ void __init one_highpage_init(struct pag
 		set_page_count(page, 1);
 		__free_page(page);
 		totalhigh_pages++;
+		spin_lock_init(&page->nr_dentry_lock);
 	} else
 		SetPageReserved(page);
 }
@@ -669,6 +670,7 @@ static int noinline do_test_wp_bit(void)
 void free_initmem(void)
 {
 	unsigned long addr;
+	struct page *page;
 
 	addr = (unsigned long)(&__init_begin);
 	for (; addr < (unsigned long)(&__init_end); addr += PAGE_SIZE) {
@@ -676,6 +678,8 @@ void free_initmem(void)
 		set_page_count(virt_to_page(addr), 1);
 		memset((void *)addr, 0xcc, PAGE_SIZE);
 		free_page(addr);
+		page = virt_to_page(addr);
+		spin_lock_init(&page->nr_dentry_lock);
 		totalram_pages++;
 	}
 	printk (KERN_INFO "Freeing unused kernel memory: %dk freed\n", (__init_end - __init_begin) >> 10);
@@ -684,12 +688,16 @@ void free_initmem(void)
 #ifdef CONFIG_BLK_DEV_INITRD
 void free_initrd_mem(unsigned long start, unsigned long end)
 {
+	struct page *page;
+
 	if (start < end)
 		printk (KERN_INFO "Freeing initrd memory: %ldk freed\n", (end - start) >> 10);
 	for (; start < end; start += PAGE_SIZE) {
 		ClearPageReserved(virt_to_page(start));
 		set_page_count(virt_to_page(start), 1);
 		free_page(start);
+		page = virt_to_page(start);
+		spin_lock_init(&page->nr_dentry_lock);
 		totalram_pages++;
 	}
 }
diff -puN fs/dcache.c~dcache_stats fs/dcache.c
--- linux-2.6.13-rc7/fs/dcache.c~dcache_stats	2005-09-12 10:57:52.000000000 +0530
+++ linux-2.6.13-rc7-bharata/fs/dcache.c	2005-09-13 12:27:07.079829848 +0530
@@ -33,6 +33,7 @@
 #include <linux/seqlock.h>
 #include <linux/swap.h>
 #include <linux/bootmem.h>
+#include <linux/pagemap.h>
 
 /* #define DCACHE_DEBUG 1 */
 
@@ -69,12 +70,48 @@ struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+atomic_t nr_dentry[30]; /* I have seen a max of 27 dentries in a page */
+struct lru_dentry_stat lru_dentry_stat;
+DEFINE_SPINLOCK(prune_dcache_lock);
+
+void get_dstat_info(void)
+{
+	struct dentry *dentry;
+
+	lru_dentry_stat.nr_total = lru_dentry_stat.nr_inuse = 0;
+	lru_dentry_stat.nr_ref = lru_dentry_stat.nr_free = 0;
+
+	spin_lock(&dcache_lock);
+	list_for_each_entry(dentry, &dentry_unused, d_lru) {
+		if (atomic_read(&dentry->d_count))
+			lru_dentry_stat.nr_inuse++;
+		if (dentry->d_flags & DCACHE_REFERENCED)
+			lru_dentry_stat.nr_ref++;
+	}
+	lru_dentry_stat.nr_total = dentry_stat.nr_unused;
+	lru_dentry_stat.nr_free = lru_dentry_stat.nr_total -
+		lru_dentry_stat.nr_inuse;
+	spin_unlock(&dcache_lock);
+}
+
 static void d_callback(struct rcu_head *head)
 {
 	struct dentry * dentry = container_of(head, struct dentry, d_rcu);
+	unsigned long flags;
+	struct page *page;
 
 	if (dname_external(dentry))
 		kfree(dentry->d_name.name);
+
+	page = virt_to_page(dentry);
+	spin_lock_irqsave(&page->nr_dentry_lock, flags);
+	atomic_dec(&nr_dentry[page->nr_dentry]);
+	if (--page->nr_dentry != 0)
+		atomic_inc(&nr_dentry[page->nr_dentry]);
+	BUG_ON(atomic_read(&nr_dentry[page->nr_dentry]) < 0);
+	BUG_ON(page->nr_dentry > 29);
+	spin_unlock_irqrestore(&page->nr_dentry_lock, flags);
+
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
@@ -393,6 +430,9 @@ static inline void prune_one_dentry(stru
  
 static void prune_dcache(int count)
 {
+	int nr_requested = count;
+	int nr_freed = 0;
+
 	spin_lock(&dcache_lock);
 	for (; count ; count--) {
 		struct dentry *dentry;
@@ -427,8 +467,13 @@ static void prune_dcache(int count)
 			continue;
 		}
 		prune_one_dentry(dentry);
+		nr_freed++;
 	}
 	spin_unlock(&dcache_lock);
+	spin_lock(&prune_dcache_lock);
+	lru_dentry_stat.dprune_req = nr_requested;
+	lru_dentry_stat.dprune_freed = nr_freed;
+	spin_unlock(&prune_dcache_lock);
 }
 
 /*
@@ -720,6 +765,8 @@ struct dentry *d_alloc(struct dentry * p
 {
 	struct dentry *dentry;
 	char *dname;
+	unsigned long flags;
+	struct page *page;
 
 	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 
 	if (!dentry)
@@ -769,6 +816,15 @@ struct dentry *d_alloc(struct dentry * p
 	dentry_stat.nr_dentry++;
 	spin_unlock(&dcache_lock);
 
+	page = virt_to_page(dentry);
+	spin_lock_irqsave(&page->nr_dentry_lock, flags);
+	if (page->nr_dentry != 0)
+		atomic_dec(&nr_dentry[page->nr_dentry]);
+	atomic_inc(&nr_dentry[++page->nr_dentry]);
+	BUG_ON(atomic_read(&nr_dentry[page->nr_dentry]) < 0);
+	BUG_ON(page->nr_dentry > 29);
+	spin_unlock_irqrestore(&page->nr_dentry_lock, flags);
+
 	return dentry;
 }
 
diff -puN fs/proc/proc_misc.c~dcache_stats fs/proc/proc_misc.c
--- linux-2.6.13-rc7/fs/proc/proc_misc.c~dcache_stats	2005-09-12 10:57:52.000000000 +0530
+++ linux-2.6.13-rc7-bharata/fs/proc/proc_misc.c	2005-09-13 11:49:43.460911768 +0530
@@ -45,6 +45,7 @@
 #include <linux/sysrq.h>
 #include <linux/vmalloc.h>
 #include <linux/crash_dump.h>
+#include <linux/dcache.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 #include <asm/io.h>
@@ -115,6 +116,9 @@ static int uptime_read_proc(char *page, 
 	return proc_calc_metrics(page, start, off, count, eof, len);
 }
 
+extern atomic_t nr_dentry[];
+extern spinlock_t prune_dcache_lock;
+
 static int meminfo_read_proc(char *page, char **start, off_t off,
 				 int count, int *eof, void *data)
 {
@@ -128,6 +132,7 @@ static int meminfo_read_proc(char *page,
 	unsigned long allowed;
 	struct vmalloc_info vmi;
 	long cached;
+	int j, total_dcache_pages = 0;
 
 	get_page_state(&ps);
 	get_zone_counts(&active, &inactive, &free);
@@ -200,6 +205,28 @@ static int meminfo_read_proc(char *page,
 		vmi.largest_chunk >> 10
 		);
 
+		for (j =0; j < 30; j++) {
+			len += sprintf(page + len, "pages_with_[%2d]_dentries: %d\n",
+					j, atomic_read(&nr_dentry[j]));
+			total_dcache_pages += atomic_read(&nr_dentry[j]);
+		}
+		len += sprintf(page + len, "dcache_pages total: %d\n",
+			total_dcache_pages);
+
+		spin_lock(&prune_dcache_lock);
+		len += sprintf(page + len, "prune_dcache: requested  %d freed %d\n",
+			lru_dentry_stat.dprune_req, lru_dentry_stat.dprune_freed);
+		spin_unlock(&prune_dcache_lock);
+
+		get_dstat_info();
+		len += sprintf(page + len, "dcache lru list data:\n"
+			"dentries total: %d\n"
+			"dentries in_use: %d\n"
+			"dentries free: %d\n"
+			"dentries referenced: %d\n",
+			lru_dentry_stat.nr_total, lru_dentry_stat.nr_inuse,
+			lru_dentry_stat.nr_free, lru_dentry_stat.nr_ref);
+
 		len += hugetlb_report_meminfo(page + len);
 
 	return proc_calc_metrics(page, start, off, count, eof, len);
diff -puN mm/bootmem.c~dcache_stats mm/bootmem.c
--- linux-2.6.13-rc7/mm/bootmem.c~dcache_stats	2005-09-12 10:57:52.000000000 +0530
+++ linux-2.6.13-rc7-bharata/mm/bootmem.c	2005-09-13 11:26:31.358543496 +0530
@@ -291,12 +291,14 @@ static unsigned long __init free_all_boo
 			page = pfn_to_page(pfn);
 			count += BITS_PER_LONG;
 			__ClearPageReserved(page);
+			spin_lock_init(&page->nr_dentry_lock);
 			order = ffs(BITS_PER_LONG) - 1;
 			set_page_refs(page, order);
 			for (j = 1; j < BITS_PER_LONG; j++) {
 				if (j + 16 < BITS_PER_LONG)
 					prefetchw(page + j + 16);
 				__ClearPageReserved(page + j);
+				spin_lock_init(&((page + j)->nr_dentry_lock));
 			}
 			__free_pages(page, order);
 			i += BITS_PER_LONG;
@@ -311,6 +313,7 @@ static unsigned long __init free_all_boo
 					__ClearPageReserved(page);
 					set_page_refs(page, 0);
 					__free_page(page);
+					spin_lock_init(&page->nr_dentry_lock);
 				}
 			}
 		} else {
@@ -331,6 +334,7 @@ static unsigned long __init free_all_boo
 		__ClearPageReserved(page);
 		set_page_count(page, 1);
 		__free_page(page);
+		spin_lock_init(&page->nr_dentry_lock);
 	}
 	total += count;
 	bdata->node_bootmem_map = NULL;
diff -puN include/linux/dcache.h~dcache_stats include/linux/dcache.h
--- linux-2.6.13-rc7/include/linux/dcache.h~dcache_stats	2005-09-12 17:30:01.000000000 +0530
+++ linux-2.6.13-rc7-bharata/include/linux/dcache.h	2005-09-13 12:27:07.080829696 +0530
@@ -46,6 +46,17 @@ struct dentry_stat_t {
 };
 extern struct dentry_stat_t dentry_stat;
 
+struct lru_dentry_stat {
+	int nr_total;
+	int nr_inuse;
+	int nr_ref;
+	int nr_free;
+	int dprune_req;
+	int dprune_freed;
+};
+extern struct lru_dentry_stat lru_dentry_stat;
+extern void get_dstat_info(void);
+
 /* Name hashing routines. Initial hash value */
 /* Hash courtesy of the R5 hash in reiserfs modulo sign bits */
 #define init_name_hash()		0
_

[-- Attachment #3: rbtree_dcache_reclaim.patch --]
[-- Type: text/plain, Size: 6971 bytes --]



This patch maintains the dentries in a red black tree. RB tree is
scanned in-order and dentries are put into the end of LRU list
to increase the chances of freeing a dentries of a given page.

Original Author: Santhosh Rao <raosanth@us.ibm.com>

Signed-off-by: Bharata B Rao <bharata@in.ibm.com>
---

 fs/dcache.c            |  143 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |    2 
 2 files changed, 141 insertions(+), 4 deletions(-)

diff -puN fs/dcache.c~rbtree_dcache_reclaim fs/dcache.c
--- linux-2.6.13-rc7/fs/dcache.c~rbtree_dcache_reclaim	2005-09-13 12:11:11.279133640 +0530
+++ linux-2.6.13-rc7-bharata/fs/dcache.c	2005-09-13 12:15:02.732947312 +0530
@@ -34,6 +34,7 @@
 #include <linux/swap.h>
 #include <linux/bootmem.h>
 #include <linux/pagemap.h>
+#include <linux/rbtree.h>
 
 /* #define DCACHE_DEBUG 1 */
 
@@ -70,6 +71,50 @@ struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
 };
 
+static struct rb_root dentry_tree = RB_ROOT;
+
+#define RB_NONE (2)
+#define ON_RB(node)	((node)->rb_color != RB_NONE)
+#define RB_CLEAR(node)	((node)->rb_color = RB_NONE )
+  
+
+/* take a dentry safely off the rbtree */
+static void drb_delete(struct dentry* dentry)
+{
+	if (ON_RB(&dentry->d_rb)) {
+		rb_erase(&dentry->d_rb, &dentry_tree);
+		RB_CLEAR(&dentry->d_rb);
+	} else {
+		/* All allocated dentry objs should be in the tree */
+		BUG_ON(1);
+	}
+}
+
+static  struct dentry * drb_insert(struct dentry * dentry)
+{
+	struct rb_node ** p = &dentry_tree.rb_node;
+	struct rb_node * parent = NULL;
+	struct rb_node * node    = &dentry->d_rb;
+	struct dentry  * cur    = NULL;
+
+	while (*p) {
+		parent = *p;
+		cur = rb_entry(parent, struct dentry, d_rb);
+
+		if (dentry < cur)
+			p = &(*p)->rb_left;
+		else if (dentry > cur)
+			p = &(*p)->rb_right;
+		else {
+			return cur;
+		}
+	}
+
+	rb_link_node(node, parent, p);
+	rb_insert_color(node,&dentry_tree); 
+	return NULL;
+}
+
 atomic_t nr_dentry[30]; /* I have seen a max of 27 dentries in a page */
 struct lru_dentry_stat lru_dentry_stat;
 DEFINE_SPINLOCK(prune_dcache_lock);
@@ -232,6 +277,7 @@ kill_it: {
   		list_del(&dentry->d_child);
 		dentry_stat.nr_dentry--;	/* For d_free, below */
 		/*drops the locks, at that point nobody can reach this dentry */
+		drb_delete(dentry);
 		dentry_iput(dentry);
 		parent = dentry->d_parent;
 		d_free(dentry);
@@ -407,6 +453,7 @@ static inline void prune_one_dentry(stru
 	__d_drop(dentry);
 	list_del(&dentry->d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
+	drb_delete(dentry);
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
 	d_free(dentry);
@@ -416,7 +463,7 @@ static inline void prune_one_dentry(stru
 }
 
 /**
- * prune_dcache - shrink the dcache
+ * prune_lru - shrink the lru list
  * @count: number of entries to try and free
  *
  * Shrink the dcache. This is done when we need
@@ -428,7 +475,7 @@ static inline void prune_one_dentry(stru
  * all the dentries are in use.
  */
  
-static void prune_dcache(int count)
+static void prune_lru(int count)
 {
 	int nr_requested = count;
 	int nr_freed = 0;
@@ -476,6 +523,93 @@ static void prune_dcache(int count)
 	spin_unlock(&prune_dcache_lock);
 }
 
+/**
+ * prune_dcache - try and "intelligently" shrink the dcache
+ * @requested - num of dentrys to try and free
+ *
+ * The basic strategy here is to scan through our tree of dentrys
+ * in-order and put them at the end of the lru - free list
+ * Why in-order?  Because, we want the chances of actually freeing
+ * all 15-27 (depending on arch) dentrys on a given page, instead
+ * of just in random lru order, which tends to lower dcache utilization
+ * and not free many pages.
+ */
+static void prune_dcache(unsigned  requested)
+{
+	/* ------ debug --------- */
+	//static int mod = 0;
+	//int flag = 0, removed = 0;
+	/* ------ debug --------- */
+
+	unsigned found = 0;
+	unsigned count;
+	struct rb_node * next;
+	struct dentry *dentry;
+#define NUM_LRU_PTRS 8
+	struct rb_node *lru_ptrs[NUM_LRU_PTRS];
+	struct list_head *cur;
+	int i;
+
+	spin_lock(&dcache_lock);
+	
+       	cur = dentry_unused.prev;
+
+	/* grab NUM_LRU_PTRS entrys off the end of lru list */
+	/* we'll use these as pseudo-random starting points in the tree */
+	for (i = 0 ; i < NUM_LRU_PTRS ; i++ ){
+		if ( cur == &dentry_unused ) {
+			/* if there aren't NUM_LRU_PTRS entrys, we probably
+			   can't even free a page now, give up */
+			spin_unlock(&dcache_lock);
+			return;
+		}
+		lru_ptrs[i] = &(list_entry(cur,struct dentry, d_lru)->d_rb); 
+		cur = cur->prev;
+	}
+	
+	i = 0;
+	
+	do {
+		count = 4 * PAGE_SIZE / sizeof(struct dentry) ; /* abitrary heuristic */
+		next = lru_ptrs[i];
+		for (; count ; count--) {
+			if( ! next ) {
+				//flag = 1;  /* ------ debug --------- */
+				break;
+			}
+			dentry = list_entry(next, struct dentry, d_rb);
+			next = rb_next(next);
+			prefetch(next);
+			if( ! list_empty( &dentry->d_lru) ) {
+				list_del_init(&dentry->d_lru);
+				dentry_stat.nr_unused--;
+			}
+			if (atomic_read(&dentry->d_count)) {
+				//removed++; 	/* ------ debug --------- */
+				continue;
+			} else {
+				list_add_tail(&dentry->d_lru, &dentry_unused);
+				dentry_stat.nr_unused++;
+				found++;
+			}
+		}
+		i++;
+	} while ( (found < requested / 2) && (i < NUM_LRU_PTRS ) );
+#undef NUM_LRU_PTRS
+
+	spin_unlock(&dcache_lock);
+	
+	/* ------ debug --------- */
+	//mod++;	
+	//if ( ! (mod & 64) ) {
+	//	mod = 0;
+	//	printk("prune_dcache: i %d flag %d, found %d removed %d\n",i,flag,found,removed);
+	//}
+	/* ------ debug --------- */
+
+	prune_lru(found);
+}
+
 /*
  * Shrink the dcache for the specified super block.
  * This allows us to unmount a device without disturbing
@@ -687,7 +821,7 @@ void shrink_dcache_parent(struct dentry 
 	int found;
 
 	while ((found = select_parent(parent)) != 0)
-		prune_dcache(found);
+		prune_lru(found);
 }
 
 /**
@@ -725,7 +859,7 @@ void shrink_dcache_anon(struct hlist_hea
 			}
 		}
 		spin_unlock(&dcache_lock);
-		prune_dcache(found);
+		prune_lru(found);
 	} while(found);
 }
 
@@ -814,6 +948,7 @@ struct dentry *d_alloc(struct dentry * p
 	if (parent)
 		list_add(&dentry->d_child, &parent->d_subdirs);
 	dentry_stat.nr_dentry++;
+	drb_insert(dentry);
 	spin_unlock(&dcache_lock);
 
 	page = virt_to_page(dentry);
diff -puN include/linux/dcache.h~rbtree_dcache_reclaim include/linux/dcache.h
--- linux-2.6.13-rc7/include/linux/dcache.h~rbtree_dcache_reclaim	2005-09-13 12:11:11.284132880 +0530
+++ linux-2.6.13-rc7-bharata/include/linux/dcache.h	2005-09-13 12:11:11.306129536 +0530
@@ -9,6 +9,7 @@
 #include <linux/cache.h>
 #include <linux/rcupdate.h>
 #include <asm/bug.h>
+#include <linux/rbtree.h>
 
 struct nameidata;
 struct vfsmount;
@@ -104,6 +105,7 @@ struct dentry {
 	struct dentry *d_parent;	/* parent directory */
 	struct qstr d_name;
 
+	struct rb_node   d_rb;
 	struct list_head d_lru;		/* LRU list */
 	struct list_head d_child;	/* child of parent list */
 	struct list_head d_subdirs;	/* our children */
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-13  8:47     ` Bharata B Rao
@ 2005-09-13 21:59       ` David Chinner
  2005-09-14  9:01         ` Andi Kleen
  2005-09-14 15:48         ` Sonny Rao
  2005-09-14 21:34       ` Marcelo Tosatti
  2005-09-14 23:08       ` Marcelo Tosatti
  2 siblings, 2 replies; 32+ messages in thread
From: David Chinner @ 2005-09-13 21:59 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> 
> Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> to improve this dcache fragmentation problem.

FYI, in the past I've tried this patch to reduce dcache fragmentation on
an Altix (16k pages, 62 dentries to a slab page) under heavy
fileserver workloads and it had no measurable effect. It appeared
that there was almost always at least one active dentry on each page
in the slab.  The story may very well be different on 4k page
machines, however.

Typically, fragmentation was bad enough that reclaim removed ~90% of
the working set of dentries to free about 1% of the memory in the
dentry slab. We had to get down to freeing > 95% of the dentry cache
before fragmentation started to reduce and the system stopped trying to
reclaim the dcache which we then spent the next 10 minutes
repopulating......

We also tried separating out directory dentries into a separate slab
so that (potentially) longer lived dentries were clustered together
rather than sparsely distributed around the slab cache.  Once again,
it had no measurable effect on the level of fragmentation (with or
without the rbtree patch).

FWIW, the inode cache was showing very similar levels of fragmentation
under reclaim as well.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-13 21:59       ` David Chinner
@ 2005-09-14  9:01         ` Andi Kleen
  2005-09-14  9:16           ` Manfred Spraul
                             ` (3 more replies)
  2005-09-14 15:48         ` Sonny Rao
  1 sibling, 4 replies; 32+ messages in thread
From: Andi Kleen @ 2005-09-14  9:01 UTC (permalink / raw)
  To: David Chinner
  Cc: Bharata B Rao, Theodore Ts'o, Dipankar Sarma, linux-mm,
	linux-kernel, manfred

On Tuesday 13 September 2005 23:59, David Chinner wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > to improve this dcache fragmentation problem.
>
> FYI, in the past I've tried this patch to reduce dcache fragmentation on
> an Altix (16k pages, 62 dentries to a slab page) under heavy
> fileserver workloads and it had no measurable effect. It appeared
> that there was almost always at least one active dentry on each page
> in the slab.  The story may very well be different on 4k page
> machines, however.

I always thought dentry freeing would work much better if it
was turned upside down.

Instead of starting from the high level dcache lists it could
be driven by slab: on memory pressure slab tries to return pages with unused 
cache objects. In that case it should check if there are only
a small number of pinned objects on the page set left, and if 
yes use a new callback to the higher level user (=dcache) and ask them
to free the object.

The slab datastructures are not completely suited for this right now,
but it could be done by using one more of the list_heads in struct page
for slab backing pages.

It would probably not be very LRU but a simple hack of having slowly 
increasing dcache generations. Each dentry use updates the generation.
First slab memory freeing pass only frees objects with older generations.

Using slowly increasing generations has the advantage of timestamps
that you can avoid dirtying cache lines in the common case when 
the generation doesn't change on access (= no additional cache line bouncing)
and it would easily allow to tune the aging rate under stress by changing the 
length of the generation.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14  9:01         ` Andi Kleen
@ 2005-09-14  9:16           ` Manfred Spraul
  2005-09-14  9:43             ` Andrew Morton
  2005-09-14  9:35           ` Andrew Morton
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Manfred Spraul @ 2005-09-14  9:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Chinner, Bharata B Rao, Theodore Ts'o, Dipankar Sarma,
	linux-mm, linux-kernel

Andi Kleen wrote:

>The slab datastructures are not completely suited for this right now,
>but it could be done by using one more of the list_heads in struct page
>for slab backing pages.
>
>  
>
I agree, I even started prototyping something a year ago, but ran out of 
time.
One tricky point are directory dentries: As far as I see, they are 
pinned and unfreeable if a (freeable) directory entry is in the cache.

--
    Manfred

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14  9:01         ` Andi Kleen
  2005-09-14  9:16           ` Manfred Spraul
@ 2005-09-14  9:35           ` Andrew Morton
  2005-09-14 13:57           ` Martin J. Bligh
  2005-09-14 22:48           ` David Chinner
  3 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2005-09-14  9:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: dgc, bharata, tytso, dipankar, linux-mm, linux-kernel, manfred

Andi Kleen <ak@suse.de> wrote:
>
> On Tuesday 13 September 2005 23:59, David Chinner wrote:
> > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > > to improve this dcache fragmentation problem.
> >
> > FYI, in the past I've tried this patch to reduce dcache fragmentation on
> > an Altix (16k pages, 62 dentries to a slab page) under heavy
> > fileserver workloads and it had no measurable effect. It appeared
> > that there was almost always at least one active dentry on each page
> > in the slab.  The story may very well be different on 4k page
> > machines, however.
> 
> I always thought dentry freeing would work much better if it
> was turned upside down.
> 
> Instead of starting from the high level dcache lists it could
> be driven by slab: on memory pressure slab tries to return pages with unused 
> cache objects. In that case it should check if there are only
> a small number of pinned objects on the page set left, and if 
> yes use a new callback to the higher level user (=dcache) and ask them
> to free the object.

Considered doing that with buffer_heads a few years ago.  It's impossible
unless you have a global lock, which bh's don't have.  dentries _do_ have a
global lock, and we'd be tied to having it for ever more.

The shrinking code would have be able to deal with a dentry which is going
through destruction by other call paths, so dcache_lock coverage would have
to be extended considerably - it would have to cover the kmem_cache_free(),
for example.   Or we put some i_am_alive flag into the dentry.

> The slab datastructures are not completely suited for this right now,
> but it could be done by using one more of the list_heads in struct page
> for slab backing pages.

Yes, some help would be needed in the slab code.

There's only one list_head in struct page and slab is already using it.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14  9:16           ` Manfred Spraul
@ 2005-09-14  9:43             ` Andrew Morton
  2005-09-14  9:52               ` Dipankar Sarma
  2005-09-14 22:44               ` Theodore Ts'o
  0 siblings, 2 replies; 32+ messages in thread
From: Andrew Morton @ 2005-09-14  9:43 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: ak, dgc, bharata, tytso, dipankar, linux-mm, linux-kernel

Manfred Spraul <manfred@colorfullife.com> wrote:
>
> One tricky point are directory dentries: As far as I see, they are 
>  pinned and unfreeable if a (freeable) directory entry is in the cache.
>

Well.  That's the whole problem.

I don't think it's been demonstrated that Ted's problem was caused by
internal fragementation, btw.  Ted, could you run slabtop, see what the
dcache occupancy is?  Monitor it as you start to manually apply pressure? 
If the occupancy falls to 10% and not many slab pages are freed up yet then
yup, it's internal fragmentation.

I've found that internal fragmentation due to pinned directory dentries can
be very high if you're running silly benchmarks which create some
regular-shaped directory tree which can easily create pathological
patterns.  For real-world things with irregular creation and access
patterns and irregular directory sizes the fragmentation isn't as easy to
demonstrate.

Another approach would be to do an aging round on a directory's children
when an unfreeable dentry is encountered on the LRU.  Something like that. 
If internal fragmentation is indeed the problem.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14  9:43             ` Andrew Morton
@ 2005-09-14  9:52               ` Dipankar Sarma
  2005-09-14 22:44               ` Theodore Ts'o
  1 sibling, 0 replies; 32+ messages in thread
From: Dipankar Sarma @ 2005-09-14  9:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Manfred Spraul, ak, dgc, bharata, tytso, linux-mm, linux-kernel

On Wed, Sep 14, 2005 at 02:43:13AM -0700, Andrew Morton wrote:
> Manfred Spraul <manfred@colorfullife.com> wrote:
> >
> > One tricky point are directory dentries: As far as I see, they are 
> >  pinned and unfreeable if a (freeable) directory entry is in the cache.
> >
> I don't think it's been demonstrated that Ted's problem was caused by
> internal fragementation, btw.  Ted, could you run slabtop, see what the
> dcache occupancy is?  Monitor it as you start to manually apply pressure? 
> If the occupancy falls to 10% and not many slab pages are freed up yet then
> yup, it's internal fragmentation.
> 
> I've found that internal fragmentation due to pinned directory dentries can
> be very high if you're running silly benchmarks which create some
> regular-shaped directory tree which can easily create pathological
> patterns.  For real-world things with irregular creation and access
> patterns and irregular directory sizes the fragmentation isn't as easy to
> demonstrate.
> 
> Another approach would be to do an aging round on a directory's children
> when an unfreeable dentry is encountered on the LRU.  Something like that. 
> If internal fragmentation is indeed the problem.

One other point to look at is whether fragmentation is due to pinned
dentries or not. We can get that information only from dcache itself.
That is what we need to acertain first using the instrumentation
patch. Solving the problem of large # of pinned dentries and large # of LRU 
free dentries will likely require different approaches. Even the
LRU dentries are sometimes pinned due to the lazy-lru stuff that
we did for lock-free dcache. Let us get some accurate dentry
stats first from the instrumentation patch.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14  9:01         ` Andi Kleen
  2005-09-14  9:16           ` Manfred Spraul
  2005-09-14  9:35           ` Andrew Morton
@ 2005-09-14 13:57           ` Martin J. Bligh
  2005-09-14 15:37             ` Sonny Rao
  2005-09-15  7:21             ` Helge Hafting
  2005-09-14 22:48           ` David Chinner
  3 siblings, 2 replies; 32+ messages in thread
From: Martin J. Bligh @ 2005-09-14 13:57 UTC (permalink / raw)
  To: Andi Kleen, David Chinner
  Cc: Bharata B Rao, Theodore Ts'o, Dipankar Sarma, linux-mm,
	linux-kernel, manfred

>> > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
>> > to improve this dcache fragmentation problem.
>> 
>> FYI, in the past I've tried this patch to reduce dcache fragmentation on
>> an Altix (16k pages, 62 dentries to a slab page) under heavy
>> fileserver workloads and it had no measurable effect. It appeared
>> that there was almost always at least one active dentry on each page
>> in the slab.  The story may very well be different on 4k page
>> machines, however.
> 
> I always thought dentry freeing would work much better if it
> was turned upside down.
> 
> Instead of starting from the high level dcache lists it could
> be driven by slab: on memory pressure slab tries to return pages with unused 
> cache objects. In that case it should check if there are only
> a small number of pinned objects on the page set left, and if 
> yes use a new callback to the higher level user (=dcache) and ask them
> to free the object.
> 
> The slab datastructures are not completely suited for this right now,
> but it could be done by using one more of the list_heads in struct page
> for slab backing pages.
> 
> It would probably not be very LRU but a simple hack of having slowly 
> increasing dcache generations. Each dentry use updates the generation.
> First slab memory freeing pass only frees objects with older generations.

If they're freeable, we should easily be able to move them, and therefore 
compact a fragmented slab. That way we can preserve the LRU'ness of it.
Stage 1: free the oldest entries. Stage 2: compact the slab into whole
pages. Stage 3: free whole pages back to teh page allocator.

> Using slowly increasing generations has the advantage of timestamps
> that you can avoid dirtying cache lines in the common case when 
> the generation doesn't change on access (= no additional cache line bouncing)
> and it would easily allow to tune the aging rate under stress by changing the 
> length of the generation.

LRU algorithm may need general tweaking like this anyway ... strict LRU
is expensive to keep.

M.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14 13:57           ` Martin J. Bligh
@ 2005-09-14 15:37             ` Sonny Rao
  2005-09-15  7:21             ` Helge Hafting
  1 sibling, 0 replies; 32+ messages in thread
From: Sonny Rao @ 2005-09-14 15:37 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andi Kleen, David Chinner, Bharata B Rao, Theodore Ts'o,
	Dipankar Sarma, linux-mm, linux-kernel, manfred

On Wed, Sep 14, 2005 at 06:57:56AM -0700, Martin J. Bligh wrote:
> >> > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> >> > to improve this dcache fragmentation problem.
> >> 
> >> FYI, in the past I've tried this patch to reduce dcache fragmentation on
> >> an Altix (16k pages, 62 dentries to a slab page) under heavy
> >> fileserver workloads and it had no measurable effect. It appeared
> >> that there was almost always at least one active dentry on each page
> >> in the slab.  The story may very well be different on 4k page
> >> machines, however.
> > 
> > I always thought dentry freeing would work much better if it
> > was turned upside down.
> > 
> > Instead of starting from the high level dcache lists it could
> > be driven by slab: on memory pressure slab tries to return pages with unused 
> > cache objects. In that case it should check if there are only
> > a small number of pinned objects on the page set left, and if 
> > yes use a new callback to the higher level user (=dcache) and ask them
> > to free the object.
> > 
> > The slab datastructures are not completely suited for this right now,
> > but it could be done by using one more of the list_heads in struct page
> > for slab backing pages.
> > 
> > It would probably not be very LRU but a simple hack of having slowly 
> > increasing dcache generations. Each dentry use updates the generation.
> > First slab memory freeing pass only frees objects with older generations.
> 
> If they're freeable, we should easily be able to move them, and therefore 
> compact a fragmented slab. That way we can preserve the LRU'ness of it.
> Stage 1: free the oldest entries. Stage 2: compact the slab into whole
> pages. Stage 3: free whole pages back to teh page allocator.
> 
> > Using slowly increasing generations has the advantage of timestamps
> > that you can avoid dirtying cache lines in the common case when 
> > the generation doesn't change on access (= no additional cache line bouncing)
> > and it would easily allow to tune the aging rate under stress by changing the 
> > length of the generation.
> 
> LRU algorithm may need general tweaking like this anyway ... strict LRU
> is expensive to keep.

Based on what I remember, I'd contend it isn't really LRU today, so I
wouldn't try and stick to something that we aren't doing. :)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-13 21:59       ` David Chinner
  2005-09-14  9:01         ` Andi Kleen
@ 2005-09-14 15:48         ` Sonny Rao
  2005-09-14 22:02           ` David Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Sonny Rao @ 2005-09-14 15:48 UTC (permalink / raw)
  To: David Chinner
  Cc: Bharata B Rao, Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Wed, Sep 14, 2005 at 07:59:32AM +1000, David Chinner wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > 
> > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > to improve this dcache fragmentation problem.
> 
> FYI, in the past I've tried this patch to reduce dcache fragmentation on
> an Altix (16k pages, 62 dentries to a slab page) under heavy
> fileserver workloads and it had no measurable effect. It appeared
> that there was almost always at least one active dentry on each page
> in the slab.  The story may very well be different on 4k page
> machines, however.
> 
> Typically, fragmentation was bad enough that reclaim removed ~90% of
> the working set of dentries to free about 1% of the memory in the
> dentry slab. We had to get down to freeing > 95% of the dentry cache
> before fragmentation started to reduce and the system stopped trying to
> reclaim the dcache which we then spent the next 10 minutes
> repopulating......
> 
> We also tried separating out directory dentries into a separate slab
> so that (potentially) longer lived dentries were clustered together
> rather than sparsely distributed around the slab cache.  Once again,
> it had no measurable effect on the level of fragmentation (with or
> without the rbtree patch).

I'm not surprised... With 62 dentrys per page, the likelyhood of
success is very small, and in fact performance could degrade since we
are holding the dcache lock more often and doing less useful work.

It has been over a year and my memory is hazy, but I think I did see
about a 10% improvement on my workload (some sort of SFS simulation
with millions of files being randomly accessed)  on an x86 machine but CPU
utilization also went way up which I think was the dcache lock.

Whatever happened to the  vfs_cache_pressue  band-aid/sledgehammer ?  
Is it not considered an option ?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-13  8:47     ` Bharata B Rao
  2005-09-13 21:59       ` David Chinner
@ 2005-09-14 21:34       ` Marcelo Tosatti
  2005-09-14 21:43         ` Dipankar Sarma
  2005-09-15  4:28         ` Bharata B Rao
  2005-09-14 23:08       ` Marcelo Tosatti
  2 siblings, 2 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2005-09-14 21:34 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> > On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> > > Do you have the /proc/sys/fs/dentry-state output when such lowmem
> > > shortage happens ?
> > 
> > Not yet, but the situation occurs on my laptop about 2 or 3 times
> > (when I'm not travelling and so it doesn't get rebooted).  So
> > reproducing it isn't utterly trivial, but it's does happen often
> > enough that it should be possible to get the necessary data.
> > 
> > > This is a problem that Bharata has been investigating at the moment.
> > > But he hasn't seen anything that can't be cured by a small memory
> > > pressure - IOW, dentries do get freed under memory pressure. So
> > > your case might be very useful. Bharata is maintaing an instrumentation
> > > patch to collect more information and an alternative dentry aging patch 
> > > (using rbtree). Perhaps you could try with those.
> > 
> > Send it to me, and I'd be happy to try either the instrumentation
> > patch or the dentry aging patch.
> > 
> 
> Ted,
> 
> I am sending two patches here.
> 
> First is dentry_stats patch which collects some dcache statistics
> and puts it into /proc/meminfo. This patch provides information 
> about how dentries are distributed in dcache slab pages, how many
> free and in use dentries are present in dentry_unused lru list and
> how prune_dcache() performs with respect to freeing the requested
> number of dentries.

Hi Bharata,

+void get_dstat_info(void)
+{
+       struct dentry *dentry;
+
+       lru_dentry_stat.nr_total = lru_dentry_stat.nr_inuse = 0;
+       lru_dentry_stat.nr_ref = lru_dentry_stat.nr_free = 0;
+
+       spin_lock(&dcache_lock);
+       list_for_each_entry(dentry, &dentry_unused, d_lru) {
+               if (atomic_read(&dentry->d_count))
+                       lru_dentry_stat.nr_inuse++;

Dentries on dentry_unused list with d_count positive? Is that possible 
at all? As far as my limited understanding goes, only dentries with zero 
count can be part of the dentry_unused list.

+               if (dentry->d_flags & DCACHE_REFERENCED)
+                       lru_dentry_stat.nr_ref++;
+       }


@@ -393,6 +430,9 @@ static inline void prune_one_dentry(stru

 static void prune_dcache(int count)
 {
+       int nr_requested = count;
+       int nr_freed = 0;
+
        spin_lock(&dcache_lock);
        for (; count ; count--) {
                struct dentry *dentry;
@@ -427,8 +467,13 @@ static void prune_dcache(int count)
                        continue;
                }
                prune_one_dentry(dentry);
+               nr_freed++;
        }
        spin_unlock(&dcache_lock);
+       spin_lock(&prune_dcache_lock);
+       lru_dentry_stat.dprune_req = nr_requested;
+       lru_dentry_stat.dprune_freed = nr_freed;

Don't you mean "+=" ? 

+       spin_unlock(&prune_dcache_lock);




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14 21:34       ` Marcelo Tosatti
@ 2005-09-14 21:43         ` Dipankar Sarma
  2005-09-15  4:28         ` Bharata B Rao
  1 sibling, 0 replies; 32+ messages in thread
From: Dipankar Sarma @ 2005-09-14 21:43 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Bharata B Rao, Theodore Ts'o, linux-mm, linux-kernel

On Wed, Sep 14, 2005 at 06:34:04PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> > 
> > Ted,
> > 
> > I am sending two patches here.
> > 
> > First is dentry_stats patch which collects some dcache statistics
> > and puts it into /proc/meminfo. This patch provides information 
> > about how dentries are distributed in dcache slab pages, how many
> > free and in use dentries are present in dentry_unused lru list and
> > how prune_dcache() performs with respect to freeing the requested
> > number of dentries.
> 
> Hi Bharata,
> 
> +void get_dstat_info(void)
> +{
> +       struct dentry *dentry;
> +
> +       lru_dentry_stat.nr_total = lru_dentry_stat.nr_inuse = 0;
> +       lru_dentry_stat.nr_ref = lru_dentry_stat.nr_free = 0;
> +
> +       spin_lock(&dcache_lock);
> +       list_for_each_entry(dentry, &dentry_unused, d_lru) {
> +               if (atomic_read(&dentry->d_count))
> +                       lru_dentry_stat.nr_inuse++;
> 
> Dentries on dentry_unused list with d_count positive? Is that possible 
> at all? As far as my limited understanding goes, only dentries with zero 
> count can be part of the dentry_unused list.

That changed during the lock-free dcache implementation during
2.5. If we strictly update the lru list, we will have to acquire
the dcache_lock in __d_lookup() on a successful lookup. So we
did lazy-lru, leave the dentries with non-zero refcounts
and clean them up later when we acquire dcache_lock for other
purposes.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14 15:48         ` Sonny Rao
@ 2005-09-14 22:02           ` David Chinner
  2005-09-14 22:40             ` Sonny Rao
  0 siblings, 1 reply; 32+ messages in thread
From: David Chinner @ 2005-09-14 22:02 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Bharata B Rao, Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Wed, Sep 14, 2005 at 11:48:52AM -0400, Sonny Rao wrote:
> On Wed, Sep 14, 2005 at 07:59:32AM +1000, David Chinner wrote:
> > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > 
> > > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > > to improve this dcache fragmentation problem.
> > 
> > FYI, in the past I've tried this patch to reduce dcache fragmentation on
> > an Altix (16k pages, 62 dentries to a slab page) under heavy
> > fileserver workloads and it had no measurable effect. It appeared
> > that there was almost always at least one active dentry on each page
> > in the slab.  The story may very well be different on 4k page
> > machines, however.

....

> I'm not surprised... With 62 dentrys per page, the likelyhood of
> success is very small, and in fact performance could degrade since we
> are holding the dcache lock more often and doing less useful work.
> 
> It has been over a year and my memory is hazy, but I think I did see
> about a 10% improvement on my workload (some sort of SFS simulation
> with millions of files being randomly accessed)  on an x86 machine but CPU
> utilization also went way up which I think was the dcache lock.

Hmmm - can't say that I've had the same experience. I did not notice
any decrease in fragmentation or increase in CPU usage...

FWIW, SFS is just one workload that produces fragmentation.  Any
load that mixes or switches repeatedly between filesystem traversals
to producing memory pressure via the page cache tends to result in
fragmentation of the inode and dentry slabs...

> Whatever happened to the  vfs_cache_pressue  band-aid/sledgehammer ?  
> Is it not considered an option ?

All that did was increase the fragmentation levels. Instead of
seeing a 4-5:1 free/used ratio in the dcache, it would push out to
10-15:1 if vfs_cache_pressue was used to prefer reclaiming dentries
over page cache pages. Going the other way and prefering reclaim of
page cache pages did nothing to change the level of fragmentation.
Reclaim still freed most of the dentries in the working set but it
took a little longer to do it.

Right now our only solution to prevent fragmentation on reclaim is
to throw more memory at the machine to prevent reclaim from
happening as the workload changes.

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14 22:02           ` David Chinner
@ 2005-09-14 22:40             ` Sonny Rao
  2005-09-15  1:14               ` David Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Sonny Rao @ 2005-09-14 22:40 UTC (permalink / raw)
  To: David Chinner
  Cc: Bharata B Rao, Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Thu, Sep 15, 2005 at 08:02:22AM +1000, David Chinner wrote:
> On Wed, Sep 14, 2005 at 11:48:52AM -0400, Sonny Rao wrote:
> > On Wed, Sep 14, 2005 at 07:59:32AM +1000, David Chinner wrote:
> > > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > > 
> > > > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > > > to improve this dcache fragmentation problem.
> > > 
> > > FYI, in the past I've tried this patch to reduce dcache fragmentation on
> > > an Altix (16k pages, 62 dentries to a slab page) under heavy
> > > fileserver workloads and it had no measurable effect. It appeared
> > > that there was almost always at least one active dentry on each page
> > > in the slab.  The story may very well be different on 4k page
> > > machines, however.
> 
> ....
> 
> > I'm not surprised... With 62 dentrys per page, the likelyhood of
> > success is very small, and in fact performance could degrade since we
> > are holding the dcache lock more often and doing less useful work.
> > 
> > It has been over a year and my memory is hazy, but I think I did see
> > about a 10% improvement on my workload (some sort of SFS simulation
> > with millions of files being randomly accessed)  on an x86 machine but CPU
> > utilization also went way up which I think was the dcache lock.
> 
> Hmmm - can't say that I've had the same experience. I did not notice
> any decrease in fragmentation or increase in CPU usage...

Well, this was on an x86 machine with 8 cores but relatively poor
scalability and horrific memory latencies ... i.e. it tends to
exaggerate the effects of bad locks compared to what I would see on a
more scalable POWER machine.  We actually ran SFS on a 4-way POWER-5
machine with the patch and didn't see any real change in throughput,
and fragmentation was a little better.  I can go dig out the data if
someone is really interested.

In your case with 62 dentry objects per page (which is only going to
get much worse as we bump up base page sizes), I think the chances of
success of this approach or anything similar are horrible because we
aren't really solving any of the fundamental issues.

For me, the patch we mostly an experiment to see if the "blunderbus"
effect (to quote mjb) could be controlled any better that we do
today.  Mostly, it didn't seem worth it to me -- especially since we
wanted the global dcache lock to go away. 
 
> FWIW, SFS is just one workload that produces fragmentation.  Any
> load that mixes or switches repeatedly between filesystem traversals
> to producing memory pressure via the page cache tends to result in
> fragmentation of the inode and dentry slabs...

Yep, and that's more or less how I "simulated" SFS, just had tons of
small files.  I wasn't trying to really simulate the networking part
or op mixture etc -- just the slab fragmentation as a "worst-case".

> > Whatever happened to the  vfs_cache_pressue  band-aid/sledgehammer ?  
> > Is it not considered an option ?
> 
> All that did was increase the fragmentation levels. Instead of
> seeing a 4-5:1 free/used ratio in the dcache, it would push out to
> 10-15:1 if vfs_cache_pressue was used to prefer reclaiming dentries
> over page cache pages. Going the other way and prefering reclaim of
> page cache pages did nothing to change the level of fragmentation.
> Reclaim still freed most of the dentries in the working set but it
> took a little longer to do it.

Yes, but on systems with smaller pages it does seem to have some
positive effect.  I don't really know how well this has been
quantified. 
 
> Right now our only solution to prevent fragmentation on reclaim is
> to throw more memory at the machine to prevent reclaim from
> happening as the workload changes.

That is unfortunate, but interesting because I didn't know if this was
not a "real-problem" as some have contended.  I know SPEC SFS is a
somewhat questionable workload (really, what isn't though?), so the
evidence gathered from that didn't seem to convince many people.  

What kind of (real) workload are you seeing this on?

Thanks,

Sonny


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14  9:43             ` Andrew Morton
  2005-09-14  9:52               ` Dipankar Sarma
@ 2005-09-14 22:44               ` Theodore Ts'o
  1 sibling, 0 replies; 32+ messages in thread
From: Theodore Ts'o @ 2005-09-14 22:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Manfred Spraul, ak, dgc, bharata, dipankar, linux-mm, linux-kernel

On Wed, Sep 14, 2005 at 02:43:13AM -0700, Andrew Morton wrote:
> Manfred Spraul <manfred@colorfullife.com> wrote:
> >
> > One tricky point are directory dentries: As far as I see, they are 
> >  pinned and unfreeable if a (freeable) directory entry is in the cache.
> >
> 
> Well.  That's the whole problem.
> 
> I don't think it's been demonstrated that Ted's problem was caused by
> internal fragementation, btw.  Ted, could you run slabtop, see what the
> dcache occupancy is?  Monitor it as you start to manually apply pressure? 
> If the occupancy falls to 10% and not many slab pages are freed up yet then
> yup, it's internal fragmentation.

The next time I can get my machine into that state, sure, I'll try it.
I used to be able to reproduce it using normal laptop usage patterns
(Lotus notes running under wine, kernel builds, apt-get upgrade's,
openoffice, firefox, etc.)  about twice a week with 2.6.13-rc5, but
with 2.6.13, it happened once or twice, but since then I haven't been
able to trigger it.  (Predictably, not after I posted about it on
LKML.  :-/)

I've been trying a few things in the hopes of deliberately triggering
it, but so far, no luck.  Maybe I should go back to 2.6.13-rc5 and see
if I have an easier time of reproducing the failure case.

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14  9:01         ` Andi Kleen
                             ` (2 preceding siblings ...)
  2005-09-14 13:57           ` Martin J. Bligh
@ 2005-09-14 22:48           ` David Chinner
  3 siblings, 0 replies; 32+ messages in thread
From: David Chinner @ 2005-09-14 22:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bharata B Rao, Theodore Ts'o, Dipankar Sarma, linux-mm,
	linux-kernel, manfred

On Wed, Sep 14, 2005 at 11:01:15AM +0200, Andi Kleen wrote:
> On Tuesday 13 September 2005 23:59, David Chinner wrote:
> > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > Second is Sonny Rao's rbtree dentry reclaim patch which is an attempt
> > > to improve this dcache fragmentation problem.
> >
> > FYI, in the past I've tried this patch to reduce dcache fragmentation on
> > an Altix (16k pages, 62 dentries to a slab page) under heavy
> > fileserver workloads and it had no measurable effect. It appeared
> > that there was almost always at least one active dentry on each page
> > in the slab.  The story may very well be different on 4k page
> > machines, however.
> 
> I always thought dentry freeing would work much better if it
> was turned upside down.
> 
> Instead of starting from the high level dcache lists it could
> be driven by slab: on memory pressure slab tries to return pages with unused 
> cache objects. In that case it should check if there are only
> a small number of pinned objects on the page set left, and if 
> yes use a new callback to the higher level user (=dcache) and ask them
> to free the object.

If you add a slab free object callback, then you have the beginnings
of a more flexible solution to memory reclaim from the slabs.

For example, you can easily implement a reclaim-not-allocate method
for new slab allocations for when there is no memory available or the
size of the slab is passed some configurable high water mark...

Right now these is no way to control the size of a slab cache.  Part
of the reason for the fragmentation I have seen is the massive
changes in size of the caches due to the OS making wrong decisions
about memory reclaim when small changes in the workload occur. We
currently have no way to provide hints to help the OS make the right
decision for a given workload....

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-13  8:47     ` Bharata B Rao
  2005-09-13 21:59       ` David Chinner
  2005-09-14 21:34       ` Marcelo Tosatti
@ 2005-09-14 23:08       ` Marcelo Tosatti
  2005-09-15  9:39         ` Bharata B Rao
  2 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2005-09-14 23:08 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> > On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> > > Do you have the /proc/sys/fs/dentry-state output when such lowmem
> > > shortage happens ?
> > 
> > Not yet, but the situation occurs on my laptop about 2 or 3 times
> > (when I'm not travelling and so it doesn't get rebooted).  So
> > reproducing it isn't utterly trivial, but it's does happen often
> > enough that it should be possible to get the necessary data.
> > 
> > > This is a problem that Bharata has been investigating at the moment.
> > > But he hasn't seen anything that can't be cured by a small memory
> > > pressure - IOW, dentries do get freed under memory pressure. So
> > > your case might be very useful. Bharata is maintaing an instrumentation
> > > patch to collect more information and an alternative dentry aging patch 
> > > (using rbtree). Perhaps you could try with those.
> > 
> > Send it to me, and I'd be happy to try either the instrumentation
> > patch or the dentry aging patch.
> > 
> 
> Ted,
> 
> I am sending two patches here.
> 
> First is dentry_stats patch which collects some dcache statistics
> and puts it into /proc/meminfo. This patch provides information 
> about how dentries are distributed in dcache slab pages, how many
> free and in use dentries are present in dentry_unused lru list and
> how prune_dcache() performs with respect to freeing the requested
> number of dentries.

Bharata, 

Ideally one should move the "nr_requested/nr_freed" counters from your
stats patch into "struct shrinker" (or somewhere else more appropriate
in which per-shrinkable-cache stats are maintained), and use the
"mod_page_state" infrastructure to do lockless per-CPU accounting. ie.
break /proc/vmstats's "slabs_scanned" apart in meaningful pieces.

IMO something along that line should be merged into mainline to walk
away from the "what the fuck is going on" state of things.
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14 22:40             ` Sonny Rao
@ 2005-09-15  1:14               ` David Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: David Chinner @ 2005-09-15  1:14 UTC (permalink / raw)
  To: Sonny Rao
  Cc: Bharata B Rao, Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Wed, Sep 14, 2005 at 06:40:40PM -0400, Sonny Rao wrote:
> On Thu, Sep 15, 2005 at 08:02:22AM +1000, David Chinner wrote:
> > Right now our only solution to prevent fragmentation on reclaim is
> > to throw more memory at the machine to prevent reclaim from
> > happening as the workload changes.
> 
> That is unfortunate, but interesting because I didn't know if this was
> not a "real-problem" as some have contended.  I know SPEC SFS is a
> somewhat questionable workload (really, what isn't though?), so the
> evidence gathered from that didn't seem to convince many people.  
> 
> What kind of (real) workload are you seeing this on?

Nothing special. Here's an example from a local altix build
server (8p, 12GiB RAM):

linvfs_icache     3376574 3891360    672   24    1 : tunables   54   27    8 : slabdata 162140 162140      0
dentry_cache      2632811 3007186    256   62    1 : tunables  120   60    8 : slabdata  48503  48503      0

I just copied and untarred some stuff I need to look at (~2GiB
data) and when that completed we now have:

linvfs_icache     590840 2813328    672   24    1 : tunables   54   27    8 : slabdata 117222 117222
dentry_cache      491984 2717708    256   62    1 : tunables  120   60    8 : slabdata  43834  43834

A few minutes later, with ppl doing normal work (rsync, kernel and
userspace package builds, tar, etc), a bit more had been reclaimed:

linvfs_icache     580589 2797992    672   24    1 : tunables   54   27    8 : slabdata 116583 116583      0
dentry_cache      412009 2418558    256   62    1 : tunables  120   60    8 : slabdata  39009  39009      0

We started with ~2.9GiB of active slab objects in ~210k pages
(3.3GiB RAM) in these two slabs. We've trimmed their active size
down to ~500MiB, but we still have 155k pages (2.5GiB) allocated to
the slabs. 

I've seen much worse than this on build servers with more memory and
larger filesystems, especially after the filesystems have been
crawled by a backup program over night and we've ended up with > 10
million objects in each of these caches. 

Cheers,

Dave.
-- 
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14 21:34       ` Marcelo Tosatti
  2005-09-14 21:43         ` Dipankar Sarma
@ 2005-09-15  4:28         ` Bharata B Rao
  1 sibling, 0 replies; 32+ messages in thread
From: Bharata B Rao @ 2005-09-15  4:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Wed, Sep 14, 2005 at 06:34:04PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > On Sun, Sep 11, 2005 at 11:16:36PM -0400, Theodore Ts'o wrote:
> > > On Sun, Sep 11, 2005 at 05:30:46PM +0530, Dipankar Sarma wrote:
> > > > Do you have the /proc/sys/fs/dentry-state output when such lowmem
> > > > shortage happens ?
> > > 
> > > Not yet, but the situation occurs on my laptop about 2 or 3 times
> > > (when I'm not travelling and so it doesn't get rebooted).  So
> > > reproducing it isn't utterly trivial, but it's does happen often
> > > enough that it should be possible to get the necessary data.
> > > 
> > > > This is a problem that Bharata has been investigating at the moment.
> > > > But he hasn't seen anything that can't be cured by a small memory
> > > > pressure - IOW, dentries do get freed under memory pressure. So
> > > > your case might be very useful. Bharata is maintaing an instrumentation
> > > > patch to collect more information and an alternative dentry aging patch 
> > > > (using rbtree). Perhaps you could try with those.
> > > 
> > > Send it to me, and I'd be happy to try either the instrumentation
> > > patch or the dentry aging patch.
> > > 
> > 
> > Ted,
> > 
> > I am sending two patches here.
> > 
> > First is dentry_stats patch which collects some dcache statistics
> > and puts it into /proc/meminfo. This patch provides information 
> > about how dentries are distributed in dcache slab pages, how many
> > free and in use dentries are present in dentry_unused lru list and
> > how prune_dcache() performs with respect to freeing the requested
> > number of dentries.
> 
> Hi Bharata,
> 
> +void get_dstat_info(void)
> +{
> +       struct dentry *dentry;
> +
> +       lru_dentry_stat.nr_total = lru_dentry_stat.nr_inuse = 0;
> +       lru_dentry_stat.nr_ref = lru_dentry_stat.nr_free = 0;
> +
> +       spin_lock(&dcache_lock);
> +       list_for_each_entry(dentry, &dentry_unused, d_lru) {
> +               if (atomic_read(&dentry->d_count))
> +                       lru_dentry_stat.nr_inuse++;
> 
> Dentries on dentry_unused list with d_count positive? Is that possible 
> at all? As far as my limited understanding goes, only dentries with zero 
> count can be part of the dentry_unused list.

As Dipankar mentioned, its now possible to have positive d_count dentires
on unused_list. BTW I think we need a better way to get this data than
going through the entire unused_list linearly, which might not be 
scalable with huge number of dentries.

> 
> +               if (dentry->d_flags & DCACHE_REFERENCED)
> +                       lru_dentry_stat.nr_ref++;
> +       }
> 
> 
> @@ -393,6 +430,9 @@ static inline void prune_one_dentry(stru
> 
>  static void prune_dcache(int count)
>  {
> +       int nr_requested = count;
> +       int nr_freed = 0;
> +
>         spin_lock(&dcache_lock);
>         for (; count ; count--) {
>                 struct dentry *dentry;
> @@ -427,8 +467,13 @@ static void prune_dcache(int count)
>                         continue;
>                 }
>                 prune_one_dentry(dentry);
> +               nr_freed++;
>         }
>         spin_unlock(&dcache_lock);
> +       spin_lock(&prune_dcache_lock);
> +       lru_dentry_stat.dprune_req = nr_requested;
> +       lru_dentry_stat.dprune_freed = nr_freed;
> 
> Don't you mean "+=" ? 

No. Actually here I am capturing the number of dentries freed
per invocation of prune_dcache.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14 13:57           ` Martin J. Bligh
  2005-09-14 15:37             ` Sonny Rao
@ 2005-09-15  7:21             ` Helge Hafting
  1 sibling, 0 replies; 32+ messages in thread
From: Helge Hafting @ 2005-09-15  7:21 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andi Kleen, David Chinner, Bharata B Rao, Theodore Ts'o,
	Dipankar Sarma, linux-mm, linux-kernel, manfred

Martin J. Bligh wrote:

>
>If they're freeable, we should easily be able to move them, and therefore 
>compact a fragmented slab. That way we can preserve the LRU'ness of it.
>Stage 1: free the oldest entries. Stage 2: compact the slab into whole
>pages. Stage 3: free whole pages back to teh page allocator.
>  
>
That seems like the perfect solution to me.  Freeing up 95% or more
gives us clean pages - and moving instead of actually freeing
everything avoids the cost of repopulating the cache later. :-)

Helge Hafting

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-14 23:08       ` Marcelo Tosatti
@ 2005-09-15  9:39         ` Bharata B Rao
  2005-09-15 13:29           ` Marcelo Tosatti
  0 siblings, 1 reply; 32+ messages in thread
From: Bharata B Rao @ 2005-09-15  9:39 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Wed, Sep 14, 2005 at 08:08:43PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > 
<snip>
> > First is dentry_stats patch which collects some dcache statistics
> > and puts it into /proc/meminfo. This patch provides information 
> > about how dentries are distributed in dcache slab pages, how many
> > free and in use dentries are present in dentry_unused lru list and
> > how prune_dcache() performs with respect to freeing the requested
> > number of dentries.
> 
> Bharata, 
> 
> Ideally one should move the "nr_requested/nr_freed" counters from your
> stats patch into "struct shrinker" (or somewhere else more appropriate
> in which per-shrinkable-cache stats are maintained), and use the
> "mod_page_state" infrastructure to do lockless per-CPU accounting. ie.
> break /proc/vmstats's "slabs_scanned" apart in meaningful pieces.

Yes, I agree that we should have the nr_requested and nr_freed type of
counters in appropriate place. And "struct shrinker" is probably right
place for it.

Essentially you are suggesting that we maintain per cpu statistics
of 'requested to free'(scanned) slab objects and actual freed objects.
And this should be on per shrinkable cache basis.

Is it ok to maintain this requested/freed counters as growing counters
or would it make more sense to have them reflect the statistics from
the latest/last attempt of cache shrink ? And where would be right
place to export this information ? (/proc/slabinfo ?,  since it already
gives details of all caches)

If I understand correctly, "slabs_scanned" is the sum total number
of objects from all shrinkable caches scanned for possible freeeing.
I didn't get why this is part of page_state which mostly includes
page related statistics.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-15  9:39         ` Bharata B Rao
@ 2005-09-15 13:29           ` Marcelo Tosatti
  2005-10-02 16:32             ` Bharata B Rao
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2005-09-15 13:29 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Thu, Sep 15, 2005 at 03:09:45PM +0530, Bharata B Rao wrote:
> On Wed, Sep 14, 2005 at 08:08:43PM -0300, Marcelo Tosatti wrote:
> > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > 
> <snip>
> > > First is dentry_stats patch which collects some dcache statistics
> > > and puts it into /proc/meminfo. This patch provides information 
> > > about how dentries are distributed in dcache slab pages, how many
> > > free and in use dentries are present in dentry_unused lru list and
> > > how prune_dcache() performs with respect to freeing the requested
> > > number of dentries.
> > 
> > Bharata, 
> > 
> > Ideally one should move the "nr_requested/nr_freed" counters from your
> > stats patch into "struct shrinker" (or somewhere else more appropriate
> > in which per-shrinkable-cache stats are maintained), and use the
> > "mod_page_state" infrastructure to do lockless per-CPU accounting. ie.
> > break /proc/vmstats's "slabs_scanned" apart in meaningful pieces.
> 
> Yes, I agree that we should have the nr_requested and nr_freed type of
> counters in appropriate place. And "struct shrinker" is probably right
> place for it.
> 
> Essentially you are suggesting that we maintain per cpu statistics
> of 'requested to free'(scanned) slab objects and actual freed objects.
> And this should be on per shrinkable cache basis.

Yep. 

> Is it ok to maintain this requested/freed counters as growing counters
> or would it make more sense to have them reflect the statistics from
> the latest/last attempt of cache shrink ? 

It makes a lot more sense to account for all shrink attempts: it is necessary
to know how the reclaiming process is behaving over time. Thats why I wondered
about using "=" instead of "+=" in your patch.

> And where would be right place to export this information ?
> (/proc/slabinfo ?, since it already gives details of all caches)

My feeling is that changing /proc/slabinfo format might break userspace
applications.

> If I understand correctly, "slabs_scanned" is the sum total number
> of objects from all shrinkable caches scanned for possible freeeing.

Yep.

> I didn't get why this is part of page_state which mostly includes
> page related statistics.

Well, page_state contains most of the reclaiming statistics - its scope
is broader than "struct page" information.

To me it seems like the best place.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-09-15 13:29           ` Marcelo Tosatti
@ 2005-10-02 16:32             ` Bharata B Rao
  2005-10-02 20:06               ` Marcelo
  0 siblings, 1 reply; 32+ messages in thread
From: Bharata B Rao @ 2005-10-02 16:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3630 bytes --]

On Thu, Sep 15, 2005 at 10:29:10AM -0300, Marcelo Tosatti wrote:
> On Thu, Sep 15, 2005 at 03:09:45PM +0530, Bharata B Rao wrote:
> > On Wed, Sep 14, 2005 at 08:08:43PM -0300, Marcelo Tosatti wrote:
> > > On Tue, Sep 13, 2005 at 02:17:52PM +0530, Bharata B Rao wrote:
> > > > 
> > <snip>
> > > > First is dentry_stats patch which collects some dcache statistics
> > > > and puts it into /proc/meminfo. This patch provides information 
> > > > about how dentries are distributed in dcache slab pages, how many
> > > > free and in use dentries are present in dentry_unused lru list and
> > > > how prune_dcache() performs with respect to freeing the requested
> > > > number of dentries.
> > > 
> > > Bharata, 
> > > 
> > > Ideally one should move the "nr_requested/nr_freed" counters from your
> > > stats patch into "struct shrinker" (or somewhere else more appropriate
> > > in which per-shrinkable-cache stats are maintained), and use the
> > > "mod_page_state" infrastructure to do lockless per-CPU accounting. ie.
> > > break /proc/vmstats's "slabs_scanned" apart in meaningful pieces.
> > 
> > Yes, I agree that we should have the nr_requested and nr_freed type of
> > counters in appropriate place. And "struct shrinker" is probably right
> > place for it.
> > 
> > Essentially you are suggesting that we maintain per cpu statistics
> > of 'requested to free'(scanned) slab objects and actual freed objects.
> > And this should be on per shrinkable cache basis.
> 
> Yep. 
> 
> > Is it ok to maintain this requested/freed counters as growing counters
> > or would it make more sense to have them reflect the statistics from
> > the latest/last attempt of cache shrink ? 
> 
> It makes a lot more sense to account for all shrink attempts: it is necessary
> to know how the reclaiming process is behaving over time. Thats why I wondered
> about using "=" instead of "+=" in your patch.
> 
> > And where would be right place to export this information ?
> > (/proc/slabinfo ?, since it already gives details of all caches)
> 
> My feeling is that changing /proc/slabinfo format might break userspace
> applications.
> 
> > If I understand correctly, "slabs_scanned" is the sum total number
> > of objects from all shrinkable caches scanned for possible freeeing.
> 
> Yep.
> 
> > I didn't get why this is part of page_state which mostly includes
> > page related statistics.
> 
> Well, page_state contains most of the reclaiming statistics - its scope
> is broader than "struct page" information.
> 
> To me it seems like the best place.
> 

Marcelo,

The attached patch is an attempt to break the "slabs_scanned" into
meaningful pieces as you suggested.

But I coudn't do this cleanly because kmem_cache_t isn't defined
in a .h file and I didn't want to touch too many files in the first
attempt.

What I am doing here is making the "requested to free" and
"actual freed" counters as part of struct shrinker. With this I can
update these statistics seamlessly from shrink_slab().

I don't have this as per cpu counters because I wasn't sure if shrink_slab()
would have many concurrent executions warranting a lockless percpu
counters for these.

I am displaying this information as part of /proc/slabinfo and I have
verified that it atleast isn't breaking slabtop.

I thought about having this as part of /proc/vmstat and using
mod_page_state infrastructure as u suggested, but having the
"requested to free" and "actual freed" counters in struct page_state
for only those caches which set the shrinker function didn't look
good.

If you think that all this can be done in a better way, please
let me know.

Regards,
Bharata.

[-- Attachment #2: cache_shrink_stats.patch --]
[-- Type: text/plain, Size: 6764 bytes --]



Signed-off-by: Bharata B Rao <bharata@in.ibm.com>
---

 fs/dcache.c          |    4 +++-
 fs/dquot.c           |    4 +++-
 fs/inode.c           |    4 +++-
 include/linux/mm.h   |   15 ++++++++++++++-
 include/linux/slab.h |    3 +++
 mm/slab.c            |   14 ++++++++++++++
 mm/vmscan.c          |   19 +++++++------------
 7 files changed, 47 insertions(+), 16 deletions(-)

diff -puN mm/vmscan.c~cache_shrink_stats mm/vmscan.c
--- linux-2.6.14-rc2-shrink/mm/vmscan.c~cache_shrink_stats	2005-09-28 11:17:01.508944136 +0530
+++ linux-2.6.14-rc2-shrink-bharata/mm/vmscan.c	2005-09-28 17:18:57.799566152 +0530
@@ -84,17 +84,6 @@ struct scan_control {
 	int swap_cluster_max;
 };
 
-/*
- * The list of shrinker callbacks used by to apply pressure to
- * ageable caches.
- */
-struct shrinker {
-	shrinker_t		shrinker;
-	struct list_head	list;
-	int			seeks;	/* seeks to recreate an obj */
-	long			nr;	/* objs pending delete */
-};
-
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -146,6 +135,8 @@ struct shrinker *set_shrinker(int seeks,
 	        shrinker->shrinker = theshrinker;
 	        shrinker->seeks = seeks;
 	        shrinker->nr = 0;
+		atomic_set(&shrinker->nr_req, 0);
+		atomic_set(&shrinker->nr_freed, 0);
 	        down_write(&shrinker_rwsem);
 	        list_add_tail(&shrinker->list, &shrinker_list);
 	        up_write(&shrinker_rwsem);
@@ -221,9 +212,13 @@ static int shrink_slab(unsigned long sca
 			shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
 			if (shrink_ret == -1)
 				break;
-			if (shrink_ret < nr_before)
+			if (shrink_ret < nr_before) {
 				ret += nr_before - shrink_ret;
+				atomic_add(nr_before - shrink_ret,
+					&shrinker->nr_freed);
+			}
 			mod_page_state(slabs_scanned, this_scan);
+			atomic_add(this_scan, &shrinker->nr_req);
 			total_scan -= this_scan;
 
 			cond_resched();
diff -puN fs/inode.c~cache_shrink_stats fs/inode.c
--- linux-2.6.14-rc2-shrink/fs/inode.c~cache_shrink_stats	2005-09-28 11:25:58.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/fs/inode.c	2005-09-28 14:02:24.422431992 +0530
@@ -1357,11 +1357,13 @@ void __init inode_init_early(void)
 void __init inode_init(unsigned long mempages)
 {
 	int loop;
+	struct shrinker *shrinker;
 
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode),
 				0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC, init_once, NULL);
-	set_shrinker(DEFAULT_SEEKS, shrink_icache_memory);
+	shrinker = set_shrinker(DEFAULT_SEEKS, shrink_icache_memory);
+	kmem_set_shrinker(inode_cachep, shrinker);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff -puN fs/dquot.c~cache_shrink_stats fs/dquot.c
--- linux-2.6.14-rc2-shrink/fs/dquot.c~cache_shrink_stats	2005-09-28 11:28:51.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/fs/dquot.c	2005-09-28 14:06:13.197652872 +0530
@@ -1793,6 +1793,7 @@ static int __init dquot_init(void)
 {
 	int i;
 	unsigned long nr_hash, order;
+	struct shrinker *shrinker;
 
 	printk(KERN_NOTICE "VFS: Disk quotas %s\n", __DQUOT_VERSION__);
 
@@ -1824,7 +1825,8 @@ static int __init dquot_init(void)
 	printk("Dquot-cache hash table entries: %ld (order %ld, %ld bytes)\n",
 			nr_hash, order, (PAGE_SIZE << order));
 
-	set_shrinker(DEFAULT_SEEKS, shrink_dqcache_memory);
+	shrinker = set_shrinker(DEFAULT_SEEKS, shrink_dqcache_memory);
+	kmem_set_shrinker(dquot_cachep, shrinker);
 
 	return 0;
 }
diff -puN fs/dcache.c~cache_shrink_stats fs/dcache.c
--- linux-2.6.14-rc2-shrink/fs/dcache.c~cache_shrink_stats	2005-09-28 11:31:35.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/fs/dcache.c	2005-09-28 13:47:46.507895288 +0530
@@ -1668,6 +1668,7 @@ static void __init dcache_init_early(voi
 static void __init dcache_init(unsigned long mempages)
 {
 	int loop;
+	struct shrinker *shrinker;
 
 	/* 
 	 * A constructor could be added for stable state like the lists,
@@ -1680,7 +1681,8 @@ static void __init dcache_init(unsigned 
 					 SLAB_RECLAIM_ACCOUNT|SLAB_PANIC,
 					 NULL, NULL);
 	
-	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
+	shrinker = set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
+	kmem_set_shrinker(dentry_cache, shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
diff -puN mm/slab.c~cache_shrink_stats mm/slab.c
--- linux-2.6.14-rc2-shrink/mm/slab.c~cache_shrink_stats	2005-09-28 11:40:00.285338264 +0530
+++ linux-2.6.14-rc2-shrink-bharata/mm/slab.c	2005-09-28 14:26:52.187297816 +0530
@@ -400,6 +400,9 @@ struct kmem_cache_s {
 	/* de-constructor func */
 	void (*dtor)(void *, kmem_cache_t *, unsigned long);
 
+	/* shrinker data for this cache */
+	struct shrinker *shrinker;
+
 /* 4) cache creation/removal */
 	const char		*name;
 	struct list_head	next;
@@ -3483,6 +3486,12 @@ static int s_show(struct seq_file *m, vo
 			allochit, allocmiss, freehit, freemiss);
 	}
 #endif
+	/* shrinker stats */
+	if (cachep->shrinker) {
+		seq_printf(m, " : shrinker stat %7lu %7lu",
+			atomic_read(&cachep->shrinker->nr_req),
+			atomic_read(&cachep->shrinker->nr_freed));
+	}
 	seq_putc(m, '\n');
 	spin_unlock_irq(&cachep->spinlock);
 	return 0;
@@ -3606,3 +3615,8 @@ char *kstrdup(const char *s, unsigned in
 	return buf;
 }
 EXPORT_SYMBOL(kstrdup);
+
+void kmem_set_shrinker(kmem_cache_t *cachep, struct shrinker *shrinker)
+{
+	cachep->shrinker = shrinker;
+}
diff -puN include/linux/mm.h~cache_shrink_stats include/linux/mm.h
--- linux-2.6.14-rc2-shrink/include/linux/mm.h~cache_shrink_stats	2005-09-28 12:41:09.664507840 +0530
+++ linux-2.6.14-rc2-shrink-bharata/include/linux/mm.h	2005-09-28 12:41:46.014981728 +0530
@@ -755,7 +755,20 @@ typedef int (*shrinker_t)(int nr_to_scan
  */
 
 #define DEFAULT_SEEKS 2
-struct shrinker;
+
+/*
+ * The list of shrinker callbacks used by to apply pressure to
+ * ageable caches.
+ */
+struct shrinker {
+	shrinker_t		shrinker;
+	struct list_head	list;
+	int			seeks;	/* seeks to recreate an obj */
+	long			nr;	/* objs pending delete */
+	atomic_t		nr_req; /* objs scanned for possible freeing */
+	atomic_t		nr_freed; /* actual number of objects freed */
+};
+
 extern struct shrinker *set_shrinker(int, shrinker_t);
 extern void remove_shrinker(struct shrinker *shrinker);
 
diff -puN include/linux/slab.h~cache_shrink_stats include/linux/slab.h
--- linux-2.6.14-rc2-shrink/include/linux/slab.h~cache_shrink_stats	2005-09-28 13:52:53.852171856 +0530
+++ linux-2.6.14-rc2-shrink-bharata/include/linux/slab.h	2005-09-28 14:07:42.127133536 +0530
@@ -147,6 +147,9 @@ extern kmem_cache_t	*bio_cachep;
 
 extern atomic_t slab_reclaim_pages;
 
+struct shrinker;
+extern void kmem_set_shrinker(kmem_cache_t *cachep, struct shrinker *shrinker);
+
 #endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_SLAB_H */
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough
  2005-10-02 16:32             ` Bharata B Rao
@ 2005-10-02 20:06               ` Marcelo
  2005-10-04 13:36                 ` shrinkable cache statistics [was Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough] Bharata B Rao
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo @ 2005-10-02 20:06 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Marcelo Tosatti, Theodore Ts'o, Dipankar Sarma, linux-mm,
	linux-kernel


Bharata,

On Sun, Oct 02, 2005 at 10:02:29PM +0530, Bharata B Rao wrote:
> 
> Marcelo,
> 
> The attached patch is an attempt to break the "slabs_scanned" into
> meaningful pieces as you suggested.
> 
> But I coudn't do this cleanly because kmem_cache_t isn't defined
> in a .h file and I didn't want to touch too many files in the first
> attempt.
> 
> What I am doing here is making the "requested to free" and
> "actual freed" counters as part of struct shrinker. With this I can
> update these statistics seamlessly from shrink_slab().
> 
> I don't have this as per cpu counters because I wasn't sure if shrink_slab()
> would have many concurrent executions warranting a lockless percpu
> counters for these.

Per-CPU counters are interesting because they avoid the atomic
operation _and_ potential cacheline bouncing. Given the fact that less
commonly used counters in the reclaim path are already per-CPU,
I think that it might be worth to do it here too.

> I am displaying this information as part of /proc/slabinfo and I have
> verified that it atleast isn't breaking slabtop.
> 
> I thought about having this as part of /proc/vmstat and using
> mod_page_state infrastructure as u suggested, but having the
> "requested to free" and "actual freed" counters in struct page_state
> for only those caches which set the shrinker function didn't look
> good.

OK... You could change the atomic counters to per-CPU variables
in "struct shrinker".

> If you think that all this can be done in a better way, please
> let me know. 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* shrinkable cache statistics [was Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough]
  2005-10-02 20:06               ` Marcelo
@ 2005-10-04 13:36                 ` Bharata B Rao
  2005-10-05 21:25                   ` Marcelo Tosatti
  0 siblings, 1 reply; 32+ messages in thread
From: Bharata B Rao @ 2005-10-04 13:36 UTC (permalink / raw)
  To: Marcelo; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4574 bytes --]

Marcelo,

Here's my next attempt in breaking the "slabs_scanned" from /proc/vmstat
into meaningful per cache statistics. Now I have the statistics counters
as percpu. [an issue remaining is that there are more than one cache as
part of mbcache and they all have a common shrinker routine and I am
displaying the collective shrinker stats info on each of them in
/proc/slabinfo ==> some kind of duplication]

With this patch (and my earlier dcache stats patch) I observed some
interesting results with the following test scenario on a 8cpu p3 box:

- Ran an application which consumes 40% of the total memory.
- Ran dbench on tmpfs with 128 clients twice (serially).
- Ran a find on a ext3 partition having ~9.5million entries (files and
  directories included)

At the end of this run, I have the following results:

[root@llm09 bharata]# cat /proc/meminfo
MemTotal:      3872528 kB
MemFree:       1420940 kB
Buffers:        714068 kB
Cached:          21536 kB
SwapCached:       2264 kB
Active:        1672680 kB
Inactive:       637460 kB
HighTotal:     3014616 kB
HighFree:      1411740 kB
LowTotal:       857912 kB
LowFree:          9200 kB
SwapTotal:     2096472 kB
SwapFree:      2051408 kB
Dirty:             172 kB
Writeback:           0 kB
Mapped:        1583680 kB
Slab:           119564 kB
CommitLimit:   4032736 kB
Committed_AS:  1647260 kB
PageTables:       2248 kB
VmallocTotal:   114680 kB
VmallocUsed:      1264 kB
VmallocChunk:   113384 kB
nr_dentries/page        nr_pages        nr_inuse
         0              0               0
         1              5               2
         2              12              4
         3              26              9
         4              46              18
         5              76              40
         6              82              47
         7              91              59
         8              122             93
         9              114             102
        10              142             136
        11              138             185
        12              118             164
        13              128             206
        14              126             208
        15              120             219
        16              136             261
        17              159             315
        18              145             311
        19              179             379
        20              192             407
        21              256             631
        22              286             741
        23              316             816
        24              342             934
        25              381             1177
        26              664             2813
        27              0               0
        28              0               0
        29              0               0
Total:                  4402            10277
dcache lru: total 75369 inuse 3599

[Here,
nr_dentries/page - Number of dentries per page
nr_pages - Number of pages with given number of dentries
nr_inuse - Number of inuse dentries in those pages.
Eg: From the above data, there are 26 pages with 3 dentries each
and out of 78 total dentries in these 3 pages, 9 dentries are in use.]

[root@llm09 bharata]# grep shrinker /proc/slabinfo
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> : shrinker stat <nr requested> <nr freed>
ext3_xattr             0      0     48   78    1 : tunables  120   60    8 : slabdata      0      0      0 : shrinker stat       0       0
dquot                  0      0    160   24    1 : tunables  120   60    8 : slabdata      0      0      0 : shrinker stat       0       0
inode_cache         1301   1390    400   10    1 : tunables   54   27    8 : slabdata    139    139      0 : shrinker stat  682752  681900
dentry_cache       82110 114452    152   26    1 : tunables  120   60    8 : slabdata   4402   4402      0 : shrinker stat 1557760  760100

[root@llm09 bharata]# grep slabs_scanned /proc/vmstat
slabs_scanned 2240512

[root@llm09 bharata]# cat /proc/sys/fs/dentry-state
82046   75369   45      0       3599    0
[The order of dentry-state o/p is like this:
total dentries in dentry hash list, total dentries in lru list, age limit,
want_pages, inuse dentries in lru list, dummy]

So, we can see that with low memory pressure, even though the
shrinker runs on dcache repeatedly, not many dentries are freed
by dcache. And dcache lru list still has huge number of free
dentries.

Regards,
Bharata.

[-- Attachment #2: cache_shrink_stats.patch --]
[-- Type: text/plain, Size: 8730 bytes --]


This patch adds two more fields to each entry of shrinkable cache
in /proc/slabinfo: the number of objects scanned for freeing and the
actual number of objects freed.

Signed-off-by: Bharata B Rao <bharata@in.ibm.com>
---

 fs/dcache.c          |    4 +++-
 fs/dquot.c           |    4 +++-
 fs/inode.c           |    4 +++-
 fs/mbcache.c         |    2 ++
 include/linux/mm.h   |   39 ++++++++++++++++++++++++++++++++++++++-
 include/linux/slab.h |    3 +++
 mm/slab.c            |   15 +++++++++++++++
 mm/vmscan.c          |   23 +++++++++++------------
 8 files changed, 78 insertions(+), 16 deletions(-)

diff -puN mm/vmscan.c~cache_shrink_stats mm/vmscan.c
--- linux-2.6.14-rc2-shrink/mm/vmscan.c~cache_shrink_stats	2005-09-28 11:17:01.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/mm/vmscan.c	2005-10-04 15:27:52.000000000 +0530
@@ -84,17 +84,6 @@ struct scan_control {
 	int swap_cluster_max;
 };
 
-/*
- * The list of shrinker callbacks used by to apply pressure to
- * ageable caches.
- */
-struct shrinker {
-	shrinker_t		shrinker;
-	struct list_head	list;
-	int			seeks;	/* seeks to recreate an obj */
-	long			nr;	/* objs pending delete */
-};
-
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -146,6 +135,11 @@ struct shrinker *set_shrinker(int seeks,
 	        shrinker->shrinker = theshrinker;
 	        shrinker->seeks = seeks;
 	        shrinker->nr = 0;
+		shrinker->s_stats = alloc_percpu(struct shrinker_stats);
+		if (!shrinker->s_stats) {
+			kfree(shrinker);
+			return NULL;
+		}
 	        down_write(&shrinker_rwsem);
 	        list_add_tail(&shrinker->list, &shrinker_list);
 	        up_write(&shrinker_rwsem);
@@ -162,6 +156,7 @@ void remove_shrinker(struct shrinker *sh
 	down_write(&shrinker_rwsem);
 	list_del(&shrinker->list);
 	up_write(&shrinker_rwsem);
+	free_percpu(shrinker->s_stats);
 	kfree(shrinker);
 }
 EXPORT_SYMBOL(remove_shrinker);
@@ -221,8 +216,12 @@ static int shrink_slab(unsigned long sca
 			shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
 			if (shrink_ret == -1)
 				break;
-			if (shrink_ret < nr_before)
+			if (shrink_ret < nr_before) {
 				ret += nr_before - shrink_ret;
+				shrinker_stat_add(shrinker, nr_freed,
+					(nr_before - shrink_ret));
+			}
+			shrinker_stat_add(shrinker, nr_req, this_scan);
 			mod_page_state(slabs_scanned, this_scan);
 			total_scan -= this_scan;
 
diff -puN fs/inode.c~cache_shrink_stats fs/inode.c
--- linux-2.6.14-rc2-shrink/fs/inode.c~cache_shrink_stats	2005-09-28 11:25:58.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/fs/inode.c	2005-09-28 14:02:24.000000000 +0530
@@ -1357,11 +1357,13 @@ void __init inode_init_early(void)
 void __init inode_init(unsigned long mempages)
 {
 	int loop;
+	struct shrinker *shrinker;
 
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode),
 				0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC, init_once, NULL);
-	set_shrinker(DEFAULT_SEEKS, shrink_icache_memory);
+	shrinker = set_shrinker(DEFAULT_SEEKS, shrink_icache_memory);
+	kmem_set_shrinker(inode_cachep, shrinker);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff -puN fs/dquot.c~cache_shrink_stats fs/dquot.c
--- linux-2.6.14-rc2-shrink/fs/dquot.c~cache_shrink_stats	2005-09-28 11:28:51.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/fs/dquot.c	2005-09-28 14:06:13.000000000 +0530
@@ -1793,6 +1793,7 @@ static int __init dquot_init(void)
 {
 	int i;
 	unsigned long nr_hash, order;
+	struct shrinker *shrinker;
 
 	printk(KERN_NOTICE "VFS: Disk quotas %s\n", __DQUOT_VERSION__);
 
@@ -1824,7 +1825,8 @@ static int __init dquot_init(void)
 	printk("Dquot-cache hash table entries: %ld (order %ld, %ld bytes)\n",
 			nr_hash, order, (PAGE_SIZE << order));
 
-	set_shrinker(DEFAULT_SEEKS, shrink_dqcache_memory);
+	shrinker = set_shrinker(DEFAULT_SEEKS, shrink_dqcache_memory);
+	kmem_set_shrinker(dquot_cachep, shrinker);
 
 	return 0;
 }
diff -puN fs/dcache.c~cache_shrink_stats fs/dcache.c
--- linux-2.6.14-rc2-shrink/fs/dcache.c~cache_shrink_stats	2005-09-28 11:31:35.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/fs/dcache.c	2005-09-28 13:47:46.000000000 +0530
@@ -1668,6 +1668,7 @@ static void __init dcache_init_early(voi
 static void __init dcache_init(unsigned long mempages)
 {
 	int loop;
+	struct shrinker *shrinker;
 
 	/* 
 	 * A constructor could be added for stable state like the lists,
@@ -1680,7 +1681,8 @@ static void __init dcache_init(unsigned 
 					 SLAB_RECLAIM_ACCOUNT|SLAB_PANIC,
 					 NULL, NULL);
 	
-	set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
+	shrinker = set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
+	kmem_set_shrinker(dentry_cache, shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
diff -puN mm/slab.c~cache_shrink_stats mm/slab.c
--- linux-2.6.14-rc2-shrink/mm/slab.c~cache_shrink_stats	2005-09-28 11:40:00.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/mm/slab.c	2005-10-04 14:09:53.000000000 +0530
@@ -400,6 +400,9 @@ struct kmem_cache_s {
 	/* de-constructor func */
 	void (*dtor)(void *, kmem_cache_t *, unsigned long);
 
+	/* shrinker data for this cache */
+	struct shrinker *shrinker;
+
 /* 4) cache creation/removal */
 	const char		*name;
 	struct list_head	next;
@@ -3363,6 +3366,7 @@ static void *s_start(struct seq_file *m,
 				" <error> <maxfreeable> <nodeallocs> <remotefrees>");
 		seq_puts(m, " : cpustat <allochit> <allocmiss> <freehit> <freemiss>");
 #endif
+		seq_puts(m, " : shrinker stat <nr requested> <nr freed>");
 		seq_putc(m, '\n');
 	}
 	p = cache_chain.next;
@@ -3483,6 +3487,12 @@ static int s_show(struct seq_file *m, vo
 			allochit, allocmiss, freehit, freemiss);
 	}
 #endif
+	/* shrinker stats */
+	if (cachep->shrinker) {
+		seq_printf(m, " : shrinker stat %7lu %7lu",
+			shrinker_stat_read(cachep->shrinker, nr_req),
+			shrinker_stat_read(cachep->shrinker, nr_freed));
+	}
 	seq_putc(m, '\n');
 	spin_unlock_irq(&cachep->spinlock);
 	return 0;
@@ -3606,3 +3616,8 @@ char *kstrdup(const char *s, unsigned in
 	return buf;
 }
 EXPORT_SYMBOL(kstrdup);
+
+void kmem_set_shrinker(kmem_cache_t *cachep, struct shrinker *shrinker)
+{
+	cachep->shrinker = shrinker;
+}
diff -puN include/linux/mm.h~cache_shrink_stats include/linux/mm.h
--- linux-2.6.14-rc2-shrink/include/linux/mm.h~cache_shrink_stats	2005-09-28 12:41:09.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/include/linux/mm.h	2005-10-04 12:29:22.000000000 +0530
@@ -755,7 +755,44 @@ typedef int (*shrinker_t)(int nr_to_scan
  */
 
 #define DEFAULT_SEEKS 2
-struct shrinker;
+
+struct shrinker_stats {
+	unsigned long nr_req; /* objs scanned for possible freeing */
+	unsigned long nr_freed; /* actual number of objects freed */
+};
+
+/*
+ * The list of shrinker callbacks used by to apply pressure to
+ * ageable caches.
+ */
+struct shrinker {
+	shrinker_t		shrinker;
+	struct list_head	list;
+	int			seeks;	/* seeks to recreate an obj */
+	long			nr;	/* objs pending delete */
+	struct shrinker_stats	*s_stats;
+};
+
+#define shrinker_stat_add(shrinker, field, addnd)		\
+	do {							\
+		preempt_disable();				\
+		(per_cpu_ptr(shrinker->s_stats,			\
+			smp_processor_id())->field += addnd);	\
+		preempt_enable();				\
+	} while (0)
+
+#define shrinker_stat_read(shrinker, field)				\
+({									\
+	typeof(shrinker->s_stats->field) res = 0;			\
+	int i;								\
+	for (i=0; i < NR_CPUS; i++) {					\
+		if (!cpu_possible(i))					\
+			continue;					\
+		res += per_cpu_ptr(shrinker->s_stats, i)->field;	\
+	}								\
+	res;								\
+})
+
 extern struct shrinker *set_shrinker(int, shrinker_t);
 extern void remove_shrinker(struct shrinker *shrinker);
 
diff -puN include/linux/slab.h~cache_shrink_stats include/linux/slab.h
--- linux-2.6.14-rc2-shrink/include/linux/slab.h~cache_shrink_stats	2005-09-28 13:52:53.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/include/linux/slab.h	2005-09-28 14:07:42.000000000 +0530
@@ -147,6 +147,9 @@ extern kmem_cache_t	*bio_cachep;
 
 extern atomic_t slab_reclaim_pages;
 
+struct shrinker;
+extern void kmem_set_shrinker(kmem_cache_t *cachep, struct shrinker *shrinker);
+
 #endif	/* __KERNEL__ */
 
 #endif	/* _LINUX_SLAB_H */
diff -puN fs/mbcache.c~cache_shrink_stats fs/mbcache.c
--- linux-2.6.14-rc2-shrink/fs/mbcache.c~cache_shrink_stats	2005-10-04 13:47:35.000000000 +0530
+++ linux-2.6.14-rc2-shrink-bharata/fs/mbcache.c	2005-10-04 13:48:34.000000000 +0530
@@ -292,6 +292,8 @@ mb_cache_create(const char *name, struct
 	if (!cache->c_entry_cache)
 		goto fail;
 
+	kmem_set_shrinker(cache->c_entry_cache, mb_shrinker);
+
 	spin_lock(&mb_cache_spinlock);
 	list_add(&cache->c_cache_list, &mb_cache_list);
 	spin_unlock(&mb_cache_spinlock);
_

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: shrinkable cache statistics [was Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough]
  2005-10-04 13:36                 ` shrinkable cache statistics [was Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough] Bharata B Rao
@ 2005-10-05 21:25                   ` Marcelo Tosatti
  2005-10-07  8:12                     ` Bharata B Rao
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2005-10-05 21:25 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

Hi Bharata,

On Tue, Oct 04, 2005 at 07:06:35PM +0530, Bharata B Rao wrote:
> Marcelo,
> 
> Here's my next attempt in breaking the "slabs_scanned" from /proc/vmstat
> into meaningful per cache statistics. Now I have the statistics counters
> as percpu. [an issue remaining is that there are more than one cache as
> part of mbcache and they all have a common shrinker routine and I am
> displaying the collective shrinker stats info on each of them in
> /proc/slabinfo ==> some kind of duplication]

Looks good to me! IMO it should be a candidate for -mm/mainline.

Nothing useful to suggest on the mbcache issue... sorry.

> With this patch (and my earlier dcache stats patch) I observed some
> interesting results with the following test scenario on a 8cpu p3 box:
> 
> - Ran an application which consumes 40% of the total memory.
> - Ran dbench on tmpfs with 128 clients twice (serially).
> - Ran a find on a ext3 partition having ~9.5million entries (files and
>   directories included)
> 
> At the end of this run, I have the following results:
> 
> [root@llm09 bharata]# cat /proc/meminfo
> MemTotal:      3872528 kB
> MemFree:       1420940 kB
> Buffers:        714068 kB
> Cached:          21536 kB
> SwapCached:       2264 kB
> Active:        1672680 kB
> Inactive:       637460 kB
> HighTotal:     3014616 kB
> HighFree:      1411740 kB
> LowTotal:       857912 kB
> LowFree:          9200 kB
> SwapTotal:     2096472 kB
> SwapFree:      2051408 kB
> Dirty:             172 kB
> Writeback:           0 kB
> Mapped:        1583680 kB
> Slab:           119564 kB
> CommitLimit:   4032736 kB
> Committed_AS:  1647260 kB
> PageTables:       2248 kB
> VmallocTotal:   114680 kB
> VmallocUsed:      1264 kB
> VmallocChunk:   113384 kB
> nr_dentries/page        nr_pages        nr_inuse
>          0              0               0
>          1              5               2
>          2              12              4
>          3              26              9
>          4              46              18
>          5              76              40
>          6              82              47
>          7              91              59
>          8              122             93
>          9              114             102
>         10              142             136
>         11              138             185
>         12              118             164
>         13              128             206
>         14              126             208
>         15              120             219
>         16              136             261
>         17              159             315
>         18              145             311
>         19              179             379
>         20              192             407
>         21              256             631
>         22              286             741
>         23              316             816
>         24              342             934
>         25              381             1177
>         26              664             2813
>         27              0               0
>         28              0               0
>         29              0               0
> Total:                  4402            10277
> dcache lru: total 75369 inuse 3599
> 
> [Here,
> nr_dentries/page - Number of dentries per page
> nr_pages - Number of pages with given number of dentries
> nr_inuse - Number of inuse dentries in those pages.
> Eg: From the above data, there are 26 pages with 3 dentries each
> and out of 78 total dentries in these 3 pages, 9 dentries are in use.]
> 
> [root@llm09 bharata]# grep shrinker /proc/slabinfo
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> : shrinker stat <nr requested> <nr freed>
> ext3_xattr             0      0     48   78    1 : tunables  120   60    8 : slabdata      0      0      0 : shrinker stat       0       0
> dquot                  0      0    160   24    1 : tunables  120   60    8 : slabdata      0      0      0 : shrinker stat       0       0
> inode_cache         1301   1390    400   10    1 : tunables   54   27    8 : slabdata    139    139      0 : shrinker stat  682752  681900
> dentry_cache       82110 114452    152   26    1 : tunables  120   60    8 : slabdata   4402   4402      0 : shrinker stat 1557760  760100
> 
> [root@llm09 bharata]# grep slabs_scanned /proc/vmstat
> slabs_scanned 2240512
> 
> [root@llm09 bharata]# cat /proc/sys/fs/dentry-state
> 82046   75369   45      0       3599    0
> [The order of dentry-state o/p is like this:
> total dentries in dentry hash list, total dentries in lru list, age limit,
> want_pages, inuse dentries in lru list, dummy]
> 
> So, we can see that with low memory pressure, even though the
> shrinker runs on dcache repeatedly, not many dentries are freed
> by dcache. And dcache lru list still has huge number of free
> dentries.

The success/attempt ratio is about 1/2, which seems alright? 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: shrinkable cache statistics [was Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough]
  2005-10-05 21:25                   ` Marcelo Tosatti
@ 2005-10-07  8:12                     ` Bharata B Rao
  0 siblings, 0 replies; 32+ messages in thread
From: Bharata B Rao @ 2005-10-07  8:12 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Theodore Ts'o, Dipankar Sarma, linux-mm, linux-kernel

On Wed, Oct 05, 2005 at 06:25:51PM -0300, Marcelo Tosatti wrote:
> Hi Bharata,
> 
> On Tue, Oct 04, 2005 at 07:06:35PM +0530, Bharata B Rao wrote:
> > Marcelo,
> > 
> > Here's my next attempt in breaking the "slabs_scanned" from /proc/vmstat
> > into meaningful per cache statistics. Now I have the statistics counters
> > as percpu. [an issue remaining is that there are more than one cache as
> > part of mbcache and they all have a common shrinker routine and I am
> > displaying the collective shrinker stats info on each of them in
> > /proc/slabinfo ==> some kind of duplication]
> 
> Looks good to me! IMO it should be a candidate for -mm/mainline.
> 
> Nothing useful to suggest on the mbcache issue... sorry.

Thanks Marcelo for reviewing.

<snip>

> > 
> > [root@llm09 bharata]# grep shrinker /proc/slabinfo
> > # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> : shrinker stat <nr requested> <nr freed>
> > ext3_xattr             0      0     48   78    1 : tunables  120   60    8 : slabdata      0      0      0 : shrinker stat       0       0
> > dquot                  0      0    160   24    1 : tunables  120   60    8 : slabdata      0      0      0 : shrinker stat       0       0
> > inode_cache         1301   1390    400   10    1 : tunables   54   27    8 : slabdata    139    139      0 : shrinker stat  682752  681900
> > dentry_cache       82110 114452    152   26    1 : tunables  120   60    8 : slabdata   4402   4402      0 : shrinker stat 1557760  760100
> > 
> > [root@llm09 bharata]# grep slabs_scanned /proc/vmstat
> > slabs_scanned 2240512
> > 
> > [root@llm09 bharata]# cat /proc/sys/fs/dentry-state
> > 82046   75369   45      0       3599    0
> > [The order of dentry-state o/p is like this:
> > total dentries in dentry hash list, total dentries in lru list, age limit,
> > want_pages, inuse dentries in lru list, dummy]
> > 
> > So, we can see that with low memory pressure, even though the
> > shrinker runs on dcache repeatedly, not many dentries are freed
> > by dcache. And dcache lru list still has huge number of free
> > dentries.
> 
> The success/attempt ratio is about 1/2, which seems alright? 
> 

Hmm... when compared to inode_cache, I felt dcache shrinker wasn't
doing a good job. Anyway I will analyze further to see if things
can be made better with the existing shrinker.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2005-10-07  8:13 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-09-11 10:57 VM balancing issues on 2.6.13: dentry cache not getting shrunk enough Theodore Ts'o
2005-09-11 12:00 ` Dipankar Sarma
2005-09-12  3:16   ` Theodore Ts'o
2005-09-12  6:16     ` Martin J. Bligh
2005-09-12 12:53       ` Bharata B Rao
2005-09-13  8:47     ` Bharata B Rao
2005-09-13 21:59       ` David Chinner
2005-09-14  9:01         ` Andi Kleen
2005-09-14  9:16           ` Manfred Spraul
2005-09-14  9:43             ` Andrew Morton
2005-09-14  9:52               ` Dipankar Sarma
2005-09-14 22:44               ` Theodore Ts'o
2005-09-14  9:35           ` Andrew Morton
2005-09-14 13:57           ` Martin J. Bligh
2005-09-14 15:37             ` Sonny Rao
2005-09-15  7:21             ` Helge Hafting
2005-09-14 22:48           ` David Chinner
2005-09-14 15:48         ` Sonny Rao
2005-09-14 22:02           ` David Chinner
2005-09-14 22:40             ` Sonny Rao
2005-09-15  1:14               ` David Chinner
2005-09-14 21:34       ` Marcelo Tosatti
2005-09-14 21:43         ` Dipankar Sarma
2005-09-15  4:28         ` Bharata B Rao
2005-09-14 23:08       ` Marcelo Tosatti
2005-09-15  9:39         ` Bharata B Rao
2005-09-15 13:29           ` Marcelo Tosatti
2005-10-02 16:32             ` Bharata B Rao
2005-10-02 20:06               ` Marcelo
2005-10-04 13:36                 ` shrinkable cache statistics [was Re: VM balancing issues on 2.6.13: dentry cache not getting shrunk enough] Bharata B Rao
2005-10-05 21:25                   ` Marcelo Tosatti
2005-10-07  8:12                     ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).