* VFS scalability git tree @ 2010-07-22 19:01 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-22 19:01 UTC (permalink / raw) To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Frank Mayhar, John Stultz I'm pleased to announce I have a git tree up of my vfs scalability work. git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git Branch vfs-scale-working The really interesting new item is the store-free path walk, (43fe2b) which I've re-introduced. It has had a complete redesign, it has much better performance and scalability in more cases, and is actually sane code now. What this does is to allow parallel name lookups to walk down common elements without any cacheline bouncing between them. It can walk across many interesting cases such as mount points, back up '..', and negative dentries of most filesystems. It does so without requiring any atomic operations or any stores at all to hared data. This also makes it very fast in serial performance (path walking is nearly twice as fast on my Opteron). In cases where it cannot continue the RCU walk (eg. dentry does not exist), then it can in most cases take a reference on the farthest element it has reached so far, and then continue on with a regular refcount-based path walk. My first attempt at this simply dropped everything and re-did the full refcount based walk. I've also been working on stress testing, bug fixing, cutting down 'XXX'es, and improving changelogs and comments. Most filesystems are untested (it's too large a job to do comprehensive stress tests on everything), but none have known issues (except nilfs2). Ext2/3, nfs, nfsd, and ram based filesystems seem to work well, ext4/btrfs/xfs/autofs4 have had light testing. I've never had filesystem corruption when testing these patches (only lockups or other bugs). But standard disclaimer: they may eat your data. Summary of a few numbers I've run. google's socket teardown workload runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20% on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s up to 15GB/s). At this point, I would be very interested in reviewing, correctness testing on different configurations, and of course benchmarking. Thanks, Nick ^ permalink raw reply [flat|nested] 76+ messages in thread
* VFS scalability git tree @ 2010-07-22 19:01 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-22 19:01 UTC (permalink / raw) To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Frank Mayhar, John Stultz I'm pleased to announce I have a git tree up of my vfs scalability work. git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git Branch vfs-scale-working The really interesting new item is the store-free path walk, (43fe2b) which I've re-introduced. It has had a complete redesign, it has much better performance and scalability in more cases, and is actually sane code now. What this does is to allow parallel name lookups to walk down common elements without any cacheline bouncing between them. It can walk across many interesting cases such as mount points, back up '..', and negative dentries of most filesystems. It does so without requiring any atomic operations or any stores at all to hared data. This also makes it very fast in serial performance (path walking is nearly twice as fast on my Opteron). In cases where it cannot continue the RCU walk (eg. dentry does not exist), then it can in most cases take a reference on the farthest element it has reached so far, and then continue on with a regular refcount-based path walk. My first attempt at this simply dropped everything and re-did the full refcount based walk. I've also been working on stress testing, bug fixing, cutting down 'XXX'es, and improving changelogs and comments. Most filesystems are untested (it's too large a job to do comprehensive stress tests on everything), but none have known issues (except nilfs2). Ext2/3, nfs, nfsd, and ram based filesystems seem to work well, ext4/btrfs/xfs/autofs4 have had light testing. I've never had filesystem corruption when testing these patches (only lockups or other bugs). But standard disclaimer: they may eat your data. Summary of a few numbers I've run. google's socket teardown workload runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20% on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s up to 15GB/s). At this point, I would be very interested in reviewing, correctness testing on different configurations, and of course benchmarking. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-22 19:01 ` Nick Piggin @ 2010-07-23 11:13 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 11:13 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Branch vfs-scale-working I've got a couple of patches needed to build XFS - they shrinker merge left some bad fragments - I'll post them in a minute. This email is for the longest ever lockdep warning I've seen that occurred on boot. Cheers, Dave. [ 6.368707] ====================================================== [ 6.369773] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ] [ 6.370379] 2.6.35-rc5-dgc+ #58 [ 6.370882] ------------------------------------------------------ [ 6.371475] pmcd/2124 [HC0[0]:SC0[1]:HE1:SE0] is trying to acquire: [ 6.372062] (&sb->s_type->i_lock_key#6){+.+...}, at: [<ffffffff81736f8c>] socket_get_id+0x3c/0x60 [ 6.372268] [ 6.372268] and this task is already holding: [ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120 [ 6.372268] which would create a new lock dependency: [ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...} -> (&sb->s_type->i_lock_key#6){+.+...} [ 6.372268] [ 6.372268] but this new dependency connects a SOFTIRQ-irq-safe lock: [ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...} [ 6.372268] ... which became SOFTIRQ-irq-safe at: [ 6.372268] [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180 [ 6.372268] [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0 [ 6.372268] [<ffffffff81795502>] tcp_check_req+0x202/0x440 [ 6.372268] [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0 [ 6.372268] [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0 [ 6.372268] [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0 [ 6.372268] [<ffffffff81771af7>] ip_rcv+0x397/0x760 [ 6.372268] [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330 [ 6.372268] [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0 [ 6.372268] [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0 [ 6.372268] [<ffffffff81084cc2>] __do_softirq+0xd2/0x260 [ 6.372268] [<ffffffff81035edc>] call_softirq+0x1c/0x50 [ 6.372268] [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0 [ 6.372268] [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40 [ 6.372268] [<ffffffff8173c59e>] release_sock+0x14e/0x1a0 [ 6.372268] [<ffffffff817a3975>] inet_stream_connect+0x75/0x320 [ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] [ 6.372268] to a SOFTIRQ-irq-unsafe lock: [ 6.372268] (&sb->s_type->i_lock_key#6){+.+...} [ 6.372268] ... which became SOFTIRQ-irq-unsafe at: [ 6.372268] ... [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0 [ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180 [ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30 [ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0 [ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20 [ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59 [ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0 [ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204 [ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 6.372268] [ 6.372268] other info that might help us debug this: [ 6.372268] [ 6.372268] 3 locks held by pmcd/2124: [ 6.372268] #0: (&p->lock){+.+.+.}, at: [<ffffffff81171dae>] seq_read+0x3e/0x430 [ 6.372268] #1: (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120 [ 6.372268] #2: (clock-AF_INET){++....}, at: [<ffffffff8173b6ae>] sock_i_ino+0x2e/0x70 [ 6.372268] [ 6.372268] the dependencies between SOFTIRQ-irq-safe lock and the holding lock: [ 6.372268] -> (&(&hashinfo->ehash_locks[i])->rlock){+.-...} ops: 3 { [ 6.372268] HARDIRQ-ON-W at: [ 6.372268] [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180 [ 6.372268] [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0 [ 6.372268] [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60 [ 6.372268] [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0 [ 6.372268] [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320 [ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] IN-SOFTIRQ-W at: [ 6.372268] [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180 [ 6.372268] [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0 [ 6.372268] [<ffffffff81795502>] tcp_check_req+0x202/0x440 [ 6.372268] [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0 [ 6.372268] [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0 [ 6.372268] [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0 [ 6.372268] [<ffffffff81771af7>] ip_rcv+0x397/0x760 [ 6.372268] [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330 [ 6.372268] [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0 [ 6.372268] [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0 [ 6.372268] [<ffffffff81084cc2>] __do_softirq+0xd2/0x260 [ 6.372268] [<ffffffff81035edc>] call_softirq+0x1c/0x50 [ 6.372268] [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0 [ 6.372268] [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40 [ 6.372268] [<ffffffff8173c59e>] release_sock+0x14e/0x1a0 [ 6.372268] [<ffffffff817a3975>] inet_stream_connect+0x75/0x320 [ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] INITIAL USE at: [ 6.372268] [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180 [ 6.372268] [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0 [ 6.372268] [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60 [ 6.372268] [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0 [ 6.372268] [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320 [ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] } [ 6.372268] ... key at: [<ffffffff8285ddf8>] __key.47027+0x0/0x8 [ 6.372268] ... acquired at: [ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0 [ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70 [ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520 [ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430 [ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0 [ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170 [ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] [ 6.372268] [ 6.372268] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: [ 6.372268] -> (&sb->s_type->i_lock_key#6){+.+...} ops: 1185 { [ 6.372268] HARDIRQ-ON-W at: [ 6.372268] [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0 [ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180 [ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30 [ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0 [ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20 [ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59 [ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0 [ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204 [ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 6.372268] SOFTIRQ-ON-W at: [ 6.372268] [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0 [ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180 [ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30 [ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0 [ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20 [ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59 [ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0 [ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204 [ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 6.372268] INITIAL USE at: [ 6.372268] [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0 [ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180 [ 6.372268] [<f [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0 [ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20 [ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59 [ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0 [ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204 [ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 6.372268] } [ 6.372268] ... key at: [<ffffffff81bd5bd8>] sock_fs_type+0x58/0x80 [ 6.372268] ... acquired at: [ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0 [ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70 [ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520 [ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430 [ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0 [ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170 [ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] [ 6.372268] [ 6.372268] stack backtrace: [ 6.372268] Pid: 2124, comm: pmcd Not tainted 2.6.35-rc5-dgc+ #58 [ 6.372268] Call Trace: [ 6.372268] [<ffffffff810b28d9>] check_usage+0x499/0x4a0 [ 6.372268] [<ffffffff810b24c6>] ? check_usage+0x86/0x4a0 [ 6.372268] [<ffffffff810af729>] ? __bfs+0x129/0x260 [ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0 [ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70 [ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520 [ 6.372268] [<ffffffff81791750>] ? established_get_first+0x60/0x120 [ 6.372268] [<ffffffff8182beb7>] ? _raw_spin_lock_bh+0x67/0x70 [ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430 [ 6.372268] [<ffffffff81171d70>] ? seq_read+0x0/0x430 [ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0 [ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170 [ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-23 11:13 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 11:13 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Branch vfs-scale-working I've got a couple of patches needed to build XFS - they shrinker merge left some bad fragments - I'll post them in a minute. This email is for the longest ever lockdep warning I've seen that occurred on boot. Cheers, Dave. [ 6.368707] ====================================================== [ 6.369773] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ] [ 6.370379] 2.6.35-rc5-dgc+ #58 [ 6.370882] ------------------------------------------------------ [ 6.371475] pmcd/2124 [HC0[0]:SC0[1]:HE1:SE0] is trying to acquire: [ 6.372062] (&sb->s_type->i_lock_key#6){+.+...}, at: [<ffffffff81736f8c>] socket_get_id+0x3c/0x60 [ 6.372268] [ 6.372268] and this task is already holding: [ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120 [ 6.372268] which would create a new lock dependency: [ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...} -> (&sb->s_type->i_lock_key#6){+.+...} [ 6.372268] [ 6.372268] but this new dependency connects a SOFTIRQ-irq-safe lock: [ 6.372268] (&(&hashinfo->ehash_locks[i])->rlock){+.-...} [ 6.372268] ... which became SOFTIRQ-irq-safe at: [ 6.372268] [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180 [ 6.372268] [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0 [ 6.372268] [<ffffffff81795502>] tcp_check_req+0x202/0x440 [ 6.372268] [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0 [ 6.372268] [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0 [ 6.372268] [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0 [ 6.372268] [<ffffffff81771af7>] ip_rcv+0x397/0x760 [ 6.372268] [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330 [ 6.372268] [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0 [ 6.372268] [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0 [ 6.372268] [<ffffffff81084cc2>] __do_softirq+0xd2/0x260 [ 6.372268] [<ffffffff81035edc>] call_softirq+0x1c/0x50 [ 6.372268] [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0 [ 6.372268] [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40 [ 6.372268] [<ffffffff8173c59e>] release_sock+0x14e/0x1a0 [ 6.372268] [<ffffffff817a3975>] inet_stream_connect+0x75/0x320 [ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] [ 6.372268] to a SOFTIRQ-irq-unsafe lock: [ 6.372268] (&sb->s_type->i_lock_key#6){+.+...} [ 6.372268] ... which became SOFTIRQ-irq-unsafe at: [ 6.372268] ... [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0 [ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180 [ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30 [ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0 [ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20 [ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59 [ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0 [ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204 [ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 6.372268] [ 6.372268] other info that might help us debug this: [ 6.372268] [ 6.372268] 3 locks held by pmcd/2124: [ 6.372268] #0: (&p->lock){+.+.+.}, at: [<ffffffff81171dae>] seq_read+0x3e/0x430 [ 6.372268] #1: (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120 [ 6.372268] #2: (clock-AF_INET){++....}, at: [<ffffffff8173b6ae>] sock_i_ino+0x2e/0x70 [ 6.372268] [ 6.372268] the dependencies between SOFTIRQ-irq-safe lock and the holding lock: [ 6.372268] -> (&(&hashinfo->ehash_locks[i])->rlock){+.-...} ops: 3 { [ 6.372268] HARDIRQ-ON-W at: [ 6.372268] [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180 [ 6.372268] [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0 [ 6.372268] [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60 [ 6.372268] [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0 [ 6.372268] [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320 [ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] IN-SOFTIRQ-W at: [ 6.372268] [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180 [ 6.372268] [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0 [ 6.372268] [<ffffffff81795502>] tcp_check_req+0x202/0x440 [ 6.372268] [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0 [ 6.372268] [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0 [ 6.372268] [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0 [ 6.372268] [<ffffffff81771af7>] ip_rcv+0x397/0x760 [ 6.372268] [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330 [ 6.372268] [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0 [ 6.372268] [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0 [ 6.372268] [<ffffffff81084cc2>] __do_softirq+0xd2/0x260 [ 6.372268] [<ffffffff81035edc>] call_softirq+0x1c/0x50 [ 6.372268] [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0 [ 6.372268] [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40 [ 6.372268] [<ffffffff8173c59e>] release_sock+0x14e/0x1a0 [ 6.372268] [<ffffffff817a3975>] inet_stream_connect+0x75/0x320 [ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] INITIAL USE at: [ 6.372268] [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180 [ 6.372268] [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0 [ 6.372268] [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60 [ 6.372268] [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0 [ 6.372268] [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320 [ 6.372268] [<ffffffff81737917>] sys_connect+0xa7/0xc0 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] } [ 6.372268] ... key at: [<ffffffff8285ddf8>] __key.47027+0x0/0x8 [ 6.372268] ... acquired at: [ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0 [ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70 [ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520 [ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430 [ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0 [ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170 [ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] [ 6.372268] [ 6.372268] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: [ 6.372268] -> (&sb->s_type->i_lock_key#6){+.+...} ops: 1185 { [ 6.372268] HARDIRQ-ON-W at: [ 6.372268] [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0 [ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180 [ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30 [ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0 [ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20 [ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59 [ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0 [ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204 [ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 6.372268] SOFTIRQ-ON-W at: [ 6.372268] [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0 [ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180 [ 6.372268] [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30 [ 6.372268] [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0 [ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20 [ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59 [ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0 [ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204 [ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 6.372268] INITIAL USE at: [ 6.372268] [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff8116af72>] new_inode+0x52/0xd0 [ 6.372268] [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180 [ 6.372268] [<f [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0 [ 6.372268] [<ffffffff81152f29>] kern_mount_data+0x19/0x20 [ 6.372268] [<ffffffff81e1c075>] sock_init+0x4e/0x59 [ 6.372268] [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0 [ 6.372268] [<ffffffff81de5767>] kernel_init+0x17a/0x204 [ 6.372268] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 6.372268] } [ 6.372268] ... key at: [<ffffffff81bd5bd8>] sock_fs_type+0x58/0x80 [ 6.372268] ... acquired at: [ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0 [ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70 [ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520 [ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430 [ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0 [ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170 [ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b [ 6.372268] [ 6.372268] [ 6.372268] stack backtrace: [ 6.372268] Pid: 2124, comm: pmcd Not tainted 2.6.35-rc5-dgc+ #58 [ 6.372268] Call Trace: [ 6.372268] [<ffffffff810b28d9>] check_usage+0x499/0x4a0 [ 6.372268] [<ffffffff810b24c6>] ? check_usage+0x86/0x4a0 [ 6.372268] [<ffffffff810af729>] ? __bfs+0x129/0x260 [ 6.372268] [<ffffffff810b2940>] check_irq_usage+0x60/0xf0 [ 6.372268] [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450 [ 6.372268] [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160 [ 6.372268] [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70 [ 6.372268] [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff81736f8c>] socket_get_id+0x3c/0x60 [ 6.372268] [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70 [ 6.372268] [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520 [ 6.372268] [<ffffffff81791750>] ? established_get_first+0x60/0x120 [ 6.372268] [<ffffffff8182beb7>] ? _raw_spin_lock_bh+0x67/0x70 [ 6.372268] [<ffffffff81172005>] seq_read+0x295/0x430 [ 6.372268] [<ffffffff81171d70>] ? seq_read+0x0/0x430 [ 6.372268] [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0 [ 6.372268] [<ffffffff81150165>] vfs_read+0xb5/0x170 [ 6.372268] [<ffffffff81150274>] sys_read+0x54/0x90 [ 6.372268] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* [PATCH 0/2] vfs scalability tree fixes 2010-07-23 11:13 ` Dave Chinner @ 2010-07-23 14:04 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw) To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul Nick, Here's the fixes I applied to your tree to make the XFS inode cache shrinker build and scan sanely. Cheers, Dave. ^ permalink raw reply [flat|nested] 76+ messages in thread
* [PATCH 0/2] vfs scalability tree fixes @ 2010-07-23 14:04 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw) To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul Nick, Here's the fixes I applied to your tree to make the XFS inode cache shrinker build and scan sanely. Cheers, Dave. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 0/2] vfs scalability tree fixes 2010-07-23 14:04 ` Dave Chinner @ 2010-07-23 16:09 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 16:09 UTC (permalink / raw) To: Dave Chinner Cc: npiggin, linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul On Sat, Jul 24, 2010 at 12:04:00AM +1000, Dave Chinner wrote: > Nick, > > Here's the fixes I applied to your tree to make the XFS inode cache > shrinker build and scan sanely. Thanks for these Dave ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 0/2] vfs scalability tree fixes @ 2010-07-23 16:09 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 16:09 UTC (permalink / raw) To: Dave Chinner Cc: npiggin, linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul On Sat, Jul 24, 2010 at 12:04:00AM +1000, Dave Chinner wrote: > Nick, > > Here's the fixes I applied to your tree to make the XFS inode cache > shrinker build and scan sanely. Thanks for these Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* [PATCH 1/2] xfs: fix shrinker build 2010-07-23 11:13 ` Dave Chinner @ 2010-07-23 14:04 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw) To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul From: Dave Chinner <dchinner@redhat.com> Remove the stray mount list lock reference from the shrinker code. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/linux-2.6/xfs_sync.c | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index 7a5a368..05426bf 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -916,10 +916,8 @@ xfs_reclaim_inode_shrink( done: nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH); - if (!nr) { - up_read(&xfs_mount_list_lock); + if (!nr) return 0; - } xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0, XFS_ICI_RECLAIM_TAG, 1, &nr); /* if we don't exhaust the scan, don't bother coming back */ @@ -935,7 +933,6 @@ xfs_inode_shrinker_register( struct xfs_mount *mp) { mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink; - mp->m_inode_shrink.seeks = DEFAULT_SEEKS; register_shrinker(&mp->m_inode_shrink); } -- 1.7.1 ^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 1/2] xfs: fix shrinker build @ 2010-07-23 14:04 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw) To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul From: Dave Chinner <dchinner@redhat.com> Remove the stray mount list lock reference from the shrinker code. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/linux-2.6/xfs_sync.c | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index 7a5a368..05426bf 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -916,10 +916,8 @@ xfs_reclaim_inode_shrink( done: nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH); - if (!nr) { - up_read(&xfs_mount_list_lock); + if (!nr) return 0; - } xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0, XFS_ICI_RECLAIM_TAG, 1, &nr); /* if we don't exhaust the scan, don't bother coming back */ @@ -935,7 +933,6 @@ xfs_inode_shrinker_register( struct xfs_mount *mp) { mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink; - mp->m_inode_shrink.seeks = DEFAULT_SEEKS; register_shrinker(&mp->m_inode_shrink); } -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count 2010-07-23 11:13 ` Dave Chinner @ 2010-07-23 14:04 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw) To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul From: Dave Chinner <dchinner@redhat.com> The shrinker uses a global static to aggregate excess scan counts. This should be per filesystem like all the other shrinker context to operate correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/linux-2.6/xfs_sync.c | 5 ++--- fs/xfs/xfs_mount.h | 1 + 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index 05426bf..b0e6296 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -893,7 +893,6 @@ xfs_reclaim_inode_shrink( unsigned long global, gfp_t gfp_mask) { - static unsigned long nr_to_scan; int nr; struct xfs_mount *mp; struct xfs_perag *pag; @@ -908,14 +907,14 @@ xfs_reclaim_inode_shrink( nr_reclaimable += pag->pag_ici_reclaimable; xfs_perag_put(pag); } - shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable, + shrinker_add_scan(&mp->m_shrink_scan_nr, scanned, global, nr_reclaimable, DEFAULT_SEEKS); if (!(gfp_mask & __GFP_FS)) { return 0; } done: - nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH); + nr = shrinker_do_scan(&mp->m_shrink_scan_nr, SHRINK_BATCH); if (!nr) return 0; xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0, diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 5761087..ed5531f 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -260,6 +260,7 @@ typedef struct xfs_mount { __int64_t m_update_flags; /* sb flags we need to update on the next remount,rw */ struct shrinker m_inode_shrink; /* inode reclaim shrinker */ + unsigned long m_shrink_scan_nr; /* shrinker scan count */ } xfs_mount_t; /* -- 1.7.1 ^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count @ 2010-07-23 14:04 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw) To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul From: Dave Chinner <dchinner@redhat.com> The shrinker uses a global static to aggregate excess scan counts. This should be per filesystem like all the other shrinker context to operate correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/linux-2.6/xfs_sync.c | 5 ++--- fs/xfs/xfs_mount.h | 1 + 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index 05426bf..b0e6296 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -893,7 +893,6 @@ xfs_reclaim_inode_shrink( unsigned long global, gfp_t gfp_mask) { - static unsigned long nr_to_scan; int nr; struct xfs_mount *mp; struct xfs_perag *pag; @@ -908,14 +907,14 @@ xfs_reclaim_inode_shrink( nr_reclaimable += pag->pag_ici_reclaimable; xfs_perag_put(pag); } - shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable, + shrinker_add_scan(&mp->m_shrink_scan_nr, scanned, global, nr_reclaimable, DEFAULT_SEEKS); if (!(gfp_mask & __GFP_FS)) { return 0; } done: - nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH); + nr = shrinker_do_scan(&mp->m_shrink_scan_nr, SHRINK_BATCH); if (!nr) return 0; xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0, diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 5761087..ed5531f 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -260,6 +260,7 @@ typedef struct xfs_mount { __int64_t m_update_flags; /* sb flags we need to update on the next remount,rw */ struct shrinker m_inode_shrink; /* inode reclaim shrinker */ + unsigned long m_shrink_scan_nr; /* shrinker scan count */ } xfs_mount_t; /* -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-23 11:13 ` Dave Chinner @ 2010-07-23 15:51 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 15:51 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > I've got a couple of patches needed to build XFS - they shrinker > merge left some bad fragments - I'll post them in a minute. This OK cool. > email is for the longest ever lockdep warning I've seen that > occurred on boot. Ah thanks. OK that was one of my attempts to keep sockets out of hidding the vfs as much as possible (lazy inode number evaluation). Not a big problem, but I'll drop the patch for now. I have just got one for you too, btw :) (on vanilla kernel but it is messing up my lockdep stress testing on xfs). Real or false? [ INFO: possible circular locking dependency detected ] 2.6.35-rc5-00064-ga9f7f2e #334 ------------------------------------------------------- kswapd0/605 is trying to acquire lock: (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0 but task is already holding lock: (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>] xfs_reclaim_inode_shrink+0xc6/0x140 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&xfs_mount_list_lock){++++.-}: [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70 [<ffffffff815aa646>] _raw_spin_lock+0x36/0x50 [<ffffffff810fabf3>] try_to_free_buffers+0x43/0xb0 [<ffffffff812763b2>] xfs_vm_releasepage+0x92/0xe0 [<ffffffff810908ee>] try_to_release_page+0x2e/0x50 [<ffffffff8109ef56>] shrink_page_list+0x486/0x5a0 [<ffffffff8109f35d>] shrink_inactive_list+0x2ed/0x700 [<ffffffff8109fda0>] shrink_zone+0x3b0/0x460 [<ffffffff810a0f41>] try_to_free_pages+0x241/0x3a0 [<ffffffff810999e2>] __alloc_pages_nodemask+0x4c2/0x6b0 [<ffffffff810c52c6>] alloc_pages_current+0x76/0xf0 [<ffffffff8109205b>] __page_cache_alloc+0xb/0x10 [<ffffffff81092a2a>] find_or_create_page+0x4a/0xa0 [<ffffffff812780cc>] _xfs_buf_lookup_pages+0x14c/0x360 [<ffffffff81279122>] xfs_buf_get+0x72/0x160 [<ffffffff8126eb68>] xfs_trans_get_buf+0xc8/0xf0 [<ffffffff8124439f>] xfs_da_do_buf+0x3df/0x6d0 [<ffffffff81244825>] xfs_da_get_buf+0x25/0x30 [<ffffffff8124a076>] xfs_dir2_data_init+0x46/0xe0 [<ffffffff81247f89>] xfs_dir2_sf_to_block+0xb9/0x5a0 [<ffffffff812501c8>] xfs_dir2_sf_addname+0x418/0x5c0 [<ffffffff81247d7c>] xfs_dir_createname+0x14c/0x1a0 [<ffffffff81271d49>] xfs_create+0x449/0x5d0 [<ffffffff8127d802>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8127d92b>] xfs_vn_create+0xb/0x10 [<ffffffff810ddc81>] vfs_create+0x81/0xd0 [<ffffffff810df1a5>] do_last+0x535/0x690 [<ffffffff810e11fd>] do_filp_open+0x21d/0x660 [<ffffffff810d16b4>] do_sys_open+0x64/0x140 [<ffffffff810d17bb>] sys_open+0x1b/0x20 [<ffffffff810023eb>] system_call_fastpath+0x16/0x1b :-> #0 (&(&ip->i_lock)->mr_lock){++++--}: [<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10 [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70 [<ffffffff8105dfba>] down_write_nested+0x4a/0x70 [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0 [<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250 [<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120 [<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0 [<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140 [<ffffffff8109ff7d>] shrink_slab+0x12d/0x190 [<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0 [<ffffffff810a0b1e>] kswapd+0xbe/0x2a0 [<ffffffff810592ae>] kthread+0x8e/0xa0 [<ffffffff81003194>] kernel_thread_helper+0x4/0x10 other info that might help us debug this: 2 locks held by kswapd0/605: #0: (shrinker_rwsem){++++..}, at: [<ffffffff8109fe88>] shrink_slab+0x38/0x190 #1: (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>] xfs_reclaim_inode_shrink+0xc6/0x140 stack backtrace: Pid: 605, comm: kswapd0 Not tainted 2.6.35-rc5-00064-ga9f7f2e #334 Call Trace: [<ffffffff8106c5d9>] print_circular_bug+0xe9/0xf0 [<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10 [<ffffffff8106e3c2>] ? __lock_acquire+0x1092/0x1c10 [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70 [<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0 [<ffffffff8105dfba>] down_write_nested+0x4a/0x70 [<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0 [<ffffffff815ae795>] ? sub_preempt_count+0x95/0xd0 [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0 [<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250 [<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120 [<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250 [<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0 [<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250 [<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140 [<ffffffff8109ff7d>] shrink_slab+0x12d/0x190 [<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0 [<ffffffff810a0b1e>] kswapd+0xbe/0x2a0 [<ffffffff81059700>] ? autoremove_wake_function+0x0/0x40 [<ffffffff815aaf3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [<ffffffff810a0a60>] ? kswapd+0x0/0x2a0 [<ffffffff810592ae>] kthread+0x8e/0xa0 [<ffffffff81003194>] kernel_thread_helper+0x4/0x10 [<ffffffff815ab400>] ? restore_args+0x0/0x30 [<ffffffff81059220>] ? kthread+0x0/0xa0 ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-23 15:51 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 15:51 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > I've got a couple of patches needed to build XFS - they shrinker > merge left some bad fragments - I'll post them in a minute. This OK cool. > email is for the longest ever lockdep warning I've seen that > occurred on boot. Ah thanks. OK that was one of my attempts to keep sockets out of hidding the vfs as much as possible (lazy inode number evaluation). Not a big problem, but I'll drop the patch for now. I have just got one for you too, btw :) (on vanilla kernel but it is messing up my lockdep stress testing on xfs). Real or false? [ INFO: possible circular locking dependency detected ] 2.6.35-rc5-00064-ga9f7f2e #334 ------------------------------------------------------- kswapd0/605 is trying to acquire lock: (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0 but task is already holding lock: (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>] xfs_reclaim_inode_shrink+0xc6/0x140 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&xfs_mount_list_lock){++++.-}: [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70 [<ffffffff815aa646>] _raw_spin_lock+0x36/0x50 [<ffffffff810fabf3>] try_to_free_buffers+0x43/0xb0 [<ffffffff812763b2>] xfs_vm_releasepage+0x92/0xe0 [<ffffffff810908ee>] try_to_release_page+0x2e/0x50 [<ffffffff8109ef56>] shrink_page_list+0x486/0x5a0 [<ffffffff8109f35d>] shrink_inactive_list+0x2ed/0x700 [<ffffffff8109fda0>] shrink_zone+0x3b0/0x460 [<ffffffff810a0f41>] try_to_free_pages+0x241/0x3a0 [<ffffffff810999e2>] __alloc_pages_nodemask+0x4c2/0x6b0 [<ffffffff810c52c6>] alloc_pages_current+0x76/0xf0 [<ffffffff8109205b>] __page_cache_alloc+0xb/0x10 [<ffffffff81092a2a>] find_or_create_page+0x4a/0xa0 [<ffffffff812780cc>] _xfs_buf_lookup_pages+0x14c/0x360 [<ffffffff81279122>] xfs_buf_get+0x72/0x160 [<ffffffff8126eb68>] xfs_trans_get_buf+0xc8/0xf0 [<ffffffff8124439f>] xfs_da_do_buf+0x3df/0x6d0 [<ffffffff81244825>] xfs_da_get_buf+0x25/0x30 [<ffffffff8124a076>] xfs_dir2_data_init+0x46/0xe0 [<ffffffff81247f89>] xfs_dir2_sf_to_block+0xb9/0x5a0 [<ffffffff812501c8>] xfs_dir2_sf_addname+0x418/0x5c0 [<ffffffff81247d7c>] xfs_dir_createname+0x14c/0x1a0 [<ffffffff81271d49>] xfs_create+0x449/0x5d0 [<ffffffff8127d802>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8127d92b>] xfs_vn_create+0xb/0x10 [<ffffffff810ddc81>] vfs_create+0x81/0xd0 [<ffffffff810df1a5>] do_last+0x535/0x690 [<ffffffff810e11fd>] do_filp_open+0x21d/0x660 [<ffffffff810d16b4>] do_sys_open+0x64/0x140 [<ffffffff810d17bb>] sys_open+0x1b/0x20 [<ffffffff810023eb>] system_call_fastpath+0x16/0x1b :-> #0 (&(&ip->i_lock)->mr_lock){++++--}: [<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10 [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70 [<ffffffff8105dfba>] down_write_nested+0x4a/0x70 [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0 [<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250 [<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120 [<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0 [<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140 [<ffffffff8109ff7d>] shrink_slab+0x12d/0x190 [<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0 [<ffffffff810a0b1e>] kswapd+0xbe/0x2a0 [<ffffffff810592ae>] kthread+0x8e/0xa0 [<ffffffff81003194>] kernel_thread_helper+0x4/0x10 other info that might help us debug this: 2 locks held by kswapd0/605: #0: (shrinker_rwsem){++++..}, at: [<ffffffff8109fe88>] shrink_slab+0x38/0x190 #1: (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>] xfs_reclaim_inode_shrink+0xc6/0x140 stack backtrace: Pid: 605, comm: kswapd0 Not tainted 2.6.35-rc5-00064-ga9f7f2e #334 Call Trace: [<ffffffff8106c5d9>] print_circular_bug+0xe9/0xf0 [<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10 [<ffffffff8106e3c2>] ? __lock_acquire+0x1092/0x1c10 [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70 [<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0 [<ffffffff8105dfba>] down_write_nested+0x4a/0x70 [<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0 [<ffffffff815ae795>] ? sub_preempt_count+0x95/0xd0 [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0 [<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250 [<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120 [<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250 [<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0 [<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250 [<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140 [<ffffffff8109ff7d>] shrink_slab+0x12d/0x190 [<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0 [<ffffffff810a0b1e>] kswapd+0xbe/0x2a0 [<ffffffff81059700>] ? autoremove_wake_function+0x0/0x40 [<ffffffff815aaf3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [<ffffffff810a0a60>] ? kswapd+0x0/0x2a0 [<ffffffff810592ae>] kthread+0x8e/0xa0 [<ffffffff81003194>] kernel_thread_helper+0x4/0x10 [<ffffffff815ab400>] ? restore_args+0x0/0x30 [<ffffffff81059220>] ? kthread+0x0/0xa0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-23 15:51 ` Nick Piggin @ 2010-07-24 0:21 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-24 0:21 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Sat, Jul 24, 2010 at 01:51:18AM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > I've got a couple of patches needed to build XFS - they shrinker > > merge left some bad fragments - I'll post them in a minute. This > > OK cool. > > > > email is for the longest ever lockdep warning I've seen that > > occurred on boot. > > Ah thanks. OK that was one of my attempts to keep sockets out of > hidding the vfs as much as possible (lazy inode number evaluation). > Not a big problem, but I'll drop the patch for now. > > I have just got one for you too, btw :) (on vanilla kernel but it is > messing up my lockdep stress testing on xfs). Real or false? > > [ INFO: possible circular locking dependency detected ] > 2.6.35-rc5-00064-ga9f7f2e #334 > ------------------------------------------------------- > kswapd0/605 is trying to acquire lock: > (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>] > xfs_ilock+0x7c/0xa0 > > but task is already holding lock: > (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>] > xfs_reclaim_inode_shrink+0xc6/0x140 False positive, but the xfs_mount_list_lock is gone in 2.6.35-rc6 - the shrinker context change has fixed that - so you can ignore it anyway. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-24 0:21 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-24 0:21 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Sat, Jul 24, 2010 at 01:51:18AM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > I've got a couple of patches needed to build XFS - they shrinker > > merge left some bad fragments - I'll post them in a minute. This > > OK cool. > > > > email is for the longest ever lockdep warning I've seen that > > occurred on boot. > > Ah thanks. OK that was one of my attempts to keep sockets out of > hidding the vfs as much as possible (lazy inode number evaluation). > Not a big problem, but I'll drop the patch for now. > > I have just got one for you too, btw :) (on vanilla kernel but it is > messing up my lockdep stress testing on xfs). Real or false? > > [ INFO: possible circular locking dependency detected ] > 2.6.35-rc5-00064-ga9f7f2e #334 > ------------------------------------------------------- > kswapd0/605 is trying to acquire lock: > (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>] > xfs_ilock+0x7c/0xa0 > > but task is already holding lock: > (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>] > xfs_reclaim_inode_shrink+0xc6/0x140 False positive, but the xfs_mount_list_lock is gone in 2.6.35-rc6 - the shrinker context change has fixed that - so you can ignore it anyway. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-22 19:01 ` Nick Piggin @ 2010-07-23 11:17 ` Christoph Hellwig -1 siblings, 0 replies; 76+ messages in thread From: Christoph Hellwig @ 2010-07-23 11:17 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz I might sound like a broken record, but if you want to make forward progress with this split it into smaller series. What would be useful for example would be one series each to split the global inode_lock and dcache_lock, without introducing all the fancy new locking primitives, per-bucket locks and lru schemes for a start. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-23 11:17 ` Christoph Hellwig 0 siblings, 0 replies; 76+ messages in thread From: Christoph Hellwig @ 2010-07-23 11:17 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz I might sound like a broken record, but if you want to make forward progress with this split it into smaller series. What would be useful for example would be one series each to split the global inode_lock and dcache_lock, without introducing all the fancy new locking primitives, per-bucket locks and lru schemes for a start. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-23 11:17 ` Christoph Hellwig @ 2010-07-23 15:42 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 15:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 07:17:46AM -0400, Christoph Hellwig wrote: > I might sound like a broken record, but if you want to make forward > progress with this split it into smaller series. No I appreciate the advice. I put this tree up for people to fetch without posting patches all the time. I think it is important to test and to see the big picture when reviewing the patches, but you are right about how to actually submit patches on the ML. > What would be useful for example would be one series each to split > the global inode_lock and dcache_lock, without introducing all the > fancy new locking primitives, per-bucket locks and lru schemes for > a start. I've kept the series fairly well structured like that. Basically it is in these parts: 1. files lock 2. vfsmount lock 3. mnt refcount 4a. put several new global spinlocks around different parts of dcache 4b. remove dcache_lock after the above protect everything 4c. start doing fine grained locking of hash, inode alias, lru, etc etc 5a, 5b, 5c. same for inodes 6. some further optimisations and cleanups 7. store-free path walking This kind of sequence. I will again try to submit a first couple of things to Al soon. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-23 15:42 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 15:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 07:17:46AM -0400, Christoph Hellwig wrote: > I might sound like a broken record, but if you want to make forward > progress with this split it into smaller series. No I appreciate the advice. I put this tree up for people to fetch without posting patches all the time. I think it is important to test and to see the big picture when reviewing the patches, but you are right about how to actually submit patches on the ML. > What would be useful for example would be one series each to split > the global inode_lock and dcache_lock, without introducing all the > fancy new locking primitives, per-bucket locks and lru schemes for > a start. I've kept the series fairly well structured like that. Basically it is in these parts: 1. files lock 2. vfsmount lock 3. mnt refcount 4a. put several new global spinlocks around different parts of dcache 4b. remove dcache_lock after the above protect everything 4c. start doing fine grained locking of hash, inode alias, lru, etc etc 5a, 5b, 5c. same for inodes 6. some further optimisations and cleanups 7. store-free path walking This kind of sequence. I will again try to submit a first couple of things to Al soon. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-22 19:01 ` Nick Piggin @ 2010-07-23 13:55 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 13:55 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Branch vfs-scale-working Bug's I've noticed so far: - Using XFS, the existing vfs inode count statistic does not decrease as inodes are free. - the existing vfs dentry count remains at zero - the existing vfs free inode count remains at zero $ pminfo -f vfs.inodes vfs.dentry vfs.inodes.count value 7472612 vfs.inodes.free value 0 vfs.dentry.count value 0 vfs.dentry.free value 0 Performance Summary: With lockdep and CONFIG_XFS_DEBUG enabled, a 16 thread parallel sequential create/unlink workload on an 8p/4GB RAM VM with a virtio block device sitting on a short-stroked 12x2TB SAS array w/ 512MB BBWC in RAID0 via dm and using the noop elevator in the guest VM: $ sudo mkfs.xfs -f -l size=128m -d agcount=16 /dev/vdb meta-data=/dev/vdb isize=256 agcount=16, agsize=1638400 blks = sectsz=512 attr=2 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch $ sudo chmod 777 /mnt/scratch $ cd ~/src/fs_mark-3.3/ $ ./fs_mark -S0 -n 500000 -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 -d /mnt/scratch/15 files/s 2.6.34-rc4 12550 2.6.35-rc5+scale 12285 So the same within the error margins of the benchmark. Screenshot of monitoring graphs - you can see the effect of the broken stats: http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-16x500-xfs.png http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc5-npiggin-scale-lockdep-16x500-xfs.png With a production build (i.e. no lockdep, no xfs debug), I'll run the same fs_mark parallel create/unlink workload to show scalability as I ran here: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html The numbers can't be directly compared, but the test and the setup is the same. The XFS numbers below are with delayed logging enabled. ext4 is using default mkfs and mount parameters except for barrier=0. All numbers are averages of three runs. fs_mark rate (thousands of files/second) 2.6.35-rc5 2.6.35-rc5-scale threads xfs ext4 xfs ext4 1 20 39 20 39 2 35 55 35 57 4 60 41 57 42 8 79 9 75 9 ext4 is getting IO bound at more than 2 threads, so apart from pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm going to ignore ext4 for the purposes of testing scalability here. For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% CPU and with Nick's patches it's about 650% (10% higher) for slightly lower throughput. So at this class of machine for this workload, the changes result in a slight reduction in scalability. I looked at dbench on XFS as well, but didn't see any significant change in the numbers at up to 200 load threads, so not much to talk about there. Sometime over the weekend I'll build a 16p VM and see what I get from that... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-23 13:55 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-23 13:55 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Branch vfs-scale-working Bug's I've noticed so far: - Using XFS, the existing vfs inode count statistic does not decrease as inodes are free. - the existing vfs dentry count remains at zero - the existing vfs free inode count remains at zero $ pminfo -f vfs.inodes vfs.dentry vfs.inodes.count value 7472612 vfs.inodes.free value 0 vfs.dentry.count value 0 vfs.dentry.free value 0 Performance Summary: With lockdep and CONFIG_XFS_DEBUG enabled, a 16 thread parallel sequential create/unlink workload on an 8p/4GB RAM VM with a virtio block device sitting on a short-stroked 12x2TB SAS array w/ 512MB BBWC in RAID0 via dm and using the noop elevator in the guest VM: $ sudo mkfs.xfs -f -l size=128m -d agcount=16 /dev/vdb meta-data=/dev/vdb isize=256 agcount=16, agsize=1638400 blks = sectsz=512 attr=2 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch $ sudo chmod 777 /mnt/scratch $ cd ~/src/fs_mark-3.3/ $ ./fs_mark -S0 -n 500000 -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 -d /mnt/scratch/15 files/s 2.6.34-rc4 12550 2.6.35-rc5+scale 12285 So the same within the error margins of the benchmark. Screenshot of monitoring graphs - you can see the effect of the broken stats: http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-16x500-xfs.png http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc5-npiggin-scale-lockdep-16x500-xfs.png With a production build (i.e. no lockdep, no xfs debug), I'll run the same fs_mark parallel create/unlink workload to show scalability as I ran here: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html The numbers can't be directly compared, but the test and the setup is the same. The XFS numbers below are with delayed logging enabled. ext4 is using default mkfs and mount parameters except for barrier=0. All numbers are averages of three runs. fs_mark rate (thousands of files/second) 2.6.35-rc5 2.6.35-rc5-scale threads xfs ext4 xfs ext4 1 20 39 20 39 2 35 55 35 57 4 60 41 57 42 8 79 9 75 9 ext4 is getting IO bound at more than 2 threads, so apart from pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm going to ignore ext4 for the purposes of testing scalability here. For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% CPU and with Nick's patches it's about 650% (10% higher) for slightly lower throughput. So at this class of machine for this workload, the changes result in a slight reduction in scalability. I looked at dbench on XFS as well, but didn't see any significant change in the numbers at up to 200 load threads, so not much to talk about there. Sometime over the weekend I'll build a 16p VM and see what I get from that... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-23 13:55 ` Dave Chinner @ 2010-07-23 16:16 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 16:16 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > Bug's I've noticed so far: > > - Using XFS, the existing vfs inode count statistic does not decrease > as inodes are free. > - the existing vfs dentry count remains at zero > - the existing vfs free inode count remains at zero > > $ pminfo -f vfs.inodes vfs.dentry > > vfs.inodes.count > value 7472612 > > vfs.inodes.free > value 0 > > vfs.dentry.count > value 0 > > vfs.dentry.free > value 0 Hm, I must have broken it along the way and not noticed. Thanks for pointing that out. > With a production build (i.e. no lockdep, no xfs debug), I'll > run the same fs_mark parallel create/unlink workload to show > scalability as I ran here: > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > The numbers can't be directly compared, but the test and the setup > is the same. The XFS numbers below are with delayed logging > enabled. ext4 is using default mkfs and mount parameters except for > barrier=0. All numbers are averages of three runs. > > fs_mark rate (thousands of files/second) > 2.6.35-rc5 2.6.35-rc5-scale > threads xfs ext4 xfs ext4 > 1 20 39 20 39 > 2 35 55 35 57 > 4 60 41 57 42 > 8 79 9 75 9 > > ext4 is getting IO bound at more than 2 threads, so apart from > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > going to ignore ext4 for the purposes of testing scalability here. > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > CPU and with Nick's patches it's about 650% (10% higher) for > slightly lower throughput. So at this class of machine for this > workload, the changes result in a slight reduction in scalability. That's a good test case, thanks. I'll see if I can find where this is coming from. I will suspect RCU-inodes I suppose. Hm, may have to make them DESTROY_BY_RCU afterall. Thanks, Nick ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-23 16:16 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 16:16 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > Bug's I've noticed so far: > > - Using XFS, the existing vfs inode count statistic does not decrease > as inodes are free. > - the existing vfs dentry count remains at zero > - the existing vfs free inode count remains at zero > > $ pminfo -f vfs.inodes vfs.dentry > > vfs.inodes.count > value 7472612 > > vfs.inodes.free > value 0 > > vfs.dentry.count > value 0 > > vfs.dentry.free > value 0 Hm, I must have broken it along the way and not noticed. Thanks for pointing that out. > With a production build (i.e. no lockdep, no xfs debug), I'll > run the same fs_mark parallel create/unlink workload to show > scalability as I ran here: > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > The numbers can't be directly compared, but the test and the setup > is the same. The XFS numbers below are with delayed logging > enabled. ext4 is using default mkfs and mount parameters except for > barrier=0. All numbers are averages of three runs. > > fs_mark rate (thousands of files/second) > 2.6.35-rc5 2.6.35-rc5-scale > threads xfs ext4 xfs ext4 > 1 20 39 20 39 > 2 35 55 35 57 > 4 60 41 57 42 > 8 79 9 75 9 > > ext4 is getting IO bound at more than 2 threads, so apart from > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > going to ignore ext4 for the purposes of testing scalability here. > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > CPU and with Nick's patches it's about 650% (10% higher) for > slightly lower throughput. So at this class of machine for this > workload, the changes result in a slight reduction in scalability. That's a good test case, thanks. I'll see if I can find where this is coming from. I will suspect RCU-inodes I suppose. Hm, may have to make them DESTROY_BY_RCU afterall. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-23 13:55 ` Dave Chinner @ 2010-07-27 7:05 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-27 7:05 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > With a production build (i.e. no lockdep, no xfs debug), I'll > run the same fs_mark parallel create/unlink workload to show > scalability as I ran here: > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead of a real disk (I don't have easy access to a good disk setup ATM, but I guess we're more interested in code above the block layer anyway). Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as yours. I found that performance is a little unstable, so I sync and echo 3 > drop_caches between each run. When it starts reclaiming memory, things get a bit more erratic (and XFS seemed to be almost livelocking for tens of seconds in inode reclaim). So I started with 50 runs of fs_mark -n 20000 (which did not cause reclaim), rebuilding a new filesystem between every run. That gave the following files/sec numbers: N Min Max Median Avg Stddev x 50 100986.4 127622 125013.4 123248.82 5244.1988 + 50 100967.6 135918.6 130214.9 127926.94 6374.6975 Difference at 95.0% confidence 4678.12 +/- 2316.07 3.79567% +/- 1.87919% (Student's t, pooled s = 5836.88) This is 3.8% in favour of vfs-scale-working. I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did start to fill up memory and cause reclaim during the 2nd and subsequent iterations. N Min Max Median Avg Stddev x 10 116919.7 126785.7 123279.2 122245.17 3169.7993 + 10 110985.1 132440.7 130122.1 126573.41 7151.2947 No difference proven at 95.0% confidence x 10 75820.9 105934.9 79521.7 84263.37 11210.173 + 10 75698.3 115091.7 82932 93022.75 16725.304 No difference proven at 95.0% confidence x 10 66330.5 74950.4 69054.5 69102 2335.615 + 10 68348.5 74231.5 70728.2 70879.45 1838.8345 No difference proven at 95.0% confidence x 10 59353.8 69813.1 67416.7 65164.96 4175.8209 + 10 59670.7 77719.1 74326.1 70966.02 6469.0398 Difference at 95.0% confidence 5801.06 +/- 5115.66 8.90212% +/- 7.85033% (Student's t, pooled s = 5444.54) vfs-scale-working was ahead at every point, but the results were too erratic to read much into it (even the last point I think is questionable). I can provide raw numbers or more details on the setup if required. > enabled. ext4 is using default mkfs and mount parameters except for > barrier=0. All numbers are averages of three runs. > > fs_mark rate (thousands of files/second) > 2.6.35-rc5 2.6.35-rc5-scale > threads xfs ext4 xfs ext4 > 1 20 39 20 39 > 2 35 55 35 57 > 4 60 41 57 42 > 8 79 9 75 9 > > ext4 is getting IO bound at more than 2 threads, so apart from > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > going to ignore ext4 for the purposes of testing scalability here. > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > CPU and with Nick's patches it's about 650% (10% higher) for > slightly lower throughput. So at this class of machine for this > workload, the changes result in a slight reduction in scalability. I wonder if these results are stable. It's possible that changes in reclaim behaviour are causing my patches to require more IO for a given unit of work? I was seeing XFS 'livelock' in reclaim more with my patches, it could be due to more parallelism now being allowed from the vfs and reclaim. Based on my above numbers, I don't see that rcu-inodes is causing a problem, and in terms of SMP scalability, there is really no way that vanilla is more scalable, so I'm interested to see where this slowdown is coming from. > I looked at dbench on XFS as well, but didn't see any significant > change in the numbers at up to 200 load threads, so not much to > talk about there. On a smaller system, dbench doesn't bottleneck too much. It's more of a test to find shared cachelines and such on larger systems when you're talking about several GB/s bandwidths. Thanks, Nick ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-27 7:05 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-27 7:05 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > With a production build (i.e. no lockdep, no xfs debug), I'll > run the same fs_mark parallel create/unlink workload to show > scalability as I ran here: > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead of a real disk (I don't have easy access to a good disk setup ATM, but I guess we're more interested in code above the block layer anyway). Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as yours. I found that performance is a little unstable, so I sync and echo 3 > drop_caches between each run. When it starts reclaiming memory, things get a bit more erratic (and XFS seemed to be almost livelocking for tens of seconds in inode reclaim). So I started with 50 runs of fs_mark -n 20000 (which did not cause reclaim), rebuilding a new filesystem between every run. That gave the following files/sec numbers: N Min Max Median Avg Stddev x 50 100986.4 127622 125013.4 123248.82 5244.1988 + 50 100967.6 135918.6 130214.9 127926.94 6374.6975 Difference at 95.0% confidence 4678.12 +/- 2316.07 3.79567% +/- 1.87919% (Student's t, pooled s = 5836.88) This is 3.8% in favour of vfs-scale-working. I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did start to fill up memory and cause reclaim during the 2nd and subsequent iterations. N Min Max Median Avg Stddev x 10 116919.7 126785.7 123279.2 122245.17 3169.7993 + 10 110985.1 132440.7 130122.1 126573.41 7151.2947 No difference proven at 95.0% confidence x 10 75820.9 105934.9 79521.7 84263.37 11210.173 + 10 75698.3 115091.7 82932 93022.75 16725.304 No difference proven at 95.0% confidence x 10 66330.5 74950.4 69054.5 69102 2335.615 + 10 68348.5 74231.5 70728.2 70879.45 1838.8345 No difference proven at 95.0% confidence x 10 59353.8 69813.1 67416.7 65164.96 4175.8209 + 10 59670.7 77719.1 74326.1 70966.02 6469.0398 Difference at 95.0% confidence 5801.06 +/- 5115.66 8.90212% +/- 7.85033% (Student's t, pooled s = 5444.54) vfs-scale-working was ahead at every point, but the results were too erratic to read much into it (even the last point I think is questionable). I can provide raw numbers or more details on the setup if required. > enabled. ext4 is using default mkfs and mount parameters except for > barrier=0. All numbers are averages of three runs. > > fs_mark rate (thousands of files/second) > 2.6.35-rc5 2.6.35-rc5-scale > threads xfs ext4 xfs ext4 > 1 20 39 20 39 > 2 35 55 35 57 > 4 60 41 57 42 > 8 79 9 75 9 > > ext4 is getting IO bound at more than 2 threads, so apart from > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > going to ignore ext4 for the purposes of testing scalability here. > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > CPU and with Nick's patches it's about 650% (10% higher) for > slightly lower throughput. So at this class of machine for this > workload, the changes result in a slight reduction in scalability. I wonder if these results are stable. It's possible that changes in reclaim behaviour are causing my patches to require more IO for a given unit of work? I was seeing XFS 'livelock' in reclaim more with my patches, it could be due to more parallelism now being allowed from the vfs and reclaim. Based on my above numbers, I don't see that rcu-inodes is causing a problem, and in terms of SMP scalability, there is really no way that vanilla is more scalable, so I'm interested to see where this slowdown is coming from. > I looked at dbench on XFS as well, but didn't see any significant > change in the numbers at up to 200 load threads, so not much to > talk about there. On a smaller system, dbench doesn't bottleneck too much. It's more of a test to find shared cachelines and such on larger systems when you're talking about several GB/s bandwidths. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-27 7:05 ` Nick Piggin @ 2010-07-27 8:06 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-27 8:06 UTC (permalink / raw) To: xfs; +Cc: Dave Chinner, linux-fsdevel On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > run the same fs_mark parallel create/unlink workload to show > > scalability as I ran here: > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > of a real disk (I don't have easy access to a good disk setup ATM, but > I guess we're more interested in code above the block layer anyway). > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > yours. > > I found that performance is a little unstable, so I sync and echo 3 > > drop_caches between each run. When it starts reclaiming memory, things > get a bit more erratic (and XFS seemed to be almost livelocking for tens > of seconds in inode reclaim). So about this XFS livelock type thingy. It looks like this, and happens periodically while running the above fs_mark benchmark requiring reclaim of inodes: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 15 0 6900 31032 192 471852 0 0 28 183296 8520 46672 5 91 4 0 19 0 7044 22928 192 466712 96 144 1056 115586 8622 41695 3 96 1 0 19 0 7136 59884 192 471200 160 92 6768 34564 995 542 1 99 0 0 19 0 7244 17008 192 467860 0 104 2068 32953 1044 630 1 99 0 0 18 0 7244 43436 192 467324 0 0 12 0 817 405 0 100 0 0 18 0 7244 43684 192 467324 0 0 0 0 806 425 0 100 0 0 18 0 7244 43932 192 467324 0 0 0 0 808 403 0 100 0 0 18 0 7244 44924 192 467324 0 0 0 0 808 398 0 100 0 0 18 0 7244 45456 192 467324 0 0 0 0 809 409 0 100 0 0 18 0 7244 45472 192 467324 0 0 0 0 805 412 0 100 0 0 18 0 7244 46392 192 467324 0 0 0 0 807 401 0 100 0 0 18 0 7244 47012 192 467324 0 0 0 0 810 414 0 100 0 0 18 0 7244 47260 192 467324 0 0 0 0 806 396 0 100 0 0 18 0 7244 47752 192 467324 0 0 0 0 806 403 0 100 0 0 18 0 7244 48204 192 467324 0 0 0 0 810 409 0 100 0 0 18 0 7244 48608 192 467324 0 0 0 0 807 412 0 100 0 0 18 0 7244 48876 192 467324 0 0 0 0 805 406 0 100 0 0 18 0 7244 49000 192 467324 0 0 0 0 809 402 0 100 0 0 18 0 7244 49408 192 467324 0 0 0 0 807 396 0 100 0 0 18 0 7244 49908 192 467324 0 0 0 0 809 406 0 100 0 0 18 0 7244 50032 192 467324 0 0 0 0 805 404 0 100 0 0 18 0 7244 50032 192 467324 0 0 0 0 805 406 0 100 0 0 19 0 7244 73436 192 467324 0 0 0 6340 808 384 0 100 0 0 20 0 7244 490220 192 467324 0 0 0 8411 830 389 0 100 0 0 18 0 7244 620092 192 467324 0 0 0 4 809 435 0 100 0 0 18 0 7244 620344 192 467324 0 0 0 0 806 430 0 100 0 0 16 0 7244 682620 192 467324 0 0 44 80 890 326 0 100 0 0 12 0 7244 604464 192 479308 76 0 11716 73555 2242 14318 2 94 4 0 12 0 7244 556700 192 483488 0 0 4276 77680 6576 92285 1 97 2 0 17 0 7244 502508 192 485456 0 0 2092 98368 6308 91919 1 96 4 0 11 0 7244 416500 192 487116 0 0 1760 114844 7414 63025 2 96 2 0 Nothing much happening except 100% system time for seconds at a time (length of time varies). This is on a ramdisk, so it isn't waiting for IO. During this time, lots of things are contending on the lock: 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed | --- __write_lock_failed | |--99.92%-- xfs_inode_ag_walk | xfs_inode_ag_iterator | xfs_reclaim_inode_shrink | shrink_slab | shrink_zone | balance_pgdat | kswapd | kthread | kernel_thread_helper --0.08%-- [...] 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find 1.16% fs_mark [kernel.kallsyms] [k] memcpy 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil | --- xfs_log_commit_cil _xfs_trans_commit | |--60.00%-- xfs_remove | xfs_vn_unlink | vfs_unlink | do_unlinkat | sys_unlink I'm not sure if there was a long-running read locker in there causing all the write lockers to fail, or if they are just running into one another. But anyway, I hacked the following patch which seemed to improve that behaviour. I haven't run any throughput numbers on it yet, but I could if you're interested (and it's not completely broken!) Batch pag_ici_lock acquisition on the reclaim path, and also skip inodes that appear to be busy to improve locking efficiency. Index: source/fs/xfs/linux-2.6/xfs_sync.c =================================================================== --- source.orig/fs/xfs/linux-2.6/xfs_sync.c 2010-07-26 21:12:11.000000000 +1000 +++ source/fs/xfs/linux-2.6/xfs_sync.c 2010-07-26 21:58:59.000000000 +1000 @@ -87,6 +87,91 @@ xfs_inode_ag_lookup( return ip; } +#define RECLAIM_BATCH_SIZE 32 +STATIC int +xfs_inode_ag_walk_reclaim( + struct xfs_mount *mp, + struct xfs_perag *pag, + int (*execute)(struct xfs_inode *ip, + struct xfs_perag *pag, int flags), + int flags, + int tag, + int exclusive, + int *nr_to_scan) +{ + uint32_t first_index; + int last_error = 0; + int skipped; + xfs_inode_t *batch[RECLAIM_BATCH_SIZE]; + int batchnr; + int i; + + BUG_ON(!exclusive); + +restart: + skipped = 0; + first_index = 0; +next_batch: + batchnr = 0; + /* fill the batch */ + write_lock(&pag->pag_ici_lock); + do { + xfs_inode_t *ip; + + ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag); + if (!ip) + break; + if (!(flags & SYNC_WAIT) && + (!xfs_iflock_free(ip) || + __xfs_iflags_test(ip, XFS_IRECLAIM))) + continue; + + /* + * The radix tree lock here protects a thread in xfs_iget from + * racing with us starting reclaim on the inode. Once we have + * the XFS_IRECLAIM flag set it will not touch us. + */ + spin_lock(&ip->i_flags_lock); + ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); + if (__xfs_iflags_test(ip, XFS_IRECLAIM)) { + /* ignore as it is already under reclaim */ + spin_unlock(&ip->i_flags_lock); + continue; + } + __xfs_iflags_set(ip, XFS_IRECLAIM); + spin_unlock(&ip->i_flags_lock); + + batch[batchnr++] = ip; + } while ((*nr_to_scan)-- && batchnr < RECLAIM_BATCH_SIZE); + write_unlock(&pag->pag_ici_lock); + + for (i = 0; i < batchnr; i++) { + int error = 0; + xfs_inode_t *ip = batch[i]; + + /* execute doesn't require pag->pag_ici_lock */ + error = execute(ip, pag, flags); + if (error == EAGAIN) { + skipped++; + continue; + } + if (error) + last_error = error; + + /* bail out if the filesystem is corrupted. */ + if (error == EFSCORRUPTED) + break; + } + if (batchnr == RECLAIM_BATCH_SIZE) + goto next_batch; + + if (0 && skipped) { + delay(1); + goto restart; + } + return last_error; +} + STATIC int xfs_inode_ag_walk( struct xfs_mount *mp, @@ -113,6 +198,7 @@ restart: write_lock(&pag->pag_ici_lock); else read_lock(&pag->pag_ici_lock); + ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag); if (!ip) { if (exclusive) @@ -198,8 +284,12 @@ xfs_inode_ag_iterator( nr = nr_to_scan ? *nr_to_scan : INT_MAX; ag = 0; while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) { - error = xfs_inode_ag_walk(mp, pag, execute, flags, tag, - exclusive, &nr); + if (tag == XFS_ICI_RECLAIM_TAG) + error = xfs_inode_ag_walk_reclaim(mp, pag, execute, + flags, tag, exclusive, &nr); + else + error = xfs_inode_ag_walk(mp, pag, execute, + flags, tag, exclusive, &nr); xfs_perag_put(pag); if (error) { last_error = error; @@ -789,23 +879,6 @@ xfs_reclaim_inode( { int error = 0; - /* - * The radix tree lock here protects a thread in xfs_iget from racing - * with us starting reclaim on the inode. Once we have the - * XFS_IRECLAIM flag set it will not touch us. - */ - spin_lock(&ip->i_flags_lock); - ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); - if (__xfs_iflags_test(ip, XFS_IRECLAIM)) { - /* ignore as it is already under reclaim */ - spin_unlock(&ip->i_flags_lock); - write_unlock(&pag->pag_ici_lock); - return 0; - } - __xfs_iflags_set(ip, XFS_IRECLAIM); - spin_unlock(&ip->i_flags_lock); - write_unlock(&pag->pag_ici_lock); - xfs_ilock(ip, XFS_ILOCK_EXCL); if (!xfs_iflock_nowait(ip)) { if (!(sync_mode & SYNC_WAIT)) Index: source/fs/xfs/xfs_inode.h =================================================================== --- source.orig/fs/xfs/xfs_inode.h 2010-07-26 21:10:33.000000000 +1000 +++ source/fs/xfs/xfs_inode.h 2010-07-26 21:11:59.000000000 +1000 @@ -349,6 +349,11 @@ static inline int xfs_iflock_nowait(xfs_ return try_wait_for_completion(&ip->i_flush); } +static inline int xfs_iflock_free(xfs_inode_t *ip) +{ + return completion_done(&ip->i_flush); +} + static inline void xfs_ifunlock(xfs_inode_t *ip) { complete(&ip->i_flush); ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-27 8:06 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-27 8:06 UTC (permalink / raw) To: xfs; +Cc: linux-fsdevel On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > run the same fs_mark parallel create/unlink workload to show > > scalability as I ran here: > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > of a real disk (I don't have easy access to a good disk setup ATM, but > I guess we're more interested in code above the block layer anyway). > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > yours. > > I found that performance is a little unstable, so I sync and echo 3 > > drop_caches between each run. When it starts reclaiming memory, things > get a bit more erratic (and XFS seemed to be almost livelocking for tens > of seconds in inode reclaim). So about this XFS livelock type thingy. It looks like this, and happens periodically while running the above fs_mark benchmark requiring reclaim of inodes: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 15 0 6900 31032 192 471852 0 0 28 183296 8520 46672 5 91 4 0 19 0 7044 22928 192 466712 96 144 1056 115586 8622 41695 3 96 1 0 19 0 7136 59884 192 471200 160 92 6768 34564 995 542 1 99 0 0 19 0 7244 17008 192 467860 0 104 2068 32953 1044 630 1 99 0 0 18 0 7244 43436 192 467324 0 0 12 0 817 405 0 100 0 0 18 0 7244 43684 192 467324 0 0 0 0 806 425 0 100 0 0 18 0 7244 43932 192 467324 0 0 0 0 808 403 0 100 0 0 18 0 7244 44924 192 467324 0 0 0 0 808 398 0 100 0 0 18 0 7244 45456 192 467324 0 0 0 0 809 409 0 100 0 0 18 0 7244 45472 192 467324 0 0 0 0 805 412 0 100 0 0 18 0 7244 46392 192 467324 0 0 0 0 807 401 0 100 0 0 18 0 7244 47012 192 467324 0 0 0 0 810 414 0 100 0 0 18 0 7244 47260 192 467324 0 0 0 0 806 396 0 100 0 0 18 0 7244 47752 192 467324 0 0 0 0 806 403 0 100 0 0 18 0 7244 48204 192 467324 0 0 0 0 810 409 0 100 0 0 18 0 7244 48608 192 467324 0 0 0 0 807 412 0 100 0 0 18 0 7244 48876 192 467324 0 0 0 0 805 406 0 100 0 0 18 0 7244 49000 192 467324 0 0 0 0 809 402 0 100 0 0 18 0 7244 49408 192 467324 0 0 0 0 807 396 0 100 0 0 18 0 7244 49908 192 467324 0 0 0 0 809 406 0 100 0 0 18 0 7244 50032 192 467324 0 0 0 0 805 404 0 100 0 0 18 0 7244 50032 192 467324 0 0 0 0 805 406 0 100 0 0 19 0 7244 73436 192 467324 0 0 0 6340 808 384 0 100 0 0 20 0 7244 490220 192 467324 0 0 0 8411 830 389 0 100 0 0 18 0 7244 620092 192 467324 0 0 0 4 809 435 0 100 0 0 18 0 7244 620344 192 467324 0 0 0 0 806 430 0 100 0 0 16 0 7244 682620 192 467324 0 0 44 80 890 326 0 100 0 0 12 0 7244 604464 192 479308 76 0 11716 73555 2242 14318 2 94 4 0 12 0 7244 556700 192 483488 0 0 4276 77680 6576 92285 1 97 2 0 17 0 7244 502508 192 485456 0 0 2092 98368 6308 91919 1 96 4 0 11 0 7244 416500 192 487116 0 0 1760 114844 7414 63025 2 96 2 0 Nothing much happening except 100% system time for seconds at a time (length of time varies). This is on a ramdisk, so it isn't waiting for IO. During this time, lots of things are contending on the lock: 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed | --- __write_lock_failed | |--99.92%-- xfs_inode_ag_walk | xfs_inode_ag_iterator | xfs_reclaim_inode_shrink | shrink_slab | shrink_zone | balance_pgdat | kswapd | kthread | kernel_thread_helper --0.08%-- [...] 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find 1.16% fs_mark [kernel.kallsyms] [k] memcpy 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil | --- xfs_log_commit_cil _xfs_trans_commit | |--60.00%-- xfs_remove | xfs_vn_unlink | vfs_unlink | do_unlinkat | sys_unlink I'm not sure if there was a long-running read locker in there causing all the write lockers to fail, or if they are just running into one another. But anyway, I hacked the following patch which seemed to improve that behaviour. I haven't run any throughput numbers on it yet, but I could if you're interested (and it's not completely broken!) Batch pag_ici_lock acquisition on the reclaim path, and also skip inodes that appear to be busy to improve locking efficiency. Index: source/fs/xfs/linux-2.6/xfs_sync.c =================================================================== --- source.orig/fs/xfs/linux-2.6/xfs_sync.c 2010-07-26 21:12:11.000000000 +1000 +++ source/fs/xfs/linux-2.6/xfs_sync.c 2010-07-26 21:58:59.000000000 +1000 @@ -87,6 +87,91 @@ xfs_inode_ag_lookup( return ip; } +#define RECLAIM_BATCH_SIZE 32 +STATIC int +xfs_inode_ag_walk_reclaim( + struct xfs_mount *mp, + struct xfs_perag *pag, + int (*execute)(struct xfs_inode *ip, + struct xfs_perag *pag, int flags), + int flags, + int tag, + int exclusive, + int *nr_to_scan) +{ + uint32_t first_index; + int last_error = 0; + int skipped; + xfs_inode_t *batch[RECLAIM_BATCH_SIZE]; + int batchnr; + int i; + + BUG_ON(!exclusive); + +restart: + skipped = 0; + first_index = 0; +next_batch: + batchnr = 0; + /* fill the batch */ + write_lock(&pag->pag_ici_lock); + do { + xfs_inode_t *ip; + + ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag); + if (!ip) + break; + if (!(flags & SYNC_WAIT) && + (!xfs_iflock_free(ip) || + __xfs_iflags_test(ip, XFS_IRECLAIM))) + continue; + + /* + * The radix tree lock here protects a thread in xfs_iget from + * racing with us starting reclaim on the inode. Once we have + * the XFS_IRECLAIM flag set it will not touch us. + */ + spin_lock(&ip->i_flags_lock); + ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); + if (__xfs_iflags_test(ip, XFS_IRECLAIM)) { + /* ignore as it is already under reclaim */ + spin_unlock(&ip->i_flags_lock); + continue; + } + __xfs_iflags_set(ip, XFS_IRECLAIM); + spin_unlock(&ip->i_flags_lock); + + batch[batchnr++] = ip; + } while ((*nr_to_scan)-- && batchnr < RECLAIM_BATCH_SIZE); + write_unlock(&pag->pag_ici_lock); + + for (i = 0; i < batchnr; i++) { + int error = 0; + xfs_inode_t *ip = batch[i]; + + /* execute doesn't require pag->pag_ici_lock */ + error = execute(ip, pag, flags); + if (error == EAGAIN) { + skipped++; + continue; + } + if (error) + last_error = error; + + /* bail out if the filesystem is corrupted. */ + if (error == EFSCORRUPTED) + break; + } + if (batchnr == RECLAIM_BATCH_SIZE) + goto next_batch; + + if (0 && skipped) { + delay(1); + goto restart; + } + return last_error; +} + STATIC int xfs_inode_ag_walk( struct xfs_mount *mp, @@ -113,6 +198,7 @@ restart: write_lock(&pag->pag_ici_lock); else read_lock(&pag->pag_ici_lock); + ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag); if (!ip) { if (exclusive) @@ -198,8 +284,12 @@ xfs_inode_ag_iterator( nr = nr_to_scan ? *nr_to_scan : INT_MAX; ag = 0; while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) { - error = xfs_inode_ag_walk(mp, pag, execute, flags, tag, - exclusive, &nr); + if (tag == XFS_ICI_RECLAIM_TAG) + error = xfs_inode_ag_walk_reclaim(mp, pag, execute, + flags, tag, exclusive, &nr); + else + error = xfs_inode_ag_walk(mp, pag, execute, + flags, tag, exclusive, &nr); xfs_perag_put(pag); if (error) { last_error = error; @@ -789,23 +879,6 @@ xfs_reclaim_inode( { int error = 0; - /* - * The radix tree lock here protects a thread in xfs_iget from racing - * with us starting reclaim on the inode. Once we have the - * XFS_IRECLAIM flag set it will not touch us. - */ - spin_lock(&ip->i_flags_lock); - ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE)); - if (__xfs_iflags_test(ip, XFS_IRECLAIM)) { - /* ignore as it is already under reclaim */ - spin_unlock(&ip->i_flags_lock); - write_unlock(&pag->pag_ici_lock); - return 0; - } - __xfs_iflags_set(ip, XFS_IRECLAIM); - spin_unlock(&ip->i_flags_lock); - write_unlock(&pag->pag_ici_lock); - xfs_ilock(ip, XFS_ILOCK_EXCL); if (!xfs_iflock_nowait(ip)) { if (!(sync_mode & SYNC_WAIT)) Index: source/fs/xfs/xfs_inode.h =================================================================== --- source.orig/fs/xfs/xfs_inode.h 2010-07-26 21:10:33.000000000 +1000 +++ source/fs/xfs/xfs_inode.h 2010-07-26 21:11:59.000000000 +1000 @@ -349,6 +349,11 @@ static inline int xfs_iflock_nowait(xfs_ return try_wait_for_completion(&ip->i_flush); } +static inline int xfs_iflock_free(xfs_inode_t *ip) +{ + return completion_done(&ip->i_flush); +} + static inline void xfs_ifunlock(xfs_inode_t *ip) { complete(&ip->i_flush); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) 2010-07-27 8:06 ` Nick Piggin (?) @ 2010-07-27 11:36 ` Nick Piggin 2010-07-27 13:30 ` Dave Chinner -1 siblings, 1 reply; 76+ messages in thread From: Nick Piggin @ 2010-07-27 11:36 UTC (permalink / raw) To: xfs On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > Branch vfs-scale-working > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > run the same fs_mark parallel create/unlink workload to show > > > scalability as I ran here: > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > of a real disk (I don't have easy access to a good disk setup ATM, but > > I guess we're more interested in code above the block layer anyway). > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > yours. > > > > I found that performance is a little unstable, so I sync and echo 3 > > > drop_caches between each run. When it starts reclaiming memory, things > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > of seconds in inode reclaim). On this same system, same setup (vanilla kernel with sha given below), I have now twice reproduced a complete hang in XFS. I can give more information, test patches or options etc if required. setup.sh looks like this: #!/bin/bash modprobe rd rd_size=$[2*1024*1024] dd if=/dev/zero of=/dev/ram0 bs=4K mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0 mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt The 'dd' is required to ensure rd driver does not allocate pages during IO (which can lead to out of memory deadlocks). Running just involves changing into mnt directory and while true do sync echo 3 > /proc/sys/vm/drop_caches ../dbench -c ../loadfiles/client.txt -t20 8 rm -rf clients done And wait for it to hang (happend in < 5 minutes here) Sysrq of blocked tasks looks like this: Linux version 2.6.35-rc5-00176-gcd5b8f8 (npiggin@amd) (gcc version 4.4.4 (Debian 4.4.4-7) ) #348 SMP Mon Jul 26 22:20:32 EST 2010 brd: module loaded Enabling EXPERIMENTAL delayed logging feature - use at your own risk. XFS mounting filesystem ram0 Ending clean XFS mount for filesystem: ram0 SysRq : Show Blocked State task PC stack pid father flush-1:0 D 00000000fffff8fd 0 2799 2 0x00000000 ffff8800701ff690 0000000000000046 ffff880000000000 ffff8800701fffd8 ffff8800071531f0 00000000000122c0 0000000000004000 ffff8800701fffd8 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff8800071531f0 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff8122b637>] ? xfs_iunlock+0x57/0xb0 [<ffffffff8123275b>] xfs_iomap_write_allocate+0x25b/0x3c0 [<ffffffff81232abe>] xfs_iomap+0x1fe/0x270 [<ffffffff8124aaf7>] xfs_map_blocks+0x37/0x40 [<ffffffff8107b8e1>] ? find_lock_page+0x21/0x70 [<ffffffff8124bdca>] xfs_page_state_convert+0x35a/0x690 [<ffffffff8107c38a>] ? find_or_create_page+0x3a/0xa0 [<ffffffff8124c3a6>] xfs_vm_writepage+0x76/0x110 [<ffffffff8108eb00>] ? __dec_zone_page_state+0x30/0x40 [<ffffffff81082c32>] __writepage+0x12/0x40 [<ffffffff81083367>] write_cache_pages+0x1c7/0x3d0 [<ffffffff81082c20>] ? __writepage+0x0/0x40 [<ffffffff8108358f>] generic_writepages+0x1f/0x30 [<ffffffff8124c2fc>] xfs_vm_writepages+0x4c/0x60 [<ffffffff810835bc>] do_writepages+0x1c/0x40 [<ffffffff810d606e>] writeback_single_inode+0xce/0x3b0 [<ffffffff810d6774>] writeback_sb_inodes+0x174/0x260 [<ffffffff810d701f>] writeback_inodes_wb+0x8f/0x180 [<ffffffff810d733b>] wb_writeback+0x22b/0x290 [<ffffffff810d7536>] wb_do_writeback+0x196/0x1a0 [<ffffffff810d7583>] bdi_writeback_task+0x43/0x120 [<ffffffff81050f46>] ? bit_waitqueue+0x16/0xe0 [<ffffffff8108fd00>] ? bdi_start_fn+0x0/0xe0 [<ffffffff8108fd6c>] bdi_start_fn+0x6c/0xe0 [<ffffffff8108fd00>] ? bdi_start_fn+0x0/0xe0 [<ffffffff81050bee>] kthread+0x8e/0xa0 [<ffffffff81003014>] kernel_thread_helper+0x4/0x10 [<ffffffff81050b60>] ? kthread+0x0/0xa0 [<ffffffff81003010>] ? kernel_thread_helper+0x0/0x10 xfssyncd/ram0 D 00000000fffff045 0 2807 2 0x00000000 ffff880007635d00 0000000000000046 ffff880000000000 ffff880007635fd8 ffff880007abd370 00000000000122c0 0000000000004000 ffff880007635fd8 0000000000004000 00000000000122c0 ffff88007f2b7710 ffff880007abd370 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff812572b7>] ? xfs_inode_ag_iterator+0x57/0xd0 [<ffffffff81256c0a>] xfs_commit_dummy_trans+0x4a/0xe0 [<ffffffff81257454>] xfs_sync_worker+0x74/0x80 [<ffffffff81256b2a>] xfssyncd+0x13a/0x1d0 [<ffffffff812569f0>] ? xfssyncd+0x0/0x1d0 [<ffffffff81050bee>] kthread+0x8e/0xa0 [<ffffffff81003014>] kernel_thread_helper+0x4/0x10 [<ffffffff81050b60>] ? kthread+0x0/0xa0 [<ffffffff81003010>] ? kernel_thread_helper+0x0/0x10 dbench D 00000000ffffefc6 0 2975 2974 0x00000000 ffff88005ecd1ae8 0000000000000082 ffff880000000000 ffff88005ecd1fd8 ffff8800079ce250 00000000000122c0 0000000000004000 ffff88005ecd1fd8 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff8800079ce250 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff81247920>] xfs_create+0x170/0x5d0 [<ffffffff810ca6e2>] ? __d_lookup+0xa2/0x140 [<ffffffff810d0724>] ? mntput_no_expire+0x24/0xe0 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10 [<ffffffff810c1471>] vfs_create+0x81/0xd0 [<ffffffff810c2915>] do_last+0x515/0x670 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650 [<ffffffff810c6871>] ? filldir+0x71/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130 [<ffffffff810b5d34>] do_sys_open+0x64/0x140 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80 [<ffffffff810b5e3b>] sys_open+0x1b/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000fffff045 0 2976 2974 0x00000000 ffff880060c11b18 0000000000000086 ffff880000000000 ffff880060c11fd8 ffff880007668630 00000000000122c0 0000000000004000 ffff880060c11fd8 0000000000004000 00000000000122c0 ffffffff81793020 ffff880007668630 Call Trace: [<ffffffff81236320>] xlog_grant_log_space+0x280/0x3d0 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff812412c8>] xfs_rename+0x138/0x630 [<ffffffff810c036e>] ? exec_permission+0x3e/0x70 [<ffffffff81253111>] xfs_vn_rename+0x61/0x70 [<ffffffff810c1b4e>] vfs_rename+0x41e/0x480 [<ffffffff810c3bd6>] sys_renameat+0x236/0x270 [<ffffffff8122551d>] ? xfs_dir2_sf_getdents+0x21d/0x390 [<ffffffff810c6800>] ? filldir+0x0/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810b8a4a>] ? fput+0x1aa/0x220 [<ffffffff810c3c26>] sys_rename+0x16/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffeed4 0 2977 2974 0x00000000 ffff88000873fa88 0000000000000082 ffff88000873fac8 ffff88000873ffd8 ffff880007669710 00000000000122c0 0000000000004000 ffff88000873ffd8 0000000000004000 00000000000122c0 ffff88007f2e1790 ffff880007669710 Call Trace: [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210 [<ffffffff8123dbf6>] ? xfs_icsb_disable_counter+0x16/0xa0 [<ffffffff812445bb>] ? _xfs_trans_bjoin+0x4b/0x60 [<ffffffff8107b5c9>] ? find_get_page+0x19/0xa0 [<ffffffff8123dcb6>] ? xfs_icsb_balance_counter_locked+0x36/0xc0 [<ffffffff8155f4e8>] __down+0x68/0xb0 [<ffffffff81055b0b>] down+0x3b/0x50 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160 [<ffffffff8124ee13>] xfs_buf_read+0x13/0xa0 [<ffffffff81244780>] xfs_trans_read_buf+0x1b0/0x320 [<ffffffff8122922f>] xfs_read_agi+0x6f/0xf0 [<ffffffff8122fa86>] xfs_iunlink+0x46/0x160 [<ffffffff81253d21>] ? xfs_mark_inode_dirty_sync+0x21/0x30 [<ffffffff81253dcf>] ? xfs_ichgtime+0x9f/0xc0 [<ffffffff81245677>] xfs_droplink+0x57/0x70 [<ffffffff8124751a>] xfs_remove+0x28a/0x370 [<ffffffff81253443>] xfs_vn_unlink+0x43/0x90 [<ffffffff810c161b>] vfs_unlink+0x8b/0x110 [<ffffffff810c0e20>] ? lookup_hash+0x30/0x40 [<ffffffff810c3db3>] do_unlinkat+0x183/0x1c0 [<ffffffff810bb3f1>] ? sys_newstat+0x31/0x50 [<ffffffff810c3e01>] sys_unlink+0x11/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffefa8 0 2978 2974 0x00000000 ffff880040da7c38 0000000000000082 ffff880000000000 ffff880040da7fd8 ffff880007668090 00000000000122c0 0000000000004000 ffff880040da7fd8 0000000000004000 00000000000122c0 ffff88012ff78b90 ffff880007668090 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff812495e9>] xfs_setattr+0x7e9/0xad0 [<ffffffff81253766>] xfs_vn_setattr+0x16/0x20 [<ffffffff810cdb94>] notify_change+0x104/0x2e0 [<ffffffff810db270>] utimes_common+0xd0/0x1a0 [<ffffffff810bb64e>] ? sys_newfstat+0x2e/0x40 [<ffffffff810db416>] do_utimes+0xd6/0xf0 [<ffffffff810db5ae>] sys_utime+0x1e/0x70 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffefd3 0 2979 2974 0x00000000 ffff8800072c7ae8 0000000000000082 ffff8800072c7b78 ffff8800072c7fd8 ffff880007669170 00000000000122c0 0000000000004000 ffff8800072c7fd8 0000000000004000 00000000000122c0 ffff88012ff785f0 ffff880007669170 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff8122b637>] ? xfs_iunlock+0x57/0xb0 [<ffffffff81247920>] xfs_create+0x170/0x5d0 [<ffffffff810ca6e2>] ? __d_lookup+0xa2/0x140 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10 [<ffffffff810c1471>] vfs_create+0x81/0xd0 [<ffffffff810c2915>] do_last+0x515/0x670 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650 [<ffffffff810c6871>] ? filldir+0x71/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130 [<ffffffff810b5d34>] do_sys_open+0x64/0x140 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80 [<ffffffff810b5e3b>] sys_open+0x1b/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffeed0 0 2980 2974 0x00000000 ffff88003dd91688 0000000000000082 0000000000000000 ffff88003dd91fd8 ffff880007abc290 00000000000122c0 0000000000004000 ffff88003dd91fd8 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff880007abc290 Call Trace: [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210 [<ffffffff8155f4e8>] __down+0x68/0xb0 [<ffffffff81055b0b>] down+0x3b/0x50 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160 [<ffffffff8124ee13>] xfs_buf_read+0x13/0xa0 [<ffffffff81244780>] xfs_trans_read_buf+0x1b0/0x320 [<ffffffff8122922f>] xfs_read_agi+0x6f/0xf0 [<ffffffff812292d9>] xfs_ialloc_read_agi+0x29/0x90 [<ffffffff8122957b>] xfs_ialloc_ag_select+0x12b/0x260 [<ffffffff8122abc7>] xfs_dialloc+0x3d7/0x860 [<ffffffff8124acc8>] ? __xfs_get_blocks+0x1c8/0x210 [<ffffffff8107b5c9>] ? find_get_page+0x19/0xa0 [<ffffffff810ddb9e>] ? unmap_underlying_metadata+0xe/0x50 [<ffffffff8122ef4d>] xfs_ialloc+0x5d/0x690 [<ffffffff8124a031>] ? kmem_zone_alloc+0x91/0xe0 [<ffffffff8124570d>] xfs_dir_ialloc+0x7d/0x320 [<ffffffff81236552>] ? xfs_log_reserve+0xe2/0xf0 [<ffffffff81247b83>] xfs_create+0x3d3/0x5d0 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10 [<ffffffff810c1471>] vfs_create+0x81/0xd0 [<ffffffff810c2915>] do_last+0x515/0x670 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650 [<ffffffff810c6871>] ? filldir+0x71/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130 [<ffffffff810b5d34>] do_sys_open+0x64/0x140 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80 [<ffffffff810b5e3b>] sys_open+0x1b/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffeed0 0 2981 2974 0x00000000 ffff88005b79f618 0000000000000086 ffff88005b79f598 ffff88005b79ffd8 ffff880007abcdd0 00000000000122c0 0000000000004000 ffff88005b79ffd8 0000000000004000 00000000000122c0 ffff88007f2b7710 ffff880007abcdd0 Call Trace: [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210 [<ffffffff8122cd44>] ? xfs_iext_bno_to_ext+0x84/0x160 [<ffffffff8155f4e8>] __down+0x68/0xb0 [<ffffffff81055b0b>] down+0x3b/0x50 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160 [<ffffffff81244a40>] xfs_trans_get_buf+0xc0/0xe0 [<ffffffff8121ac3f>] xfs_da_do_buf+0x3df/0x6d0 [<ffffffff8121b0c5>] xfs_da_get_buf+0x25/0x30 [<ffffffff81220926>] ? xfs_dir2_data_init+0x46/0xe0 [<ffffffff81220926>] xfs_dir2_data_init+0x46/0xe0 [<ffffffff8121e829>] xfs_dir2_sf_to_block+0xb9/0x5a0 [<ffffffff8105106a>] ? wake_up_bit+0x2a/0x40 [<ffffffff81226a78>] xfs_dir2_sf_addname+0x418/0x5c0 [<ffffffff8122f3fb>] ? xfs_ialloc+0x50b/0x690 [<ffffffff8121e61c>] xfs_dir_createname+0x14c/0x1a0 [<ffffffff81247bf9>] xfs_create+0x449/0x5d0 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10 [<ffffffff810c1471>] vfs_create+0x81/0xd0 [<ffffffff810c2915>] do_last+0x515/0x670 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650 [<ffffffff810c6871>] ? filldir+0x71/0xd0 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130 [<ffffffff810b5d34>] do_sys_open+0x64/0x140 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80 [<ffffffff810b5e3b>] sys_open+0x1b/0x20 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b dbench D 00000000ffffefbf 0 2982 2974 0x00000000 ffff88005b7f9c38 0000000000000082 ffff880000000000 ffff88005b7f9fd8 ffff880007698150 00000000000122c0 0000000000004000 ffff88005b7f9fd8 0000000000004000 00000000000122c0 ffff88012ff79130 ffff880007698150 Call Trace: [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200 [<ffffffff812495e9>] xfs_setattr+0x7e9/0xad0 [<ffffffff81253766>] xfs_vn_setattr+0x16/0x20 [<ffffffff810cdb94>] notify_change+0x104/0x2e0 [<ffffffff810db270>] utimes_common+0xd0/0x1a0 [<ffffffff810bb64e>] ? sys_newfstat+0x2e/0x40 [<ffffffff810db416>] do_utimes+0xd6/0xf0 [<ffffffff810db5ae>] sys_utime+0x1e/0x70 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) 2010-07-27 11:36 ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin @ 2010-07-27 13:30 ` Dave Chinner 2010-07-27 14:58 ` XFS hang in xlog_grant_log_space Dave Chinner 0 siblings, 1 reply; 76+ messages in thread From: Dave Chinner @ 2010-07-27 13:30 UTC (permalink / raw) To: Nick Piggin; +Cc: xfs On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > On this same system, same setup (vanilla kernel with sha given below), > I have now twice reproduced a complete hang in XFS. I can give more > information, test patches or options etc if required. > > setup.sh looks like this: > #!/bin/bash > modprobe rd rd_size=$[2*1024*1024] > dd if=/dev/zero of=/dev/ram0 bs=4K > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0 > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt > > The 'dd' is required to ensure rd driver does not allocate pages > during IO (which can lead to out of memory deadlocks). Running just > involves changing into mnt directory and > > while true > do > sync > echo 3 > /proc/sys/vm/drop_caches > ../dbench -c ../loadfiles/client.txt -t20 8 > rm -rf clients > done > > And wait for it to hang (happend in < 5 minutes here) .... > Call Trace: > [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 It's waiting on log space to be freed up. Either there's an accounting problem (possible), or you've got an xfslogd/xfsaild spinning and not making progress competing log IOs or pushing the tail of the log. I'll see if I can reproduce it. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-27 13:30 ` Dave Chinner @ 2010-07-27 14:58 ` Dave Chinner 2010-07-28 13:17 ` Dave Chinner 0 siblings, 1 reply; 76+ messages in thread From: Dave Chinner @ 2010-07-27 14:58 UTC (permalink / raw) To: Nick Piggin; +Cc: xfs On Tue, Jul 27, 2010 at 11:30:38PM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote: > > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > > On this same system, same setup (vanilla kernel with sha given below), > > I have now twice reproduced a complete hang in XFS. I can give more > > information, test patches or options etc if required. > > > > setup.sh looks like this: > > #!/bin/bash > > modprobe rd rd_size=$[2*1024*1024] > > dd if=/dev/zero of=/dev/ram0 bs=4K > > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0 > > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt > > > > The 'dd' is required to ensure rd driver does not allocate pages > > during IO (which can lead to out of memory deadlocks). Running just > > involves changing into mnt directory and > > > > while true > > do > > sync > > echo 3 > /proc/sys/vm/drop_caches > > ../dbench -c ../loadfiles/client.txt -t20 8 > > rm -rf clients > > done > > > > And wait for it to hang (happend in < 5 minutes here) > .... > > Call Trace: > > [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 > > It's waiting on log space to be freed up. Either there's an > accounting problem (possible), or you've got an xfslogd/xfsaild > spinning and not making progress competing log IOs or pushing the > tail of the log. I'll see if I can reproduce it. Ok, I've just reproduced it. From some tracing: touch-3340 [004] 1844935.582716: xfs_log_reserve: dev 1:0 type CREATE t_ocnt 2 t_cnt 2 t_curr_res 167148 t_unit_res 167148 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserve_headq 0xffff88010f489c78 write_headq 0x(null) grant_reserve_cycle 314 grant_reserve_bytes 24250680 grant_write_cycle 314 grant_write_bytes 24250680 curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532 The key part here is this: curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532 This says the tail of the log is roughly 62MB behind the head. i.e the log is full and we are waiting for tail pushing to write the item holding the tail in place to disk so it can them be moved forward. That's better than an accounting problem, at least. So what is holding the tail in place? The first item on the AIL appears to be: xfsaild/ram0-2997 [000] 1844800.800764: xfs_buf_cond_lock: dev 1:0 bno 0x280120 len 0x2000 hold 3 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller xfs_buf_item_trylock A stale buffer. Given that the next objects show this trace: xfsaild/ram0-2997 [000] 1844800.800767: xfs_ilock_nowait: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_trylock xfsaild/ram0-2997 [000] 1844800.800768: xfs_buf_rele: dev 1:0 bno 0x280120 len 0x2000 hold 4 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller _xfs_buf_find xfsaild/ram0-2997 [000] 1844800.800769: xfs_iunlock: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_pushbuf we see the next item on the AIL is an inode but the trace is followed by a release on the original buffer, than tells me the inode is flush locked and it returned XFS_ITEM_PUSHBUF to push the inode buffer out. That results in xfs_inode_item_pushbuf() being called, and that tries to lock the inode buffer to flush it. xfs_buf_rele is called if the trylock on the buffer fails. IOWs, this looks to be another problem with inode cluster freeing. Ok, so we can't flush the buffer because it is locked. Why is it locked? Well, that is unclear as yet. None of the blocked processes should be holding an inode buffer locked, and a stale buffer should be unlocked during transaction commit and not live longer than the log IO that writes the transaction to disk. That is, it should not get locked again before everything is freed up. That's as much as I can get from post-mortem analysis - I need to capture a trace that spans the lockup to catch what happens to the buffer that we are hung on. That will have to wait until the morning.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-27 14:58 ` XFS hang in xlog_grant_log_space Dave Chinner @ 2010-07-28 13:17 ` Dave Chinner 2010-07-29 14:05 ` Nick Piggin 0 siblings, 1 reply; 76+ messages in thread From: Dave Chinner @ 2010-07-28 13:17 UTC (permalink / raw) To: Nick Piggin; +Cc: xfs On Wed, Jul 28, 2010 at 12:58:09AM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 11:30:38PM +1000, Dave Chinner wrote: > > On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote: > > > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > > > On this same system, same setup (vanilla kernel with sha given below), > > > I have now twice reproduced a complete hang in XFS. I can give more > > > information, test patches or options etc if required. > > > > > > setup.sh looks like this: > > > #!/bin/bash > > > modprobe rd rd_size=$[2*1024*1024] > > > dd if=/dev/zero of=/dev/ram0 bs=4K > > > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0 > > > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt > > > > > > The 'dd' is required to ensure rd driver does not allocate pages > > > during IO (which can lead to out of memory deadlocks). Running just > > > involves changing into mnt directory and > > > > > > while true > > > do > > > sync > > > echo 3 > /proc/sys/vm/drop_caches > > > ../dbench -c ../loadfiles/client.txt -t20 8 > > > rm -rf clients > > > done > > > > > > And wait for it to hang (happend in < 5 minutes here) > > .... > > > Call Trace: > > > [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0 > > > > It's waiting on log space to be freed up. Either there's an > > accounting problem (possible), or you've got an xfslogd/xfsaild > > spinning and not making progress competing log IOs or pushing the > > tail of the log. I'll see if I can reproduce it. > > Ok, I've just reproduced it. From some tracing: > > touch-3340 [004] 1844935.582716: xfs_log_reserve: dev 1:0 type CREATE t_ocnt 2 t_cnt 2 t_curr_res 167148 t_unit_res 167148 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserve_headq 0xffff88010f489c78 write_headq 0x(null) grant_reserve_cycle 314 grant_reserve_bytes 24250680 grant_write_cycle 314 grant_write_bytes 24250680 curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532 > > The key part here is this: > > curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532 > > This says the tail of the log is roughly 62MB behind the head. i.e > the log is full and we are waiting for tail pushing to write the > item holding the tail in place to disk so it can them be moved > forward. That's better than an accounting problem, at least. > > So what is holding the tail in place? The first item on the AIL > appears to be: > > xfsaild/ram0-2997 [000] 1844800.800764: xfs_buf_cond_lock: dev 1:0 bno 0x280120 len 0x2000 hold 3 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller xfs_buf_item_trylock > > A stale buffer. Given that the next objects show this trace: > > xfsaild/ram0-2997 [000] 1844800.800767: xfs_ilock_nowait: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_trylock > xfsaild/ram0-2997 [000] 1844800.800768: xfs_buf_rele: dev 1:0 bno 0x280120 len 0x2000 hold 4 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller _xfs_buf_find > xfsaild/ram0-2997 [000] 1844800.800769: xfs_iunlock: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_pushbuf > > we see the next item on the AIL is an inode but the trace is > followed by a release on the original buffer, than tells me the > inode is flush locked and it returned XFS_ITEM_PUSHBUF to push the > inode buffer out. That results in xfs_inode_item_pushbuf() being > called, and that tries to lock the inode buffer to flush it. > xfs_buf_rele is called if the trylock on the buffer fails. > > IOWs, this looks to be another problem with inode cluster freeing. > > Ok, so we can't flush the buffer because it is locked. Why is it > locked? Well, that is unclear as yet. None of the blocked processes > should be holding an inode buffer locked, and a stale buffer should > be unlocked during transaction commit and not live longer than > the log IO that writes the transaction to disk. That is, it should > not get locked again before everything is freed up. > > That's as much as I can get from post-mortem analysis - I need to > capture a trace that spans the lockup to catch what happens > to the buffer that we are hung on. That will have to wait until the > morning.... Ok, so I got a trace of all the inode and buffer locking and log item operations, and the reason the hang has occurred can bee seen here: dbench-3084 [007] 1877156.395784: xfs_buf_item_unlock_stale: dev 1:0 bno 0x80040 len 0x2000 hold 2 pincount 0 lock 0 flags |ASYNC|DONE|STALE|PAGE_CACHE recur 1 refcount 1 bliflags |STALE|INODE_ALLOC|STALE_INODE lidesc 0x(null) liflags IN_AIL The key points in this trace is that when we are unlocking a stale buffer during transaction commit - we don't actually unlock it. WHat we do it: /* * If the buf item is marked stale, then don't do anything. We'll * unlock the buffer and free the buf item when the buffer is unpinned * for the last time. */ if (bip->bli_flags & XFS_BLI_STALE) { trace_xfs_buf_item_unlock_stale(bip); ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL); if (!aborted) { atomic_dec(&bip->bli_refcount); return; } } But from the above trace it can be seen that the buffer pincount is zero. Hence it will never get and unpin callback, and hence never get unlocked. As a result, this process here is the one that is stuck on the buffer: dbench D 0000000000000007 0 3084 1 0x00000004 ffff880104e07608 0000000000000086 ffff880104e075b8 0000000000014000 ffff880104e07fd8 0000000000014000 ffff880104e07fd8 ffff880104c1d770 0000000000014000 0000000000014000 ffff880104e07fd8 0000000000014000 Call Trace: [<ffffffff817e59cd>] schedule_timeout+0x1ed/0x2c0 [<ffffffff810de9de>] ? ring_buffer_lock_reserve+0x9e/0x160 [<ffffffff817e690e>] __down+0x7e/0xc0 [<ffffffff812f41d5>] ? _xfs_buf_find+0x145/0x290 [<ffffffff810a05d0>] down+0x40/0x50 [<ffffffff812f41d5>] ? _xfs_buf_find+0x145/0x290 [<ffffffff812f314d>] xfs_buf_lock+0x4d/0x110 [<ffffffff812f41d5>] _xfs_buf_find+0x145/0x290 [<ffffffff812f4380>] xfs_buf_get+0x60/0x1c0 [<ffffffff812ea8f0>] xfs_trans_get_buf+0xe0/0x180 [<ffffffff812ccdab>] xfs_ialloc_inode_init+0xcb/0x1c0 [<ffffffff812cdaf9>] xfs_ialloc_ag_alloc+0x179/0x4a0 [<ffffffff812cdeff>] xfs_dialloc+0xdf/0x870 [<ffffffff8105ec88>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffff812d2105>] xfs_ialloc+0x65/0x6b0 [<ffffffff812eb032>] xfs_dir_ialloc+0x82/0x2d0 [<ffffffff8128ce83>] ? ftrace_raw_event_xfs_lock_class+0xd3/0xe0 [<ffffffff812ecad7>] ? xfs_create+0x1a7/0x690 [<ffffffff812ecd37>] xfs_create+0x407/0x690 [<ffffffff812f9f97>] xfs_vn_mknod+0xa7/0x1c0 [<ffffffff812fa0e0>] xfs_vn_create+0x10/0x20 [<ffffffff8114ca6c>] vfs_create+0xac/0xd0 [<ffffffff8114d6ec>] do_last+0x51c/0x620 [<ffffffff8114f6d8>] do_filp_open+0x228/0x640 [<ffffffff812c1f18>] ? xfs_dir2_block_getdents+0x218/0x220 [<ffffffff8115a62a>] ? alloc_fd+0x10a/0x150 [<ffffffff8113f919>] do_sys_open+0x69/0x140 [<ffffffff8113fa30>] sys_open+0x20/0x30 [<ffffffff81035032>] system_call_fastpath+0x16/0x1b It is trying to allocate a new inode chunk on disk, which happens to be the one we just removed and staled. Now to find out why the pin count on the buffer is wrong. ..... Now it makes no sense - the buf item pin trace is there directly directly before the above unlock stale trace. The buf item pinning increments both the buffer pin count and the buf item refcount, neith of which are reflected in the unlock stale trace. From the trace analysis I did: process 83 84 85 86 87 ..... get trans_read stale format pin unlock committed unpin stale free trans_get init format pin item unlock get trans_read unlock get trans_read item unlock **** pincount goes from 1 to 0 here without any unpin traces **** get trans_read unlock get trans_read item unlock (repeat 1x) get trans_read unlock get trans_read stale format pin **** no committed/unlock/free trace for this transaction **** **** hold count goes from 2 to 1 without any rele traces **** **** is a new buffer allocated here without the staled buffer being committed? **** trans_get init format pin item unlock get trans_read unlock get trans_read item unlock (repeat 9x) get trans_read unlock get trans_read stale format pin **** pin count does not increment! **** buf lock from xfs_buf_find => hangs **** buffer that was found is locked, not pinned **** Something very strange is happening, and to make matters worse I cannot reproduce it with a debug kernel (ran for 3 hours without failing). Hence it smells like a race condition somewhere. I've reproduced it without delayed logging, so it is not directly related to that functionality. I've seen this warning: Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117 Which indicates we failed to mark an inode stale when freeing an inode cluster, but I think I've fixed that and the problem still shows up. It's posible the last version didn't fix it, but.... Now I've got the ag iterator rotor patch in place as well and possibly a different version of the cluster free fix to what I previously tested and it's now been running for almost half an hour. I can't say yet whether I've fixed the bug of just changed the timing enough to avoid it. I'll leave this test running over night and redo individual patch testing tomorrow. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-28 13:17 ` Dave Chinner @ 2010-07-29 14:05 ` Nick Piggin 2010-07-29 22:56 ` Dave Chinner 0 siblings, 1 reply; 76+ messages in thread From: Nick Piggin @ 2010-07-29 14:05 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Piggin, xfs On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote: > Something very strange is happening, and to make matters worse I > cannot reproduce it with a debug kernel (ran for 3 hours without > failing). Hence it smells like a race condition somewhere. > > I've reproduced it without delayed logging, so it is not directly > related to that functionality. > > I've seen this warning: > > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117 > > Which indicates we failed to mark an inode stale when freeing an > inode cluster, but I think I've fixed that and the problem still > shows up. It's posible the last version didn't fix it, but.... I've seen that one a couple of times too. Keeps coming back each time you echo 3 > /proc/sys/vm/drop_caches :) > Now I've got the ag iterator rotor patch in place as well and > possibly a different version of the cluster free fix to what I > previously tested and it's now been running for almost half an hour. > I can't say yet whether I've fixed the bug of just changed the > timing enough to avoid it. I'll leave this test running over night > and redo individual patch testing tomorrow. I reproduced it with fs_stress now too. Any patches I could test for you just let me know. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-29 14:05 ` Nick Piggin @ 2010-07-29 22:56 ` Dave Chinner 2010-07-30 3:59 ` Nick Piggin 0 siblings, 1 reply; 76+ messages in thread From: Dave Chinner @ 2010-07-29 22:56 UTC (permalink / raw) To: Nick Piggin; +Cc: Nick Piggin, xfs On Fri, Jul 30, 2010 at 12:05:46AM +1000, Nick Piggin wrote: > On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote: > > Something very strange is happening, and to make matters worse I > > cannot reproduce it with a debug kernel (ran for 3 hours without > > failing). Hence it smells like a race condition somewhere. > > > > I've reproduced it without delayed logging, so it is not directly > > related to that functionality. > > > > I've seen this warning: > > > > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117 > > > > Which indicates we failed to mark an inode stale when freeing an > > inode cluster, but I think I've fixed that and the problem still > > shows up. It's posible the last version didn't fix it, but.... > > I've seen that one a couple of times too. Keeps coming back each > time you echo 3 > /proc/sys/vm/drop_caches :) Yup - it's an unflushable inode that is pinning the tail of the log, hence causing the log space hangs. > > Now I've got the ag iterator rotor patch in place as well and > > possibly a different version of the cluster free fix to what I > > previously tested and it's now been running for almost half an hour. > > I can't say yet whether I've fixed the bug of just changed the > > timing enough to avoid it. I'll leave this test running over night > > and redo individual patch testing tomorrow. > > I reproduced it with fs_stress now too. Any patches I could test > for you just let me know. You should see them in a few minutes ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: XFS hang in xlog_grant_log_space 2010-07-29 22:56 ` Dave Chinner @ 2010-07-30 3:59 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-30 3:59 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, xfs On Fri, Jul 30, 2010 at 08:56:58AM +1000, Dave Chinner wrote: > On Fri, Jul 30, 2010 at 12:05:46AM +1000, Nick Piggin wrote: > > On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote: > > > Something very strange is happening, and to make matters worse I > > > cannot reproduce it with a debug kernel (ran for 3 hours without > > > failing). Hence it smells like a race condition somewhere. > > > > > > I've reproduced it without delayed logging, so it is not directly > > > related to that functionality. > > > > > > I've seen this warning: > > > > > > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117 > > > > > > Which indicates we failed to mark an inode stale when freeing an > > > inode cluster, but I think I've fixed that and the problem still > > > shows up. It's posible the last version didn't fix it, but.... > > > > I've seen that one a couple of times too. Keeps coming back each > > time you echo 3 > /proc/sys/vm/drop_caches :) > > Yup - it's an unflushable inode that is pinning the tail of the log, > hence causing the log space hangs. > > > > Now I've got the ag iterator rotor patch in place as well and > > > possibly a different version of the cluster free fix to what I > > > previously tested and it's now been running for almost half an hour. > > > I can't say yet whether I've fixed the bug of just changed the > > > timing enough to avoid it. I'll leave this test running over night > > > and redo individual patch testing tomorrow. > > > > I reproduced it with fs_stress now too. Any patches I could test > > for you just let me know. > > You should see them in a few minutes ;) It's certainly not locking up like it used to... Thanks! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-27 8:06 ` Nick Piggin @ 2010-07-28 12:57 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-28 12:57 UTC (permalink / raw) To: Nick Piggin; +Cc: xfs, linux-fsdevel On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > Branch vfs-scale-working > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > run the same fs_mark parallel create/unlink workload to show > > > scalability as I ran here: > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > of a real disk (I don't have easy access to a good disk setup ATM, but > > I guess we're more interested in code above the block layer anyway). > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > yours. > > > > I found that performance is a little unstable, so I sync and echo 3 > > > drop_caches between each run. When it starts reclaiming memory, things > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > of seconds in inode reclaim). > > So about this XFS livelock type thingy. It looks like this, and happens > periodically while running the above fs_mark benchmark requiring reclaim > of inodes: .... > Nothing much happening except 100% system time for seconds at a time > (length of time varies). This is on a ramdisk, so it isn't waiting > for IO. > > During this time, lots of things are contending on the lock: > > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed > | > --- __write_lock_failed > | > |--99.92%-- xfs_inode_ag_walk > | xfs_inode_ag_iterator > | xfs_reclaim_inode_shrink > | shrink_slab > | shrink_zone > | balance_pgdat > | kswapd > | kthread > | kernel_thread_helper > --0.08%-- [...] > > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find > 1.16% fs_mark [kernel.kallsyms] [k] memcpy > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil > | > --- xfs_log_commit_cil > _xfs_trans_commit > | > |--60.00%-- xfs_remove > | xfs_vn_unlink > | vfs_unlink > | do_unlinkat > | sys_unlink > > I'm not sure if there was a long-running read locker in there causing > all the write lockers to fail, or if they are just running into one > another. The longest hold is in the inode cluster writeback (xfs_iflush_cluster), but if there is no IO then I don't see how that would be a problem. I suspect that it might be caused by having several CPUs all trying to run the shrinker at the same time and them all starting at the same AG and therefore lockstepping and getting nothing done because they are all scanning the same inodes. Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to avoid this lockstepping. I've attached a patch below to do this - can you give it a try? > But anyway, I hacked the following patch which seemed to > improve that behaviour. I haven't run any throughput numbers on it yet, > but I could if you're interested (and it's not completely broken!) Batching is certainly something that I have been considering, but apart from the excessive scanning bug, the per-ag inode tree lookups hve not featured prominently in any profiling I've done, so it hasn't been a high priority. You patch looks like it will work fine, but I think it can be made a lot cleaner. I'll have a closer look at this once I get to the bottom of the dbench hang you are seeing.... Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: add an AG iterator start rotor From: Dave Chinner <dchinner@redhat.com> To avoid multiple CPUs from executing inode cache shrinkers on the same AG all at the same time, make every shrinker call start on a different AG. This will mostly prevent concurrent shrinker calls from competing for the serialising pag_ici_lock and lock-stepping reclaim. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/linux-2.6/xfs_sync.c | 11 ++++++++++- fs/xfs/xfs_mount.h | 1 + 2 files changed, 11 insertions(+), 1 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index dfcbd98..5322105 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -181,11 +181,14 @@ xfs_inode_ag_iterator( struct xfs_perag *pag; int error = 0; int last_error = 0; + xfs_agnumber_t start_ag; xfs_agnumber_t ag; int nr; + int looped = 0; nr = nr_to_scan ? *nr_to_scan : INT_MAX; - ag = 0; + start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount; + ag = start_ag; while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) { error = xfs_inode_ag_walk(mp, pag, execute, flags, tag, exclusive, &nr); @@ -197,6 +200,12 @@ xfs_inode_ag_iterator( } if (nr <= 0) break; + if (ag >= mp->m_sb.sb_agcount) { + looped = 1; + ag = 0; + } + if (ag >= start_ag && looped) + break; } if (nr_to_scan) *nr_to_scan = nr; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 622da21..fae61bb 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -199,6 +199,7 @@ typedef struct xfs_mount { __int64_t m_update_flags; /* sb flags we need to update on the next remount,rw */ struct shrinker m_inode_shrink; /* inode reclaim shrinker */ + atomic_t m_agiter_rotor; /* ag iterator start rotor */ } xfs_mount_t; /* ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-28 12:57 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-28 12:57 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, xfs On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > Branch vfs-scale-working > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > run the same fs_mark parallel create/unlink workload to show > > > scalability as I ran here: > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > of a real disk (I don't have easy access to a good disk setup ATM, but > > I guess we're more interested in code above the block layer anyway). > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > yours. > > > > I found that performance is a little unstable, so I sync and echo 3 > > > drop_caches between each run. When it starts reclaiming memory, things > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > of seconds in inode reclaim). > > So about this XFS livelock type thingy. It looks like this, and happens > periodically while running the above fs_mark benchmark requiring reclaim > of inodes: .... > Nothing much happening except 100% system time for seconds at a time > (length of time varies). This is on a ramdisk, so it isn't waiting > for IO. > > During this time, lots of things are contending on the lock: > > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed > | > --- __write_lock_failed > | > |--99.92%-- xfs_inode_ag_walk > | xfs_inode_ag_iterator > | xfs_reclaim_inode_shrink > | shrink_slab > | shrink_zone > | balance_pgdat > | kswapd > | kthread > | kernel_thread_helper > --0.08%-- [...] > > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find > 1.16% fs_mark [kernel.kallsyms] [k] memcpy > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil > | > --- xfs_log_commit_cil > _xfs_trans_commit > | > |--60.00%-- xfs_remove > | xfs_vn_unlink > | vfs_unlink > | do_unlinkat > | sys_unlink > > I'm not sure if there was a long-running read locker in there causing > all the write lockers to fail, or if they are just running into one > another. The longest hold is in the inode cluster writeback (xfs_iflush_cluster), but if there is no IO then I don't see how that would be a problem. I suspect that it might be caused by having several CPUs all trying to run the shrinker at the same time and them all starting at the same AG and therefore lockstepping and getting nothing done because they are all scanning the same inodes. Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to avoid this lockstepping. I've attached a patch below to do this - can you give it a try? > But anyway, I hacked the following patch which seemed to > improve that behaviour. I haven't run any throughput numbers on it yet, > but I could if you're interested (and it's not completely broken!) Batching is certainly something that I have been considering, but apart from the excessive scanning bug, the per-ag inode tree lookups hve not featured prominently in any profiling I've done, so it hasn't been a high priority. You patch looks like it will work fine, but I think it can be made a lot cleaner. I'll have a closer look at this once I get to the bottom of the dbench hang you are seeing.... Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: add an AG iterator start rotor From: Dave Chinner <dchinner@redhat.com> To avoid multiple CPUs from executing inode cache shrinkers on the same AG all at the same time, make every shrinker call start on a different AG. This will mostly prevent concurrent shrinker calls from competing for the serialising pag_ici_lock and lock-stepping reclaim. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/linux-2.6/xfs_sync.c | 11 ++++++++++- fs/xfs/xfs_mount.h | 1 + 2 files changed, 11 insertions(+), 1 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c index dfcbd98..5322105 100644 --- a/fs/xfs/linux-2.6/xfs_sync.c +++ b/fs/xfs/linux-2.6/xfs_sync.c @@ -181,11 +181,14 @@ xfs_inode_ag_iterator( struct xfs_perag *pag; int error = 0; int last_error = 0; + xfs_agnumber_t start_ag; xfs_agnumber_t ag; int nr; + int looped = 0; nr = nr_to_scan ? *nr_to_scan : INT_MAX; - ag = 0; + start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount; + ag = start_ag; while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) { error = xfs_inode_ag_walk(mp, pag, execute, flags, tag, exclusive, &nr); @@ -197,6 +200,12 @@ xfs_inode_ag_iterator( } if (nr <= 0) break; + if (ag >= mp->m_sb.sb_agcount) { + looped = 1; + ag = 0; + } + if (ag >= start_ag && looped) + break; } if (nr_to_scan) *nr_to_scan = nr; diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 622da21..fae61bb 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -199,6 +199,7 @@ typedef struct xfs_mount { __int64_t m_update_flags; /* sb flags we need to update on the next remount,rw */ struct shrinker m_inode_shrink; /* inode reclaim shrinker */ + atomic_t m_agiter_rotor; /* ag iterator start rotor */ } xfs_mount_t; /* _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-28 12:57 ` Dave Chinner @ 2010-07-29 14:03 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-29 14:03 UTC (permalink / raw) To: Dave Chinner; +Cc: Nick Piggin, xfs, linux-fsdevel On Wed, Jul 28, 2010 at 10:57:17PM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > > > Branch vfs-scale-working > > > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > > run the same fs_mark parallel create/unlink workload to show > > > > scalability as I ran here: > > > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > > of a real disk (I don't have easy access to a good disk setup ATM, but > > > I guess we're more interested in code above the block layer anyway). > > > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > > yours. > > > > > > I found that performance is a little unstable, so I sync and echo 3 > > > > drop_caches between each run. When it starts reclaiming memory, things > > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > > of seconds in inode reclaim). > > > > So about this XFS livelock type thingy. It looks like this, and happens > > periodically while running the above fs_mark benchmark requiring reclaim > > of inodes: > .... > > > Nothing much happening except 100% system time for seconds at a time > > (length of time varies). This is on a ramdisk, so it isn't waiting > > for IO. > > > > During this time, lots of things are contending on the lock: > > > > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed > > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed > > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion > > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock > > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed > > | > > --- __write_lock_failed > > | > > |--99.92%-- xfs_inode_ag_walk > > | xfs_inode_ag_iterator > > | xfs_reclaim_inode_shrink > > | shrink_slab > > | shrink_zone > > | balance_pgdat > > | kswapd > > | kthread > > | kernel_thread_helper > > --0.08%-- [...] > > > > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock > > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find > > 1.16% fs_mark [kernel.kallsyms] [k] memcpy > > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave > > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil > > | > > --- xfs_log_commit_cil > > _xfs_trans_commit > > | > > |--60.00%-- xfs_remove > > | xfs_vn_unlink > > | vfs_unlink > > | do_unlinkat > > | sys_unlink > > > > I'm not sure if there was a long-running read locker in there causing > > all the write lockers to fail, or if they are just running into one > > another. > > The longest hold is in the inode cluster writeback > (xfs_iflush_cluster), but if there is no IO then I don't see how > that would be a problem. No I wasn't suggesting there was, just that there could have been one that I didn't notice in profiles (ie. because it had taken read lock rather than spinning on it). > I suspect that it might be caused by having several CPUs > all trying to run the shrinker at the same time and them all > starting at the same AG and therefore lockstepping and getting > nothing done because they are all scanning the same inodes. I think that is the most likely answer, yes. > Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to > avoid this lockstepping. I've attached a patch below to do this > - can you give it a try? Cool yes I will. I could try it in combination with the batching patch too. Thanks. > > But anyway, I hacked the following patch which seemed to > > improve that behaviour. I haven't run any throughput numbers on it yet, > > but I could if you're interested (and it's not completely broken!) > > Batching is certainly something that I have been considering, but > apart from the excessive scanning bug, the per-ag inode tree lookups > hve not featured prominently in any profiling I've done, so it > hasn't been a high priority. > > You patch looks like it will work fine, but I think it can be made a > lot cleaner. I'll have a closer look at this once I get to the bottom of > the dbench hang you are seeing.... Well I'll see if I can measure any efficiency or lock contention improvements with it and report back. Might have to wait till the weekend. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-29 14:03 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-29 14:03 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, Nick Piggin, xfs On Wed, Jul 28, 2010 at 10:57:17PM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote: > > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > > > Branch vfs-scale-working > > > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > > run the same fs_mark parallel create/unlink workload to show > > > > scalability as I ran here: > > > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > > of a real disk (I don't have easy access to a good disk setup ATM, but > > > I guess we're more interested in code above the block layer anyway). > > > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > > yours. > > > > > > I found that performance is a little unstable, so I sync and echo 3 > > > > drop_caches between each run. When it starts reclaiming memory, things > > > get a bit more erratic (and XFS seemed to be almost livelocking for tens > > > of seconds in inode reclaim). > > > > So about this XFS livelock type thingy. It looks like this, and happens > > periodically while running the above fs_mark benchmark requiring reclaim > > of inodes: > .... > > > Nothing much happening except 100% system time for seconds at a time > > (length of time varies). This is on a ramdisk, so it isn't waiting > > for IO. > > > > During this time, lots of things are contending on the lock: > > > > 60.37% fs_mark [kernel.kallsyms] [k] __write_lock_failed > > 4.30% kswapd0 [kernel.kallsyms] [k] __write_lock_failed > > 3.70% fs_mark [kernel.kallsyms] [k] try_wait_for_completion > > 3.59% fs_mark [kernel.kallsyms] [k] _raw_write_lock > > 3.46% kswapd1 [kernel.kallsyms] [k] __write_lock_failed > > | > > --- __write_lock_failed > > | > > |--99.92%-- xfs_inode_ag_walk > > | xfs_inode_ag_iterator > > | xfs_reclaim_inode_shrink > > | shrink_slab > > | shrink_zone > > | balance_pgdat > > | kswapd > > | kthread > > | kernel_thread_helper > > --0.08%-- [...] > > > > 3.02% fs_mark [kernel.kallsyms] [k] _raw_spin_lock > > 1.82% fs_mark [kernel.kallsyms] [k] _xfs_buf_find > > 1.16% fs_mark [kernel.kallsyms] [k] memcpy > > 0.86% fs_mark [kernel.kallsyms] [k] _raw_spin_lock_irqsave > > 0.75% fs_mark [kernel.kallsyms] [k] xfs_log_commit_cil > > | > > --- xfs_log_commit_cil > > _xfs_trans_commit > > | > > |--60.00%-- xfs_remove > > | xfs_vn_unlink > > | vfs_unlink > > | do_unlinkat > > | sys_unlink > > > > I'm not sure if there was a long-running read locker in there causing > > all the write lockers to fail, or if they are just running into one > > another. > > The longest hold is in the inode cluster writeback > (xfs_iflush_cluster), but if there is no IO then I don't see how > that would be a problem. No I wasn't suggesting there was, just that there could have been one that I didn't notice in profiles (ie. because it had taken read lock rather than spinning on it). > I suspect that it might be caused by having several CPUs > all trying to run the shrinker at the same time and them all > starting at the same AG and therefore lockstepping and getting > nothing done because they are all scanning the same inodes. I think that is the most likely answer, yes. > Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to > avoid this lockstepping. I've attached a patch below to do this > - can you give it a try? Cool yes I will. I could try it in combination with the batching patch too. Thanks. > > But anyway, I hacked the following patch which seemed to > > improve that behaviour. I haven't run any throughput numbers on it yet, > > but I could if you're interested (and it's not completely broken!) > > Batching is certainly something that I have been considering, but > apart from the excessive scanning bug, the per-ag inode tree lookups > hve not featured prominently in any profiling I've done, so it > hasn't been a high priority. > > You patch looks like it will work fine, but I think it can be made a > lot cleaner. I'll have a closer look at this once I get to the bottom of > the dbench hang you are seeing.... Well I'll see if I can measure any efficiency or lock contention improvements with it and report back. Might have to wait till the weekend. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-27 7:05 ` Nick Piggin @ 2010-07-27 11:09 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-27 11:09 UTC (permalink / raw) To: Nick Piggin Cc: Dave Chinner, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > run the same fs_mark parallel create/unlink workload to show > > scalability as I ran here: > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > of a real disk (I don't have easy access to a good disk setup ATM, but > I guess we're more interested in code above the block layer anyway). > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > yours. I also tried dbench on this setup. 20 runs of dbench -t20 8 (that is a 20 second run, 8 clients). Numbers are throughput, higher is better: N Min Max Median Avg Stddev vanilla 20 2219.19 2249.43 2230.43 2230.9915 7.2528893 scale 20 2428.21 2490.8 2437.86 2444.111 16.668256 Difference at 95.0% confidence 213.119 +/- 8.22695 9.55268% +/- 0.368757% (Student's t, pooled s = 12.8537) vfs-scale is 9.5% or 210MB/s faster than vanilla. Like fs_mark, dbench has creat/unlink activity, so I hope rcu-inodes should not be such a problem in practice. In my creat/unlink benchmark, it is creating and destroying one inode repeatedly, which is the absolute worst case for rcu-inodes. Wheras in most real workloads would be creating and destroying many inodes, which is not such a dis advantage for rcu-inodes. Incidentally, XFS was by far the fastest "real" filesystem I tested on this workload. ext4 was around 1700MB/s (ext2 was around 3100MB/s and ramfs is 3350MB/s). ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-27 11:09 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-27 11:09 UTC (permalink / raw) To: Nick Piggin Cc: Dave Chinner, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > run the same fs_mark parallel create/unlink workload to show > > scalability as I ran here: > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > of a real disk (I don't have easy access to a good disk setup ATM, but > I guess we're more interested in code above the block layer anyway). > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > yours. I also tried dbench on this setup. 20 runs of dbench -t20 8 (that is a 20 second run, 8 clients). Numbers are throughput, higher is better: N Min Max Median Avg Stddev vanilla 20 2219.19 2249.43 2230.43 2230.9915 7.2528893 scale 20 2428.21 2490.8 2437.86 2444.111 16.668256 Difference at 95.0% confidence 213.119 +/- 8.22695 9.55268% +/- 0.368757% (Student's t, pooled s = 12.8537) vfs-scale is 9.5% or 210MB/s faster than vanilla. Like fs_mark, dbench has creat/unlink activity, so I hope rcu-inodes should not be such a problem in practice. In my creat/unlink benchmark, it is creating and destroying one inode repeatedly, which is the absolute worst case for rcu-inodes. Wheras in most real workloads would be creating and destroying many inodes, which is not such a dis advantage for rcu-inodes. Incidentally, XFS was by far the fastest "real" filesystem I tested on this workload. ext4 was around 1700MB/s (ext2 was around 3100MB/s and ramfs is 3350MB/s). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-27 7:05 ` Nick Piggin @ 2010-07-27 13:18 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-27 13:18 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > run the same fs_mark parallel create/unlink workload to show > > scalability as I ran here: > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > of a real disk (I don't have easy access to a good disk setup ATM, but > I guess we're more interested in code above the block layer anyway). > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > yours. A s a personal prefernce, I don't like testing filesystem performance on ramdisks because it hides problems caused by changes in IO latency. I'll come back to this later. > I found that performance is a little unstable, so I sync and echo 3 > > drop_caches between each run. Quite possibly because of the smaller log - that will cause more frequent pushing on the log tail and hence I/O patterns will vary a bit... Also, keep in mind that delayed logging is shiny and new - it has increased XFS metadata performance and parallelism by an order of magnitude and so we're really seeing new a bunch of brand new issues that have never been seen before with this functionality. As such, there's still some interactions I haven't got to the bottom of with delayed logging - it's stable enough to use and benchmark and won't corrupt anything but there are still has some warts we need to solve. The difficulty (as always) is in reliably reproducing the bad behaviour. > When it starts reclaiming memory, things > get a bit more erratic (and XFS seemed to be almost livelocking for tens > of seconds in inode reclaim). I can't say that I've seen this - even when testing up to 10m inodes. Yes, kswapd is almost permanently active on these runs, but when creating 100,000 inodes/s we also need to be reclaiming 100,000 inodes/s so it's not surprising that when 7 CPUs are doing allocation we need at least one CPU to run reclaim.... > So I started with 50 runs of fs_mark > -n 20000 (which did not cause reclaim), rebuilding a new filesystem > between every run. > > That gave the following files/sec numbers: > N Min Max Median Avg Stddev > x 50 100986.4 127622 125013.4 123248.82 5244.1988 > + 50 100967.6 135918.6 130214.9 127926.94 6374.6975 > Difference at 95.0% confidence > 4678.12 +/- 2316.07 > 3.79567% +/- 1.87919% > (Student's t, pooled s = 5836.88) > > This is 3.8% in favour of vfs-scale-working. > > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did > start to fill up memory and cause reclaim during the 2nd and subsequent > iterations. I haven't used this mode, so I can't really comment on the results you are seeing. > > enabled. ext4 is using default mkfs and mount parameters except for > > barrier=0. All numbers are averages of three runs. > > > > fs_mark rate (thousands of files/second) > > 2.6.35-rc5 2.6.35-rc5-scale > > threads xfs ext4 xfs ext4 > > 1 20 39 20 39 > > 2 35 55 35 57 > > 4 60 41 57 42 > > 8 79 9 75 9 > > > > ext4 is getting IO bound at more than 2 threads, so apart from > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > > going to ignore ext4 for the purposes of testing scalability here. > > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > > CPU and with Nick's patches it's about 650% (10% higher) for > > slightly lower throughput. So at this class of machine for this > > workload, the changes result in a slight reduction in scalability. > > I wonder if these results are stable. It's possible that changes in > reclaim behaviour are causing my patches to require more IO for a > given unit of work? More likely that's the result of using a smaller log size because it will require more frequent metadata pushes to make space for new transactions. > I was seeing XFS 'livelock' in reclaim more with my patches, it > could be due to more parallelism now being allowed from the vfs and > reclaim. > > Based on my above numbers, I don't see that rcu-inodes is causing a > problem, and in terms of SMP scalability, there is really no way that > vanilla is more scalable, so I'm interested to see where this slowdown > is coming from. As I said initially, ram disks hide IO latency changes resulting from increased numbers of IO or increases in seek distances. My initial guess is the change in inode reclaim behaviour causing different IO patterns and more seeks under reclaim because the zone based reclaim is no longer reclaiming inodes in the order they are created (i.e. we are not doing sequential inode reclaim any more. FWIW, I use PCP monitoring graphs to correlate behavioural changes across different subsystems because it is far easier to relate information visually than it is by looking at raw numbers or traces. I think this graph shows the effect of relcaim on performance most clearly: http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png It's pretty clear that when the inode/dentry cache shrinkers are running, sustained create/unlink performance goes right down. From a different tab not in the screen shot (the other "test-4" tab), I could see CPU usage also goes down and the disk iops go way up whenever the create/unlink performance dropped. This same behaviour happens with the vfs-scale patchset, so it's not related to lock contention - just aggressive reclaim of still-dirty inodes. FYI, The patch under test there was the XFS shrinker ignoring 7 out of 8 shrinker calls and then on the 8th call doing the work of all previous calls. i.e emulating SHRINK_BATCH = 1024. Interestingly enough, that one change reduced the runtime of the 8m inode create/unlink load by ~25% (from ~24min to ~18min). That is by far the largest improvement I've been able to obtain from modifying the shrinker code, and it is from those sorts of observations that I think that IO being issued from reclaim is currently the most significant performance limiting factor for XFS in this sort of workload.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-27 13:18 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-27 13:18 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > run the same fs_mark parallel create/unlink workload to show > > scalability as I ran here: > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > of a real disk (I don't have easy access to a good disk setup ATM, but > I guess we're more interested in code above the block layer anyway). > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > yours. A s a personal prefernce, I don't like testing filesystem performance on ramdisks because it hides problems caused by changes in IO latency. I'll come back to this later. > I found that performance is a little unstable, so I sync and echo 3 > > drop_caches between each run. Quite possibly because of the smaller log - that will cause more frequent pushing on the log tail and hence I/O patterns will vary a bit... Also, keep in mind that delayed logging is shiny and new - it has increased XFS metadata performance and parallelism by an order of magnitude and so we're really seeing new a bunch of brand new issues that have never been seen before with this functionality. As such, there's still some interactions I haven't got to the bottom of with delayed logging - it's stable enough to use and benchmark and won't corrupt anything but there are still has some warts we need to solve. The difficulty (as always) is in reliably reproducing the bad behaviour. > When it starts reclaiming memory, things > get a bit more erratic (and XFS seemed to be almost livelocking for tens > of seconds in inode reclaim). I can't say that I've seen this - even when testing up to 10m inodes. Yes, kswapd is almost permanently active on these runs, but when creating 100,000 inodes/s we also need to be reclaiming 100,000 inodes/s so it's not surprising that when 7 CPUs are doing allocation we need at least one CPU to run reclaim.... > So I started with 50 runs of fs_mark > -n 20000 (which did not cause reclaim), rebuilding a new filesystem > between every run. > > That gave the following files/sec numbers: > N Min Max Median Avg Stddev > x 50 100986.4 127622 125013.4 123248.82 5244.1988 > + 50 100967.6 135918.6 130214.9 127926.94 6374.6975 > Difference at 95.0% confidence > 4678.12 +/- 2316.07 > 3.79567% +/- 1.87919% > (Student's t, pooled s = 5836.88) > > This is 3.8% in favour of vfs-scale-working. > > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did > start to fill up memory and cause reclaim during the 2nd and subsequent > iterations. I haven't used this mode, so I can't really comment on the results you are seeing. > > enabled. ext4 is using default mkfs and mount parameters except for > > barrier=0. All numbers are averages of three runs. > > > > fs_mark rate (thousands of files/second) > > 2.6.35-rc5 2.6.35-rc5-scale > > threads xfs ext4 xfs ext4 > > 1 20 39 20 39 > > 2 35 55 35 57 > > 4 60 41 57 42 > > 8 79 9 75 9 > > > > ext4 is getting IO bound at more than 2 threads, so apart from > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > > going to ignore ext4 for the purposes of testing scalability here. > > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > > CPU and with Nick's patches it's about 650% (10% higher) for > > slightly lower throughput. So at this class of machine for this > > workload, the changes result in a slight reduction in scalability. > > I wonder if these results are stable. It's possible that changes in > reclaim behaviour are causing my patches to require more IO for a > given unit of work? More likely that's the result of using a smaller log size because it will require more frequent metadata pushes to make space for new transactions. > I was seeing XFS 'livelock' in reclaim more with my patches, it > could be due to more parallelism now being allowed from the vfs and > reclaim. > > Based on my above numbers, I don't see that rcu-inodes is causing a > problem, and in terms of SMP scalability, there is really no way that > vanilla is more scalable, so I'm interested to see where this slowdown > is coming from. As I said initially, ram disks hide IO latency changes resulting from increased numbers of IO or increases in seek distances. My initial guess is the change in inode reclaim behaviour causing different IO patterns and more seeks under reclaim because the zone based reclaim is no longer reclaiming inodes in the order they are created (i.e. we are not doing sequential inode reclaim any more. FWIW, I use PCP monitoring graphs to correlate behavioural changes across different subsystems because it is far easier to relate information visually than it is by looking at raw numbers or traces. I think this graph shows the effect of relcaim on performance most clearly: http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png It's pretty clear that when the inode/dentry cache shrinkers are running, sustained create/unlink performance goes right down. From a different tab not in the screen shot (the other "test-4" tab), I could see CPU usage also goes down and the disk iops go way up whenever the create/unlink performance dropped. This same behaviour happens with the vfs-scale patchset, so it's not related to lock contention - just aggressive reclaim of still-dirty inodes. FYI, The patch under test there was the XFS shrinker ignoring 7 out of 8 shrinker calls and then on the 8th call doing the work of all previous calls. i.e emulating SHRINK_BATCH = 1024. Interestingly enough, that one change reduced the runtime of the 8m inode create/unlink load by ~25% (from ~24min to ~18min). That is by far the largest improvement I've been able to obtain from modifying the shrinker code, and it is from those sorts of observations that I think that IO being issued from reclaim is currently the most significant performance limiting factor for XFS in this sort of workload.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-27 13:18 ` Dave Chinner @ 2010-07-27 15:09 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-27 15:09 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > Branch vfs-scale-working > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > run the same fs_mark parallel create/unlink workload to show > > > scalability as I ran here: > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > of a real disk (I don't have easy access to a good disk setup ATM, but > > I guess we're more interested in code above the block layer anyway). > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > yours. > > A s a personal prefernce, I don't like testing filesystem performance > on ramdisks because it hides problems caused by changes in IO > latency. I'll come back to this later. Very true, although it's good if you don't have some fast disks, and it can be good to trigger different races than disks tend to. So I still want to get to the bottom of the slowdown you saw on vfs-scale. > > I found that performance is a little unstable, so I sync and echo 3 > > > drop_caches between each run. > > Quite possibly because of the smaller log - that will cause more > frequent pushing on the log tail and hence I/O patterns will vary a > bit... Well... I think the test case (or how I'm running it) is simply a bit unstable. I mean, there are subtle interactions all the way from the CPU scheduler to the disk, so when I say unstable I'm not particularly blaming XFS :) > Also, keep in mind that delayed logging is shiny and new - it has > increased XFS metadata performance and parallelism by an order of > magnitude and so we're really seeing new a bunch of brand new issues > that have never been seen before with this functionality. As such, > there's still some interactions I haven't got to the bottom of with > delayed logging - it's stable enough to use and benchmark and won't > corrupt anything but there are still has some warts we need to > solve. The difficulty (as always) is in reliably reproducing the bad > behaviour. Sure, and I didn't see any corruptions, it seems pretty stable and scalability is better than other filesystems. I'll see if I can give a better recipe to reproduce the 'livelock'ish behaviour. > > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did > > start to fill up memory and cause reclaim during the 2nd and subsequent > > iterations. > > I haven't used this mode, so I can't really comment on the results > you are seeing. It's a bit strange. Help says it should clear inodes between iterations (without the -k flag), but it does not seem to. > > > enabled. ext4 is using default mkfs and mount parameters except for > > > barrier=0. All numbers are averages of three runs. > > > > > > fs_mark rate (thousands of files/second) > > > 2.6.35-rc5 2.6.35-rc5-scale > > > threads xfs ext4 xfs ext4 > > > 1 20 39 20 39 > > > 2 35 55 35 57 > > > 4 60 41 57 42 > > > 8 79 9 75 9 > > > > > > ext4 is getting IO bound at more than 2 threads, so apart from > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > > > going to ignore ext4 for the purposes of testing scalability here. > > > > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > > > CPU and with Nick's patches it's about 650% (10% higher) for > > > slightly lower throughput. So at this class of machine for this > > > workload, the changes result in a slight reduction in scalability. > > > > I wonder if these results are stable. It's possible that changes in > > reclaim behaviour are causing my patches to require more IO for a > > given unit of work? > > More likely that's the result of using a smaller log size because it > will require more frequent metadata pushes to make space for new > transactions. I was just checking whether your numbers are stable (where you saw some slowdown with vfs-scale patches), and what could be the cause. I agree that running real disks could make big changes in behaviour. > > I was seeing XFS 'livelock' in reclaim more with my patches, it > > could be due to more parallelism now being allowed from the vfs and > > reclaim. > > > > Based on my above numbers, I don't see that rcu-inodes is causing a > > problem, and in terms of SMP scalability, there is really no way that > > vanilla is more scalable, so I'm interested to see where this slowdown > > is coming from. > > As I said initially, ram disks hide IO latency changes resulting > from increased numbers of IO or increases in seek distances. My > initial guess is the change in inode reclaim behaviour causing > different IO patterns and more seeks under reclaim because the zone > based reclaim is no longer reclaiming inodes in the order > they are created (i.e. we are not doing sequential inode reclaim any > more. Sounds plausible. I'll do more investigations along those lines. > FWIW, I use PCP monitoring graphs to correlate behavioural changes > across different subsystems because it is far easier to relate > information visually than it is by looking at raw numbers or traces. > I think this graph shows the effect of relcaim on performance > most clearly: > > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png I haven't actually used that, it looks interesting. > It's pretty clear that when the inode/dentry cache shrinkers are > running, sustained create/unlink performance goes right down. From a > different tab not in the screen shot (the other "test-4" tab), I > could see CPU usage also goes down and the disk iops go way up > whenever the create/unlink performance dropped. This same behaviour > happens with the vfs-scale patchset, so it's not related to lock > contention - just aggressive reclaim of still-dirty inodes. > > FYI, The patch under test there was the XFS shrinker ignoring 7 out > of 8 shrinker calls and then on the 8th call doing the work of all > previous calls. i.e emulating SHRINK_BATCH = 1024. Interestingly > enough, that one change reduced the runtime of the 8m inode > create/unlink load by ~25% (from ~24min to ~18min). Hmm, interesting. Well that's naturally configurable with the shrinker API changes I'm hoping to have merged. I'll plan to push that ahead of the vfs-scale patches of course. > That is by far the largest improvement I've been able to obtain from > modifying the shrinker code, and it is from those sorts of > observations that I think that IO being issued from reclaim is > currently the most significant performance limiting factor for XFS > in this sort of workload.... How is the xfs inode reclaim tied to linux inode reclaim? Does the xfs inode not become reclaimable until some time after the linux inode is reclaimed? Or what? Do all or most of the xfs inodes require IO before being reclaimed during this test? I wonder if you could throttle them a bit or sort them somehow so that they tend to be cleaned by writeout and reclaim just comes after and removes the clean ones, like pagecache reclaim is (supposed) to work.? ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-27 15:09 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-27 15:09 UTC (permalink / raw) To: Dave Chinner Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote: > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > > > Branch vfs-scale-working > > > > > > With a production build (i.e. no lockdep, no xfs debug), I'll > > > run the same fs_mark parallel create/unlink workload to show > > > scalability as I ran here: > > > > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html > > > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead > > of a real disk (I don't have easy access to a good disk setup ATM, but > > I guess we're more interested in code above the block layer anyway). > > > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as > > yours. > > A s a personal prefernce, I don't like testing filesystem performance > on ramdisks because it hides problems caused by changes in IO > latency. I'll come back to this later. Very true, although it's good if you don't have some fast disks, and it can be good to trigger different races than disks tend to. So I still want to get to the bottom of the slowdown you saw on vfs-scale. > > I found that performance is a little unstable, so I sync and echo 3 > > > drop_caches between each run. > > Quite possibly because of the smaller log - that will cause more > frequent pushing on the log tail and hence I/O patterns will vary a > bit... Well... I think the test case (or how I'm running it) is simply a bit unstable. I mean, there are subtle interactions all the way from the CPU scheduler to the disk, so when I say unstable I'm not particularly blaming XFS :) > Also, keep in mind that delayed logging is shiny and new - it has > increased XFS metadata performance and parallelism by an order of > magnitude and so we're really seeing new a bunch of brand new issues > that have never been seen before with this functionality. As such, > there's still some interactions I haven't got to the bottom of with > delayed logging - it's stable enough to use and benchmark and won't > corrupt anything but there are still has some warts we need to > solve. The difficulty (as always) is in reliably reproducing the bad > behaviour. Sure, and I didn't see any corruptions, it seems pretty stable and scalability is better than other filesystems. I'll see if I can give a better recipe to reproduce the 'livelock'ish behaviour. > > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did > > start to fill up memory and cause reclaim during the 2nd and subsequent > > iterations. > > I haven't used this mode, so I can't really comment on the results > you are seeing. It's a bit strange. Help says it should clear inodes between iterations (without the -k flag), but it does not seem to. > > > enabled. ext4 is using default mkfs and mount parameters except for > > > barrier=0. All numbers are averages of three runs. > > > > > > fs_mark rate (thousands of files/second) > > > 2.6.35-rc5 2.6.35-rc5-scale > > > threads xfs ext4 xfs ext4 > > > 1 20 39 20 39 > > > 2 35 55 35 57 > > > 4 60 41 57 42 > > > 8 79 9 75 9 > > > > > > ext4 is getting IO bound at more than 2 threads, so apart from > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > > > going to ignore ext4 for the purposes of testing scalability here. > > > > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > > > CPU and with Nick's patches it's about 650% (10% higher) for > > > slightly lower throughput. So at this class of machine for this > > > workload, the changes result in a slight reduction in scalability. > > > > I wonder if these results are stable. It's possible that changes in > > reclaim behaviour are causing my patches to require more IO for a > > given unit of work? > > More likely that's the result of using a smaller log size because it > will require more frequent metadata pushes to make space for new > transactions. I was just checking whether your numbers are stable (where you saw some slowdown with vfs-scale patches), and what could be the cause. I agree that running real disks could make big changes in behaviour. > > I was seeing XFS 'livelock' in reclaim more with my patches, it > > could be due to more parallelism now being allowed from the vfs and > > reclaim. > > > > Based on my above numbers, I don't see that rcu-inodes is causing a > > problem, and in terms of SMP scalability, there is really no way that > > vanilla is more scalable, so I'm interested to see where this slowdown > > is coming from. > > As I said initially, ram disks hide IO latency changes resulting > from increased numbers of IO or increases in seek distances. My > initial guess is the change in inode reclaim behaviour causing > different IO patterns and more seeks under reclaim because the zone > based reclaim is no longer reclaiming inodes in the order > they are created (i.e. we are not doing sequential inode reclaim any > more. Sounds plausible. I'll do more investigations along those lines. > FWIW, I use PCP monitoring graphs to correlate behavioural changes > across different subsystems because it is far easier to relate > information visually than it is by looking at raw numbers or traces. > I think this graph shows the effect of relcaim on performance > most clearly: > > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png I haven't actually used that, it looks interesting. > It's pretty clear that when the inode/dentry cache shrinkers are > running, sustained create/unlink performance goes right down. From a > different tab not in the screen shot (the other "test-4" tab), I > could see CPU usage also goes down and the disk iops go way up > whenever the create/unlink performance dropped. This same behaviour > happens with the vfs-scale patchset, so it's not related to lock > contention - just aggressive reclaim of still-dirty inodes. > > FYI, The patch under test there was the XFS shrinker ignoring 7 out > of 8 shrinker calls and then on the 8th call doing the work of all > previous calls. i.e emulating SHRINK_BATCH = 1024. Interestingly > enough, that one change reduced the runtime of the 8m inode > create/unlink load by ~25% (from ~24min to ~18min). Hmm, interesting. Well that's naturally configurable with the shrinker API changes I'm hoping to have merged. I'll plan to push that ahead of the vfs-scale patches of course. > That is by far the largest improvement I've been able to obtain from > modifying the shrinker code, and it is from those sorts of > observations that I think that IO being issued from reclaim is > currently the most significant performance limiting factor for XFS > in this sort of workload.... How is the xfs inode reclaim tied to linux inode reclaim? Does the xfs inode not become reclaimable until some time after the linux inode is reclaimed? Or what? Do all or most of the xfs inodes require IO before being reclaimed during this test? I wonder if you could throttle them a bit or sort them somehow so that they tend to be cleaned by writeout and reclaim just comes after and removes the clean ones, like pagecache reclaim is (supposed) to work.? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-27 15:09 ` Nick Piggin (?) @ 2010-07-28 4:59 ` Dave Chinner -1 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-28 4:59 UTC (permalink / raw) To: Nick Piggin Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote: > > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > solve. The difficulty (as always) is in reliably reproducing the bad > > behaviour. > > Sure, and I didn't see any corruptions, it seems pretty stable and > scalability is better than other filesystems. I'll see if I can > give a better recipe to reproduce the 'livelock'ish behaviour. Well, stable is a good start :) > > > > fs_mark rate (thousands of files/second) > > > > 2.6.35-rc5 2.6.35-rc5-scale > > > > threads xfs ext4 xfs ext4 > > > > 1 20 39 20 39 > > > > 2 35 55 35 57 > > > > 4 60 41 57 42 > > > > 8 79 9 75 9 > > > > > > > > ext4 is getting IO bound at more than 2 threads, so apart from > > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > > > > going to ignore ext4 for the purposes of testing scalability here. > > > > > > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > > > > CPU and with Nick's patches it's about 650% (10% higher) for > > > > slightly lower throughput. So at this class of machine for this > > > > workload, the changes result in a slight reduction in scalability. > > > > > > I wonder if these results are stable. It's possible that changes in > > > reclaim behaviour are causing my patches to require more IO for a > > > given unit of work? > > > > More likely that's the result of using a smaller log size because it > > will require more frequent metadata pushes to make space for new > > transactions. > > I was just checking whether your numbers are stable (where you > saw some slowdown with vfs-scale patches), and what could be the > cause. I agree that running real disks could make big changes in > behaviour. Yeah, the numbers are repeatable within about +/-5%. I generally don't bother with optimisations that result in gains/losses less than that because IO benchmarks that reliably repoduce results with more precise repeatability than that are few and far between. > > FWIW, I use PCP monitoring graphs to correlate behavioural changes > > across different subsystems because it is far easier to relate > > information visually than it is by looking at raw numbers or traces. > > I think this graph shows the effect of relcaim on performance > > most clearly: > > > > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png > > I haven't actually used that, it looks interesting. The archiving side of PCP is the most useful, I find. i.e. being able to record the metrics into a file and analyse them with pmchart or other tools after the fact... > > That is by far the largest improvement I've been able to obtain from > > modifying the shrinker code, and it is from those sorts of > > observations that I think that IO being issued from reclaim is > > currently the most significant performance limiting factor for XFS > > in this sort of workload.... > > How is the xfs inode reclaim tied to linux inode reclaim? Does the > xfs inode not become reclaimable until some time after the linux inode > is reclaimed? Or what? The struct xfs_inode embeds a struct inode like so: struct xfs_inode { ..... struct inode i_inode; } so they are the same chunk of memory. XFS does not use the VFS inode hashes for finding inodes - that's what the per-ag radix trees are used for. The xfs_inode lives longer than the struct inode because we do non-trivial work after the VFS "reclaims" the struct inode. For example, when an inode is unlinked do not truncate or free the inode until after the VFS has finished with it - the inode remains on the unlinked list (orphaned in ext3 terms) from the time is is unlinked by the VFS to the time the last VFs reference goes away. When XFS gets it, XFS then issues the inactive transaction that takes the inode off the unlinked list and marks it free in the inode alloc btree. This transaction is asynchronous and dirties the xfs inode. Finally XFS will mark the inode as reclaimable via a radix tree tag. The final processing of the inode is then done via eaither a background relcaim walk from xfssyncd (every 30s) where it will do non-blocking operations to finalіze reclaim. It may take several passes to actually reclaim the inode. e.g. one pass to force the log if the inode is pinned, another pass to flush the inode to disk if it is dirty and not stale, and then another pass to reclaim the inode once clean. There may be multiple passes inbetween where the inode is skipped because those operations have not completed. And to top it all off, if the inode is looked up again (cache hit) while in the reclaimable state, it will be removed from the reclaim state and reused immediately. in this case we don't need to continue the reclaim processing other things will ensure all the correct information will go to disk. > Do all or most of the xfs inodes require IO before being reclaimed > during this test? Yes, because all the inode are being dirtied and they are being reclaimed faster than background flushing expires them. > I wonder if you could throttle them a bit or sort > them somehow so that they tend to be cleaned by writeout and reclaim > just comes after and removes the clean ones, like pagecache reclaim > is (supposed) to work.? The whole point of using the radix trees is to get nicely sorted reclaim IO - inodes are indexed by number, and the radix tree walk gives us ascending inode number (and hence ascending block number) reclaim - and the background reclaim allows optimal flushing to occur by aggregating all the IO into delayed write metadata buffers so they can be sorted and flushed to the elevator by the xfsbufd in the most optimal manner possible. The shrinker does preempt this somewhat, which is why delaying the XFS shrinker's work appears to improve things alot. If the shrinker is not running, the the background reclaim does exactly what you are suggesting. However, I don't think the increase in iops is caused by the XFS inode shrinker - I think that it is the VFS cache shrinkers. If you look at the the graphs in the link above, preformance doesn't decrease when the XFS inode cache is being shrunk (top chart, yellow trace) - it drops when the vfs caches are being shrunk (middle chart). I haven't correlated the behaviour any further than that because I haven't had time. FWIW, all this background reclaim, radix tree reclaim tagging and walking, embedded struct inodes, etc is all relatively new code. The oldest bit of it was introduced in 2.6.31 (I think) and so a significant part of what we are exploring here is uncharted territory. The changes to relcaim, etc are aprtially reponsible for the scalabilty we are geting from delayed logging, but there is certainly room for improvement.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-28 4:59 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-28 4:59 UTC (permalink / raw) To: Nick Piggin Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote: > > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > solve. The difficulty (as always) is in reliably reproducing the bad > > behaviour. > > Sure, and I didn't see any corruptions, it seems pretty stable and > scalability is better than other filesystems. I'll see if I can > give a better recipe to reproduce the 'livelock'ish behaviour. Well, stable is a good start :) > > > > fs_mark rate (thousands of files/second) > > > > 2.6.35-rc5 2.6.35-rc5-scale > > > > threads xfs ext4 xfs ext4 > > > > 1 20 39 20 39 > > > > 2 35 55 35 57 > > > > 4 60 41 57 42 > > > > 8 79 9 75 9 > > > > > > > > ext4 is getting IO bound at more than 2 threads, so apart from > > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > > > > going to ignore ext4 for the purposes of testing scalability here. > > > > > > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > > > > CPU and with Nick's patches it's about 650% (10% higher) for > > > > slightly lower throughput. So at this class of machine for this > > > > workload, the changes result in a slight reduction in scalability. > > > > > > I wonder if these results are stable. It's possible that changes in > > > reclaim behaviour are causing my patches to require more IO for a > > > given unit of work? > > > > More likely that's the result of using a smaller log size because it > > will require more frequent metadata pushes to make space for new > > transactions. > > I was just checking whether your numbers are stable (where you > saw some slowdown with vfs-scale patches), and what could be the > cause. I agree that running real disks could make big changes in > behaviour. Yeah, the numbers are repeatable within about +/-5%. I generally don't bother with optimisations that result in gains/losses less than that because IO benchmarks that reliably repoduce results with more precise repeatability than that are few and far between. > > FWIW, I use PCP monitoring graphs to correlate behavioural changes > > across different subsystems because it is far easier to relate > > information visually than it is by looking at raw numbers or traces. > > I think this graph shows the effect of relcaim on performance > > most clearly: > > > > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png > > I haven't actually used that, it looks interesting. The archiving side of PCP is the most useful, I find. i.e. being able to record the metrics into a file and analyse them with pmchart or other tools after the fact... > > That is by far the largest improvement I've been able to obtain from > > modifying the shrinker code, and it is from those sorts of > > observations that I think that IO being issued from reclaim is > > currently the most significant performance limiting factor for XFS > > in this sort of workload.... > > How is the xfs inode reclaim tied to linux inode reclaim? Does the > xfs inode not become reclaimable until some time after the linux inode > is reclaimed? Or what? The struct xfs_inode embeds a struct inode like so: struct xfs_inode { ..... struct inode i_inode; } so they are the same chunk of memory. XFS does not use the VFS inode hashes for finding inodes - that's what the per-ag radix trees are used for. The xfs_inode lives longer than the struct inode because we do non-trivial work after the VFS "reclaims" the struct inode. For example, when an inode is unlinked do not truncate or free the inode until after the VFS has finished with it - the inode remains on the unlinked list (orphaned in ext3 terms) from the time is is unlinked by the VFS to the time the last VFs reference goes away. When XFS gets it, XFS then issues the inactive transaction that takes the inode off the unlinked list and marks it free in the inode alloc btree. This transaction is asynchronous and dirties the xfs inode. Finally XFS will mark the inode as reclaimable via a radix tree tag. The final processing of the inode is then done via eaither a background relcaim walk from xfssyncd (every 30s) where it will do non-blocking operations to finalN?ze reclaim. It may take several passes to actually reclaim the inode. e.g. one pass to force the log if the inode is pinned, another pass to flush the inode to disk if it is dirty and not stale, and then another pass to reclaim the inode once clean. There may be multiple passes inbetween where the inode is skipped because those operations have not completed. And to top it all off, if the inode is looked up again (cache hit) while in the reclaimable state, it will be removed from the reclaim state and reused immediately. in this case we don't need to continue the reclaim processing other things will ensure all the correct information will go to disk. > Do all or most of the xfs inodes require IO before being reclaimed > during this test? Yes, because all the inode are being dirtied and they are being reclaimed faster than background flushing expires them. > I wonder if you could throttle them a bit or sort > them somehow so that they tend to be cleaned by writeout and reclaim > just comes after and removes the clean ones, like pagecache reclaim > is (supposed) to work.? The whole point of using the radix trees is to get nicely sorted reclaim IO - inodes are indexed by number, and the radix tree walk gives us ascending inode number (and hence ascending block number) reclaim - and the background reclaim allows optimal flushing to occur by aggregating all the IO into delayed write metadata buffers so they can be sorted and flushed to the elevator by the xfsbufd in the most optimal manner possible. The shrinker does preempt this somewhat, which is why delaying the XFS shrinker's work appears to improve things alot. If the shrinker is not running, the the background reclaim does exactly what you are suggesting. However, I don't think the increase in iops is caused by the XFS inode shrinker - I think that it is the VFS cache shrinkers. If you look at the the graphs in the link above, preformance doesn't decrease when the XFS inode cache is being shrunk (top chart, yellow trace) - it drops when the vfs caches are being shrunk (middle chart). I haven't correlated the behaviour any further than that because I haven't had time. FWIW, all this background reclaim, radix tree reclaim tagging and walking, embedded struct inodes, etc is all relatively new code. The oldest bit of it was introduced in 2.6.31 (I think) and so a significant part of what we are exploring here is uncharted territory. The changes to relcaim, etc are aprtially reponsible for the scalabilty we are geting from delayed logging, but there is certainly room for improvement.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-28 4:59 ` Dave Chinner 0 siblings, 0 replies; 76+ messages in thread From: Dave Chinner @ 2010-07-28 4:59 UTC (permalink / raw) To: Nick Piggin Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote: > On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote: > > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote: > > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote: > > solve. The difficulty (as always) is in reliably reproducing the bad > > behaviour. > > Sure, and I didn't see any corruptions, it seems pretty stable and > scalability is better than other filesystems. I'll see if I can > give a better recipe to reproduce the 'livelock'ish behaviour. Well, stable is a good start :) > > > > fs_mark rate (thousands of files/second) > > > > 2.6.35-rc5 2.6.35-rc5-scale > > > > threads xfs ext4 xfs ext4 > > > > 1 20 39 20 39 > > > > 2 35 55 35 57 > > > > 4 60 41 57 42 > > > > 8 79 9 75 9 > > > > > > > > ext4 is getting IO bound at more than 2 threads, so apart from > > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm > > > > going to ignore ext4 for the purposes of testing scalability here. > > > > > > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600% > > > > CPU and with Nick's patches it's about 650% (10% higher) for > > > > slightly lower throughput. So at this class of machine for this > > > > workload, the changes result in a slight reduction in scalability. > > > > > > I wonder if these results are stable. It's possible that changes in > > > reclaim behaviour are causing my patches to require more IO for a > > > given unit of work? > > > > More likely that's the result of using a smaller log size because it > > will require more frequent metadata pushes to make space for new > > transactions. > > I was just checking whether your numbers are stable (where you > saw some slowdown with vfs-scale patches), and what could be the > cause. I agree that running real disks could make big changes in > behaviour. Yeah, the numbers are repeatable within about +/-5%. I generally don't bother with optimisations that result in gains/losses less than that because IO benchmarks that reliably repoduce results with more precise repeatability than that are few and far between. > > FWIW, I use PCP monitoring graphs to correlate behavioural changes > > across different subsystems because it is far easier to relate > > information visually than it is by looking at raw numbers or traces. > > I think this graph shows the effect of relcaim on performance > > most clearly: > > > > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png > > I haven't actually used that, it looks interesting. The archiving side of PCP is the most useful, I find. i.e. being able to record the metrics into a file and analyse them with pmchart or other tools after the fact... > > That is by far the largest improvement I've been able to obtain from > > modifying the shrinker code, and it is from those sorts of > > observations that I think that IO being issued from reclaim is > > currently the most significant performance limiting factor for XFS > > in this sort of workload.... > > How is the xfs inode reclaim tied to linux inode reclaim? Does the > xfs inode not become reclaimable until some time after the linux inode > is reclaimed? Or what? The struct xfs_inode embeds a struct inode like so: struct xfs_inode { ..... struct inode i_inode; } so they are the same chunk of memory. XFS does not use the VFS inode hashes for finding inodes - that's what the per-ag radix trees are used for. The xfs_inode lives longer than the struct inode because we do non-trivial work after the VFS "reclaims" the struct inode. For example, when an inode is unlinked do not truncate or free the inode until after the VFS has finished with it - the inode remains on the unlinked list (orphaned in ext3 terms) from the time is is unlinked by the VFS to the time the last VFs reference goes away. When XFS gets it, XFS then issues the inactive transaction that takes the inode off the unlinked list and marks it free in the inode alloc btree. This transaction is asynchronous and dirties the xfs inode. Finally XFS will mark the inode as reclaimable via a radix tree tag. The final processing of the inode is then done via eaither a background relcaim walk from xfssyncd (every 30s) where it will do non-blocking operations to finalіze reclaim. It may take several passes to actually reclaim the inode. e.g. one pass to force the log if the inode is pinned, another pass to flush the inode to disk if it is dirty and not stale, and then another pass to reclaim the inode once clean. There may be multiple passes inbetween where the inode is skipped because those operations have not completed. And to top it all off, if the inode is looked up again (cache hit) while in the reclaimable state, it will be removed from the reclaim state and reused immediately. in this case we don't need to continue the reclaim processing other things will ensure all the correct information will go to disk. > Do all or most of the xfs inodes require IO before being reclaimed > during this test? Yes, because all the inode are being dirtied and they are being reclaimed faster than background flushing expires them. > I wonder if you could throttle them a bit or sort > them somehow so that they tend to be cleaned by writeout and reclaim > just comes after and removes the clean ones, like pagecache reclaim > is (supposed) to work.? The whole point of using the radix trees is to get nicely sorted reclaim IO - inodes are indexed by number, and the radix tree walk gives us ascending inode number (and hence ascending block number) reclaim - and the background reclaim allows optimal flushing to occur by aggregating all the IO into delayed write metadata buffers so they can be sorted and flushed to the elevator by the xfsbufd in the most optimal manner possible. The shrinker does preempt this somewhat, which is why delaying the XFS shrinker's work appears to improve things alot. If the shrinker is not running, the the background reclaim does exactly what you are suggesting. However, I don't think the increase in iops is caused by the XFS inode shrinker - I think that it is the VFS cache shrinkers. If you look at the the graphs in the link above, preformance doesn't decrease when the XFS inode cache is being shrunk (top chart, yellow trace) - it drops when the vfs caches are being shrunk (middle chart). I haven't correlated the behaviour any further than that because I haven't had time. FWIW, all this background reclaim, radix tree reclaim tagging and walking, embedded struct inodes, etc is all relatively new code. The oldest bit of it was introduced in 2.6.31 (I think) and so a significant part of what we are exploring here is uncharted territory. The changes to relcaim, etc are aprtially reponsible for the scalabilty we are geting from delayed logging, but there is certainly room for improvement.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-22 19:01 ` Nick Piggin @ 2010-07-23 15:35 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 15:35 UTC (permalink / raw) To: Nick Piggin, Michael Neuling Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > Summary of a few numbers I've run. google's socket teardown workload > runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20% > on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s > up to 15GB/s). Following post just contains some preliminary benchmark numbers on a POWER7. Boring if you're not interested in this stuff. IBM and Mikey kindly allowed me to do some test runs on a big POWER7 system today. Very is the only word I'm authorized to describe how big is big. We tested the vfs-scale-working and master branches from my git tree as of today. I'll stick with relative numbers to be safe. All tests were run on ramfs. First and very important is single threaded performance of basic code. POWER7 is obviously vastly different from a Barcelona or Nehalem. and store-free path walk uses a lot of seqlocks, which are cheap on x86, a little more epensive on others. Test case time difference, vanilla to vfs-scale (negative is better) stat() -10.8% +/- 0.3% close(open()) 4.3% +/- 0.3% unlink(creat()) 36.8% +/- 0.3% stat is significantly faster which is really good. open/close is a bit slower which we didn't get time to analyse. There are one or two seqlock checks which might be avoided, which could make up the difference. It's not horrible, but I hope to get POWER7 open/close more competitive (on x86 open/close is even a bit faster). Note this is a worst case for rcu-path-walk: lookup of "./file", because it has to take refcount on the final element. With more elements, rcu walk should gain the advantage. creat/unlink is showing the big RCU penalty. However I have penciled out a working design with Linus of how to do SLAB_DESTROY_BY_RCU. However it makes the store-free path walking and some inode RCU list walking a little bit trickier, so I prefer not to dump too much on at once. There is something that can be done if regressions show up. I don't anticipate many regressions outside microbenchmarks, and this is about the absolute worst case. On to parallel tests. Firstly, the google socket workload. Running with "NR_THREADS" children, vfs-scale patches do this: root@p7ih06:~/google# time ./google --files_per_cpu 10000 > /dev/null real 0m4.976s user 8m38.925s sys 6m45.236s root@p7ih06:~/google# time ./google --files_per_cpu 20000 > /dev/null real 0m7.816s user 11m21.034s sys 14m38.258s root@p7ih06:~/google# time ./google --files_per_cpu 40000 > /dev/null real 0m11.358s user 11m37.955s sys 28m44.911s Reducing to NR_THREADS/4 children allows vanilla to complete: root@p7ih06:~/google# time ./google --files_per_cpu 10000 real 1m23.118s user 3m31.820s sys 81m10.405s I was actually surprised it did that well. Dbench was an interesting one. We didn't manage to stretch the box's legs, unfortunately! dbench with 1 proc gave about 500MB/s, 64 procs gave 21GB/s, 128 and throughput dropped dramatically. Turns out that weird things start happening with rename seqlock versus d_lookup, and d_move contention (dbench does a sprinkle of renaming). That can be improved I think, but noth worth bothering with for the time being. It's not really worth testing vanilla at high dbench parallelism. Parallel git diff workload looked OK. It seemed to be scaling fine in the vfs, but it hit a bottlneck in powerpc's tlb invalidation, so numbers may not be so interesting. Lastly, some parallel syscall microbenchmarks: procs vanilla vfs-scale open-close, seperate-cwd 1 384557.70 355923.82 op/s/proc NR_CORES 86.63 164054.64 op/s/proc NR_THREADS 18.68 (ouch!) open-close, same-cwd 1 381074.32 339161.25 NR_CORES 104.16 107653.05 creat-unlink, seperate-cwd 1 145891.05 104301.06 NR_CORES 29.81 10061.66 creat-unlink, same-cwd 1 129681.27 104301.06 NR_CORES 12.68 181.24 So we can see the single thread performance regressions here, but the vanilla case really chokes at high CPU counts. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-23 15:35 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-23 15:35 UTC (permalink / raw) To: Nick Piggin, Michael Neuling Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > Summary of a few numbers I've run. google's socket teardown workload > runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20% > on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s > up to 15GB/s). Following post just contains some preliminary benchmark numbers on a POWER7. Boring if you're not interested in this stuff. IBM and Mikey kindly allowed me to do some test runs on a big POWER7 system today. Very is the only word I'm authorized to describe how big is big. We tested the vfs-scale-working and master branches from my git tree as of today. I'll stick with relative numbers to be safe. All tests were run on ramfs. First and very important is single threaded performance of basic code. POWER7 is obviously vastly different from a Barcelona or Nehalem. and store-free path walk uses a lot of seqlocks, which are cheap on x86, a little more epensive on others. Test case time difference, vanilla to vfs-scale (negative is better) stat() -10.8% +/- 0.3% close(open()) 4.3% +/- 0.3% unlink(creat()) 36.8% +/- 0.3% stat is significantly faster which is really good. open/close is a bit slower which we didn't get time to analyse. There are one or two seqlock checks which might be avoided, which could make up the difference. It's not horrible, but I hope to get POWER7 open/close more competitive (on x86 open/close is even a bit faster). Note this is a worst case for rcu-path-walk: lookup of "./file", because it has to take refcount on the final element. With more elements, rcu walk should gain the advantage. creat/unlink is showing the big RCU penalty. However I have penciled out a working design with Linus of how to do SLAB_DESTROY_BY_RCU. However it makes the store-free path walking and some inode RCU list walking a little bit trickier, so I prefer not to dump too much on at once. There is something that can be done if regressions show up. I don't anticipate many regressions outside microbenchmarks, and this is about the absolute worst case. On to parallel tests. Firstly, the google socket workload. Running with "NR_THREADS" children, vfs-scale patches do this: root@p7ih06:~/google# time ./google --files_per_cpu 10000 > /dev/null real 0m4.976s user 8m38.925s sys 6m45.236s root@p7ih06:~/google# time ./google --files_per_cpu 20000 > /dev/null real 0m7.816s user 11m21.034s sys 14m38.258s root@p7ih06:~/google# time ./google --files_per_cpu 40000 > /dev/null real 0m11.358s user 11m37.955s sys 28m44.911s Reducing to NR_THREADS/4 children allows vanilla to complete: root@p7ih06:~/google# time ./google --files_per_cpu 10000 real 1m23.118s user 3m31.820s sys 81m10.405s I was actually surprised it did that well. Dbench was an interesting one. We didn't manage to stretch the box's legs, unfortunately! dbench with 1 proc gave about 500MB/s, 64 procs gave 21GB/s, 128 and throughput dropped dramatically. Turns out that weird things start happening with rename seqlock versus d_lookup, and d_move contention (dbench does a sprinkle of renaming). That can be improved I think, but noth worth bothering with for the time being. It's not really worth testing vanilla at high dbench parallelism. Parallel git diff workload looked OK. It seemed to be scaling fine in the vfs, but it hit a bottlneck in powerpc's tlb invalidation, so numbers may not be so interesting. Lastly, some parallel syscall microbenchmarks: procs vanilla vfs-scale open-close, seperate-cwd 1 384557.70 355923.82 op/s/proc NR_CORES 86.63 164054.64 op/s/proc NR_THREADS 18.68 (ouch!) open-close, same-cwd 1 381074.32 339161.25 NR_CORES 104.16 107653.05 creat-unlink, seperate-cwd 1 145891.05 104301.06 NR_CORES 29.81 10061.66 creat-unlink, same-cwd 1 129681.27 104301.06 NR_CORES 12.68 181.24 So we can see the single thread performance regressions here, but the vanilla case really chokes at high CPU counts. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-22 19:01 ` Nick Piggin @ 2010-07-24 8:43 ` KOSAKI Motohiro -1 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 8:43 UTC (permalink / raw) To: Nick Piggin Cc: kosaki.motohiro, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz > At this point, I would be very interested in reviewing, correctness > testing on different configurations, and of course benchmarking. I haven't review this series so long time. but I've found one misterious shrink_slab() usage. can you please see my patch? (I will send it as another mail) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-24 8:43 ` KOSAKI Motohiro 0 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 8:43 UTC (permalink / raw) To: Nick Piggin Cc: kosaki.motohiro, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz > At this point, I would be very interested in reviewing, correctness > testing on different configurations, and of course benchmarking. I haven't review this series so long time. but I've found one misterious shrink_slab() usage. can you please see my patch? (I will send it as another mail) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() 2010-07-24 8:43 ` KOSAKI Motohiro (?) @ 2010-07-24 8:44 ` KOSAKI Motohiro -1 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 8:44 UTC (permalink / raw) To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz Cc: kosaki.motohiro Now, shrink_slab() doesn't return number of reclaimed objects. IOW, current shrink_all_slab() is broken. Thus instead we use reclaim_state to detect no reclaimable slab objects. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 20 +++++++++----------- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index d7256e0..bfa1975 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig void shrink_all_slab(void) { struct zone *zone; - unsigned long nr; + struct reclaim_state reclaim_state; -again: - nr = 0; - for_each_zone(zone) - nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL); - /* - * If we reclaimed less than 10 objects, might as well call - * it a day. Nothing special about the number 10. - */ - if (nr >= 10) - goto again; + current->reclaim_state = &reclaim_state; + do { + reclaim_state.reclaimed_slab = 0; + for_each_zone(zone) + shrink_slab(zone, 1, 1, 1, GFP_KERNEL); + } while (reclaim_state.reclaimed_slab); + + current->reclaim_state = NULL; } static inline int is_page_cache_freeable(struct page *page) -- 1.6.5.2 ^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() @ 2010-07-24 8:44 ` KOSAKI Motohiro 0 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 8:44 UTC (permalink / raw) To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz Cc: kosaki.motohiro Now, shrink_slab() doesn't return number of reclaimed objects. IOW, current shrink_all_slab() is broken. Thus instead we use reclaim_state to detect no reclaimable slab objects. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 20 +++++++++----------- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index d7256e0..bfa1975 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig void shrink_all_slab(void) { struct zone *zone; - unsigned long nr; + struct reclaim_state reclaim_state; -again: - nr = 0; - for_each_zone(zone) - nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL); - /* - * If we reclaimed less than 10 objects, might as well call - * it a day. Nothing special about the number 10. - */ - if (nr >= 10) - goto again; + current->reclaim_state = &reclaim_state; + do { + reclaim_state.reclaimed_slab = 0; + for_each_zone(zone) + shrink_slab(zone, 1, 1, 1, GFP_KERNEL); + } while (reclaim_state.reclaimed_slab); + + current->reclaim_state = NULL; } static inline int is_page_cache_freeable(struct page *page) -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() @ 2010-07-24 8:44 ` KOSAKI Motohiro 0 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 8:44 UTC (permalink / raw) To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz Cc: kosaki.motohiro Now, shrink_slab() doesn't return number of reclaimed objects. IOW, current shrink_all_slab() is broken. Thus instead we use reclaim_state to detect no reclaimable slab objects. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 20 +++++++++----------- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index d7256e0..bfa1975 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig void shrink_all_slab(void) { struct zone *zone; - unsigned long nr; + struct reclaim_state reclaim_state; -again: - nr = 0; - for_each_zone(zone) - nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL); - /* - * If we reclaimed less than 10 objects, might as well call - * it a day. Nothing special about the number 10. - */ - if (nr >= 10) - goto again; + current->reclaim_state = &reclaim_state; + do { + reclaim_state.reclaimed_slab = 0; + for_each_zone(zone) + shrink_slab(zone, 1, 1, 1, GFP_KERNEL); + } while (reclaim_state.reclaimed_slab); + + current->reclaim_state = NULL; } static inline int is_page_cache_freeable(struct page *page) -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() 2010-07-24 8:44 ` KOSAKI Motohiro @ 2010-07-24 12:05 ` KOSAKI Motohiro -1 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 12:05 UTC (permalink / raw) To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz Cc: kosaki.motohiro 2010/7/24 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>: > Now, shrink_slab() doesn't return number of reclaimed objects. IOW, > current shrink_all_slab() is broken. Thus instead we use reclaim_state > to detect no reclaimable slab objects. > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > --- > mm/vmscan.c | 20 +++++++++----------- > 1 files changed, 9 insertions(+), 11 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d7256e0..bfa1975 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig > void shrink_all_slab(void) > { > struct zone *zone; > - unsigned long nr; > + struct reclaim_state reclaim_state; > > -again: > - nr = 0; > - for_each_zone(zone) > - nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL); > - /* > - * If we reclaimed less than 10 objects, might as well call > - * it a day. Nothing special about the number 10. > - */ > - if (nr >= 10) > - goto again; > + current->reclaim_state = &reclaim_state; > + do { > + reclaim_state.reclaimed_slab = 0; > + for_each_zone(zone) Oops, this should be for_each_populated_zone(). > + shrink_slab(zone, 1, 1, 1, GFP_KERNEL); > + } while (reclaim_state.reclaimed_slab); > + > + current->reclaim_state = NULL; > } > > static inline int is_page_cache_freeable(struct page *page) > -- > 1.6.5.2 > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() @ 2010-07-24 12:05 ` KOSAKI Motohiro 0 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 12:05 UTC (permalink / raw) To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz Cc: kosaki.motohiro 2010/7/24 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>: > Now, shrink_slab() doesn't return number of reclaimed objects. IOW, > current shrink_all_slab() is broken. Thus instead we use reclaim_state > to detect no reclaimable slab objects. > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > --- > mm/vmscan.c | 20 +++++++++----------- > 1 files changed, 9 insertions(+), 11 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index d7256e0..bfa1975 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig > void shrink_all_slab(void) > { > struct zone *zone; > - unsigned long nr; > + struct reclaim_state reclaim_state; > > -again: > - nr = 0; > - for_each_zone(zone) > - nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL); > - /* > - * If we reclaimed less than 10 objects, might as well call > - * it a day. Nothing special about the number 10. > - */ > - if (nr >= 10) > - goto again; > + current->reclaim_state = &reclaim_state; > + do { > + reclaim_state.reclaimed_slab = 0; > + for_each_zone(zone) Oops, this should be for_each_populated_zone(). > + shrink_slab(zone, 1, 1, 1, GFP_KERNEL); > + } while (reclaim_state.reclaimed_slab); > + > + current->reclaim_state = NULL; > } > > static inline int is_page_cache_freeable(struct page *page) > -- > 1.6.5.2 > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* [PATCH 2/2] vmscan: change shrink_slab() return tyep with void 2010-07-24 8:43 ` KOSAKI Motohiro (?) @ 2010-07-24 8:46 ` KOSAKI Motohiro -1 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 8:46 UTC (permalink / raw) To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz Cc: kosaki.motohiro Now, no caller use the return value of shrink_slab(). Thus we can change it with void. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index bfa1975..89b593e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -277,24 +277,23 @@ EXPORT_SYMBOL(shrinker_do_scan); * * Returns the number of slab objects which we shrunk. */ -static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total, +static void shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total, unsigned long global, gfp_t gfp_mask) { struct shrinker *shrinker; - unsigned long ret = 0; if (scanned == 0) scanned = SWAP_CLUSTER_MAX; if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + return; list_for_each_entry(shrinker, &shrinker_list, list) { (*shrinker->shrink)(shrinker, zone, scanned, total, global, gfp_mask); } up_read(&shrinker_rwsem); - return ret; + return; } void shrink_all_slab(void) -- 1.6.5.2 ^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 2/2] vmscan: change shrink_slab() return tyep with void @ 2010-07-24 8:46 ` KOSAKI Motohiro 0 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 8:46 UTC (permalink / raw) To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz Cc: kosaki.motohiro Now, no caller use the return value of shrink_slab(). Thus we can change it with void. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index bfa1975..89b593e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -277,24 +277,23 @@ EXPORT_SYMBOL(shrinker_do_scan); * * Returns the number of slab objects which we shrunk. */ -static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total, +static void shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total, unsigned long global, gfp_t gfp_mask) { struct shrinker *shrinker; - unsigned long ret = 0; if (scanned == 0) scanned = SWAP_CLUSTER_MAX; if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + return; list_for_each_entry(shrinker, &shrinker_list, list) { (*shrinker->shrink)(shrinker, zone, scanned, total, global, gfp_mask); } up_read(&shrinker_rwsem); - return ret; + return; } void shrink_all_slab(void) -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 2/2] vmscan: change shrink_slab() return tyep with void @ 2010-07-24 8:46 ` KOSAKI Motohiro 0 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 8:46 UTC (permalink / raw) To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz Cc: kosaki.motohiro Now, no caller use the return value of shrink_slab(). Thus we can change it with void. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index bfa1975..89b593e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -277,24 +277,23 @@ EXPORT_SYMBOL(shrinker_do_scan); * * Returns the number of slab objects which we shrunk. */ -static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total, +static void shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total, unsigned long global, gfp_t gfp_mask) { struct shrinker *shrinker; - unsigned long ret = 0; if (scanned == 0) scanned = SWAP_CLUSTER_MAX; if (!down_read_trylock(&shrinker_rwsem)) - return 1; /* Assume we'll be able to shrink next time */ + return; list_for_each_entry(shrinker, &shrinker_list, list) { (*shrinker->shrink)(shrinker, zone, scanned, total, global, gfp_mask); } up_read(&shrinker_rwsem); - return ret; + return; } void shrink_all_slab(void) -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-24 8:43 ` KOSAKI Motohiro @ 2010-07-24 10:54 ` KOSAKI Motohiro -1 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 10:54 UTC (permalink / raw) To: Nick Piggin Cc: kosaki.motohiro, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz > > At this point, I would be very interested in reviewing, correctness > > testing on different configurations, and of course benchmarking. > > I haven't review this series so long time. but I've found one misterious > shrink_slab() usage. can you please see my patch? (I will send it as > another mail) Plus, I have one question. upstream shrink_slab() calculation and your calculation have bigger change rather than your patch description explained. upstream: shrink_slab() lru_scanned max_pass basic_scan_objects = 4 x ------------- x ----------------------------- lru_pages shrinker->seeks (default:2) scan_objects = min(basic_scan_objects, max_pass * 2) shrink_icache_memory() sysctl_vfs_cache_pressure max_pass = inodes_stat.nr_unused x -------------------------- 100 That said, higher sysctl_vfs_cache_pressure makes higher slab reclaim. In the other hand, your code: shrinker_add_scan() scanned objects scan_objects = 4 x ------------- x ----------- x SHRINK_FACTOR x SHRINK_FACTOR total ratio shrink_icache_memory() ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100 That said, higher sysctl_vfs_cache_pressure makes smaller slab reclaim. So, I guess following change honorly refrect your original intention. New calculation is, shrinker_add_scan() scanned scan_objects = ------------- x objects x ratio total shrink_icache_memory() ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100 This has the same behavior as upstream. because upstream's 4/shrinker->seeks = 2. also the above has DEFAULT_SEEKS = SHRINK_FACTORx2. =============== o move 'ratio' from denominator to numerator o adapt kvm/mmu_shrink o SHRINK_FACTOR / 2 (default seek) x 4 (unknown shrink slab modifier) -> (SHRINK_FACTOR*2) == DEFAULT_SEEKS --- arch/x86/kvm/mmu.c | 2 +- mm/vmscan.c | 10 ++-------- 2 files changed, 3 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ae5a038..cea1e92 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2942,7 +2942,7 @@ static int mmu_shrink(struct shrinker *shrink, } shrinker_add_scan(&nr_to_scan, scanned, global, cache_count, - DEFAULT_SEEKS*10); + DEFAULT_SEEKS/10); done: cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH); diff --git a/mm/vmscan.c b/mm/vmscan.c index 89b593e..2d8e9ab 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -208,14 +208,8 @@ void shrinker_add_scan(unsigned long *dst, { unsigned long long delta; - /* - * The constant 4 comes from old code. Who knows why. - * This could all use a good tune up with some decent - * benchmarks and numbers. - */ - delta = (unsigned long long)scanned * objects - * SHRINK_FACTOR * SHRINK_FACTOR * 4UL; - do_div(delta, (ratio * total + 1)); + delta = (unsigned long long)scanned * objects * ratio; + do_div(delta, total+ 1); /* * Avoid risking looping forever due to too large nr value: -- 1.6.5.2 ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-24 10:54 ` KOSAKI Motohiro 0 siblings, 0 replies; 76+ messages in thread From: KOSAKI Motohiro @ 2010-07-24 10:54 UTC (permalink / raw) To: Nick Piggin Cc: kosaki.motohiro, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz > > At this point, I would be very interested in reviewing, correctness > > testing on different configurations, and of course benchmarking. > > I haven't review this series so long time. but I've found one misterious > shrink_slab() usage. can you please see my patch? (I will send it as > another mail) Plus, I have one question. upstream shrink_slab() calculation and your calculation have bigger change rather than your patch description explained. upstream: shrink_slab() lru_scanned max_pass basic_scan_objects = 4 x ------------- x ----------------------------- lru_pages shrinker->seeks (default:2) scan_objects = min(basic_scan_objects, max_pass * 2) shrink_icache_memory() sysctl_vfs_cache_pressure max_pass = inodes_stat.nr_unused x -------------------------- 100 That said, higher sysctl_vfs_cache_pressure makes higher slab reclaim. In the other hand, your code: shrinker_add_scan() scanned objects scan_objects = 4 x ------------- x ----------- x SHRINK_FACTOR x SHRINK_FACTOR total ratio shrink_icache_memory() ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100 That said, higher sysctl_vfs_cache_pressure makes smaller slab reclaim. So, I guess following change honorly refrect your original intention. New calculation is, shrinker_add_scan() scanned scan_objects = ------------- x objects x ratio total shrink_icache_memory() ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100 This has the same behavior as upstream. because upstream's 4/shrinker->seeks = 2. also the above has DEFAULT_SEEKS = SHRINK_FACTORx2. =============== o move 'ratio' from denominator to numerator o adapt kvm/mmu_shrink o SHRINK_FACTOR / 2 (default seek) x 4 (unknown shrink slab modifier) -> (SHRINK_FACTOR*2) == DEFAULT_SEEKS --- arch/x86/kvm/mmu.c | 2 +- mm/vmscan.c | 10 ++-------- 2 files changed, 3 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ae5a038..cea1e92 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2942,7 +2942,7 @@ static int mmu_shrink(struct shrinker *shrink, } shrinker_add_scan(&nr_to_scan, scanned, global, cache_count, - DEFAULT_SEEKS*10); + DEFAULT_SEEKS/10); done: cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH); diff --git a/mm/vmscan.c b/mm/vmscan.c index 89b593e..2d8e9ab 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -208,14 +208,8 @@ void shrinker_add_scan(unsigned long *dst, { unsigned long long delta; - /* - * The constant 4 comes from old code. Who knows why. - * This could all use a good tune up with some decent - * benchmarks and numbers. - */ - delta = (unsigned long long)scanned * objects - * SHRINK_FACTOR * SHRINK_FACTOR * 4UL; - do_div(delta, (ratio * total + 1)); + delta = (unsigned long long)scanned * objects * ratio; + do_div(delta, total+ 1); /* * Avoid risking looping forever due to too large nr value: -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-22 19:01 ` Nick Piggin @ 2010-07-26 5:41 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-26 5:41 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, linux-mm, Frank Mayhar, John Stultz, Dave Chinner, KOSAKI Motohiro, Michael Neuling On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git Pushed several fixes and improvements o XFS bugs fixed by Dave o dentry and inode stats bugs noticed by Dave o vmscan shrinker bugs fixed by KOSAKI san o compile bugs noticed by John o a few attempts to improve powerpc performance (eg. reducing smp_rmb()) o scalability improvments for rename_lock ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-26 5:41 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-26 5:41 UTC (permalink / raw) To: linux-fsdevel Cc: linux-kernel, linux-mm, Frank Mayhar, John Stultz, Dave Chinner, KOSAKI Motohiro, Michael Neuling On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git Pushed several fixes and improvements o XFS bugs fixed by Dave o dentry and inode stats bugs noticed by Dave o vmscan shrinker bugs fixed by KOSAKI san o compile bugs noticed by John o a few attempts to improve powerpc performance (eg. reducing smp_rmb()) o scalability improvments for rename_lock -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-26 5:41 ` Nick Piggin @ 2010-07-28 10:24 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-28 10:24 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz, Dave Chinner, KOSAKI Motohiro, Michael Neuling On Mon, Jul 26, 2010 at 03:41:11PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Pushed several fixes and improvements > o XFS bugs fixed by Dave > o dentry and inode stats bugs noticed by Dave > o vmscan shrinker bugs fixed by KOSAKI san > o compile bugs noticed by John > o a few attempts to improve powerpc performance (eg. reducing smp_rmb()) > o scalability improvments for rename_lock Yet another result on my small 2s8c Opteron. This time, the re-aim benchmark configured as described here: http://ertos.nicta.com.au/publications/papers/Chubb_Williams_05.pdf It is using ext2 on ramdisk and an IO intensive workload, with fsync activity. I did 10 runs on each, and took the max jobs/sec of each run. N Min Max Median Avg Stddev x 10 2598750 2735122 2665384.6 2653353.8 46421.696 + 10 3337297.3 3484687.5 3410689.7 3397763.8 49994.631 Difference at 95.0% confidence 744410 +/- 45327.3 28.0554% +/- 1.7083% Average is 2653K jobs/s for vanilla, versus 3398K jobs/s for vfs-scalem or 28% speedup. The profile is interesting. It is known to be inode_lock intensive, but we also see here that it is do_lookup intensive, due to cacheline bouncing in common elements of path lookups. Vanilla: # Overhead Symbol # ........ ...... # 7.63% [k] __d_lookup | |--88.59%-- do_lookup |--9.75%-- __lookup_hash |--0.89%-- d_lookup 7.17% [k] _raw_spin_lock | |--11.07%-- _atomic_dec_and_lock | | | |--53.73%-- dput | --46.27%-- iput | |--9.85%-- __mark_inode_dirty | | | |--46.25%-- ext2_new_inode | |--25.32%-- __set_page_dirty | |--18.27%-- nobh_write_end | |--6.91%-- ext2_new_blocks | |--3.12%-- ext2_unlink | |--7.69%-- ext2_new_inode | |--6.84%-- insert_inode_locked | ext2_new_inode | |--6.56%-- new_inode | ext2_new_inode | |--5.61%-- writeback_single_inode | sync_inode | generic_file_fsync | ext2_fsync | |--5.13%-- dput |--3.75%-- generic_delete_inode |--3.56%-- __d_lookup |--3.53%-- ext2_free_inode |--3.40%-- sync_inode |--2.71%-- d_instantiate |--2.36%-- d_delete |--2.25%-- inode_sub_bytes |--1.84%-- file_move |--1.52%-- file_kill |--1.36%-- ext2_new_blocks |--1.34%-- ext2_create |--1.34%-- d_alloc |--1.11%-- do_lookup |--1.07%-- iput |--1.05%-- __d_instantiate 4.19% [k] mutex_spin_on_owner | |--99.92%-- __mutex_lock_slowpath | mutex_lock | | | |--56.45%-- do_unlinkat | | sys_unlink | | | --43.55%-- do_last | do_filp_open 2.96% [k] _atomic_dec_and_lock | |--58.18%-- dput |--31.02%-- mntput_no_expire |--3.30%-- path_put |--3.09%-- iput |--2.69%-- link_path_walk |--1.02%-- fput 2.73% [k] copy_user_generic_string 2.67% [k] __mark_inode_dirty 2.65% [k] link_path_walk 2.63% [k] mark_buffer_dirty 1.72% [k] __memcpy 1.62% [k] generic_getxattr 1.50% [k] acl_permission_check 1.30% [k] __find_get_block 1.30% [k] __memset 1.17% [k] ext2_find_entry 1.09% [k] ext2_new_inode 1.06% [k] system_call 1.01% [k] kmem_cache_free 1.00% [k] dput In vfs-scale, most of the spinlock contention and path lookup cost is gone. Contention for parent i_mutex (and d_lock) for creat/unlink operations is now at the top of the profile. A lot of the spinlock overhead seems to be not contention so much as the the cost of the atomics. Down at 3% it is much less a problem than it was though. We may run into a bit of contention on the per-bdi inode dirty/io list lock, with just a single ramdisk device (dirty/fsync activity will hit this lock), but it is really not worth worrying about at the moment. # Overhead Symbol # ........ ...... # 5.67% [k] mutex_spin_on_owner | |--99.96%-- __mutex_lock_slowpath | mutex_lock | | | |--58.63%-- do_unlinkat | | sys_unlink | | | --41.37%-- do_last | do_filp_open 3.93% [k] __mark_inode_dirty 3.43% [k] copy_user_generic_string 3.31% [k] link_path_walk 3.15% [k] mark_buffer_dirty 3.11% [k] _raw_spin_lock | |--11.03%-- __mark_inode_dirty |--10.54%-- ext2_new_inode |--7.60%-- ext2_free_inode |--6.33%-- inode_sub_bytes |--6.27%-- ext2_new_blocks |--5.80%-- generic_delete_inode |--4.09%-- ext2_create |--3.62%-- writeback_single_inode |--2.92%-- sync_inode |--2.81%-- generic_drop_inode |--2.46%-- iput |--1.86%-- dput |--1.80%-- __dquot_alloc_space |--1.61%-- __mutex_unlock_slowpath |--1.59%-- generic_file_fsync |--1.57%-- __d_instantiate |--1.55%-- __set_page_dirty_buffers |--1.36%-- d_alloc_and_lookup |--1.23%-- do_path_lookup |--1.10%-- ext2_free_blocks 2.13% [k] __memset 2.12% [k] __memcpy 1.98% [k] __d_lookup_rcu 1.46% [k] generic_getxattr 1.44% [k] ext2_find_entry 1.41% [k] __find_get_block 1.27% [k] kmem_cache_free 1.25% [k] ext2_new_inode 1.23% [k] system_call 1.02% [k] ext2_add_link 1.01% [k] strncpy_from_user 0.96% [k] kmem_cache_alloc 0.95% [k] find_get_page 0.94% [k] sysret_check 0.88% [k] __d_lookup 0.75% [k] ext2_delete_entry 0.70% [k] generic_file_aio_read 0.67% [k] generic_file_buffered_write 0.63% [k] ext2_new_blocks 0.62% [k] __percpu_counter_add 0.59% [k] __bread 0.58% [k] __wake_up_bit 0.58% [k] __mutex_lock_slowpath 0.56% [k] __ext2_write_inode 0.55% [k] ext2_get_blocks ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-28 10:24 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-28 10:24 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz, Dave Chinner, KOSAKI Motohiro, Michael Neuling On Mon, Jul 26, 2010 at 03:41:11PM +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Pushed several fixes and improvements > o XFS bugs fixed by Dave > o dentry and inode stats bugs noticed by Dave > o vmscan shrinker bugs fixed by KOSAKI san > o compile bugs noticed by John > o a few attempts to improve powerpc performance (eg. reducing smp_rmb()) > o scalability improvments for rename_lock Yet another result on my small 2s8c Opteron. This time, the re-aim benchmark configured as described here: http://ertos.nicta.com.au/publications/papers/Chubb_Williams_05.pdf It is using ext2 on ramdisk and an IO intensive workload, with fsync activity. I did 10 runs on each, and took the max jobs/sec of each run. N Min Max Median Avg Stddev x 10 2598750 2735122 2665384.6 2653353.8 46421.696 + 10 3337297.3 3484687.5 3410689.7 3397763.8 49994.631 Difference at 95.0% confidence 744410 +/- 45327.3 28.0554% +/- 1.7083% Average is 2653K jobs/s for vanilla, versus 3398K jobs/s for vfs-scalem or 28% speedup. The profile is interesting. It is known to be inode_lock intensive, but we also see here that it is do_lookup intensive, due to cacheline bouncing in common elements of path lookups. Vanilla: # Overhead Symbol # ........ ...... # 7.63% [k] __d_lookup | |--88.59%-- do_lookup |--9.75%-- __lookup_hash |--0.89%-- d_lookup 7.17% [k] _raw_spin_lock | |--11.07%-- _atomic_dec_and_lock | | | |--53.73%-- dput | --46.27%-- iput | |--9.85%-- __mark_inode_dirty | | | |--46.25%-- ext2_new_inode | |--25.32%-- __set_page_dirty | |--18.27%-- nobh_write_end | |--6.91%-- ext2_new_blocks | |--3.12%-- ext2_unlink | |--7.69%-- ext2_new_inode | |--6.84%-- insert_inode_locked | ext2_new_inode | |--6.56%-- new_inode | ext2_new_inode | |--5.61%-- writeback_single_inode | sync_inode | generic_file_fsync | ext2_fsync | |--5.13%-- dput |--3.75%-- generic_delete_inode |--3.56%-- __d_lookup |--3.53%-- ext2_free_inode |--3.40%-- sync_inode |--2.71%-- d_instantiate |--2.36%-- d_delete |--2.25%-- inode_sub_bytes |--1.84%-- file_move |--1.52%-- file_kill |--1.36%-- ext2_new_blocks |--1.34%-- ext2_create |--1.34%-- d_alloc |--1.11%-- do_lookup |--1.07%-- iput |--1.05%-- __d_instantiate 4.19% [k] mutex_spin_on_owner | |--99.92%-- __mutex_lock_slowpath | mutex_lock | | | |--56.45%-- do_unlinkat | | sys_unlink | | | --43.55%-- do_last | do_filp_open 2.96% [k] _atomic_dec_and_lock | |--58.18%-- dput |--31.02%-- mntput_no_expire |--3.30%-- path_put |--3.09%-- iput |--2.69%-- link_path_walk |--1.02%-- fput 2.73% [k] copy_user_generic_string 2.67% [k] __mark_inode_dirty 2.65% [k] link_path_walk 2.63% [k] mark_buffer_dirty 1.72% [k] __memcpy 1.62% [k] generic_getxattr 1.50% [k] acl_permission_check 1.30% [k] __find_get_block 1.30% [k] __memset 1.17% [k] ext2_find_entry 1.09% [k] ext2_new_inode 1.06% [k] system_call 1.01% [k] kmem_cache_free 1.00% [k] dput In vfs-scale, most of the spinlock contention and path lookup cost is gone. Contention for parent i_mutex (and d_lock) for creat/unlink operations is now at the top of the profile. A lot of the spinlock overhead seems to be not contention so much as the the cost of the atomics. Down at 3% it is much less a problem than it was though. We may run into a bit of contention on the per-bdi inode dirty/io list lock, with just a single ramdisk device (dirty/fsync activity will hit this lock), but it is really not worth worrying about at the moment. # Overhead Symbol # ........ ...... # 5.67% [k] mutex_spin_on_owner | |--99.96%-- __mutex_lock_slowpath | mutex_lock | | | |--58.63%-- do_unlinkat | | sys_unlink | | | --41.37%-- do_last | do_filp_open 3.93% [k] __mark_inode_dirty 3.43% [k] copy_user_generic_string 3.31% [k] link_path_walk 3.15% [k] mark_buffer_dirty 3.11% [k] _raw_spin_lock | |--11.03%-- __mark_inode_dirty |--10.54%-- ext2_new_inode |--7.60%-- ext2_free_inode |--6.33%-- inode_sub_bytes |--6.27%-- ext2_new_blocks |--5.80%-- generic_delete_inode |--4.09%-- ext2_create |--3.62%-- writeback_single_inode |--2.92%-- sync_inode |--2.81%-- generic_drop_inode |--2.46%-- iput |--1.86%-- dput |--1.80%-- __dquot_alloc_space |--1.61%-- __mutex_unlock_slowpath |--1.59%-- generic_file_fsync |--1.57%-- __d_instantiate |--1.55%-- __set_page_dirty_buffers |--1.36%-- d_alloc_and_lookup |--1.23%-- do_path_lookup |--1.10%-- ext2_free_blocks 2.13% [k] __memset 2.12% [k] __memcpy 1.98% [k] __d_lookup_rcu 1.46% [k] generic_getxattr 1.44% [k] ext2_find_entry 1.41% [k] __find_get_block 1.27% [k] kmem_cache_free 1.25% [k] ext2_new_inode 1.23% [k] system_call 1.02% [k] ext2_add_link 1.01% [k] strncpy_from_user 0.96% [k] kmem_cache_alloc 0.95% [k] find_get_page 0.94% [k] sysret_check 0.88% [k] __d_lookup 0.75% [k] ext2_delete_entry 0.70% [k] generic_file_aio_read 0.67% [k] generic_file_buffered_write 0.63% [k] ext2_new_blocks 0.62% [k] __percpu_counter_add 0.59% [k] __bread 0.58% [k] __wake_up_bit 0.58% [k] __mutex_lock_slowpath 0.56% [k] __ext2_write_inode 0.55% [k] ext2_get_blocks -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-22 19:01 ` Nick Piggin @ 2010-07-30 9:12 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-30 9:12 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Branch vfs-scale-working > > The really interesting new item is the store-free path walk, (43fe2b) > which I've re-introduced. It has had a complete redesign, it has much > better performance and scalability in more cases, and is actually sane > code now. Things are progressing well here with fixes and improvements to the branch. One thing that has been brought to my attention is that store-free path walking (rcu-walk) drops into the normal refcounted walking on any filesystem that has posix ACLs enabled. Having misread that IS_POSIXACL is based on a superblock flag, I had thought we only drop out of rcu-walk in case of encountering an inode that actually has acls. This is quite an important point for any performance testing work. ACLs can actually be rcu checked quite easily in most cases, but it takes a bit of work on APIs. Filesystems defining their own ->permission and ->d_revalidate will also not use rcu-walk. These could likewise be made to support rcu-walk more widely, but it will require knowledge of rcu-walk to be pushed into filesystems. It's not a big deal, basically: no blocking, no stores, no referencing non-rcu-protected data, and confirm with seqlock. That is usually the case in fastpaths. If it cannot be satisfied, then just return -ECHILD and you'll get called in the usual ref-walk mode next time. But for now, keep this in mind if you plan to do any serious performance testing work, *do not mount filesystems with ACL support*. Thanks, Nick ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-07-30 9:12 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-07-30 9:12 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > I'm pleased to announce I have a git tree up of my vfs scalability work. > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > Branch vfs-scale-working > > The really interesting new item is the store-free path walk, (43fe2b) > which I've re-introduced. It has had a complete redesign, it has much > better performance and scalability in more cases, and is actually sane > code now. Things are progressing well here with fixes and improvements to the branch. One thing that has been brought to my attention is that store-free path walking (rcu-walk) drops into the normal refcounted walking on any filesystem that has posix ACLs enabled. Having misread that IS_POSIXACL is based on a superblock flag, I had thought we only drop out of rcu-walk in case of encountering an inode that actually has acls. This is quite an important point for any performance testing work. ACLs can actually be rcu checked quite easily in most cases, but it takes a bit of work on APIs. Filesystems defining their own ->permission and ->d_revalidate will also not use rcu-walk. These could likewise be made to support rcu-walk more widely, but it will require knowledge of rcu-walk to be pushed into filesystems. It's not a big deal, basically: no blocking, no stores, no referencing non-rcu-protected data, and confirm with seqlock. That is usually the case in fastpaths. If it cannot be satisfied, then just return -ECHILD and you'll get called in the usual ref-walk mode next time. But for now, keep this in mind if you plan to do any serious performance testing work, *do not mount filesystems with ACL support*. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-07-30 9:12 ` Nick Piggin (?) @ 2010-08-03 0:27 ` john stultz -1 siblings, 0 replies; 76+ messages in thread From: john stultz @ 2010-08-03 0:27 UTC (permalink / raw) To: Nick Piggin Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > > > The really interesting new item is the store-free path walk, (43fe2b) > > which I've re-introduced. It has had a complete redesign, it has much > > better performance and scalability in more cases, and is actually sane > > code now. > > Things are progressing well here with fixes and improvements to the > branch. Hey Nick, Just another minor compile issue with today's vfs-scale-working branch. fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here (not in a function) >From looking at the vfat and ecryptfs changes in 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to add the following? diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index f0c2479..9ee4c10 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc) * the lookup once more. If the lookup results in the same inode, * then refresh the attributes, timeouts and mark the dentry valid. */ -static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd) +static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd) { struct inode *inode = entry->d_inode; ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-08-03 0:27 ` john stultz 0 siblings, 0 replies; 76+ messages in thread From: john stultz @ 2010-08-03 0:27 UTC (permalink / raw) To: Nick Piggin Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > > > The really interesting new item is the store-free path walk, (43fe2b) > > which I've re-introduced. It has had a complete redesign, it has much > > better performance and scalability in more cases, and is actually sane > > code now. > > Things are progressing well here with fixes and improvements to the > branch. Hey Nick, Just another minor compile issue with today's vfs-scale-working branch. fs/fuse/dir.c:231: error: a??fuse_dentry_revalidate_rcua?? undeclared here (not in a function) >From looking at the vfat and ecryptfs changes in 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to add the following? diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index f0c2479..9ee4c10 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc) * the lookup once more. If the lookup results in the same inode, * then refresh the attributes, timeouts and mark the dentry valid. */ -static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd) +static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd) { struct inode *inode = entry->d_inode; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-08-03 0:27 ` john stultz 0 siblings, 0 replies; 76+ messages in thread From: john stultz @ 2010-08-03 0:27 UTC (permalink / raw) To: Nick Piggin Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote: > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > Branch vfs-scale-working > > > > The really interesting new item is the store-free path walk, (43fe2b) > > which I've re-introduced. It has had a complete redesign, it has much > > better performance and scalability in more cases, and is actually sane > > code now. > > Things are progressing well here with fixes and improvements to the > branch. Hey Nick, Just another minor compile issue with today's vfs-scale-working branch. fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here (not in a function) >From looking at the vfat and ecryptfs changes in 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to add the following? diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index f0c2479..9ee4c10 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc) * the lookup once more. If the lookup results in the same inode, * then refresh the attributes, timeouts and mark the dentry valid. */ -static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd) +static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd) { struct inode *inode = entry->d_inode; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-08-03 0:27 ` john stultz (?) @ 2010-08-03 5:44 ` Nick Piggin -1 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-08-03 5:44 UTC (permalink / raw) To: john stultz Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote: > On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > > > The really interesting new item is the store-free path walk, (43fe2b) > > > which I've re-introduced. It has had a complete redesign, it has much > > > better performance and scalability in more cases, and is actually sane > > > code now. > > > > Things are progressing well here with fixes and improvements to the > > branch. > > Hey Nick, > Just another minor compile issue with today's vfs-scale-working branch. > > fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here > (not in a function) > > >From looking at the vfat and ecryptfs changes in > 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to > add the following? Thanks John, you're right. I thought I actually linked and ran this, but I must not have had fuse compiled in. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-08-03 5:44 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-08-03 5:44 UTC (permalink / raw) To: john stultz Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote: > On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > > > The really interesting new item is the store-free path walk, (43fe2b) > > > which I've re-introduced. It has had a complete redesign, it has much > > > better performance and scalability in more cases, and is actually sane > > > code now. > > > > Things are progressing well here with fixes and improvements to the > > branch. > > Hey Nick, > Just another minor compile issue with today's vfs-scale-working branch. > > fs/fuse/dir.c:231: error: a??fuse_dentry_revalidate_rcua?? undeclared here > (not in a function) > > >From looking at the vfat and ecryptfs changes in > 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to > add the following? Thanks John, you're right. I thought I actually linked and ran this, but I must not have had fuse compiled in. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree @ 2010-08-03 5:44 ` Nick Piggin 0 siblings, 0 replies; 76+ messages in thread From: Nick Piggin @ 2010-08-03 5:44 UTC (permalink / raw) To: john stultz Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote: > On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote: > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote: > > > I'm pleased to announce I have a git tree up of my vfs scalability work. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git > > > > > > Branch vfs-scale-working > > > > > > The really interesting new item is the store-free path walk, (43fe2b) > > > which I've re-introduced. It has had a complete redesign, it has much > > > better performance and scalability in more cases, and is actually sane > > > code now. > > > > Things are progressing well here with fixes and improvements to the > > branch. > > Hey Nick, > Just another minor compile issue with today's vfs-scale-working branch. > > fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here > (not in a function) > > >From looking at the vfat and ecryptfs changes in > 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to > add the following? Thanks John, you're right. I thought I actually linked and ran this, but I must not have had fuse compiled in. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-08-03 5:44 ` Nick Piggin (?) (?) @ 2010-09-14 22:26 ` Christoph Hellwig 2010-09-14 23:02 ` Frank Mayhar -1 siblings, 1 reply; 76+ messages in thread From: Christoph Hellwig @ 2010-09-14 22:26 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel Nick, what's the plan for going ahead with the VFS scalability work? We're pretty late in the 2.6.36 cycle now and it would be good to get the next batch prepared and reivew so that it can get some testing in -next. As mentioned before my preference would be the inode lock splitup and related patches - they are relatively simple and we're already seeing workloads where inode_lock really hurts in the writeback code. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: VFS scalability git tree 2010-09-14 22:26 ` Christoph Hellwig @ 2010-09-14 23:02 ` Frank Mayhar 0 siblings, 0 replies; 76+ messages in thread From: Frank Mayhar @ 2010-09-14 23:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel On Tue, 2010-09-14 at 18:26 -0400, Christoph Hellwig wrote: > Nick, > > what's the plan for going ahead with the VFS scalability work? We're > pretty late in the 2.6.36 cycle now and it would be good to get the next > batch prepared and reivew so that it can get some testing in -next. > > As mentioned before my preference would be the inode lock splitup and > related patches - they are relatively simple and we're already seeing > workloads where inode_lock really hurts in the writeback code. For the record, while I've been quiet here (really busy) I have run a bunch of pretty serious tests against the original set of patches (note: _not_ the latest bits in Nick's tree, I have those queued up but haven't gotten to them yet). So far I haven't seen any instability at all. (I did see one case in which a test that does a _lot_ of network traffic with tons of sockets saw a 20+% performance hit on a system with a relatively moderate number of cores but I haven't had the time to characterize it better and want to test against the newer bits in any event. Sorry to be so vague, I can't really be more specific at this point. Nailing this down is _also_ on my list.) Performance notwithstanding, I'm impressed with the stability of those original patches. I've run VM stress tests against it, FS stress tests, lots of benchmarks and a bunch of other stuff and it's solid, no crashes nor any anomalous behavior. That being the case, I would vote enthusiastically for bringing in the inode_lock splitup as soon as is feasible. -- Frank Mayhar <fmayhar@google.com> Google, Inc. ^ permalink raw reply [flat|nested] 76+ messages in thread
end of thread, other threads:[~2010-09-14 23:02 UTC | newest] Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-07-22 19:01 VFS scalability git tree Nick Piggin 2010-07-22 19:01 ` Nick Piggin 2010-07-23 11:13 ` Dave Chinner 2010-07-23 11:13 ` Dave Chinner 2010-07-23 14:04 ` [PATCH 0/2] vfs scalability tree fixes Dave Chinner 2010-07-23 14:04 ` Dave Chinner 2010-07-23 16:09 ` Nick Piggin 2010-07-23 16:09 ` Nick Piggin 2010-07-23 14:04 ` [PATCH 1/2] xfs: fix shrinker build Dave Chinner 2010-07-23 14:04 ` Dave Chinner 2010-07-23 14:04 ` [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count Dave Chinner 2010-07-23 14:04 ` Dave Chinner 2010-07-23 15:51 ` VFS scalability git tree Nick Piggin 2010-07-23 15:51 ` Nick Piggin 2010-07-24 0:21 ` Dave Chinner 2010-07-24 0:21 ` Dave Chinner 2010-07-23 11:17 ` Christoph Hellwig 2010-07-23 11:17 ` Christoph Hellwig 2010-07-23 15:42 ` Nick Piggin 2010-07-23 15:42 ` Nick Piggin 2010-07-23 13:55 ` Dave Chinner 2010-07-23 13:55 ` Dave Chinner 2010-07-23 16:16 ` Nick Piggin 2010-07-23 16:16 ` Nick Piggin 2010-07-27 7:05 ` Nick Piggin 2010-07-27 7:05 ` Nick Piggin 2010-07-27 8:06 ` Nick Piggin 2010-07-27 8:06 ` Nick Piggin 2010-07-27 11:36 ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin 2010-07-27 13:30 ` Dave Chinner 2010-07-27 14:58 ` XFS hang in xlog_grant_log_space Dave Chinner 2010-07-28 13:17 ` Dave Chinner 2010-07-29 14:05 ` Nick Piggin 2010-07-29 22:56 ` Dave Chinner 2010-07-30 3:59 ` Nick Piggin 2010-07-28 12:57 ` VFS scalability git tree Dave Chinner 2010-07-28 12:57 ` Dave Chinner 2010-07-29 14:03 ` Nick Piggin 2010-07-29 14:03 ` Nick Piggin 2010-07-27 11:09 ` Nick Piggin 2010-07-27 11:09 ` Nick Piggin 2010-07-27 13:18 ` Dave Chinner 2010-07-27 13:18 ` Dave Chinner 2010-07-27 15:09 ` Nick Piggin 2010-07-27 15:09 ` Nick Piggin 2010-07-28 4:59 ` Dave Chinner 2010-07-28 4:59 ` Dave Chinner 2010-07-28 4:59 ` Dave Chinner 2010-07-23 15:35 ` Nick Piggin 2010-07-23 15:35 ` Nick Piggin 2010-07-24 8:43 ` KOSAKI Motohiro 2010-07-24 8:43 ` KOSAKI Motohiro 2010-07-24 8:44 ` [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() KOSAKI Motohiro 2010-07-24 8:44 ` KOSAKI Motohiro 2010-07-24 8:44 ` KOSAKI Motohiro 2010-07-24 12:05 ` KOSAKI Motohiro 2010-07-24 12:05 ` KOSAKI Motohiro 2010-07-24 8:46 ` [PATCH 2/2] vmscan: change shrink_slab() return tyep with void KOSAKI Motohiro 2010-07-24 8:46 ` KOSAKI Motohiro 2010-07-24 8:46 ` KOSAKI Motohiro 2010-07-24 10:54 ` VFS scalability git tree KOSAKI Motohiro 2010-07-24 10:54 ` KOSAKI Motohiro 2010-07-26 5:41 ` Nick Piggin 2010-07-26 5:41 ` Nick Piggin 2010-07-28 10:24 ` Nick Piggin 2010-07-28 10:24 ` Nick Piggin 2010-07-30 9:12 ` Nick Piggin 2010-07-30 9:12 ` Nick Piggin 2010-08-03 0:27 ` john stultz 2010-08-03 0:27 ` john stultz 2010-08-03 0:27 ` john stultz 2010-08-03 5:44 ` Nick Piggin 2010-08-03 5:44 ` Nick Piggin 2010-08-03 5:44 ` Nick Piggin 2010-09-14 22:26 ` Christoph Hellwig 2010-09-14 23:02 ` Frank Mayhar
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.