All of lore.kernel.org
 help / color / mirror / Atom feed
* VFS scalability git tree
@ 2010-07-22 19:01 ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-22 19:01 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Frank Mayhar, John Stultz

I'm pleased to announce I have a git tree up of my vfs scalability work.

git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

Branch vfs-scale-working

The really interesting new item is the store-free path walk, (43fe2b)
which I've re-introduced. It has had a complete redesign, it has much
better performance and scalability in more cases, and is actually sane
code now.

What this does is to allow parallel name lookups to walk down common
elements without any cacheline bouncing between them.  It can walk
across many interesting cases such as mount points, back up '..', and
negative dentries of most filesystems. It does so without requiring any
atomic operations or any stores at all to hared data.  This also makes
it very fast in serial performance (path walking is nearly twice as fast
on my Opteron).

In cases where it cannot continue the RCU walk (eg. dentry does not
exist), then it can in most cases take a reference on the farthest
element it has reached so far, and then continue on with a regular
refcount-based path walk. My first attempt at this simply dropped
everything and re-did the full refcount based walk.

I've also been working on stress testing, bug fixing, cutting down
'XXX'es, and improving changelogs and comments.

Most filesystems are untested (it's too large a job to do comprehensive
stress tests on everything), but none have known issues (except nilfs2).
Ext2/3, nfs, nfsd, and ram based filesystems seem to work well,
ext4/btrfs/xfs/autofs4 have had light testing.

I've never had filesystem corruption when testing these patches (only
lockups or other bugs). But standard disclaimer: they may eat your data.

Summary of a few numbers I've run. google's socket teardown workload
runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20%
on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s
up to 15GB/s).

At this point, I would be very interested in reviewing, correctness
testing on different configurations, and of course benchmarking.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 76+ messages in thread

* VFS scalability git tree
@ 2010-07-22 19:01 ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-22 19:01 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm; +Cc: Frank Mayhar, John Stultz

I'm pleased to announce I have a git tree up of my vfs scalability work.

git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

Branch vfs-scale-working

The really interesting new item is the store-free path walk, (43fe2b)
which I've re-introduced. It has had a complete redesign, it has much
better performance and scalability in more cases, and is actually sane
code now.

What this does is to allow parallel name lookups to walk down common
elements without any cacheline bouncing between them.  It can walk
across many interesting cases such as mount points, back up '..', and
negative dentries of most filesystems. It does so without requiring any
atomic operations or any stores at all to hared data.  This also makes
it very fast in serial performance (path walking is nearly twice as fast
on my Opteron).

In cases where it cannot continue the RCU walk (eg. dentry does not
exist), then it can in most cases take a reference on the farthest
element it has reached so far, and then continue on with a regular
refcount-based path walk. My first attempt at this simply dropped
everything and re-did the full refcount based walk.

I've also been working on stress testing, bug fixing, cutting down
'XXX'es, and improving changelogs and comments.

Most filesystems are untested (it's too large a job to do comprehensive
stress tests on everything), but none have known issues (except nilfs2).
Ext2/3, nfs, nfsd, and ram based filesystems seem to work well,
ext4/btrfs/xfs/autofs4 have had light testing.

I've never had filesystem corruption when testing these patches (only
lockups or other bugs). But standard disclaimer: they may eat your data.

Summary of a few numbers I've run. google's socket teardown workload
runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20%
on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s
up to 15GB/s).

At this point, I would be very interested in reviewing, correctness
testing on different configurations, and of course benchmarking.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-22 19:01 ` Nick Piggin
@ 2010-07-23 11:13   ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 11:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> 
> Branch vfs-scale-working

I've got a couple of patches needed to build XFS - they shrinker
merge left some bad fragments - I'll post them in a minute. This
email is for the longest ever lockdep warning I've seen that
occurred on boot.

Cheers,

Dave.

[    6.368707] ======================================================
[    6.369773] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[    6.370379] 2.6.35-rc5-dgc+ #58
[    6.370882] ------------------------------------------------------
[    6.371475] pmcd/2124 [HC0[0]:SC0[1]:HE1:SE0] is trying to acquire:
[    6.372062]  (&sb->s_type->i_lock_key#6){+.+...}, at: [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[    6.372268] 
[    6.372268] and this task is already holding:
[    6.372268]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120
[    6.372268] which would create a new lock dependency:
[    6.372268]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...} -> (&sb->s_type->i_lock_key#6){+.+...}
[    6.372268] 
[    6.372268] but this new dependency connects a SOFTIRQ-irq-safe lock:
[    6.372268]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}
[    6.372268] ... which became SOFTIRQ-irq-safe at:
[    6.372268]   [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450
[    6.372268]   [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]   [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]   [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[    6.372268]   [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0
[    6.372268]   [<ffffffff81795502>] tcp_check_req+0x202/0x440
[    6.372268]   [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0
[    6.372268]   [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0
[    6.372268]   [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0
[    6.372268]   [<ffffffff81771af7>] ip_rcv+0x397/0x760
[    6.372268]   [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330
[    6.372268]   [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0
[    6.372268]   [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0
[    6.372268]   [<ffffffff81084cc2>] __do_softirq+0xd2/0x260
[    6.372268]   [<ffffffff81035edc>] call_softirq+0x1c/0x50
[    6.372268]   [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0
[    6.372268]   [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40
[    6.372268]   [<ffffffff8173c59e>] release_sock+0x14e/0x1a0
[    6.372268]   [<ffffffff817a3975>] inet_stream_connect+0x75/0x320
[    6.372268]   [<ffffffff81737917>] sys_connect+0xa7/0xc0
[    6.372268]   [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268] 
[    6.372268] to a SOFTIRQ-irq-unsafe lock:
[    6.372268]  (&sb->s_type->i_lock_key#6){+.+...}
[    6.372268] ... which became SOFTIRQ-irq-unsafe at:
[    6.372268] ...  [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450
[    6.372268]   [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]   [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]   [<ffffffff8116af72>] new_inode+0x52/0xd0
[    6.372268]   [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[    6.372268]   [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[    6.372268]   [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[    6.372268]   [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[    6.372268]   [<ffffffff81e1c075>] sock_init+0x4e/0x59
[    6.372268]   [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[    6.372268]   [<ffffffff81de5767>] kernel_init+0x17a/0x204
[    6.372268]   [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[    6.372268] 
[    6.372268] other info that might help us debug this:
[    6.372268] 
[    6.372268] 3 locks held by pmcd/2124:
[    6.372268]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff81171dae>] seq_read+0x3e/0x430
[    6.372268]  #1:  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120
[    6.372268]  #2:  (clock-AF_INET){++....}, at: [<ffffffff8173b6ae>] sock_i_ino+0x2e/0x70
[    6.372268] 
[    6.372268] the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[    6.372268] -> (&(&hashinfo->ehash_locks[i])->rlock){+.-...} ops: 3 {
[    6.372268]    HARDIRQ-ON-W at:
[    6.372268]                                        [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450
[    6.372268]                                        [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                        [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                        [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[    6.372268]                                        [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0
[    6.372268]                                        [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60
[    6.372268]                                        [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0
[    6.372268]                                        [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320
[    6.372268]                                        [<ffffffff81737917>] sys_connect+0xa7/0xc0
[    6.372268]                                        [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268]    IN-SOFTIRQ-W at:
[    6.372268]                                        [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450
[    6.372268]                                        [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                        [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                        [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[    6.372268]                                        [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0
[    6.372268]                                        [<ffffffff81795502>] tcp_check_req+0x202/0x440
[    6.372268]                                        [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0
[    6.372268]                                        [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0
[    6.372268]                                        [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0
[    6.372268]                                        [<ffffffff81771af7>] ip_rcv+0x397/0x760
[    6.372268]                                        [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330
[    6.372268]                                        [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0
[    6.372268]                                        [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0
[    6.372268]                                        [<ffffffff81084cc2>] __do_softirq+0xd2/0x260
[    6.372268]                                        [<ffffffff81035edc>] call_softirq+0x1c/0x50
[    6.372268]                                        [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0
[    6.372268]                                        [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40
[    6.372268]                                        [<ffffffff8173c59e>] release_sock+0x14e/0x1a0
[    6.372268]                                        [<ffffffff817a3975>] inet_stream_connect+0x75/0x320
[    6.372268]                                        [<ffffffff81737917>] sys_connect+0xa7/0xc0
[    6.372268]                                        [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268]    INITIAL USE at:
[    6.372268]                                       [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450
[    6.372268]                                       [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                       [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                       [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[    6.372268]                                       [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0
[    6.372268]                                       [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60
[    6.372268]                                       [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0
[    6.372268]                                       [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320
[    6.372268]                                       [<ffffffff81737917>] sys_connect+0xa7/0xc0
[    6.372268]                                       [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268]  }
[    6.372268]  ... key      at: [<ffffffff8285ddf8>] __key.47027+0x0/0x8
[    6.372268]  ... acquired at:
[    6.372268]    [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[    6.372268]    [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[    6.372268]    [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]    [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]    [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[    6.372268]    [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[    6.372268]    [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[    6.372268]    [<ffffffff81172005>] seq_read+0x295/0x430
[    6.372268]    [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[    6.372268]    [<ffffffff81150165>] vfs_read+0xb5/0x170
[    6.372268]    [<ffffffff81150274>] sys_read+0x54/0x90
[    6.372268]    [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268] 
[    6.372268] 
[    6.372268] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[    6.372268] -> (&sb->s_type->i_lock_key#6){+.+...} ops: 1185 {
[    6.372268]    HARDIRQ-ON-W at:
[    6.372268]                                        [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450
[    6.372268]                                        [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                        [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                        [<ffffffff8116af72>] new_inode+0x52/0xd0
[    6.372268]                                        [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[    6.372268]                                        [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[    6.372268]                                        [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[    6.372268]                                        [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[    6.372268]                                        [<ffffffff81e1c075>] sock_init+0x4e/0x59
[    6.372268]                                        [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[    6.372268]                                        [<ffffffff81de5767>] kernel_init+0x17a/0x204
[    6.372268]                                        [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[    6.372268]    SOFTIRQ-ON-W at:
[    6.372268]                                        [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450
[    6.372268]                                        [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                        [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                        [<ffffffff8116af72>] new_inode+0x52/0xd0
[    6.372268]                                        [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[    6.372268]                                        [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[    6.372268]                                        [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[    6.372268]                                        [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[    6.372268]                                        [<ffffffff81e1c075>] sock_init+0x4e/0x59
[    6.372268]                                        [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[    6.372268]                                        [<ffffffff81de5767>] kernel_init+0x17a/0x204
[    6.372268]                                        [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[    6.372268]    INITIAL USE at:
[    6.372268]                                       [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450
[    6.372268]                                       [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                       [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                       [<ffffffff8116af72>] new_inode+0x52/0xd0
[    6.372268]                                       [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[    6.372268]                                       [<f                          [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[    6.372268]                                       [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[    6.372268]                                       [<ffffffff81e1c075>] sock_init+0x4e/0x59
[    6.372268]                                       [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[    6.372268]                                       [<ffffffff81de5767>] kernel_init+0x17a/0x204
[    6.372268]                                       [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[    6.372268]  }
[    6.372268]  ... key      at: [<ffffffff81bd5bd8>] sock_fs_type+0x58/0x80
[    6.372268]  ... acquired at:
[    6.372268]    [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[    6.372268]    [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[    6.372268]    [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]    [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]    [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[    6.372268]    [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[    6.372268]    [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[    6.372268]    [<ffffffff81172005>] seq_read+0x295/0x430
[    6.372268]    [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[    6.372268]    [<ffffffff81150165>] vfs_read+0xb5/0x170
[    6.372268]    [<ffffffff81150274>] sys_read+0x54/0x90
[    6.372268]    [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268] 
[    6.372268] 
[    6.372268] stack backtrace:
[    6.372268] Pid: 2124, comm: pmcd Not tainted 2.6.35-rc5-dgc+ #58
[    6.372268] Call Trace:
[    6.372268]  [<ffffffff810b28d9>] check_usage+0x499/0x4a0
[    6.372268]  [<ffffffff810b24c6>] ? check_usage+0x86/0x4a0
[    6.372268]  [<ffffffff810af729>] ? __bfs+0x129/0x260
[    6.372268]  [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[    6.372268]  [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[    6.372268]  [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]  [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60
[    6.372268]  [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]  [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60
[    6.372268]  [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[    6.372268]  [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[    6.372268]  [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[    6.372268]  [<ffffffff81791750>] ? established_get_first+0x60/0x120
[    6.372268]  [<ffffffff8182beb7>] ? _raw_spin_lock_bh+0x67/0x70
[    6.372268]  [<ffffffff81172005>] seq_read+0x295/0x430
[    6.372268]  [<ffffffff81171d70>] ? seq_read+0x0/0x430
[    6.372268]  [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[    6.372268]  [<ffffffff81150165>] vfs_read+0xb5/0x170
[    6.372268]  [<ffffffff81150274>] sys_read+0x54/0x90
[    6.372268]  [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-23 11:13   ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 11:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> 
> Branch vfs-scale-working

I've got a couple of patches needed to build XFS - they shrinker
merge left some bad fragments - I'll post them in a minute. This
email is for the longest ever lockdep warning I've seen that
occurred on boot.

Cheers,

Dave.

[    6.368707] ======================================================
[    6.369773] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[    6.370379] 2.6.35-rc5-dgc+ #58
[    6.370882] ------------------------------------------------------
[    6.371475] pmcd/2124 [HC0[0]:SC0[1]:HE1:SE0] is trying to acquire:
[    6.372062]  (&sb->s_type->i_lock_key#6){+.+...}, at: [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[    6.372268] 
[    6.372268] and this task is already holding:
[    6.372268]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120
[    6.372268] which would create a new lock dependency:
[    6.372268]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...} -> (&sb->s_type->i_lock_key#6){+.+...}
[    6.372268] 
[    6.372268] but this new dependency connects a SOFTIRQ-irq-safe lock:
[    6.372268]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}
[    6.372268] ... which became SOFTIRQ-irq-safe at:
[    6.372268]   [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450
[    6.372268]   [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]   [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]   [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[    6.372268]   [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0
[    6.372268]   [<ffffffff81795502>] tcp_check_req+0x202/0x440
[    6.372268]   [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0
[    6.372268]   [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0
[    6.372268]   [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0
[    6.372268]   [<ffffffff81771af7>] ip_rcv+0x397/0x760
[    6.372268]   [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330
[    6.372268]   [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0
[    6.372268]   [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0
[    6.372268]   [<ffffffff81084cc2>] __do_softirq+0xd2/0x260
[    6.372268]   [<ffffffff81035edc>] call_softirq+0x1c/0x50
[    6.372268]   [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0
[    6.372268]   [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40
[    6.372268]   [<ffffffff8173c59e>] release_sock+0x14e/0x1a0
[    6.372268]   [<ffffffff817a3975>] inet_stream_connect+0x75/0x320
[    6.372268]   [<ffffffff81737917>] sys_connect+0xa7/0xc0
[    6.372268]   [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268] 
[    6.372268] to a SOFTIRQ-irq-unsafe lock:
[    6.372268]  (&sb->s_type->i_lock_key#6){+.+...}
[    6.372268] ... which became SOFTIRQ-irq-unsafe at:
[    6.372268] ...  [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450
[    6.372268]   [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]   [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]   [<ffffffff8116af72>] new_inode+0x52/0xd0
[    6.372268]   [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[    6.372268]   [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[    6.372268]   [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[    6.372268]   [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[    6.372268]   [<ffffffff81e1c075>] sock_init+0x4e/0x59
[    6.372268]   [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[    6.372268]   [<ffffffff81de5767>] kernel_init+0x17a/0x204
[    6.372268]   [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[    6.372268] 
[    6.372268] other info that might help us debug this:
[    6.372268] 
[    6.372268] 3 locks held by pmcd/2124:
[    6.372268]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff81171dae>] seq_read+0x3e/0x430
[    6.372268]  #1:  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff81791750>] established_get_first+0x60/0x120
[    6.372268]  #2:  (clock-AF_INET){++....}, at: [<ffffffff8173b6ae>] sock_i_ino+0x2e/0x70
[    6.372268] 
[    6.372268] the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[    6.372268] -> (&(&hashinfo->ehash_locks[i])->rlock){+.-...} ops: 3 {
[    6.372268]    HARDIRQ-ON-W at:
[    6.372268]                                        [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450
[    6.372268]                                        [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                        [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                        [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[    6.372268]                                        [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0
[    6.372268]                                        [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60
[    6.372268]                                        [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0
[    6.372268]                                        [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320
[    6.372268]                                        [<ffffffff81737917>] sys_connect+0xa7/0xc0
[    6.372268]                                        [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268]    IN-SOFTIRQ-W at:
[    6.372268]                                        [<ffffffff810b3b26>] __lock_acquire+0x576/0x1450
[    6.372268]                                        [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                        [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                        [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[    6.372268]                                        [<ffffffff8179392a>] tcp_v4_syn_recv_sock+0x1aa/0x2d0
[    6.372268]                                        [<ffffffff81795502>] tcp_check_req+0x202/0x440
[    6.372268]                                        [<ffffffff817948c4>] tcp_v4_do_rcv+0x304/0x4f0
[    6.372268]                                        [<ffffffff81795134>] tcp_v4_rcv+0x684/0x7e0
[    6.372268]                                        [<ffffffff81771512>] ip_local_deliver+0xe2/0x1c0
[    6.372268]                                        [<ffffffff81771af7>] ip_rcv+0x397/0x760
[    6.372268]                                        [<ffffffff8174d067>] __netif_receive_skb+0x277/0x330
[    6.372268]                                        [<ffffffff8174d1f4>] process_backlog+0xd4/0x1e0
[    6.372268]                                        [<ffffffff8174dc38>] net_rx_action+0x188/0x2b0
[    6.372268]                                        [<ffffffff81084cc2>] __do_softirq+0xd2/0x260
[    6.372268]                                        [<ffffffff81035edc>] call_softirq+0x1c/0x50
[    6.372268]                                        [<ffffffff8108551b>] local_bh_enable_ip+0xeb/0xf0
[    6.372268]                                        [<ffffffff8182c544>] _raw_spin_unlock_bh+0x34/0x40
[    6.372268]                                        [<ffffffff8173c59e>] release_sock+0x14e/0x1a0
[    6.372268]                                        [<ffffffff817a3975>] inet_stream_connect+0x75/0x320
[    6.372268]                                        [<ffffffff81737917>] sys_connect+0xa7/0xc0
[    6.372268]                                        [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268]    INITIAL USE at:
[    6.372268]                                       [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450
[    6.372268]                                       [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                       [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                       [<ffffffff8177a1ba>] __inet_hash_nolisten+0xfa/0x180
[    6.372268]                                       [<ffffffff8177ab6a>] __inet_hash_connect+0x33a/0x3d0
[    6.372268]                                       [<ffffffff8177ac4f>] inet_hash_connect+0x4f/0x60
[    6.372268]                                       [<ffffffff81792522>] tcp_v4_connect+0x272/0x4f0
[    6.372268]                                       [<ffffffff817a3b8e>] inet_stream_connect+0x28e/0x320
[    6.372268]                                       [<ffffffff81737917>] sys_connect+0xa7/0xc0
[    6.372268]                                       [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268]  }
[    6.372268]  ... key      at: [<ffffffff8285ddf8>] __key.47027+0x0/0x8
[    6.372268]  ... acquired at:
[    6.372268]    [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[    6.372268]    [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[    6.372268]    [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]    [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]    [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[    6.372268]    [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[    6.372268]    [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[    6.372268]    [<ffffffff81172005>] seq_read+0x295/0x430
[    6.372268]    [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[    6.372268]    [<ffffffff81150165>] vfs_read+0xb5/0x170
[    6.372268]    [<ffffffff81150274>] sys_read+0x54/0x90
[    6.372268]    [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268] 
[    6.372268] 
[    6.372268] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[    6.372268] -> (&sb->s_type->i_lock_key#6){+.+...} ops: 1185 {
[    6.372268]    HARDIRQ-ON-W at:
[    6.372268]                                        [<ffffffff810b3b47>] __lock_acquire+0x597/0x1450
[    6.372268]                                        [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                        [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                        [<ffffffff8116af72>] new_inode+0x52/0xd0
[    6.372268]                                        [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[    6.372268]                                        [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[    6.372268]                                        [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[    6.372268]                                        [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[    6.372268]                                        [<ffffffff81e1c075>] sock_init+0x4e/0x59
[    6.372268]                                        [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[    6.372268]                                        [<ffffffff81de5767>] kernel_init+0x17a/0x204
[    6.372268]                                        [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[    6.372268]    SOFTIRQ-ON-W at:
[    6.372268]                                        [<ffffffff810b3b73>] __lock_acquire+0x5c3/0x1450
[    6.372268]                                        [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                        [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                        [<ffffffff8116af72>] new_inode+0x52/0xd0
[    6.372268]                                        [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[    6.372268]                                        [<ffffffff81735a41>] sockfs_get_sb+0x21/0x30
[    6.372268]                                        [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[    6.372268]                                        [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[    6.372268]                                        [<ffffffff81e1c075>] sock_init+0x4e/0x59
[    6.372268]                                        [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[    6.372268]                                        [<ffffffff81de5767>] kernel_init+0x17a/0x204
[    6.372268]                                        [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[    6.372268]    INITIAL USE at:
[    6.372268]                                       [<ffffffff810b37e2>] __lock_acquire+0x232/0x1450
[    6.372268]                                       [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]                                       [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]                                       [<ffffffff8116af72>] new_inode+0x52/0xd0
[    6.372268]                                       [<ffffffff81174a40>] get_sb_pseudo+0xb0/0x180
[    6.372268]                                       [<f                          [<ffffffff81152dba>] vfs_kern_mount+0x8a/0x1e0
[    6.372268]                                       [<ffffffff81152f29>] kern_mount_data+0x19/0x20
[    6.372268]                                       [<ffffffff81e1c075>] sock_init+0x4e/0x59
[    6.372268]                                       [<ffffffff810001dc>] do_one_initcall+0x3c/0x1a0
[    6.372268]                                       [<ffffffff81de5767>] kernel_init+0x17a/0x204
[    6.372268]                                       [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[    6.372268]  }
[    6.372268]  ... key      at: [<ffffffff81bd5bd8>] sock_fs_type+0x58/0x80
[    6.372268]  ... acquired at:
[    6.372268]    [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[    6.372268]    [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[    6.372268]    [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]    [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]    [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[    6.372268]    [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[    6.372268]    [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[    6.372268]    [<ffffffff81172005>] seq_read+0x295/0x430
[    6.372268]    [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[    6.372268]    [<ffffffff81150165>] vfs_read+0xb5/0x170
[    6.372268]    [<ffffffff81150274>] sys_read+0x54/0x90
[    6.372268]    [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
[    6.372268] 
[    6.372268] 
[    6.372268] stack backtrace:
[    6.372268] Pid: 2124, comm: pmcd Not tainted 2.6.35-rc5-dgc+ #58
[    6.372268] Call Trace:
[    6.372268]  [<ffffffff810b28d9>] check_usage+0x499/0x4a0
[    6.372268]  [<ffffffff810b24c6>] ? check_usage+0x86/0x4a0
[    6.372268]  [<ffffffff810af729>] ? __bfs+0x129/0x260
[    6.372268]  [<ffffffff810b2940>] check_irq_usage+0x60/0xf0
[    6.372268]  [<ffffffff810b41ff>] __lock_acquire+0xc4f/0x1450
[    6.372268]  [<ffffffff810b4aa6>] lock_acquire+0xa6/0x160
[    6.372268]  [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60
[    6.372268]  [<ffffffff8182bb26>] _raw_spin_lock+0x36/0x70
[    6.372268]  [<ffffffff81736f8c>] ? socket_get_id+0x3c/0x60
[    6.372268]  [<ffffffff81736f8c>] socket_get_id+0x3c/0x60
[    6.372268]  [<ffffffff8173b6c3>] sock_i_ino+0x43/0x70
[    6.372268]  [<ffffffff81790fc9>] tcp4_seq_show+0x1a9/0x520
[    6.372268]  [<ffffffff81791750>] ? established_get_first+0x60/0x120
[    6.372268]  [<ffffffff8182beb7>] ? _raw_spin_lock_bh+0x67/0x70
[    6.372268]  [<ffffffff81172005>] seq_read+0x295/0x430
[    6.372268]  [<ffffffff81171d70>] ? seq_read+0x0/0x430
[    6.372268]  [<ffffffff811ad9f4>] proc_reg_read+0x84/0xc0
[    6.372268]  [<ffffffff81150165>] vfs_read+0xb5/0x170
[    6.372268]  [<ffffffff81150274>] sys_read+0x54/0x90
[    6.372268]  [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-22 19:01 ` Nick Piggin
@ 2010-07-23 11:17   ` Christoph Hellwig
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2010-07-23 11:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

I might sound like a broken record, but if you want to make forward
progress with this split it into smaller series.

What would be useful for example would be one series each to split
the global inode_lock and dcache_lock, without introducing all the
fancy new locking primitives, per-bucket locks and lru schemes for
a start.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-23 11:17   ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2010-07-23 11:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

I might sound like a broken record, but if you want to make forward
progress with this split it into smaller series.

What would be useful for example would be one series each to split
the global inode_lock and dcache_lock, without introducing all the
fancy new locking primitives, per-bucket locks and lru schemes for
a start.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-22 19:01 ` Nick Piggin
@ 2010-07-23 13:55   ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 13:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> 
> Branch vfs-scale-working

Bug's I've noticed so far:

- Using XFS, the existing vfs inode count statistic does not decrease
  as inodes are free.
- the existing vfs dentry count remains at zero
- the existing vfs free inode count remains at zero

$ pminfo -f vfs.inodes vfs.dentry

vfs.inodes.count
    value 7472612

vfs.inodes.free
value 0

vfs.dentry.count
value 0

vfs.dentry.free
value 0


Performance Summary:

With lockdep and CONFIG_XFS_DEBUG enabled, a 16 thread parallel
sequential create/unlink workload on an 8p/4GB RAM VM with a virtio
block device sitting on a short-stroked 12x2TB SAS array w/ 512MB
BBWC in RAID0 via dm and using the noop elevator in the guest VM:

$ sudo mkfs.xfs -f -l size=128m -d agcount=16 /dev/vdb
meta-data=/dev/vdb               isize=256    agcount=16, agsize=1638400 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ cd ~/src/fs_mark-3.3/
$  ./fs_mark  -S0  -n  500000  -s  0  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/3  -d  /mnt/scratch/2  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d  /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11  -d  /mnt/scratch/12  -d  /mnt/scratch/13  -d  /mnt/scratch/14  -d  /mnt/scratch/15

			files/s
2.6.34-rc4		12550
2.6.35-rc5+scale	12285

So the same within the error margins of the benchmark.

Screenshot of monitoring graphs - you can see the effect of the
broken stats:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-16x500-xfs.png
http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc5-npiggin-scale-lockdep-16x500-xfs.png

With a production build (i.e. no lockdep, no xfs debug), I'll
run the same fs_mark parallel create/unlink workload to show
scalability as I ran here:

http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

The numbers can't be directly compared, but the test and the setup
is the same.  The XFS numbers below are with delayed logging
enabled. ext4 is using default mkfs and mount parameters except for
barrier=0. All numbers are averages of three runs.

	fs_mark rate (thousands of files/second)
           2.6.35-rc5   2.6.35-rc5-scale
threads    xfs   ext4     xfs    ext4
  1         20    39       20     39
  2         35    55       35     57
  4         60    41       57     42
  8         79     9       75      9

ext4 is getting IO bound at more than 2 threads, so apart from
pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
going to ignore ext4 for the purposes of testing scalability here.

For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
CPU and with Nick's patches it's about 650% (10% higher) for
slightly lower throughput.  So at this class of machine for this
workload, the changes result in a slight reduction in scalability.

I looked at dbench on XFS as well, but didn't see any significant
change in the numbers at up to 200 load threads, so not much to
talk about there.

Sometime over the weekend I'll build a 16p VM and see what I get
from that...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-23 13:55   ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 13:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> 
> Branch vfs-scale-working

Bug's I've noticed so far:

- Using XFS, the existing vfs inode count statistic does not decrease
  as inodes are free.
- the existing vfs dentry count remains at zero
- the existing vfs free inode count remains at zero

$ pminfo -f vfs.inodes vfs.dentry

vfs.inodes.count
    value 7472612

vfs.inodes.free
value 0

vfs.dentry.count
value 0

vfs.dentry.free
value 0


Performance Summary:

With lockdep and CONFIG_XFS_DEBUG enabled, a 16 thread parallel
sequential create/unlink workload on an 8p/4GB RAM VM with a virtio
block device sitting on a short-stroked 12x2TB SAS array w/ 512MB
BBWC in RAID0 via dm and using the noop elevator in the guest VM:

$ sudo mkfs.xfs -f -l size=128m -d agcount=16 /dev/vdb
meta-data=/dev/vdb               isize=256    agcount=16, agsize=1638400 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount -o delaylog,logbsize=262144,nobarrier /dev/vdb /mnt/scratch
$ sudo chmod 777 /mnt/scratch
$ cd ~/src/fs_mark-3.3/
$  ./fs_mark  -S0  -n  500000  -s  0  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/3  -d  /mnt/scratch/2  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d  /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11  -d  /mnt/scratch/12  -d  /mnt/scratch/13  -d  /mnt/scratch/14  -d  /mnt/scratch/15

			files/s
2.6.34-rc4		12550
2.6.35-rc5+scale	12285

So the same within the error margins of the benchmark.

Screenshot of monitoring graphs - you can see the effect of the
broken stats:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-16x500-xfs.png
http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc5-npiggin-scale-lockdep-16x500-xfs.png

With a production build (i.e. no lockdep, no xfs debug), I'll
run the same fs_mark parallel create/unlink workload to show
scalability as I ran here:

http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

The numbers can't be directly compared, but the test and the setup
is the same.  The XFS numbers below are with delayed logging
enabled. ext4 is using default mkfs and mount parameters except for
barrier=0. All numbers are averages of three runs.

	fs_mark rate (thousands of files/second)
           2.6.35-rc5   2.6.35-rc5-scale
threads    xfs   ext4     xfs    ext4
  1         20    39       20     39
  2         35    55       35     57
  4         60    41       57     42
  8         79     9       75      9

ext4 is getting IO bound at more than 2 threads, so apart from
pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
going to ignore ext4 for the purposes of testing scalability here.

For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
CPU and with Nick's patches it's about 650% (10% higher) for
slightly lower throughput.  So at this class of machine for this
workload, the changes result in a slight reduction in scalability.

I looked at dbench on XFS as well, but didn't see any significant
change in the numbers at up to 200 load threads, so not much to
talk about there.

Sometime over the weekend I'll build a 16p VM and see what I get
from that...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/2] vfs scalability tree fixes
  2010-07-23 11:13   ` Dave Chinner
@ 2010-07-23 14:04     ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul

Nick,

Here's the fixes I applied to your tree to make the XFS inode cache
shrinker build and scan sanely.

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 0/2] vfs scalability tree fixes
@ 2010-07-23 14:04     ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul

Nick,

Here's the fixes I applied to your tree to make the XFS inode cache
shrinker build and scan sanely.

Cheers,

Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/2] xfs: fix shrinker build
  2010-07-23 11:13   ` Dave Chinner
@ 2010-07-23 14:04     ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul

From: Dave Chinner <dchinner@redhat.com>

Remove the stray mount list lock reference from the shrinker code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 7a5a368..05426bf 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -916,10 +916,8 @@ xfs_reclaim_inode_shrink(
 
 done:
 	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
-	if (!nr) {
-		up_read(&xfs_mount_list_lock);
+	if (!nr)
 		return 0;
-	}
 	xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
 				XFS_ICI_RECLAIM_TAG, 1, &nr);
 	/* if we don't exhaust the scan, don't bother coming back */
@@ -935,7 +933,6 @@ xfs_inode_shrinker_register(
 	struct xfs_mount	*mp)
 {
 	mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink;
-	mp->m_inode_shrink.seeks = DEFAULT_SEEKS;
 	register_shrinker(&mp->m_inode_shrink);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 1/2] xfs: fix shrinker build
@ 2010-07-23 14:04     ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul

From: Dave Chinner <dchinner@redhat.com>

Remove the stray mount list lock reference from the shrinker code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 7a5a368..05426bf 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -916,10 +916,8 @@ xfs_reclaim_inode_shrink(
 
 done:
 	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
-	if (!nr) {
-		up_read(&xfs_mount_list_lock);
+	if (!nr)
 		return 0;
-	}
 	xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
 				XFS_ICI_RECLAIM_TAG, 1, &nr);
 	/* if we don't exhaust the scan, don't bother coming back */
@@ -935,7 +933,6 @@ xfs_inode_shrinker_register(
 	struct xfs_mount	*mp)
 {
 	mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink;
-	mp->m_inode_shrink.seeks = DEFAULT_SEEKS;
 	register_shrinker(&mp->m_inode_shrink);
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count
  2010-07-23 11:13   ` Dave Chinner
@ 2010-07-23 14:04     ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul

From: Dave Chinner <dchinner@redhat.com>

The shrinker uses a global static to aggregate excess scan counts.
This should be per filesystem like all the other shrinker context to
operate correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |    5 ++---
 fs/xfs/xfs_mount.h          |    1 +
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 05426bf..b0e6296 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -893,7 +893,6 @@ xfs_reclaim_inode_shrink(
 	unsigned long	global,
 	gfp_t		gfp_mask)
 {
-	static unsigned long nr_to_scan;
 	int		nr;
 	struct xfs_mount *mp;
 	struct xfs_perag *pag;
@@ -908,14 +907,14 @@ xfs_reclaim_inode_shrink(
 		nr_reclaimable += pag->pag_ici_reclaimable;
 		xfs_perag_put(pag);
 	}
-	shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable,
+	shrinker_add_scan(&mp->m_shrink_scan_nr, scanned, global, nr_reclaimable,
 				DEFAULT_SEEKS);
 	if (!(gfp_mask & __GFP_FS)) {
 		return 0;
 	}
 
 done:
-	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+	nr = shrinker_do_scan(&mp->m_shrink_scan_nr, SHRINK_BATCH);
 	if (!nr)
 		return 0;
 	xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 5761087..ed5531f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -260,6 +260,7 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	unsigned long		m_shrink_scan_nr; /* shrinker scan count */
 } xfs_mount_t;
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count
@ 2010-07-23 14:04     ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-23 14:04 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul

From: Dave Chinner <dchinner@redhat.com>

The shrinker uses a global static to aggregate excess scan counts.
This should be per filesystem like all the other shrinker context to
operate correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |    5 ++---
 fs/xfs/xfs_mount.h          |    1 +
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 05426bf..b0e6296 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -893,7 +893,6 @@ xfs_reclaim_inode_shrink(
 	unsigned long	global,
 	gfp_t		gfp_mask)
 {
-	static unsigned long nr_to_scan;
 	int		nr;
 	struct xfs_mount *mp;
 	struct xfs_perag *pag;
@@ -908,14 +907,14 @@ xfs_reclaim_inode_shrink(
 		nr_reclaimable += pag->pag_ici_reclaimable;
 		xfs_perag_put(pag);
 	}
-	shrinker_add_scan(&nr_to_scan, scanned, global, nr_reclaimable,
+	shrinker_add_scan(&mp->m_shrink_scan_nr, scanned, global, nr_reclaimable,
 				DEFAULT_SEEKS);
 	if (!(gfp_mask & __GFP_FS)) {
 		return 0;
 	}
 
 done:
-	nr = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
+	nr = shrinker_do_scan(&mp->m_shrink_scan_nr, SHRINK_BATCH);
 	if (!nr)
 		return 0;
 	xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 5761087..ed5531f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -260,6 +260,7 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	unsigned long		m_shrink_scan_nr; /* shrinker scan count */
 } xfs_mount_t;
 
 /*
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-22 19:01 ` Nick Piggin
@ 2010-07-23 15:35   ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 15:35 UTC (permalink / raw)
  To: Nick Piggin, Michael Neuling
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

> Summary of a few numbers I've run. google's socket teardown workload
> runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20%
> on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s
> up to 15GB/s).

Following post just contains some preliminary benchmark numbers on a
POWER7. Boring if you're not interested in this stuff.

IBM and Mikey kindly allowed me to do some test runs on a big POWER7
system today.  Very is the only word I'm authorized to describe how big
is big. We tested the vfs-scale-working and master branches from my git
tree as of today.  I'll stick with relative numbers to be safe. All
tests were run on ramfs.


First and very important is single threaded performance of basic code.
POWER7 is obviously vastly different from a Barcelona or Nehalem. and
store-free path walk uses a lot of seqlocks, which are cheap on x86, a
little more epensive on others.

Test case	time difference, vanilla to vfs-scale (negative is better)
stat()		-10.8% +/- 0.3%
close(open())	  4.3% +/- 0.3%
unlink(creat())	 36.8% +/- 0.3%

stat is significantly faster which is really good.

open/close is a bit slower which we didn't get time to analyse. There
are one or two seqlock checks which might be avoided, which could make
up the difference. It's not horrible, but I hope to get POWER7
open/close more competitive (on x86 open/close is even a bit faster).

Note this is a worst case for rcu-path-walk: lookup of "./file", because
it has to take refcount on the final element. With more elements, rcu
walk should gain the advantage.

creat/unlink is showing the big RCU penalty. However I have penciled
out a working design with Linus of how to do SLAB_DESTROY_BY_RCU.
However it makes the store-free path walking and some inode RCU list
walking a little bit trickier, so I prefer not to dump too much on
at once. There is something that can be done if regressions show up.
I don't anticipate many regressions outside microbenchmarks, and this
is about the absolute worst case.


On to parallel tests. Firstly, the google socket workload.
Running with "NR_THREADS" children, vfs-scale patches do this:

root@p7ih06:~/google# time ./google --files_per_cpu 10000 > /dev/null
real	0m4.976s
user	8m38.925s
sys	6m45.236s

root@p7ih06:~/google# time ./google --files_per_cpu 20000 > /dev/null
real	0m7.816s
user	11m21.034s
sys	14m38.258s

root@p7ih06:~/google# time ./google --files_per_cpu 40000 > /dev/null
real	0m11.358s
user	11m37.955s
sys	28m44.911s

Reducing to NR_THREADS/4 children allows vanilla to complete:

root@p7ih06:~/google# time ./google  --files_per_cpu 10000
real    1m23.118s
user    3m31.820s
sys     81m10.405s

I was actually surprised it did that well.


Dbench was an interesting one. We didn't manage to stretch the box's
legs, unfortunately!  dbench with 1 proc gave about 500MB/s, 64 procs 
gave 21GB/s, 128 and throughput dropped dramatically. Turns out that
weird things start happening with rename seqlock versus d_lookup, and
d_move contention (dbench does a sprinkle of renaming). That can be
improved I think, but noth worth bothering with for the time being.

It's not really worth testing vanilla at high dbench parallelism.


Parallel git diff workload looked OK. It seemed to be scaling fine
in the vfs, but it hit a bottlneck in powerpc's tlb invalidation, so
numbers may not be so interesting.


Lastly, some parallel syscall microbenchmarks:

procs		vanilla		vfs-scale
open-close, seperate-cwd
1		384557.70	355923.82	op/s/proc
NR_CORES	    86.63	164054.64	op/s/proc
NR_THREADS	    18.68 (ouch!)

open-close, same-cwd
1		381074.32	339161.25
NR_CORES	   104.16	107653.05

creat-unlink, seperate-cwd
1		145891.05	104301.06
NR_CORES	    29.81	 10061.66

creat-unlink, same-cwd
1		129681.27	104301.06
NR_CORES	    12.68	   181.24

So we can see the single thread performance regressions here, but
the vanilla case really chokes at high CPU counts.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-23 15:35   ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 15:35 UTC (permalink / raw)
  To: Nick Piggin, Michael Neuling
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

> Summary of a few numbers I've run. google's socket teardown workload
> runs 3-4x faster on my 2 socket Opteron. Single thread git diff runs 20%
> on same machine. 32 node Altix runs dbench on ramfs 150x faster (100MB/s
> up to 15GB/s).

Following post just contains some preliminary benchmark numbers on a
POWER7. Boring if you're not interested in this stuff.

IBM and Mikey kindly allowed me to do some test runs on a big POWER7
system today.  Very is the only word I'm authorized to describe how big
is big. We tested the vfs-scale-working and master branches from my git
tree as of today.  I'll stick with relative numbers to be safe. All
tests were run on ramfs.


First and very important is single threaded performance of basic code.
POWER7 is obviously vastly different from a Barcelona or Nehalem. and
store-free path walk uses a lot of seqlocks, which are cheap on x86, a
little more epensive on others.

Test case	time difference, vanilla to vfs-scale (negative is better)
stat()		-10.8% +/- 0.3%
close(open())	  4.3% +/- 0.3%
unlink(creat())	 36.8% +/- 0.3%

stat is significantly faster which is really good.

open/close is a bit slower which we didn't get time to analyse. There
are one or two seqlock checks which might be avoided, which could make
up the difference. It's not horrible, but I hope to get POWER7
open/close more competitive (on x86 open/close is even a bit faster).

Note this is a worst case for rcu-path-walk: lookup of "./file", because
it has to take refcount on the final element. With more elements, rcu
walk should gain the advantage.

creat/unlink is showing the big RCU penalty. However I have penciled
out a working design with Linus of how to do SLAB_DESTROY_BY_RCU.
However it makes the store-free path walking and some inode RCU list
walking a little bit trickier, so I prefer not to dump too much on
at once. There is something that can be done if regressions show up.
I don't anticipate many regressions outside microbenchmarks, and this
is about the absolute worst case.


On to parallel tests. Firstly, the google socket workload.
Running with "NR_THREADS" children, vfs-scale patches do this:

root@p7ih06:~/google# time ./google --files_per_cpu 10000 > /dev/null
real	0m4.976s
user	8m38.925s
sys	6m45.236s

root@p7ih06:~/google# time ./google --files_per_cpu 20000 > /dev/null
real	0m7.816s
user	11m21.034s
sys	14m38.258s

root@p7ih06:~/google# time ./google --files_per_cpu 40000 > /dev/null
real	0m11.358s
user	11m37.955s
sys	28m44.911s

Reducing to NR_THREADS/4 children allows vanilla to complete:

root@p7ih06:~/google# time ./google  --files_per_cpu 10000
real    1m23.118s
user    3m31.820s
sys     81m10.405s

I was actually surprised it did that well.


Dbench was an interesting one. We didn't manage to stretch the box's
legs, unfortunately!  dbench with 1 proc gave about 500MB/s, 64 procs 
gave 21GB/s, 128 and throughput dropped dramatically. Turns out that
weird things start happening with rename seqlock versus d_lookup, and
d_move contention (dbench does a sprinkle of renaming). That can be
improved I think, but noth worth bothering with for the time being.

It's not really worth testing vanilla at high dbench parallelism.


Parallel git diff workload looked OK. It seemed to be scaling fine
in the vfs, but it hit a bottlneck in powerpc's tlb invalidation, so
numbers may not be so interesting.


Lastly, some parallel syscall microbenchmarks:

procs		vanilla		vfs-scale
open-close, seperate-cwd
1		384557.70	355923.82	op/s/proc
NR_CORES	    86.63	164054.64	op/s/proc
NR_THREADS	    18.68 (ouch!)

open-close, same-cwd
1		381074.32	339161.25
NR_CORES	   104.16	107653.05

creat-unlink, seperate-cwd
1		145891.05	104301.06
NR_CORES	    29.81	 10061.66

creat-unlink, same-cwd
1		129681.27	104301.06
NR_CORES	    12.68	   181.24

So we can see the single thread performance regressions here, but
the vanilla case really chokes at high CPU counts.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-23 11:17   ` Christoph Hellwig
@ 2010-07-23 15:42     ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Fri, Jul 23, 2010 at 07:17:46AM -0400, Christoph Hellwig wrote:
> I might sound like a broken record, but if you want to make forward
> progress with this split it into smaller series.

No I appreciate the advice. I put this tree up for people to fetch
without posting patches all the time. I think it is important to
test and to see the big picture when reviewing the patches, but you
are right about how to actually submit patches on the ML.


> What would be useful for example would be one series each to split
> the global inode_lock and dcache_lock, without introducing all the
> fancy new locking primitives, per-bucket locks and lru schemes for
> a start.

I've kept the series fairly well structured like that. Basically it
is in these parts:

1. files lock
2. vfsmount lock
3. mnt refcount
4a. put several new global spinlocks around different parts of dcache
4b. remove dcache_lock after the above protect everything
4c. start doing fine grained locking of hash, inode alias, lru, etc etc
5a, 5b, 5c. same for inodes
6. some further optimisations and cleanups
7. store-free path walking

This kind of sequence. I will again try to submit a first couple of
things to Al soon.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-23 15:42     ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Fri, Jul 23, 2010 at 07:17:46AM -0400, Christoph Hellwig wrote:
> I might sound like a broken record, but if you want to make forward
> progress with this split it into smaller series.

No I appreciate the advice. I put this tree up for people to fetch
without posting patches all the time. I think it is important to
test and to see the big picture when reviewing the patches, but you
are right about how to actually submit patches on the ML.


> What would be useful for example would be one series each to split
> the global inode_lock and dcache_lock, without introducing all the
> fancy new locking primitives, per-bucket locks and lru schemes for
> a start.

I've kept the series fairly well structured like that. Basically it
is in these parts:

1. files lock
2. vfsmount lock
3. mnt refcount
4a. put several new global spinlocks around different parts of dcache
4b. remove dcache_lock after the above protect everything
4c. start doing fine grained locking of hash, inode alias, lru, etc etc
5a, 5b, 5c. same for inodes
6. some further optimisations and cleanups
7. store-free path walking

This kind of sequence. I will again try to submit a first couple of
things to Al soon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-23 11:13   ` Dave Chinner
@ 2010-07-23 15:51     ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 15:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> 
> I've got a couple of patches needed to build XFS - they shrinker
> merge left some bad fragments - I'll post them in a minute. This

OK cool.


> email is for the longest ever lockdep warning I've seen that
> occurred on boot.

Ah thanks. OK that was one of my attempts to keep sockets out of
hidding the vfs as much as possible (lazy inode number evaluation).
Not a big problem, but I'll drop the patch for now.

I have just got one for you too, btw :) (on vanilla kernel but it is
messing up my lockdep stress testing on xfs). Real or false?

[ INFO: possible circular locking dependency detected ]
2.6.35-rc5-00064-ga9f7f2e #334
-------------------------------------------------------
kswapd0/605 is trying to acquire lock:
 (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>]
xfs_ilock+0x7c/0xa0

but task is already holding lock:
 (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
xfs_reclaim_inode_shrink+0xc6/0x140

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&xfs_mount_list_lock){++++.-}:
       [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
       [<ffffffff815aa646>] _raw_spin_lock+0x36/0x50
       [<ffffffff810fabf3>] try_to_free_buffers+0x43/0xb0
       [<ffffffff812763b2>] xfs_vm_releasepage+0x92/0xe0
       [<ffffffff810908ee>] try_to_release_page+0x2e/0x50
       [<ffffffff8109ef56>] shrink_page_list+0x486/0x5a0
       [<ffffffff8109f35d>] shrink_inactive_list+0x2ed/0x700
       [<ffffffff8109fda0>] shrink_zone+0x3b0/0x460
       [<ffffffff810a0f41>] try_to_free_pages+0x241/0x3a0
       [<ffffffff810999e2>] __alloc_pages_nodemask+0x4c2/0x6b0
       [<ffffffff810c52c6>] alloc_pages_current+0x76/0xf0
       [<ffffffff8109205b>] __page_cache_alloc+0xb/0x10
       [<ffffffff81092a2a>] find_or_create_page+0x4a/0xa0
       [<ffffffff812780cc>] _xfs_buf_lookup_pages+0x14c/0x360
       [<ffffffff81279122>] xfs_buf_get+0x72/0x160
       [<ffffffff8126eb68>] xfs_trans_get_buf+0xc8/0xf0
       [<ffffffff8124439f>] xfs_da_do_buf+0x3df/0x6d0
       [<ffffffff81244825>] xfs_da_get_buf+0x25/0x30
       [<ffffffff8124a076>] xfs_dir2_data_init+0x46/0xe0
       [<ffffffff81247f89>] xfs_dir2_sf_to_block+0xb9/0x5a0
       [<ffffffff812501c8>] xfs_dir2_sf_addname+0x418/0x5c0
       [<ffffffff81247d7c>] xfs_dir_createname+0x14c/0x1a0
       [<ffffffff81271d49>] xfs_create+0x449/0x5d0
       [<ffffffff8127d802>] xfs_vn_mknod+0xa2/0x1b0
       [<ffffffff8127d92b>] xfs_vn_create+0xb/0x10
       [<ffffffff810ddc81>] vfs_create+0x81/0xd0
       [<ffffffff810df1a5>] do_last+0x535/0x690
       [<ffffffff810e11fd>] do_filp_open+0x21d/0x660
       [<ffffffff810d16b4>] do_sys_open+0x64/0x140
       [<ffffffff810d17bb>] sys_open+0x1b/0x20
       [<ffffffff810023eb>] system_call_fastpath+0x16/0x1b

:-> #0 (&(&ip->i_lock)->mr_lock){++++--}:
       [<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10
       [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
       [<ffffffff8105dfba>] down_write_nested+0x4a/0x70
       [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0
       [<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250
       [<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120
       [<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0
       [<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140
       [<ffffffff8109ff7d>] shrink_slab+0x12d/0x190
       [<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0
       [<ffffffff810a0b1e>] kswapd+0xbe/0x2a0
       [<ffffffff810592ae>] kthread+0x8e/0xa0
       [<ffffffff81003194>] kernel_thread_helper+0x4/0x10

other info that might help us debug this:

2 locks held by kswapd0/605:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff8109fe88>]
shrink_slab+0x38/0x190
 #1:  (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
xfs_reclaim_inode_shrink+0xc6/0x140

stack backtrace:
Pid: 605, comm: kswapd0 Not tainted 2.6.35-rc5-00064-ga9f7f2e #334
Call Trace:
 [<ffffffff8106c5d9>] print_circular_bug+0xe9/0xf0
 [<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10
 [<ffffffff8106e3c2>] ? __lock_acquire+0x1092/0x1c10
 [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
 [<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0
 [<ffffffff8105dfba>] down_write_nested+0x4a/0x70
 [<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0
 [<ffffffff815ae795>] ? sub_preempt_count+0x95/0xd0
 [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0
 [<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250
 [<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120
 [<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250
 [<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0
 [<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250
 [<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140
 [<ffffffff8109ff7d>] shrink_slab+0x12d/0x190
 [<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0
 [<ffffffff810a0b1e>] kswapd+0xbe/0x2a0
 [<ffffffff81059700>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff815aaf3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x70
 [<ffffffff810a0a60>] ? kswapd+0x0/0x2a0
 [<ffffffff810592ae>] kthread+0x8e/0xa0
 [<ffffffff81003194>] kernel_thread_helper+0x4/0x10
 [<ffffffff815ab400>] ? restore_args+0x0/0x30
 [<ffffffff81059220>] ? kthread+0x0/0xa0


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-23 15:51     ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 15:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> 
> I've got a couple of patches needed to build XFS - they shrinker
> merge left some bad fragments - I'll post them in a minute. This

OK cool.


> email is for the longest ever lockdep warning I've seen that
> occurred on boot.

Ah thanks. OK that was one of my attempts to keep sockets out of
hidding the vfs as much as possible (lazy inode number evaluation).
Not a big problem, but I'll drop the patch for now.

I have just got one for you too, btw :) (on vanilla kernel but it is
messing up my lockdep stress testing on xfs). Real or false?

[ INFO: possible circular locking dependency detected ]
2.6.35-rc5-00064-ga9f7f2e #334
-------------------------------------------------------
kswapd0/605 is trying to acquire lock:
 (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>]
xfs_ilock+0x7c/0xa0

but task is already holding lock:
 (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
xfs_reclaim_inode_shrink+0xc6/0x140

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&xfs_mount_list_lock){++++.-}:
       [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
       [<ffffffff815aa646>] _raw_spin_lock+0x36/0x50
       [<ffffffff810fabf3>] try_to_free_buffers+0x43/0xb0
       [<ffffffff812763b2>] xfs_vm_releasepage+0x92/0xe0
       [<ffffffff810908ee>] try_to_release_page+0x2e/0x50
       [<ffffffff8109ef56>] shrink_page_list+0x486/0x5a0
       [<ffffffff8109f35d>] shrink_inactive_list+0x2ed/0x700
       [<ffffffff8109fda0>] shrink_zone+0x3b0/0x460
       [<ffffffff810a0f41>] try_to_free_pages+0x241/0x3a0
       [<ffffffff810999e2>] __alloc_pages_nodemask+0x4c2/0x6b0
       [<ffffffff810c52c6>] alloc_pages_current+0x76/0xf0
       [<ffffffff8109205b>] __page_cache_alloc+0xb/0x10
       [<ffffffff81092a2a>] find_or_create_page+0x4a/0xa0
       [<ffffffff812780cc>] _xfs_buf_lookup_pages+0x14c/0x360
       [<ffffffff81279122>] xfs_buf_get+0x72/0x160
       [<ffffffff8126eb68>] xfs_trans_get_buf+0xc8/0xf0
       [<ffffffff8124439f>] xfs_da_do_buf+0x3df/0x6d0
       [<ffffffff81244825>] xfs_da_get_buf+0x25/0x30
       [<ffffffff8124a076>] xfs_dir2_data_init+0x46/0xe0
       [<ffffffff81247f89>] xfs_dir2_sf_to_block+0xb9/0x5a0
       [<ffffffff812501c8>] xfs_dir2_sf_addname+0x418/0x5c0
       [<ffffffff81247d7c>] xfs_dir_createname+0x14c/0x1a0
       [<ffffffff81271d49>] xfs_create+0x449/0x5d0
       [<ffffffff8127d802>] xfs_vn_mknod+0xa2/0x1b0
       [<ffffffff8127d92b>] xfs_vn_create+0xb/0x10
       [<ffffffff810ddc81>] vfs_create+0x81/0xd0
       [<ffffffff810df1a5>] do_last+0x535/0x690
       [<ffffffff810e11fd>] do_filp_open+0x21d/0x660
       [<ffffffff810d16b4>] do_sys_open+0x64/0x140
       [<ffffffff810d17bb>] sys_open+0x1b/0x20
       [<ffffffff810023eb>] system_call_fastpath+0x16/0x1b

:-> #0 (&(&ip->i_lock)->mr_lock){++++--}:
       [<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10
       [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
       [<ffffffff8105dfba>] down_write_nested+0x4a/0x70
       [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0
       [<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250
       [<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120
       [<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0
       [<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140
       [<ffffffff8109ff7d>] shrink_slab+0x12d/0x190
       [<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0
       [<ffffffff810a0b1e>] kswapd+0xbe/0x2a0
       [<ffffffff810592ae>] kthread+0x8e/0xa0
       [<ffffffff81003194>] kernel_thread_helper+0x4/0x10

other info that might help us debug this:

2 locks held by kswapd0/605:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff8109fe88>]
shrink_slab+0x38/0x190
 #1:  (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
xfs_reclaim_inode_shrink+0xc6/0x140

stack backtrace:
Pid: 605, comm: kswapd0 Not tainted 2.6.35-rc5-00064-ga9f7f2e #334
Call Trace:
 [<ffffffff8106c5d9>] print_circular_bug+0xe9/0xf0
 [<ffffffff8106ef10>] __lock_acquire+0x1be0/0x1c10
 [<ffffffff8106e3c2>] ? __lock_acquire+0x1092/0x1c10
 [<ffffffff8106ef9a>] lock_acquire+0x5a/0x70
 [<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0
 [<ffffffff8105dfba>] down_write_nested+0x4a/0x70
 [<ffffffff8125500c>] ? xfs_ilock+0x7c/0xa0
 [<ffffffff815ae795>] ? sub_preempt_count+0x95/0xd0
 [<ffffffff8125500c>] xfs_ilock+0x7c/0xa0
 [<ffffffff81280c98>] xfs_reclaim_inode+0x98/0x250
 [<ffffffff81281824>] xfs_inode_ag_walk+0x74/0x120
 [<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250
 [<ffffffff81281953>] xfs_inode_ag_iterator+0x83/0xe0
 [<ffffffff81280c00>] ? xfs_reclaim_inode+0x0/0x250
 [<ffffffff81281aa4>] xfs_reclaim_inode_shrink+0xf4/0x140
 [<ffffffff8109ff7d>] shrink_slab+0x12d/0x190
 [<ffffffff810a07ad>] balance_pgdat+0x43d/0x6f0
 [<ffffffff810a0b1e>] kswapd+0xbe/0x2a0
 [<ffffffff81059700>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff815aaf3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x70
 [<ffffffff810a0a60>] ? kswapd+0x0/0x2a0
 [<ffffffff810592ae>] kthread+0x8e/0xa0
 [<ffffffff81003194>] kernel_thread_helper+0x4/0x10
 [<ffffffff815ab400>] ? restore_args+0x0/0x30
 [<ffffffff81059220>] ? kthread+0x0/0xa0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/2] vfs scalability tree fixes
  2010-07-23 14:04     ` Dave Chinner
@ 2010-07-23 16:09       ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 16:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: npiggin, linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul

On Sat, Jul 24, 2010 at 12:04:00AM +1000, Dave Chinner wrote:
> Nick,
> 
> Here's the fixes I applied to your tree to make the XFS inode cache
> shrinker build and scan sanely.

Thanks for these Dave

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 0/2] vfs scalability tree fixes
@ 2010-07-23 16:09       ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 16:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: npiggin, linux-fsdevel, linux-kernel, linux-mm, fmayhar, johnstul

On Sat, Jul 24, 2010 at 12:04:00AM +1000, Dave Chinner wrote:
> Nick,
> 
> Here's the fixes I applied to your tree to make the XFS inode cache
> shrinker build and scan sanely.

Thanks for these Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-23 13:55   ` Dave Chinner
@ 2010-07-23 16:16     ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 16:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> 
> Bug's I've noticed so far:
> 
> - Using XFS, the existing vfs inode count statistic does not decrease
>   as inodes are free.
> - the existing vfs dentry count remains at zero
> - the existing vfs free inode count remains at zero
> 
> $ pminfo -f vfs.inodes vfs.dentry
> 
> vfs.inodes.count
>     value 7472612
> 
> vfs.inodes.free
> value 0
> 
> vfs.dentry.count
> value 0
> 
> vfs.dentry.free
> value 0

Hm, I must have broken it along the way and not noticed. Thanks
for pointing that out.

 
> With a production build (i.e. no lockdep, no xfs debug), I'll
> run the same fs_mark parallel create/unlink workload to show
> scalability as I ran here:
> 
> http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> The numbers can't be directly compared, but the test and the setup
> is the same.  The XFS numbers below are with delayed logging
> enabled. ext4 is using default mkfs and mount parameters except for
> barrier=0. All numbers are averages of three runs.
> 
> 	fs_mark rate (thousands of files/second)
>            2.6.35-rc5   2.6.35-rc5-scale
> threads    xfs   ext4     xfs    ext4
>   1         20    39       20     39
>   2         35    55       35     57
>   4         60    41       57     42
>   8         79     9       75      9
> 
> ext4 is getting IO bound at more than 2 threads, so apart from
> pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> going to ignore ext4 for the purposes of testing scalability here.
> 
> For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> CPU and with Nick's patches it's about 650% (10% higher) for
> slightly lower throughput.  So at this class of machine for this
> workload, the changes result in a slight reduction in scalability.

That's a good test case, thanks. I'll see if I can find where
this is coming from. I will suspect RCU-inodes I suppose. Hm,
may have to make them DESTROY_BY_RCU afterall.

Thanks,
Nick
 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-23 16:16     ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-23 16:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> 
> Bug's I've noticed so far:
> 
> - Using XFS, the existing vfs inode count statistic does not decrease
>   as inodes are free.
> - the existing vfs dentry count remains at zero
> - the existing vfs free inode count remains at zero
> 
> $ pminfo -f vfs.inodes vfs.dentry
> 
> vfs.inodes.count
>     value 7472612
> 
> vfs.inodes.free
> value 0
> 
> vfs.dentry.count
> value 0
> 
> vfs.dentry.free
> value 0

Hm, I must have broken it along the way and not noticed. Thanks
for pointing that out.

 
> With a production build (i.e. no lockdep, no xfs debug), I'll
> run the same fs_mark parallel create/unlink workload to show
> scalability as I ran here:
> 
> http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> The numbers can't be directly compared, but the test and the setup
> is the same.  The XFS numbers below are with delayed logging
> enabled. ext4 is using default mkfs and mount parameters except for
> barrier=0. All numbers are averages of three runs.
> 
> 	fs_mark rate (thousands of files/second)
>            2.6.35-rc5   2.6.35-rc5-scale
> threads    xfs   ext4     xfs    ext4
>   1         20    39       20     39
>   2         35    55       35     57
>   4         60    41       57     42
>   8         79     9       75      9
> 
> ext4 is getting IO bound at more than 2 threads, so apart from
> pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> going to ignore ext4 for the purposes of testing scalability here.
> 
> For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> CPU and with Nick's patches it's about 650% (10% higher) for
> slightly lower throughput.  So at this class of machine for this
> workload, the changes result in a slight reduction in scalability.

That's a good test case, thanks. I'll see if I can find where
this is coming from. I will suspect RCU-inodes I suppose. Hm,
may have to make them DESTROY_BY_RCU afterall.

Thanks,
Nick
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-23 15:51     ` Nick Piggin
@ 2010-07-24  0:21       ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-24  0:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Sat, Jul 24, 2010 at 01:51:18AM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > I've got a couple of patches needed to build XFS - they shrinker
> > merge left some bad fragments - I'll post them in a minute. This
> 
> OK cool.
> 
> 
> > email is for the longest ever lockdep warning I've seen that
> > occurred on boot.
> 
> Ah thanks. OK that was one of my attempts to keep sockets out of
> hidding the vfs as much as possible (lazy inode number evaluation).
> Not a big problem, but I'll drop the patch for now.
> 
> I have just got one for you too, btw :) (on vanilla kernel but it is
> messing up my lockdep stress testing on xfs). Real or false?
> 
> [ INFO: possible circular locking dependency detected ]
> 2.6.35-rc5-00064-ga9f7f2e #334
> -------------------------------------------------------
> kswapd0/605 is trying to acquire lock:
>  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>]
> xfs_ilock+0x7c/0xa0
> 
> but task is already holding lock:
>  (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
> xfs_reclaim_inode_shrink+0xc6/0x140

False positive, but the xfs_mount_list_lock is gone in 2.6.35-rc6 -
the shrinker context change has fixed that - so you can ignore it
anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-24  0:21       ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-24  0:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Sat, Jul 24, 2010 at 01:51:18AM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 09:13:10PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > I've got a couple of patches needed to build XFS - they shrinker
> > merge left some bad fragments - I'll post them in a minute. This
> 
> OK cool.
> 
> 
> > email is for the longest ever lockdep warning I've seen that
> > occurred on boot.
> 
> Ah thanks. OK that was one of my attempts to keep sockets out of
> hidding the vfs as much as possible (lazy inode number evaluation).
> Not a big problem, but I'll drop the patch for now.
> 
> I have just got one for you too, btw :) (on vanilla kernel but it is
> messing up my lockdep stress testing on xfs). Real or false?
> 
> [ INFO: possible circular locking dependency detected ]
> 2.6.35-rc5-00064-ga9f7f2e #334
> -------------------------------------------------------
> kswapd0/605 is trying to acquire lock:
>  (&(&ip->i_lock)->mr_lock){++++--}, at: [<ffffffff8125500c>]
> xfs_ilock+0x7c/0xa0
> 
> but task is already holding lock:
>  (&xfs_mount_list_lock){++++.-}, at: [<ffffffff81281a76>]
> xfs_reclaim_inode_shrink+0xc6/0x140

False positive, but the xfs_mount_list_lock is gone in 2.6.35-rc6 -
the shrinker context change has fixed that - so you can ignore it
anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-22 19:01 ` Nick Piggin
@ 2010-07-24  8:43   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24  8:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar, John Stultz

> At this point, I would be very interested in reviewing, correctness
> testing on different configurations, and of course benchmarking.

I haven't review this series so long time. but I've found one misterious
shrink_slab() usage. can you please see my patch? (I will send it as
another mail)



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-24  8:43   ` KOSAKI Motohiro
  0 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24  8:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar, John Stultz

> At this point, I would be very interested in reviewing, correctness
> testing on different configurations, and of course benchmarking.

I haven't review this series so long time. but I've found one misterious
shrink_slab() usage. can you please see my patch? (I will send it as
another mail)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab()
  2010-07-24  8:43   ` KOSAKI Motohiro
  (?)
@ 2010-07-24  8:44     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24  8:44 UTC (permalink / raw)
  To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz
  Cc: kosaki.motohiro

Now, shrink_slab() doesn't return number of reclaimed objects. IOW,
current shrink_all_slab() is broken. Thus instead we use reclaim_state
to detect no reclaimable slab objects.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d7256e0..bfa1975 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig
 void shrink_all_slab(void)
 {
 	struct zone *zone;
-	unsigned long nr;
+	struct reclaim_state reclaim_state;
 
-again:
-	nr = 0;
-	for_each_zone(zone)
-		nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
-	/*
-	 * If we reclaimed less than 10 objects, might as well call
-	 * it a day. Nothing special about the number 10.
-	 */
-	if (nr >= 10)
-		goto again;
+	current->reclaim_state = &reclaim_state;
+	do {
+		reclaim_state.reclaimed_slab = 0;
+		for_each_zone(zone)
+			shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
+	} while (reclaim_state.reclaimed_slab);
+
+	current->reclaim_state = NULL;
 }
 
 static inline int is_page_cache_freeable(struct page *page)
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab()
@ 2010-07-24  8:44     ` KOSAKI Motohiro
  0 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24  8:44 UTC (permalink / raw)
  To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz
  Cc: kosaki.motohiro

Now, shrink_slab() doesn't return number of reclaimed objects. IOW,
current shrink_all_slab() is broken. Thus instead we use reclaim_state
to detect no reclaimable slab objects.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d7256e0..bfa1975 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig
 void shrink_all_slab(void)
 {
 	struct zone *zone;
-	unsigned long nr;
+	struct reclaim_state reclaim_state;
 
-again:
-	nr = 0;
-	for_each_zone(zone)
-		nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
-	/*
-	 * If we reclaimed less than 10 objects, might as well call
-	 * it a day. Nothing special about the number 10.
-	 */
-	if (nr >= 10)
-		goto again;
+	current->reclaim_state = &reclaim_state;
+	do {
+		reclaim_state.reclaimed_slab = 0;
+		for_each_zone(zone)
+			shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
+	} while (reclaim_state.reclaimed_slab);
+
+	current->reclaim_state = NULL;
 }
 
 static inline int is_page_cache_freeable(struct page *page)
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab()
@ 2010-07-24  8:44     ` KOSAKI Motohiro
  0 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24  8:44 UTC (permalink / raw)
  To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz
  Cc: kosaki.motohiro

Now, shrink_slab() doesn't return number of reclaimed objects. IOW,
current shrink_all_slab() is broken. Thus instead we use reclaim_state
to detect no reclaimable slab objects.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   20 +++++++++-----------
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d7256e0..bfa1975 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig
 void shrink_all_slab(void)
 {
 	struct zone *zone;
-	unsigned long nr;
+	struct reclaim_state reclaim_state;
 
-again:
-	nr = 0;
-	for_each_zone(zone)
-		nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
-	/*
-	 * If we reclaimed less than 10 objects, might as well call
-	 * it a day. Nothing special about the number 10.
-	 */
-	if (nr >= 10)
-		goto again;
+	current->reclaim_state = &reclaim_state;
+	do {
+		reclaim_state.reclaimed_slab = 0;
+		for_each_zone(zone)
+			shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
+	} while (reclaim_state.reclaimed_slab);
+
+	current->reclaim_state = NULL;
 }
 
 static inline int is_page_cache_freeable(struct page *page)
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/2] vmscan: change shrink_slab() return tyep with void
  2010-07-24  8:43   ` KOSAKI Motohiro
  (?)
@ 2010-07-24  8:46     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24  8:46 UTC (permalink / raw)
  To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz
  Cc: kosaki.motohiro

Now, no caller use the return value of shrink_slab(). Thus we can change
it with void.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bfa1975..89b593e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,24 +277,23 @@ EXPORT_SYMBOL(shrinker_do_scan);
  *
  * Returns the number of slab objects which we shrunk.
  */
-static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
+static void shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
 			unsigned long global, gfp_t gfp_mask)
 {
 	struct shrinker *shrinker;
-	unsigned long ret = 0;
 
 	if (scanned == 0)
 		scanned = SWAP_CLUSTER_MAX;
 
 	if (!down_read_trylock(&shrinker_rwsem))
-		return 1;	/* Assume we'll be able to shrink next time */
+		return;
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		(*shrinker->shrink)(shrinker, zone, scanned,
 					total, global, gfp_mask);
 	}
 	up_read(&shrinker_rwsem);
-	return ret;
+	return;
 }
 
 void shrink_all_slab(void)
-- 
1.6.5.2




^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/2] vmscan: change shrink_slab() return tyep with void
@ 2010-07-24  8:46     ` KOSAKI Motohiro
  0 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24  8:46 UTC (permalink / raw)
  To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz
  Cc: kosaki.motohiro

Now, no caller use the return value of shrink_slab(). Thus we can change
it with void.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bfa1975..89b593e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,24 +277,23 @@ EXPORT_SYMBOL(shrinker_do_scan);
  *
  * Returns the number of slab objects which we shrunk.
  */
-static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
+static void shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
 			unsigned long global, gfp_t gfp_mask)
 {
 	struct shrinker *shrinker;
-	unsigned long ret = 0;
 
 	if (scanned == 0)
 		scanned = SWAP_CLUSTER_MAX;
 
 	if (!down_read_trylock(&shrinker_rwsem))
-		return 1;	/* Assume we'll be able to shrink next time */
+		return;
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		(*shrinker->shrink)(shrinker, zone, scanned,
 					total, global, gfp_mask);
 	}
 	up_read(&shrinker_rwsem);
-	return ret;
+	return;
 }
 
 void shrink_all_slab(void)
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/2] vmscan: change shrink_slab() return tyep with void
@ 2010-07-24  8:46     ` KOSAKI Motohiro
  0 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24  8:46 UTC (permalink / raw)
  To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz
  Cc: kosaki.motohiro

Now, no caller use the return value of shrink_slab(). Thus we can change
it with void.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bfa1975..89b593e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,24 +277,23 @@ EXPORT_SYMBOL(shrinker_do_scan);
  *
  * Returns the number of slab objects which we shrunk.
  */
-static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
+static void shrink_slab(struct zone *zone, unsigned long scanned, unsigned long total,
 			unsigned long global, gfp_t gfp_mask)
 {
 	struct shrinker *shrinker;
-	unsigned long ret = 0;
 
 	if (scanned == 0)
 		scanned = SWAP_CLUSTER_MAX;
 
 	if (!down_read_trylock(&shrinker_rwsem))
-		return 1;	/* Assume we'll be able to shrink next time */
+		return;
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		(*shrinker->shrink)(shrinker, zone, scanned,
 					total, global, gfp_mask);
 	}
 	up_read(&shrinker_rwsem);
-	return ret;
+	return;
 }
 
 void shrink_all_slab(void)
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-24  8:43   ` KOSAKI Motohiro
@ 2010-07-24 10:54     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24 10:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar, John Stultz

> > At this point, I would be very interested in reviewing, correctness
> > testing on different configurations, and of course benchmarking.
> 
> I haven't review this series so long time. but I've found one misterious
> shrink_slab() usage. can you please see my patch? (I will send it as
> another mail)

Plus, I have one question. upstream shrink_slab() calculation and your
calculation have bigger change rather than your patch description explained.

upstream:

  shrink_slab()

                                lru_scanned        max_pass
      basic_scan_objects = 4 x -------------  x -----------------------------
                                lru_pages        shrinker->seeks (default:2)

      scan_objects = min(basic_scan_objects, max_pass * 2)

  shrink_icache_memory()

                                          sysctl_vfs_cache_pressure
      max_pass = inodes_stat.nr_unused x --------------------------
                                                   100


That said, higher sysctl_vfs_cache_pressure makes higher slab reclaim.


In the other hand, your code: 
  shrinker_add_scan()

                           scanned          objects
      scan_objects = 4 x -------------  x -----------  x SHRINK_FACTOR x SHRINK_FACTOR
                           total            ratio

  shrink_icache_memory()

     ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100

That said, higher sysctl_vfs_cache_pressure makes smaller slab reclaim.


So, I guess following change honorly refrect your original intention.

New calculation is, 

  shrinker_add_scan()

                       scanned          
      scan_objects = -------------  x objects x ratio
                        total            

  shrink_icache_memory()

     ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100

This has the same behavior as upstream. because upstream's 4/shrinker->seeks = 2.
also the above has DEFAULT_SEEKS = SHRINK_FACTORx2.



===============
o move 'ratio' from denominator to numerator
o adapt kvm/mmu_shrink
o SHRINK_FACTOR / 2 (default seek) x 4 (unknown shrink slab modifier)
    -> (SHRINK_FACTOR*2) == DEFAULT_SEEKS

---
 arch/x86/kvm/mmu.c |    2 +-
 mm/vmscan.c        |   10 ++--------
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ae5a038..cea1e92 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2942,7 +2942,7 @@ static int mmu_shrink(struct shrinker *shrink,
 	}
 
 	shrinker_add_scan(&nr_to_scan, scanned, global, cache_count,
-			DEFAULT_SEEKS*10);
+			DEFAULT_SEEKS/10);
 
 done:
 	cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 89b593e..2d8e9ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -208,14 +208,8 @@ void shrinker_add_scan(unsigned long *dst,
 {
 	unsigned long long delta;
 
-	/*
-	 * The constant 4 comes from old code. Who knows why.
-	 * This could all use a good tune up with some decent
-	 * benchmarks and numbers.
-	 */
-	delta = (unsigned long long)scanned * objects
-			* SHRINK_FACTOR * SHRINK_FACTOR * 4UL;
-	do_div(delta, (ratio * total + 1));
+	delta = (unsigned long long)scanned * objects * ratio;
+	do_div(delta, total+ 1);
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
-- 
1.6.5.2





^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-24 10:54     ` KOSAKI Motohiro
  0 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24 10:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar, John Stultz

> > At this point, I would be very interested in reviewing, correctness
> > testing on different configurations, and of course benchmarking.
> 
> I haven't review this series so long time. but I've found one misterious
> shrink_slab() usage. can you please see my patch? (I will send it as
> another mail)

Plus, I have one question. upstream shrink_slab() calculation and your
calculation have bigger change rather than your patch description explained.

upstream:

  shrink_slab()

                                lru_scanned        max_pass
      basic_scan_objects = 4 x -------------  x -----------------------------
                                lru_pages        shrinker->seeks (default:2)

      scan_objects = min(basic_scan_objects, max_pass * 2)

  shrink_icache_memory()

                                          sysctl_vfs_cache_pressure
      max_pass = inodes_stat.nr_unused x --------------------------
                                                   100


That said, higher sysctl_vfs_cache_pressure makes higher slab reclaim.


In the other hand, your code: 
  shrinker_add_scan()

                           scanned          objects
      scan_objects = 4 x -------------  x -----------  x SHRINK_FACTOR x SHRINK_FACTOR
                           total            ratio

  shrink_icache_memory()

     ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100

That said, higher sysctl_vfs_cache_pressure makes smaller slab reclaim.


So, I guess following change honorly refrect your original intention.

New calculation is, 

  shrinker_add_scan()

                       scanned          
      scan_objects = -------------  x objects x ratio
                        total            

  shrink_icache_memory()

     ratio = DEFAULT_SEEKS * sysctl_vfs_cache_pressure / 100

This has the same behavior as upstream. because upstream's 4/shrinker->seeks = 2.
also the above has DEFAULT_SEEKS = SHRINK_FACTORx2.



===============
o move 'ratio' from denominator to numerator
o adapt kvm/mmu_shrink
o SHRINK_FACTOR / 2 (default seek) x 4 (unknown shrink slab modifier)
    -> (SHRINK_FACTOR*2) == DEFAULT_SEEKS

---
 arch/x86/kvm/mmu.c |    2 +-
 mm/vmscan.c        |   10 ++--------
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ae5a038..cea1e92 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2942,7 +2942,7 @@ static int mmu_shrink(struct shrinker *shrink,
 	}
 
 	shrinker_add_scan(&nr_to_scan, scanned, global, cache_count,
-			DEFAULT_SEEKS*10);
+			DEFAULT_SEEKS/10);
 
 done:
 	cache_count = shrinker_do_scan(&nr_to_scan, SHRINK_BATCH);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 89b593e..2d8e9ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -208,14 +208,8 @@ void shrinker_add_scan(unsigned long *dst,
 {
 	unsigned long long delta;
 
-	/*
-	 * The constant 4 comes from old code. Who knows why.
-	 * This could all use a good tune up with some decent
-	 * benchmarks and numbers.
-	 */
-	delta = (unsigned long long)scanned * objects
-			* SHRINK_FACTOR * SHRINK_FACTOR * 4UL;
-	do_div(delta, (ratio * total + 1));
+	delta = (unsigned long long)scanned * objects * ratio;
+	do_div(delta, total+ 1);
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
-- 
1.6.5.2




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead  the return value of shrink_slab()
  2010-07-24  8:44     ` KOSAKI Motohiro
@ 2010-07-24 12:05       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24 12:05 UTC (permalink / raw)
  To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz
  Cc: kosaki.motohiro

2010/7/24 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> Now, shrink_slab() doesn't return number of reclaimed objects. IOW,
> current shrink_all_slab() is broken. Thus instead we use reclaim_state
> to detect no reclaimable slab objects.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmscan.c |   20 +++++++++-----------
>  1 files changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d7256e0..bfa1975 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig
>  void shrink_all_slab(void)
>  {
>        struct zone *zone;
> -       unsigned long nr;
> +       struct reclaim_state reclaim_state;
>
> -again:
> -       nr = 0;
> -       for_each_zone(zone)
> -               nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> -       /*
> -        * If we reclaimed less than 10 objects, might as well call
> -        * it a day. Nothing special about the number 10.
> -        */
> -       if (nr >= 10)
> -               goto again;
> +       current->reclaim_state = &reclaim_state;
> +       do {
> +               reclaim_state.reclaimed_slab = 0;
> +               for_each_zone(zone)

Oops, this should be for_each_populated_zone().


> +                       shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> +       } while (reclaim_state.reclaimed_slab);
> +
> +       current->reclaim_state = NULL;
>  }
>
>  static inline int is_page_cache_freeable(struct page *page)
> --
> 1.6.5.2
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab()
@ 2010-07-24 12:05       ` KOSAKI Motohiro
  0 siblings, 0 replies; 76+ messages in thread
From: KOSAKI Motohiro @ 2010-07-24 12:05 UTC (permalink / raw)
  To: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz
  Cc: kosaki.motohiro

2010/7/24 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> Now, shrink_slab() doesn't return number of reclaimed objects. IOW,
> current shrink_all_slab() is broken. Thus instead we use reclaim_state
> to detect no reclaimable slab objects.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/vmscan.c |   20 +++++++++-----------
>  1 files changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d7256e0..bfa1975 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -300,18 +300,16 @@ static unsigned long shrink_slab(struct zone *zone, unsigned long scanned, unsig
>  void shrink_all_slab(void)
>  {
>        struct zone *zone;
> -       unsigned long nr;
> +       struct reclaim_state reclaim_state;
>
> -again:
> -       nr = 0;
> -       for_each_zone(zone)
> -               nr += shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> -       /*
> -        * If we reclaimed less than 10 objects, might as well call
> -        * it a day. Nothing special about the number 10.
> -        */
> -       if (nr >= 10)
> -               goto again;
> +       current->reclaim_state = &reclaim_state;
> +       do {
> +               reclaim_state.reclaimed_slab = 0;
> +               for_each_zone(zone)

Oops, this should be for_each_populated_zone().


> +                       shrink_slab(zone, 1, 1, 1, GFP_KERNEL);
> +       } while (reclaim_state.reclaimed_slab);
> +
> +       current->reclaim_state = NULL;
>  }
>
>  static inline int is_page_cache_freeable(struct page *page)
> --
> 1.6.5.2
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-22 19:01 ` Nick Piggin
@ 2010-07-26  5:41   ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-26  5:41 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, linux-mm, Frank Mayhar, John Stultz, Dave Chinner,
	KOSAKI Motohiro, Michael Neuling

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

Pushed several fixes and improvements
o XFS bugs fixed by Dave
o dentry and inode stats bugs noticed by Dave
o vmscan shrinker bugs fixed by KOSAKI san
o compile bugs noticed by John
o a few attempts to improve powerpc performance (eg. reducing smp_rmb())
o scalability improvments for rename_lock



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-26  5:41   ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-26  5:41 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, linux-mm, Frank Mayhar, John Stultz, Dave Chinner,
	KOSAKI Motohiro, Michael Neuling

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git

Pushed several fixes and improvements
o XFS bugs fixed by Dave
o dentry and inode stats bugs noticed by Dave
o vmscan shrinker bugs fixed by KOSAKI san
o compile bugs noticed by John
o a few attempts to improve powerpc performance (eg. reducing smp_rmb())
o scalability improvments for rename_lock


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-23 13:55   ` Dave Chinner
@ 2010-07-27  7:05     ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-27  7:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> 
> With a production build (i.e. no lockdep, no xfs debug), I'll
> run the same fs_mark parallel create/unlink workload to show
> scalability as I ran here:
> 
> http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
of a real disk (I don't have easy access to a good disk setup ATM, but
I guess we're more interested in code above the block layer anyway).

Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
yours.

I found that performance is a little unstable, so I sync and echo 3 >
drop_caches between each run. When it starts reclaiming memory, things
get a bit more erratic (and XFS seemed to be almost livelocking for tens
of seconds in inode reclaim). So I started with 50 runs of fs_mark
-n 20000 (which did not cause reclaim), rebuilding a new filesystem
between every run.

That gave the following files/sec numbers:
    N           Min           Max        Median           Avg Stddev
x  50      100986.4        127622      125013.4     123248.82 5244.1988
+  50      100967.6      135918.6      130214.9     127926.94 6374.6975
Difference at 95.0% confidence
        4678.12 +/- 2316.07
        3.79567% +/- 1.87919%
        (Student's t, pooled s = 5836.88)

This is 3.8% in favour of vfs-scale-working.

I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
start to fill up memory and cause reclaim during the 2nd and subsequent
iterations.

    N           Min           Max        Median           Avg Stddev
x  10      116919.7      126785.7      123279.2     122245.17 3169.7993
+  10      110985.1      132440.7      130122.1     126573.41 7151.2947
No difference proven at 95.0% confidence

x  10       75820.9      105934.9       79521.7      84263.37 11210.173
+  10       75698.3      115091.7         82932      93022.75 16725.304
No difference proven at 95.0% confidence

x  10       66330.5       74950.4       69054.5         69102 2335.615
+  10       68348.5       74231.5       70728.2      70879.45 1838.8345
No difference proven at 95.0% confidence

x  10       59353.8       69813.1       67416.7      65164.96 4175.8209
+  10       59670.7       77719.1       74326.1      70966.02 6469.0398
Difference at 95.0% confidence
        5801.06 +/- 5115.66
        8.90212% +/- 7.85033%
        (Student's t, pooled s = 5444.54)

vfs-scale-working was ahead at every point, but the results were
too erratic to read much into it (even the last point I think is
questionable).

I can provide raw numbers or more details on the setup if required.


> enabled. ext4 is using default mkfs and mount parameters except for
> barrier=0. All numbers are averages of three runs.
> 
> 	fs_mark rate (thousands of files/second)
>            2.6.35-rc5   2.6.35-rc5-scale
> threads    xfs   ext4     xfs    ext4
>   1         20    39       20     39
>   2         35    55       35     57
>   4         60    41       57     42
>   8         79     9       75      9
> 
> ext4 is getting IO bound at more than 2 threads, so apart from
> pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> going to ignore ext4 for the purposes of testing scalability here.
> 
> For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> CPU and with Nick's patches it's about 650% (10% higher) for
> slightly lower throughput.  So at this class of machine for this
> workload, the changes result in a slight reduction in scalability.

I wonder if these results are stable. It's possible that changes in
reclaim behaviour are causing my patches to require more IO for a
given unit of work?

I was seeing XFS 'livelock' in reclaim more with my patches, it
could be due to more parallelism now being allowed from the vfs and
reclaim.

Based on my above numbers, I don't see that rcu-inodes is causing a
problem, and in terms of SMP scalability, there is really no way that
vanilla is more scalable, so I'm interested to see where this slowdown
is coming from.


> I looked at dbench on XFS as well, but didn't see any significant
> change in the numbers at up to 200 load threads, so not much to
> talk about there.

On a smaller system, dbench doesn't bottleneck too much. It's more of
a test to find shared cachelines and such on larger systems when you're
talking about several GB/s bandwidths.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-27  7:05     ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-27  7:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> 
> With a production build (i.e. no lockdep, no xfs debug), I'll
> run the same fs_mark parallel create/unlink workload to show
> scalability as I ran here:
> 
> http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
of a real disk (I don't have easy access to a good disk setup ATM, but
I guess we're more interested in code above the block layer anyway).

Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
yours.

I found that performance is a little unstable, so I sync and echo 3 >
drop_caches between each run. When it starts reclaiming memory, things
get a bit more erratic (and XFS seemed to be almost livelocking for tens
of seconds in inode reclaim). So I started with 50 runs of fs_mark
-n 20000 (which did not cause reclaim), rebuilding a new filesystem
between every run.

That gave the following files/sec numbers:
    N           Min           Max        Median           Avg Stddev
x  50      100986.4        127622      125013.4     123248.82 5244.1988
+  50      100967.6      135918.6      130214.9     127926.94 6374.6975
Difference at 95.0% confidence
        4678.12 +/- 2316.07
        3.79567% +/- 1.87919%
        (Student's t, pooled s = 5836.88)

This is 3.8% in favour of vfs-scale-working.

I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
start to fill up memory and cause reclaim during the 2nd and subsequent
iterations.

    N           Min           Max        Median           Avg Stddev
x  10      116919.7      126785.7      123279.2     122245.17 3169.7993
+  10      110985.1      132440.7      130122.1     126573.41 7151.2947
No difference proven at 95.0% confidence

x  10       75820.9      105934.9       79521.7      84263.37 11210.173
+  10       75698.3      115091.7         82932      93022.75 16725.304
No difference proven at 95.0% confidence

x  10       66330.5       74950.4       69054.5         69102 2335.615
+  10       68348.5       74231.5       70728.2      70879.45 1838.8345
No difference proven at 95.0% confidence

x  10       59353.8       69813.1       67416.7      65164.96 4175.8209
+  10       59670.7       77719.1       74326.1      70966.02 6469.0398
Difference at 95.0% confidence
        5801.06 +/- 5115.66
        8.90212% +/- 7.85033%
        (Student's t, pooled s = 5444.54)

vfs-scale-working was ahead at every point, but the results were
too erratic to read much into it (even the last point I think is
questionable).

I can provide raw numbers or more details on the setup if required.


> enabled. ext4 is using default mkfs and mount parameters except for
> barrier=0. All numbers are averages of three runs.
> 
> 	fs_mark rate (thousands of files/second)
>            2.6.35-rc5   2.6.35-rc5-scale
> threads    xfs   ext4     xfs    ext4
>   1         20    39       20     39
>   2         35    55       35     57
>   4         60    41       57     42
>   8         79     9       75      9
> 
> ext4 is getting IO bound at more than 2 threads, so apart from
> pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> going to ignore ext4 for the purposes of testing scalability here.
> 
> For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> CPU and with Nick's patches it's about 650% (10% higher) for
> slightly lower throughput.  So at this class of machine for this
> workload, the changes result in a slight reduction in scalability.

I wonder if these results are stable. It's possible that changes in
reclaim behaviour are causing my patches to require more IO for a
given unit of work?

I was seeing XFS 'livelock' in reclaim more with my patches, it
could be due to more parallelism now being allowed from the vfs and
reclaim.

Based on my above numbers, I don't see that rcu-inodes is causing a
problem, and in terms of SMP scalability, there is really no way that
vanilla is more scalable, so I'm interested to see where this slowdown
is coming from.


> I looked at dbench on XFS as well, but didn't see any significant
> change in the numbers at up to 200 load threads, so not much to
> talk about there.

On a smaller system, dbench doesn't bottleneck too much. It's more of
a test to find shared cachelines and such on larger systems when you're
talking about several GB/s bandwidths.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-27  7:05     ` Nick Piggin
@ 2010-07-27  8:06       ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-27  8:06 UTC (permalink / raw)
  To: xfs; +Cc: Dave Chinner, linux-fsdevel

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> > 
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
> 
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.
> 
> I found that performance is a little unstable, so I sync and echo 3 >
> drop_caches between each run. When it starts reclaiming memory, things
> get a bit more erratic (and XFS seemed to be almost livelocking for tens
> of seconds in inode reclaim).

So about this XFS livelock type thingy. It looks like this, and happens
periodically while running the above fs_mark benchmark requiring reclaim
of inodes:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
15  0   6900  31032    192 471852    0    0    28 183296 8520 46672  5 91  4  0
19  0   7044  22928    192 466712   96  144  1056 115586 8622 41695  3 96  1  0
19  0   7136  59884    192 471200  160   92  6768 34564  995  542  1 99 0  0
19  0   7244  17008    192 467860    0  104  2068 32953 1044  630  1 99 0  0
18  0   7244  43436    192 467324    0    0    12     0  817  405  0 100 0  0
18  0   7244  43684    192 467324    0    0     0     0  806  425  0 100 0  0
18  0   7244  43932    192 467324    0    0     0     0  808  403  0 100 0  0
18  0   7244  44924    192 467324    0    0     0     0  808  398  0 100 0  0
18  0   7244  45456    192 467324    0    0     0     0  809  409  0 100 0  0
18  0   7244  45472    192 467324    0    0     0     0  805  412  0 100 0  0
18  0   7244  46392    192 467324    0    0     0     0  807  401  0 100 0  0
18  0   7244  47012    192 467324    0    0     0     0  810  414  0 100 0  0
18  0   7244  47260    192 467324    0    0     0     0  806  396  0 100 0  0
18  0   7244  47752    192 467324    0    0     0     0  806  403  0 100 0  0
18  0   7244  48204    192 467324    0    0     0     0  810  409  0 100 0  0
18  0   7244  48608    192 467324    0    0     0     0  807  412  0 100 0  0
18  0   7244  48876    192 467324    0    0     0     0  805  406  0 100 0  0
18  0   7244  49000    192 467324    0    0     0     0  809  402  0 100 0  0
18  0   7244  49408    192 467324    0    0     0     0  807  396  0 100 0  0
18  0   7244  49908    192 467324    0    0     0     0  809  406  0 100 0  0
18  0   7244  50032    192 467324    0    0     0     0  805  404  0 100 0  0
18  0   7244  50032    192 467324    0    0     0     0  805  406  0 100 0  0
19  0   7244  73436    192 467324    0    0     0  6340  808  384  0 100 0  0
20  0   7244 490220    192 467324    0    0     0  8411  830  389  0 100 0  0
18  0   7244 620092    192 467324    0    0     0     4  809  435  0 100 0  0
18  0   7244 620344    192 467324    0    0     0     0  806  430  0 100 0  0
16  0   7244 682620    192 467324    0    0    44    80  890  326  0 100 0  0
12  0   7244 604464    192 479308   76    0 11716 73555 2242 14318  2 94 4  0
12  0   7244 556700    192 483488    0    0  4276 77680 6576 92285  1 97 2  0
17  0   7244 502508    192 485456    0    0  2092 98368 6308 91919  1 96 4  0
11  0   7244 416500    192 487116    0    0  1760 114844 7414 63025  2 96  2  0

Nothing much happening except 100% system time for seconds at a time
(length of time varies). This is on a ramdisk, so it isn't waiting
for IO.

During this time, lots of things are contending on the lock:

    60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
     4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
     3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
     3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
     3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
                   |
                   --- __write_lock_failed
                      |
                      |--99.92%-- xfs_inode_ag_walk
                      |          xfs_inode_ag_iterator
                      |          xfs_reclaim_inode_shrink
                      |          shrink_slab
                      |          shrink_zone
                      |          balance_pgdat
                      |          kswapd
                      |          kthread
                      |          kernel_thread_helper
                       --0.08%-- [...]

     3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
     1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
     1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
     0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
     0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
                   |
                   --- xfs_log_commit_cil
                       _xfs_trans_commit
                      |
                      |--60.00%-- xfs_remove
                      |          xfs_vn_unlink
                      |          vfs_unlink
                      |          do_unlinkat
                      |          sys_unlink

I'm not sure if there was a long-running read locker in there causing
all the write lockers to fail, or if they are just running into one
another. But anyway, I hacked the following patch which seemed to
improve that behaviour. I haven't run any throughput numbers on it yet,
but I could if you're interested (and it's not completely broken!)

Batch pag_ici_lock acquisition on the reclaim path, and also skip inodes
that appear to be busy to improve locking efficiency.

Index: source/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- source.orig/fs/xfs/linux-2.6/xfs_sync.c	2010-07-26 21:12:11.000000000 +1000
+++ source/fs/xfs/linux-2.6/xfs_sync.c	2010-07-26 21:58:59.000000000 +1000
@@ -87,6 +87,91 @@ xfs_inode_ag_lookup(
 	return ip;
 }
 
+#define RECLAIM_BATCH_SIZE	32
+STATIC int
+xfs_inode_ag_walk_reclaim(
+	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
+	int			(*execute)(struct xfs_inode *ip,
+					   struct xfs_perag *pag, int flags),
+	int			flags,
+	int			tag,
+	int			exclusive,
+	int			*nr_to_scan)
+{
+	uint32_t		first_index;
+	int			last_error = 0;
+	int			skipped;
+	xfs_inode_t		*batch[RECLAIM_BATCH_SIZE];
+	int			batchnr;
+	int			i;
+
+	BUG_ON(!exclusive);
+
+restart:
+	skipped = 0;
+	first_index = 0;
+next_batch:
+	batchnr = 0;
+	/* fill the batch */
+	write_lock(&pag->pag_ici_lock);
+	do {
+		xfs_inode_t	*ip;
+
+		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
+		if (!ip)
+			break;	
+		if (!(flags & SYNC_WAIT) &&
+				(!xfs_iflock_free(ip) ||
+				__xfs_iflags_test(ip, XFS_IRECLAIM)))
+			continue;
+
+		/*
+		 * The radix tree lock here protects a thread in xfs_iget from
+		 * racing with us starting reclaim on the inode.  Once we have
+		 * the XFS_IRECLAIM flag set it will not touch us.
+		 */
+		spin_lock(&ip->i_flags_lock);
+		ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
+			/* ignore as it is already under reclaim */
+			spin_unlock(&ip->i_flags_lock);
+			continue;
+		}
+		__xfs_iflags_set(ip, XFS_IRECLAIM);
+		spin_unlock(&ip->i_flags_lock);
+
+		batch[batchnr++] = ip;
+	} while ((*nr_to_scan)-- && batchnr < RECLAIM_BATCH_SIZE);
+	write_unlock(&pag->pag_ici_lock);
+
+	for (i = 0; i < batchnr; i++) {
+		int		error = 0;
+		xfs_inode_t	*ip = batch[i];
+
+		/* execute doesn't require pag->pag_ici_lock */
+		error = execute(ip, pag, flags);
+		if (error == EAGAIN) {
+			skipped++;
+			continue;
+		}
+		if (error)
+			last_error = error;
+
+		/* bail out if the filesystem is corrupted.  */
+		if (error == EFSCORRUPTED)
+			break;
+	}
+	if (batchnr == RECLAIM_BATCH_SIZE)
+		goto next_batch;
+
+	if (0 && skipped) {
+		delay(1);
+		goto restart;
+	}
+	return last_error;
+}
+
 STATIC int
 xfs_inode_ag_walk(
 	struct xfs_mount	*mp,
@@ -113,6 +198,7 @@ restart:
 			write_lock(&pag->pag_ici_lock);
 		else
 			read_lock(&pag->pag_ici_lock);
+
 		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
 		if (!ip) {
 			if (exclusive)
@@ -198,8 +284,12 @@ xfs_inode_ag_iterator(
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
 	ag = 0;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
-		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
-						exclusive, &nr);
+		if (tag == XFS_ICI_RECLAIM_TAG)
+			error = xfs_inode_ag_walk_reclaim(mp, pag, execute,
+						flags, tag, exclusive, &nr);
+		else
+			error = xfs_inode_ag_walk(mp, pag, execute,
+						flags, tag, exclusive, &nr);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -789,23 +879,6 @@ xfs_reclaim_inode(
 {
 	int	error = 0;
 
-	/*
-	 * The radix tree lock here protects a thread in xfs_iget from racing
-	 * with us starting reclaim on the inode.  Once we have the
-	 * XFS_IRECLAIM flag set it will not touch us.
-	 */
-	spin_lock(&ip->i_flags_lock);
-	ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
-		/* ignore as it is already under reclaim */
-		spin_unlock(&ip->i_flags_lock);
-		write_unlock(&pag->pag_ici_lock);
-		return 0;
-	}
-	__xfs_iflags_set(ip, XFS_IRECLAIM);
-	spin_unlock(&ip->i_flags_lock);
-	write_unlock(&pag->pag_ici_lock);
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
 		if (!(sync_mode & SYNC_WAIT))
Index: source/fs/xfs/xfs_inode.h
===================================================================
--- source.orig/fs/xfs/xfs_inode.h	2010-07-26 21:10:33.000000000 +1000
+++ source/fs/xfs/xfs_inode.h	2010-07-26 21:11:59.000000000 +1000
@@ -349,6 +349,11 @@ static inline int xfs_iflock_nowait(xfs_
 	return try_wait_for_completion(&ip->i_flush);
 }
 
+static inline int xfs_iflock_free(xfs_inode_t *ip)
+{
+	return completion_done(&ip->i_flush);
+}
+
 static inline void xfs_ifunlock(xfs_inode_t *ip)
 {
 	complete(&ip->i_flush);

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-27  8:06       ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-27  8:06 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> > 
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
> 
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.
> 
> I found that performance is a little unstable, so I sync and echo 3 >
> drop_caches between each run. When it starts reclaiming memory, things
> get a bit more erratic (and XFS seemed to be almost livelocking for tens
> of seconds in inode reclaim).

So about this XFS livelock type thingy. It looks like this, and happens
periodically while running the above fs_mark benchmark requiring reclaim
of inodes:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
15  0   6900  31032    192 471852    0    0    28 183296 8520 46672  5 91  4  0
19  0   7044  22928    192 466712   96  144  1056 115586 8622 41695  3 96  1  0
19  0   7136  59884    192 471200  160   92  6768 34564  995  542  1 99 0  0
19  0   7244  17008    192 467860    0  104  2068 32953 1044  630  1 99 0  0
18  0   7244  43436    192 467324    0    0    12     0  817  405  0 100 0  0
18  0   7244  43684    192 467324    0    0     0     0  806  425  0 100 0  0
18  0   7244  43932    192 467324    0    0     0     0  808  403  0 100 0  0
18  0   7244  44924    192 467324    0    0     0     0  808  398  0 100 0  0
18  0   7244  45456    192 467324    0    0     0     0  809  409  0 100 0  0
18  0   7244  45472    192 467324    0    0     0     0  805  412  0 100 0  0
18  0   7244  46392    192 467324    0    0     0     0  807  401  0 100 0  0
18  0   7244  47012    192 467324    0    0     0     0  810  414  0 100 0  0
18  0   7244  47260    192 467324    0    0     0     0  806  396  0 100 0  0
18  0   7244  47752    192 467324    0    0     0     0  806  403  0 100 0  0
18  0   7244  48204    192 467324    0    0     0     0  810  409  0 100 0  0
18  0   7244  48608    192 467324    0    0     0     0  807  412  0 100 0  0
18  0   7244  48876    192 467324    0    0     0     0  805  406  0 100 0  0
18  0   7244  49000    192 467324    0    0     0     0  809  402  0 100 0  0
18  0   7244  49408    192 467324    0    0     0     0  807  396  0 100 0  0
18  0   7244  49908    192 467324    0    0     0     0  809  406  0 100 0  0
18  0   7244  50032    192 467324    0    0     0     0  805  404  0 100 0  0
18  0   7244  50032    192 467324    0    0     0     0  805  406  0 100 0  0
19  0   7244  73436    192 467324    0    0     0  6340  808  384  0 100 0  0
20  0   7244 490220    192 467324    0    0     0  8411  830  389  0 100 0  0
18  0   7244 620092    192 467324    0    0     0     4  809  435  0 100 0  0
18  0   7244 620344    192 467324    0    0     0     0  806  430  0 100 0  0
16  0   7244 682620    192 467324    0    0    44    80  890  326  0 100 0  0
12  0   7244 604464    192 479308   76    0 11716 73555 2242 14318  2 94 4  0
12  0   7244 556700    192 483488    0    0  4276 77680 6576 92285  1 97 2  0
17  0   7244 502508    192 485456    0    0  2092 98368 6308 91919  1 96 4  0
11  0   7244 416500    192 487116    0    0  1760 114844 7414 63025  2 96  2  0

Nothing much happening except 100% system time for seconds at a time
(length of time varies). This is on a ramdisk, so it isn't waiting
for IO.

During this time, lots of things are contending on the lock:

    60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
     4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
     3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
     3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
     3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
                   |
                   --- __write_lock_failed
                      |
                      |--99.92%-- xfs_inode_ag_walk
                      |          xfs_inode_ag_iterator
                      |          xfs_reclaim_inode_shrink
                      |          shrink_slab
                      |          shrink_zone
                      |          balance_pgdat
                      |          kswapd
                      |          kthread
                      |          kernel_thread_helper
                       --0.08%-- [...]

     3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
     1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
     1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
     0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
     0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
                   |
                   --- xfs_log_commit_cil
                       _xfs_trans_commit
                      |
                      |--60.00%-- xfs_remove
                      |          xfs_vn_unlink
                      |          vfs_unlink
                      |          do_unlinkat
                      |          sys_unlink

I'm not sure if there was a long-running read locker in there causing
all the write lockers to fail, or if they are just running into one
another. But anyway, I hacked the following patch which seemed to
improve that behaviour. I haven't run any throughput numbers on it yet,
but I could if you're interested (and it's not completely broken!)

Batch pag_ici_lock acquisition on the reclaim path, and also skip inodes
that appear to be busy to improve locking efficiency.

Index: source/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- source.orig/fs/xfs/linux-2.6/xfs_sync.c	2010-07-26 21:12:11.000000000 +1000
+++ source/fs/xfs/linux-2.6/xfs_sync.c	2010-07-26 21:58:59.000000000 +1000
@@ -87,6 +87,91 @@ xfs_inode_ag_lookup(
 	return ip;
 }
 
+#define RECLAIM_BATCH_SIZE	32
+STATIC int
+xfs_inode_ag_walk_reclaim(
+	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
+	int			(*execute)(struct xfs_inode *ip,
+					   struct xfs_perag *pag, int flags),
+	int			flags,
+	int			tag,
+	int			exclusive,
+	int			*nr_to_scan)
+{
+	uint32_t		first_index;
+	int			last_error = 0;
+	int			skipped;
+	xfs_inode_t		*batch[RECLAIM_BATCH_SIZE];
+	int			batchnr;
+	int			i;
+
+	BUG_ON(!exclusive);
+
+restart:
+	skipped = 0;
+	first_index = 0;
+next_batch:
+	batchnr = 0;
+	/* fill the batch */
+	write_lock(&pag->pag_ici_lock);
+	do {
+		xfs_inode_t	*ip;
+
+		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
+		if (!ip)
+			break;	
+		if (!(flags & SYNC_WAIT) &&
+				(!xfs_iflock_free(ip) ||
+				__xfs_iflags_test(ip, XFS_IRECLAIM)))
+			continue;
+
+		/*
+		 * The radix tree lock here protects a thread in xfs_iget from
+		 * racing with us starting reclaim on the inode.  Once we have
+		 * the XFS_IRECLAIM flag set it will not touch us.
+		 */
+		spin_lock(&ip->i_flags_lock);
+		ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
+		if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
+			/* ignore as it is already under reclaim */
+			spin_unlock(&ip->i_flags_lock);
+			continue;
+		}
+		__xfs_iflags_set(ip, XFS_IRECLAIM);
+		spin_unlock(&ip->i_flags_lock);
+
+		batch[batchnr++] = ip;
+	} while ((*nr_to_scan)-- && batchnr < RECLAIM_BATCH_SIZE);
+	write_unlock(&pag->pag_ici_lock);
+
+	for (i = 0; i < batchnr; i++) {
+		int		error = 0;
+		xfs_inode_t	*ip = batch[i];
+
+		/* execute doesn't require pag->pag_ici_lock */
+		error = execute(ip, pag, flags);
+		if (error == EAGAIN) {
+			skipped++;
+			continue;
+		}
+		if (error)
+			last_error = error;
+
+		/* bail out if the filesystem is corrupted.  */
+		if (error == EFSCORRUPTED)
+			break;
+	}
+	if (batchnr == RECLAIM_BATCH_SIZE)
+		goto next_batch;
+
+	if (0 && skipped) {
+		delay(1);
+		goto restart;
+	}
+	return last_error;
+}
+
 STATIC int
 xfs_inode_ag_walk(
 	struct xfs_mount	*mp,
@@ -113,6 +198,7 @@ restart:
 			write_lock(&pag->pag_ici_lock);
 		else
 			read_lock(&pag->pag_ici_lock);
+
 		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
 		if (!ip) {
 			if (exclusive)
@@ -198,8 +284,12 @@ xfs_inode_ag_iterator(
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
 	ag = 0;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
-		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
-						exclusive, &nr);
+		if (tag == XFS_ICI_RECLAIM_TAG)
+			error = xfs_inode_ag_walk_reclaim(mp, pag, execute,
+						flags, tag, exclusive, &nr);
+		else
+			error = xfs_inode_ag_walk(mp, pag, execute,
+						flags, tag, exclusive, &nr);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
@@ -789,23 +879,6 @@ xfs_reclaim_inode(
 {
 	int	error = 0;
 
-	/*
-	 * The radix tree lock here protects a thread in xfs_iget from racing
-	 * with us starting reclaim on the inode.  Once we have the
-	 * XFS_IRECLAIM flag set it will not touch us.
-	 */
-	spin_lock(&ip->i_flags_lock);
-	ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
-	if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
-		/* ignore as it is already under reclaim */
-		spin_unlock(&ip->i_flags_lock);
-		write_unlock(&pag->pag_ici_lock);
-		return 0;
-	}
-	__xfs_iflags_set(ip, XFS_IRECLAIM);
-	spin_unlock(&ip->i_flags_lock);
-	write_unlock(&pag->pag_ici_lock);
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
 		if (!(sync_mode & SYNC_WAIT))
Index: source/fs/xfs/xfs_inode.h
===================================================================
--- source.orig/fs/xfs/xfs_inode.h	2010-07-26 21:10:33.000000000 +1000
+++ source/fs/xfs/xfs_inode.h	2010-07-26 21:11:59.000000000 +1000
@@ -349,6 +349,11 @@ static inline int xfs_iflock_nowait(xfs_
 	return try_wait_for_completion(&ip->i_flush);
 }
 
+static inline int xfs_iflock_free(xfs_inode_t *ip)
+{
+	return completion_done(&ip->i_flush);
+}
+
 static inline void xfs_ifunlock(xfs_inode_t *ip)
 {
 	complete(&ip->i_flush);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-27  7:05     ` Nick Piggin
@ 2010-07-27 11:09       ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-27 11:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar, John Stultz

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> > 
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
> 
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.

I also tried dbench on this setup. 20 runs of dbench -t20 8
(that is a 20 second run, 8 clients).

Numbers are throughput, higher is better:

          N           Min           Max        Median           Avg Stddev
vanilla  20       2219.19       2249.43       2230.43     2230.9915 7.2528893
scale    20       2428.21        2490.8       2437.86      2444.111 16.668256
Difference at 95.0% confidence
        213.119 +/- 8.22695
        9.55268% +/- 0.368757%
        (Student's t, pooled s = 12.8537)

vfs-scale is 9.5% or 210MB/s faster than vanilla.

Like fs_mark, dbench has creat/unlink activity, so I hope rcu-inodes
should not be such a problem in practice. In my creat/unlink benchmark,
it is creating and destroying one inode repeatedly, which is the
absolute worst case for rcu-inodes. Wheras in most real workloads
would be creating and destroying many inodes, which is not such a dis
advantage for rcu-inodes.

Incidentally, XFS was by far the fastest "real" filesystem I tested on
this workload. ext4 was around 1700MB/s (ext2 was around 3100MB/s and
ramfs is 3350MB/s).


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-27 11:09       ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-27 11:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Dave Chinner, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar, John Stultz

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> > 
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
> 
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.

I also tried dbench on this setup. 20 runs of dbench -t20 8
(that is a 20 second run, 8 clients).

Numbers are throughput, higher is better:

          N           Min           Max        Median           Avg Stddev
vanilla  20       2219.19       2249.43       2230.43     2230.9915 7.2528893
scale    20       2428.21        2490.8       2437.86      2444.111 16.668256
Difference at 95.0% confidence
        213.119 +/- 8.22695
        9.55268% +/- 0.368757%
        (Student's t, pooled s = 12.8537)

vfs-scale is 9.5% or 210MB/s faster than vanilla.

Like fs_mark, dbench has creat/unlink activity, so I hope rcu-inodes
should not be such a problem in practice. In my creat/unlink benchmark,
it is creating and destroying one inode repeatedly, which is the
absolute worst case for rcu-inodes. Wheras in most real workloads
would be creating and destroying many inodes, which is not such a dis
advantage for rcu-inodes.

Incidentally, XFS was by far the fastest "real" filesystem I tested on
this workload. ext4 was around 1700MB/s (ext2 was around 3100MB/s and
ramfs is 3350MB/s).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree)
  2010-07-27  8:06       ` Nick Piggin
  (?)
@ 2010-07-27 11:36       ` Nick Piggin
  2010-07-27 13:30         ` Dave Chinner
  -1 siblings, 1 reply; 76+ messages in thread
From: Nick Piggin @ 2010-07-27 11:36 UTC (permalink / raw)
  To: xfs

On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> > 
> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run. When it starts reclaiming memory, things
> > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > of seconds in inode reclaim).

On this same system, same setup (vanilla kernel with sha given below),
I have now twice reproduced a complete hang in XFS. I can give more
information, test patches or options etc if required.

setup.sh looks like this:
#!/bin/bash
modprobe rd rd_size=$[2*1024*1024]
dd if=/dev/zero of=/dev/ram0 bs=4K
mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0
mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt

The 'dd' is required to ensure rd driver does not allocate pages
during IO (which can lead to out of memory deadlocks). Running just
involves changing into mnt directory and

while true
do
  sync
  echo 3 > /proc/sys/vm/drop_caches
  ../dbench -c ../loadfiles/client.txt -t20 8
  rm -rf clients
done

And wait for it to hang (happend in < 5 minutes here)


Sysrq of blocked tasks looks like this:

Linux version 2.6.35-rc5-00176-gcd5b8f8 (npiggin@amd) (gcc version 4.4.4 (Debian 4.4.4-7) ) #348 SMP Mon Jul 26 22:20:32 EST 2010

brd: module loaded
Enabling EXPERIMENTAL delayed logging feature - use at your own risk.
XFS mounting filesystem ram0
Ending clean XFS mount for filesystem: ram0
SysRq : Show Blocked State
  task                        PC stack   pid father
flush-1:0     D 00000000fffff8fd     0  2799      2 0x00000000
 ffff8800701ff690 0000000000000046 ffff880000000000 ffff8800701fffd8
 ffff8800071531f0 00000000000122c0 0000000000004000 ffff8800701fffd8
 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff8800071531f0
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff8122b637>] ? xfs_iunlock+0x57/0xb0
 [<ffffffff8123275b>] xfs_iomap_write_allocate+0x25b/0x3c0
 [<ffffffff81232abe>] xfs_iomap+0x1fe/0x270
 [<ffffffff8124aaf7>] xfs_map_blocks+0x37/0x40
 [<ffffffff8107b8e1>] ? find_lock_page+0x21/0x70
 [<ffffffff8124bdca>] xfs_page_state_convert+0x35a/0x690
 [<ffffffff8107c38a>] ? find_or_create_page+0x3a/0xa0
 [<ffffffff8124c3a6>] xfs_vm_writepage+0x76/0x110
 [<ffffffff8108eb00>] ? __dec_zone_page_state+0x30/0x40
 [<ffffffff81082c32>] __writepage+0x12/0x40
 [<ffffffff81083367>] write_cache_pages+0x1c7/0x3d0
 [<ffffffff81082c20>] ? __writepage+0x0/0x40
 [<ffffffff8108358f>] generic_writepages+0x1f/0x30
 [<ffffffff8124c2fc>] xfs_vm_writepages+0x4c/0x60
 [<ffffffff810835bc>] do_writepages+0x1c/0x40
 [<ffffffff810d606e>] writeback_single_inode+0xce/0x3b0
 [<ffffffff810d6774>] writeback_sb_inodes+0x174/0x260
 [<ffffffff810d701f>] writeback_inodes_wb+0x8f/0x180
 [<ffffffff810d733b>] wb_writeback+0x22b/0x290
 [<ffffffff810d7536>] wb_do_writeback+0x196/0x1a0
 [<ffffffff810d7583>] bdi_writeback_task+0x43/0x120
 [<ffffffff81050f46>] ? bit_waitqueue+0x16/0xe0
 [<ffffffff8108fd00>] ? bdi_start_fn+0x0/0xe0
 [<ffffffff8108fd6c>] bdi_start_fn+0x6c/0xe0
 [<ffffffff8108fd00>] ? bdi_start_fn+0x0/0xe0
 [<ffffffff81050bee>] kthread+0x8e/0xa0
 [<ffffffff81003014>] kernel_thread_helper+0x4/0x10
 [<ffffffff81050b60>] ? kthread+0x0/0xa0
 [<ffffffff81003010>] ? kernel_thread_helper+0x0/0x10
xfssyncd/ram0 D 00000000fffff045     0  2807      2 0x00000000
 ffff880007635d00 0000000000000046 ffff880000000000 ffff880007635fd8
 ffff880007abd370 00000000000122c0 0000000000004000 ffff880007635fd8
 0000000000004000 00000000000122c0 ffff88007f2b7710 ffff880007abd370
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff812572b7>] ? xfs_inode_ag_iterator+0x57/0xd0
 [<ffffffff81256c0a>] xfs_commit_dummy_trans+0x4a/0xe0
 [<ffffffff81257454>] xfs_sync_worker+0x74/0x80
 [<ffffffff81256b2a>] xfssyncd+0x13a/0x1d0
 [<ffffffff812569f0>] ? xfssyncd+0x0/0x1d0
 [<ffffffff81050bee>] kthread+0x8e/0xa0
 [<ffffffff81003014>] kernel_thread_helper+0x4/0x10
 [<ffffffff81050b60>] ? kthread+0x0/0xa0
 [<ffffffff81003010>] ? kernel_thread_helper+0x0/0x10
dbench        D 00000000ffffefc6     0  2975   2974 0x00000000
 ffff88005ecd1ae8 0000000000000082 ffff880000000000 ffff88005ecd1fd8
 ffff8800079ce250 00000000000122c0 0000000000004000 ffff88005ecd1fd8
 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff8800079ce250
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff81247920>] xfs_create+0x170/0x5d0
 [<ffffffff810ca6e2>] ? __d_lookup+0xa2/0x140
 [<ffffffff810d0724>] ? mntput_no_expire+0x24/0xe0
 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0
 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10
 [<ffffffff810c1471>] vfs_create+0x81/0xd0
 [<ffffffff810c2915>] do_last+0x515/0x670
 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650
 [<ffffffff810c6871>] ? filldir+0x71/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130
 [<ffffffff810b5d34>] do_sys_open+0x64/0x140
 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80
 [<ffffffff810b5e3b>] sys_open+0x1b/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000fffff045     0  2976   2974 0x00000000
 ffff880060c11b18 0000000000000086 ffff880000000000 ffff880060c11fd8
 ffff880007668630 00000000000122c0 0000000000004000 ffff880060c11fd8
 0000000000004000 00000000000122c0 ffffffff81793020 ffff880007668630
Call Trace:
 [<ffffffff81236320>] xlog_grant_log_space+0x280/0x3d0
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff812412c8>] xfs_rename+0x138/0x630
 [<ffffffff810c036e>] ? exec_permission+0x3e/0x70
 [<ffffffff81253111>] xfs_vn_rename+0x61/0x70
 [<ffffffff810c1b4e>] vfs_rename+0x41e/0x480
 [<ffffffff810c3bd6>] sys_renameat+0x236/0x270
 [<ffffffff8122551d>] ? xfs_dir2_sf_getdents+0x21d/0x390
 [<ffffffff810c6800>] ? filldir+0x0/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810b8a4a>] ? fput+0x1aa/0x220
 [<ffffffff810c3c26>] sys_rename+0x16/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffeed4     0  2977   2974 0x00000000
 ffff88000873fa88 0000000000000082 ffff88000873fac8 ffff88000873ffd8
 ffff880007669710 00000000000122c0 0000000000004000 ffff88000873ffd8
 0000000000004000 00000000000122c0 ffff88007f2e1790 ffff880007669710
Call Trace:
 [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210
 [<ffffffff8123dbf6>] ? xfs_icsb_disable_counter+0x16/0xa0
 [<ffffffff812445bb>] ? _xfs_trans_bjoin+0x4b/0x60
 [<ffffffff8107b5c9>] ? find_get_page+0x19/0xa0
 [<ffffffff8123dcb6>] ? xfs_icsb_balance_counter_locked+0x36/0xc0
 [<ffffffff8155f4e8>] __down+0x68/0xb0
 [<ffffffff81055b0b>] down+0x3b/0x50
 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70
 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220
 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160
 [<ffffffff8124ee13>] xfs_buf_read+0x13/0xa0
 [<ffffffff81244780>] xfs_trans_read_buf+0x1b0/0x320
 [<ffffffff8122922f>] xfs_read_agi+0x6f/0xf0
 [<ffffffff8122fa86>] xfs_iunlink+0x46/0x160
 [<ffffffff81253d21>] ? xfs_mark_inode_dirty_sync+0x21/0x30
 [<ffffffff81253dcf>] ? xfs_ichgtime+0x9f/0xc0
 [<ffffffff81245677>] xfs_droplink+0x57/0x70
 [<ffffffff8124751a>] xfs_remove+0x28a/0x370
 [<ffffffff81253443>] xfs_vn_unlink+0x43/0x90
 [<ffffffff810c161b>] vfs_unlink+0x8b/0x110
 [<ffffffff810c0e20>] ? lookup_hash+0x30/0x40
 [<ffffffff810c3db3>] do_unlinkat+0x183/0x1c0
 [<ffffffff810bb3f1>] ? sys_newstat+0x31/0x50
 [<ffffffff810c3e01>] sys_unlink+0x11/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffefa8     0  2978   2974 0x00000000
 ffff880040da7c38 0000000000000082 ffff880000000000 ffff880040da7fd8
 ffff880007668090 00000000000122c0 0000000000004000 ffff880040da7fd8
 0000000000004000 00000000000122c0 ffff88012ff78b90 ffff880007668090
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff812495e9>] xfs_setattr+0x7e9/0xad0
 [<ffffffff81253766>] xfs_vn_setattr+0x16/0x20
 [<ffffffff810cdb94>] notify_change+0x104/0x2e0
 [<ffffffff810db270>] utimes_common+0xd0/0x1a0
 [<ffffffff810bb64e>] ? sys_newfstat+0x2e/0x40
 [<ffffffff810db416>] do_utimes+0xd6/0xf0
 [<ffffffff810db5ae>] sys_utime+0x1e/0x70
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffefd3     0  2979   2974 0x00000000
 ffff8800072c7ae8 0000000000000082 ffff8800072c7b78 ffff8800072c7fd8
 ffff880007669170 00000000000122c0 0000000000004000 ffff8800072c7fd8
 0000000000004000 00000000000122c0 ffff88012ff785f0 ffff880007669170
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff8122b637>] ? xfs_iunlock+0x57/0xb0
 [<ffffffff81247920>] xfs_create+0x170/0x5d0
 [<ffffffff810ca6e2>] ? __d_lookup+0xa2/0x140
 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0
 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10
 [<ffffffff810c1471>] vfs_create+0x81/0xd0
 [<ffffffff810c2915>] do_last+0x515/0x670
 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650
 [<ffffffff810c6871>] ? filldir+0x71/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130
 [<ffffffff810b5d34>] do_sys_open+0x64/0x140
 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80
 [<ffffffff810b5e3b>] sys_open+0x1b/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffeed0     0  2980   2974 0x00000000
 ffff88003dd91688 0000000000000082 0000000000000000 ffff88003dd91fd8
 ffff880007abc290 00000000000122c0 0000000000004000 ffff88003dd91fd8
 0000000000004000 00000000000122c0 ffff88007f2d3750 ffff880007abc290
Call Trace:
 [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210
 [<ffffffff8155f4e8>] __down+0x68/0xb0
 [<ffffffff81055b0b>] down+0x3b/0x50
 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70
 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220
 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160
 [<ffffffff8124ee13>] xfs_buf_read+0x13/0xa0
 [<ffffffff81244780>] xfs_trans_read_buf+0x1b0/0x320
 [<ffffffff8122922f>] xfs_read_agi+0x6f/0xf0
 [<ffffffff812292d9>] xfs_ialloc_read_agi+0x29/0x90
 [<ffffffff8122957b>] xfs_ialloc_ag_select+0x12b/0x260
 [<ffffffff8122abc7>] xfs_dialloc+0x3d7/0x860
 [<ffffffff8124acc8>] ? __xfs_get_blocks+0x1c8/0x210
 [<ffffffff8107b5c9>] ? find_get_page+0x19/0xa0
 [<ffffffff810ddb9e>] ? unmap_underlying_metadata+0xe/0x50
 [<ffffffff8122ef4d>] xfs_ialloc+0x5d/0x690
 [<ffffffff8124a031>] ? kmem_zone_alloc+0x91/0xe0
 [<ffffffff8124570d>] xfs_dir_ialloc+0x7d/0x320
 [<ffffffff81236552>] ? xfs_log_reserve+0xe2/0xf0
 [<ffffffff81247b83>] xfs_create+0x3d3/0x5d0
 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0
 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10
 [<ffffffff810c1471>] vfs_create+0x81/0xd0
 [<ffffffff810c2915>] do_last+0x515/0x670
 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650
 [<ffffffff810c6871>] ? filldir+0x71/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130
 [<ffffffff810b5d34>] do_sys_open+0x64/0x140
 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80
 [<ffffffff810b5e3b>] sys_open+0x1b/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffeed0     0  2981   2974 0x00000000
 ffff88005b79f618 0000000000000086 ffff88005b79f598 ffff88005b79ffd8
 ffff880007abcdd0 00000000000122c0 0000000000004000 ffff88005b79ffd8
 0000000000004000 00000000000122c0 ffff88007f2b7710 ffff880007abcdd0
Call Trace:
 [<ffffffff8155e58d>] schedule_timeout+0x1ad/0x210
 [<ffffffff8122cd44>] ? xfs_iext_bno_to_ext+0x84/0x160
 [<ffffffff8155f4e8>] __down+0x68/0xb0
 [<ffffffff81055b0b>] down+0x3b/0x50
 [<ffffffff8124d59e>] xfs_buf_lock+0x4e/0x70
 [<ffffffff8124ebb3>] _xfs_buf_find+0x133/0x220
 [<ffffffff8124ecfb>] xfs_buf_get+0x5b/0x160
 [<ffffffff81244a40>] xfs_trans_get_buf+0xc0/0xe0
 [<ffffffff8121ac3f>] xfs_da_do_buf+0x3df/0x6d0
 [<ffffffff8121b0c5>] xfs_da_get_buf+0x25/0x30
 [<ffffffff81220926>] ? xfs_dir2_data_init+0x46/0xe0
 [<ffffffff81220926>] xfs_dir2_data_init+0x46/0xe0
 [<ffffffff8121e829>] xfs_dir2_sf_to_block+0xb9/0x5a0
 [<ffffffff8105106a>] ? wake_up_bit+0x2a/0x40
 [<ffffffff81226a78>] xfs_dir2_sf_addname+0x418/0x5c0
 [<ffffffff8122f3fb>] ? xfs_ialloc+0x50b/0x690
 [<ffffffff8121e61c>] xfs_dir_createname+0x14c/0x1a0
 [<ffffffff81247bf9>] xfs_create+0x449/0x5d0
 [<ffffffff81253202>] xfs_vn_mknod+0xa2/0x1b0
 [<ffffffff8125332b>] xfs_vn_create+0xb/0x10
 [<ffffffff810c1471>] vfs_create+0x81/0xd0
 [<ffffffff810c2915>] do_last+0x515/0x670
 [<ffffffff810c48cd>] do_filp_open+0x21d/0x650
 [<ffffffff810c6871>] ? filldir+0x71/0xd0
 [<ffffffff8103f012>] ? current_fs_time+0x22/0x30
 [<ffffffff810ce96b>] ? alloc_fd+0x4b/0x130
 [<ffffffff810b5d34>] do_sys_open+0x64/0x140
 [<ffffffff810b5bbd>] ? filp_close+0x4d/0x80
 [<ffffffff810b5e3b>] sys_open+0x1b/0x20
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b
dbench        D 00000000ffffefbf     0  2982   2974 0x00000000
 ffff88005b7f9c38 0000000000000082 ffff880000000000 ffff88005b7f9fd8
 ffff880007698150 00000000000122c0 0000000000004000 ffff88005b7f9fd8
 0000000000004000 00000000000122c0 ffff88012ff79130 ffff880007698150
Call Trace:
 [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
 [<ffffffff8124a0b5>] ? kmem_zone_zalloc+0x35/0x50
 [<ffffffff81034bf0>] ? default_wake_function+0x0/0x10
 [<ffffffff8124410c>] ? xfs_trans_ail_push+0x1c/0x80
 [<ffffffff81236552>] xfs_log_reserve+0xe2/0xf0
 [<ffffffff81243307>] xfs_trans_reserve+0x97/0x200
 [<ffffffff812495e9>] xfs_setattr+0x7e9/0xad0
 [<ffffffff81253766>] xfs_vn_setattr+0x16/0x20
 [<ffffffff810cdb94>] notify_change+0x104/0x2e0
 [<ffffffff810db270>] utimes_common+0xd0/0x1a0
 [<ffffffff810bb64e>] ? sys_newfstat+0x2e/0x40
 [<ffffffff810db416>] do_utimes+0xd6/0xf0
 [<ffffffff810db5ae>] sys_utime+0x1e/0x70
 [<ffffffff810022eb>] system_call_fastpath+0x16/0x1b

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-27  7:05     ` Nick Piggin
@ 2010-07-27 13:18       ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-27 13:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> > 
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
> 
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.

A s a personal prefernce, I don't like testing filesystem performance
on ramdisks because it hides problems caused by changes in IO
latency. I'll come back to this later.

> I found that performance is a little unstable, so I sync and echo 3 >
> drop_caches between each run.

Quite possibly because of the smaller log - that will cause more
frequent pushing on the log tail and hence I/O patterns will vary a
bit...

Also, keep in mind that delayed logging is shiny and new - it has
increased XFS metadata performance and parallelism by an order of
magnitude and so we're really seeing new a bunch of brand new issues
that have never been seen before with this functionality.  As such,
there's still some interactions I haven't got to the bottom of with
delayed logging - it's stable enough to use and benchmark and won't
corrupt anything but there are still has some warts we need to
solve. The difficulty (as always) is in reliably reproducing the bad
behaviour.

> When it starts reclaiming memory, things
> get a bit more erratic (and XFS seemed to be almost livelocking for tens
> of seconds in inode reclaim).

I can't say that I've seen this - even when testing up to 10m
inodes. Yes, kswapd is almost permanently active on these runs,
but when creating 100,000 inodes/s we also need to be reclaiming
100,000 inodes/s so it's not surprising that when 7 CPUs are doing
allocation we need at least one CPU to run reclaim....

> So I started with 50 runs of fs_mark
> -n 20000 (which did not cause reclaim), rebuilding a new filesystem
> between every run.
> 
> That gave the following files/sec numbers:
>     N           Min           Max        Median           Avg Stddev
> x  50      100986.4        127622      125013.4     123248.82 5244.1988
> +  50      100967.6      135918.6      130214.9     127926.94 6374.6975
> Difference at 95.0% confidence
>         4678.12 +/- 2316.07
>         3.79567% +/- 1.87919%
>         (Student's t, pooled s = 5836.88)
> 
> This is 3.8% in favour of vfs-scale-working.
> 
> I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
> start to fill up memory and cause reclaim during the 2nd and subsequent
> iterations.

I haven't used this mode, so I can't really comment on the results
you are seeing.

> > enabled. ext4 is using default mkfs and mount parameters except for
> > barrier=0. All numbers are averages of three runs.
> > 
> > 	fs_mark rate (thousands of files/second)
> >            2.6.35-rc5   2.6.35-rc5-scale
> > threads    xfs   ext4     xfs    ext4
> >   1         20    39       20     39
> >   2         35    55       35     57
> >   4         60    41       57     42
> >   8         79     9       75      9
> > 
> > ext4 is getting IO bound at more than 2 threads, so apart from
> > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > going to ignore ext4 for the purposes of testing scalability here.
> > 
> > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > CPU and with Nick's patches it's about 650% (10% higher) for
> > slightly lower throughput.  So at this class of machine for this
> > workload, the changes result in a slight reduction in scalability.
> 
> I wonder if these results are stable. It's possible that changes in
> reclaim behaviour are causing my patches to require more IO for a
> given unit of work?

More likely that's the result of using a smaller log size because it
will require more frequent metadata pushes to make space for new
transactions.

> I was seeing XFS 'livelock' in reclaim more with my patches, it
> could be due to more parallelism now being allowed from the vfs and
> reclaim.
>
> Based on my above numbers, I don't see that rcu-inodes is causing a
> problem, and in terms of SMP scalability, there is really no way that
> vanilla is more scalable, so I'm interested to see where this slowdown
> is coming from.

As I said initially, ram disks hide IO latency changes resulting
from increased numbers of IO or increases in seek distances.  My
initial guess is the change in inode reclaim behaviour causing
different IO patterns and more seeks under reclaim because the zone
based reclaim is no longer reclaiming inodes in the order
they are created (i.e. we are not doing sequential inode reclaim any
more.

FWIW, I use PCP monitoring graphs to correlate behavioural changes
across different subsystems because it is far easier to relate
information visually than it is by looking at raw numbers or traces.
I think this graph shows the effect of relcaim on performance
most clearly:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png

It's pretty clear that when the inode/dentry cache shrinkers are
running, sustained create/unlink performance goes right down. From a
different tab not in the screen shot (the other "test-4" tab), I
could see CPU usage also goes down and the disk iops go way up
whenever the create/unlink performance dropped. This same behaviour
happens with the vfs-scale patchset, so it's not related to lock
contention - just aggressive reclaim of still-dirty inodes.

FYI, The patch under test there was the XFS shrinker ignoring 7 out
of 8 shrinker calls and then on the 8th call doing the work of all
previous calls. i.e emulating  SHRINK_BATCH = 1024. Interestingly
enough, that one change reduced the runtime of the 8m inode
create/unlink load by ~25% (from ~24min to ~18min).

That is by far the largest improvement I've been able to obtain from
modifying the shrinker code, and it is from those sorts of
observations that I think that IO being issued from reclaim is
currently the most significant performance limiting factor for XFS
in this sort of workload....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-27 13:18       ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-27 13:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > 
> > With a production build (i.e. no lockdep, no xfs debug), I'll
> > run the same fs_mark parallel create/unlink workload to show
> > scalability as I ran here:
> > 
> > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> 
> I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> of a real disk (I don't have easy access to a good disk setup ATM, but
> I guess we're more interested in code above the block layer anyway).
> 
> Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> yours.

A s a personal prefernce, I don't like testing filesystem performance
on ramdisks because it hides problems caused by changes in IO
latency. I'll come back to this later.

> I found that performance is a little unstable, so I sync and echo 3 >
> drop_caches between each run.

Quite possibly because of the smaller log - that will cause more
frequent pushing on the log tail and hence I/O patterns will vary a
bit...

Also, keep in mind that delayed logging is shiny and new - it has
increased XFS metadata performance and parallelism by an order of
magnitude and so we're really seeing new a bunch of brand new issues
that have never been seen before with this functionality.  As such,
there's still some interactions I haven't got to the bottom of with
delayed logging - it's stable enough to use and benchmark and won't
corrupt anything but there are still has some warts we need to
solve. The difficulty (as always) is in reliably reproducing the bad
behaviour.

> When it starts reclaiming memory, things
> get a bit more erratic (and XFS seemed to be almost livelocking for tens
> of seconds in inode reclaim).

I can't say that I've seen this - even when testing up to 10m
inodes. Yes, kswapd is almost permanently active on these runs,
but when creating 100,000 inodes/s we also need to be reclaiming
100,000 inodes/s so it's not surprising that when 7 CPUs are doing
allocation we need at least one CPU to run reclaim....

> So I started with 50 runs of fs_mark
> -n 20000 (which did not cause reclaim), rebuilding a new filesystem
> between every run.
> 
> That gave the following files/sec numbers:
>     N           Min           Max        Median           Avg Stddev
> x  50      100986.4        127622      125013.4     123248.82 5244.1988
> +  50      100967.6      135918.6      130214.9     127926.94 6374.6975
> Difference at 95.0% confidence
>         4678.12 +/- 2316.07
>         3.79567% +/- 1.87919%
>         (Student's t, pooled s = 5836.88)
> 
> This is 3.8% in favour of vfs-scale-working.
> 
> I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
> start to fill up memory and cause reclaim during the 2nd and subsequent
> iterations.

I haven't used this mode, so I can't really comment on the results
you are seeing.

> > enabled. ext4 is using default mkfs and mount parameters except for
> > barrier=0. All numbers are averages of three runs.
> > 
> > 	fs_mark rate (thousands of files/second)
> >            2.6.35-rc5   2.6.35-rc5-scale
> > threads    xfs   ext4     xfs    ext4
> >   1         20    39       20     39
> >   2         35    55       35     57
> >   4         60    41       57     42
> >   8         79     9       75      9
> > 
> > ext4 is getting IO bound at more than 2 threads, so apart from
> > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > going to ignore ext4 for the purposes of testing scalability here.
> > 
> > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > CPU and with Nick's patches it's about 650% (10% higher) for
> > slightly lower throughput.  So at this class of machine for this
> > workload, the changes result in a slight reduction in scalability.
> 
> I wonder if these results are stable. It's possible that changes in
> reclaim behaviour are causing my patches to require more IO for a
> given unit of work?

More likely that's the result of using a smaller log size because it
will require more frequent metadata pushes to make space for new
transactions.

> I was seeing XFS 'livelock' in reclaim more with my patches, it
> could be due to more parallelism now being allowed from the vfs and
> reclaim.
>
> Based on my above numbers, I don't see that rcu-inodes is causing a
> problem, and in terms of SMP scalability, there is really no way that
> vanilla is more scalable, so I'm interested to see where this slowdown
> is coming from.

As I said initially, ram disks hide IO latency changes resulting
from increased numbers of IO or increases in seek distances.  My
initial guess is the change in inode reclaim behaviour causing
different IO patterns and more seeks under reclaim because the zone
based reclaim is no longer reclaiming inodes in the order
they are created (i.e. we are not doing sequential inode reclaim any
more.

FWIW, I use PCP monitoring graphs to correlate behavioural changes
across different subsystems because it is far easier to relate
information visually than it is by looking at raw numbers or traces.
I think this graph shows the effect of relcaim on performance
most clearly:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png

It's pretty clear that when the inode/dentry cache shrinkers are
running, sustained create/unlink performance goes right down. From a
different tab not in the screen shot (the other "test-4" tab), I
could see CPU usage also goes down and the disk iops go way up
whenever the create/unlink performance dropped. This same behaviour
happens with the vfs-scale patchset, so it's not related to lock
contention - just aggressive reclaim of still-dirty inodes.

FYI, The patch under test there was the XFS shrinker ignoring 7 out
of 8 shrinker calls and then on the 8th call doing the work of all
previous calls. i.e emulating  SHRINK_BATCH = 1024. Interestingly
enough, that one change reduced the runtime of the 8m inode
create/unlink load by ~25% (from ~24min to ~18min).

That is by far the largest improvement I've been able to obtain from
modifying the shrinker code, and it is from those sorts of
observations that I think that IO being issued from reclaim is
currently the most significant performance limiting factor for XFS
in this sort of workload....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree)
  2010-07-27 11:36       ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
@ 2010-07-27 13:30         ` Dave Chinner
  2010-07-27 14:58           ` XFS hang in xlog_grant_log_space Dave Chinner
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2010-07-27 13:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: xfs

On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On this same system, same setup (vanilla kernel with sha given below),
> I have now twice reproduced a complete hang in XFS. I can give more
> information, test patches or options etc if required.
> 
> setup.sh looks like this:
> #!/bin/bash
> modprobe rd rd_size=$[2*1024*1024]
> dd if=/dev/zero of=/dev/ram0 bs=4K
> mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0
> mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt
> 
> The 'dd' is required to ensure rd driver does not allocate pages
> during IO (which can lead to out of memory deadlocks). Running just
> involves changing into mnt directory and
> 
> while true
> do
>   sync
>   echo 3 > /proc/sys/vm/drop_caches
>   ../dbench -c ../loadfiles/client.txt -t20 8
>   rm -rf clients
> done
> 
> And wait for it to hang (happend in < 5 minutes here)
....
> Call Trace:
>  [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0

It's waiting on log space to be freed up. Either there's an
accounting problem (possible), or you've got an xfslogd/xfsaild
spinning and not making progress competing log IOs or pushing the
tail of the log. I'll see if I can reproduce it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-27 13:30         ` Dave Chinner
@ 2010-07-27 14:58           ` Dave Chinner
  2010-07-28 13:17             ` Dave Chinner
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2010-07-27 14:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: xfs

On Tue, Jul 27, 2010 at 11:30:38PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote:
> > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > On this same system, same setup (vanilla kernel with sha given below),
> > I have now twice reproduced a complete hang in XFS. I can give more
> > information, test patches or options etc if required.
> > 
> > setup.sh looks like this:
> > #!/bin/bash
> > modprobe rd rd_size=$[2*1024*1024]
> > dd if=/dev/zero of=/dev/ram0 bs=4K
> > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0
> > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt
> > 
> > The 'dd' is required to ensure rd driver does not allocate pages
> > during IO (which can lead to out of memory deadlocks). Running just
> > involves changing into mnt directory and
> > 
> > while true
> > do
> >   sync
> >   echo 3 > /proc/sys/vm/drop_caches
> >   ../dbench -c ../loadfiles/client.txt -t20 8
> >   rm -rf clients
> > done
> > 
> > And wait for it to hang (happend in < 5 minutes here)
> ....
> > Call Trace:
> >  [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
> 
> It's waiting on log space to be freed up. Either there's an
> accounting problem (possible), or you've got an xfslogd/xfsaild
> spinning and not making progress competing log IOs or pushing the
> tail of the log. I'll see if I can reproduce it.

Ok, I've just reproduced it. From some tracing:

touch-3340  [004] 1844935.582716: xfs_log_reserve: dev 1:0 type CREATE t_ocnt 2 t_cnt 2 t_curr_res 167148 t_unit_res 167148 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserve_headq 0xffff88010f489c78 write_headq 0x(null) grant_reserve_cycle 314 grant_reserve_bytes 24250680 grant_write_cycle 314 grant_write_bytes 24250680 curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532

The key part here is this:

curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532

This says the tail of the log is roughly 62MB behind the head. i.e
the log is full and we are waiting for tail pushing to write the
item holding the tail in place to disk so it can them be moved
forward. That's better than an accounting problem, at least.

So what is holding the tail in place? The first item on the AIL
appears to be:

xfsaild/ram0-2997  [000] 1844800.800764: xfs_buf_cond_lock: dev 1:0 bno 0x280120 len 0x2000 hold 3 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller xfs_buf_item_trylock

A stale buffer. Given that the next objects show this trace:

xfsaild/ram0-2997  [000] 1844800.800767: xfs_ilock_nowait: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_trylock
xfsaild/ram0-2997  [000] 1844800.800768: xfs_buf_rele: dev 1:0 bno 0x280120 len 0x2000 hold 4 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller _xfs_buf_find
xfsaild/ram0-2997  [000] 1844800.800769: xfs_iunlock: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_pushbuf

we see the next item on the AIL is an inode but the trace is
followed by a release on the original buffer, than tells me the
inode is flush locked and it returned XFS_ITEM_PUSHBUF to push the
inode buffer out. That results in xfs_inode_item_pushbuf() being
called, and that tries to lock the inode buffer to flush it.
xfs_buf_rele is called if the trylock on the buffer fails.

IOWs, this looks to be another problem with inode cluster freeing.

Ok, so we can't flush the buffer because it is locked. Why is it
locked? Well, that is unclear as yet. None of the blocked processes
should be holding an inode buffer locked, and a stale buffer should
be unlocked during transaction commit and not live longer than
the log IO that writes the transaction to disk. That is, it should
not get locked again before everything is freed up.

That's as much as I can get from post-mortem analysis - I need to
capture a trace that spans the lockup to catch what happens
to the buffer that we are hung on. That will have to wait until the
morning....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-27 13:18       ` Dave Chinner
@ 2010-07-27 15:09         ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-27 15:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> 
> A s a personal prefernce, I don't like testing filesystem performance
> on ramdisks because it hides problems caused by changes in IO
> latency. I'll come back to this later.

Very true, although it's good if you don't have some fast disks,
and it can be good to trigger different races than disks tend to.

So I still want to get to the bottom of the slowdown you saw on
vfs-scale.


> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run.
> 
> Quite possibly because of the smaller log - that will cause more
> frequent pushing on the log tail and hence I/O patterns will vary a
> bit...

Well... I think the test case (or how I'm running it) is simply a
bit unstable. I mean, there are subtle interactions all the way from
the CPU scheduler to the disk, so when I say unstable I'm not
particularly blaming XFS :)

 
> Also, keep in mind that delayed logging is shiny and new - it has
> increased XFS metadata performance and parallelism by an order of
> magnitude and so we're really seeing new a bunch of brand new issues
> that have never been seen before with this functionality.  As such,
> there's still some interactions I haven't got to the bottom of with
> delayed logging - it's stable enough to use and benchmark and won't
> corrupt anything but there are still has some warts we need to
> solve. The difficulty (as always) is in reliably reproducing the bad
> behaviour.

Sure, and I didn't see any corruptions, it seems pretty stable and
scalability is better than other filesystems. I'll see if I can
give a better recipe to reproduce the 'livelock'ish behaviour.

 
> > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
> > start to fill up memory and cause reclaim during the 2nd and subsequent
> > iterations.
> 
> I haven't used this mode, so I can't really comment on the results
> you are seeing.

It's a bit strange. Help says it should clear inodes between iterations
(without the -k flag), but it does not seem to.

 
> > > enabled. ext4 is using default mkfs and mount parameters except for
> > > barrier=0. All numbers are averages of three runs.
> > > 
> > > 	fs_mark rate (thousands of files/second)
> > >            2.6.35-rc5   2.6.35-rc5-scale
> > > threads    xfs   ext4     xfs    ext4
> > >   1         20    39       20     39
> > >   2         35    55       35     57
> > >   4         60    41       57     42
> > >   8         79     9       75      9
> > > 
> > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > going to ignore ext4 for the purposes of testing scalability here.
> > > 
> > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > slightly lower throughput.  So at this class of machine for this
> > > workload, the changes result in a slight reduction in scalability.
> > 
> > I wonder if these results are stable. It's possible that changes in
> > reclaim behaviour are causing my patches to require more IO for a
> > given unit of work?
> 
> More likely that's the result of using a smaller log size because it
> will require more frequent metadata pushes to make space for new
> transactions.

I was just checking whether your numbers are stable (where you
saw some slowdown with vfs-scale patches), and what could be the
cause. I agree that running real disks could make big changes in
behaviour.


> > I was seeing XFS 'livelock' in reclaim more with my patches, it
> > could be due to more parallelism now being allowed from the vfs and
> > reclaim.
> >
> > Based on my above numbers, I don't see that rcu-inodes is causing a
> > problem, and in terms of SMP scalability, there is really no way that
> > vanilla is more scalable, so I'm interested to see where this slowdown
> > is coming from.
> 
> As I said initially, ram disks hide IO latency changes resulting
> from increased numbers of IO or increases in seek distances.  My
> initial guess is the change in inode reclaim behaviour causing
> different IO patterns and more seeks under reclaim because the zone
> based reclaim is no longer reclaiming inodes in the order
> they are created (i.e. we are not doing sequential inode reclaim any
> more.

Sounds plausible. I'll do more investigations along those lines.

 
> FWIW, I use PCP monitoring graphs to correlate behavioural changes
> across different subsystems because it is far easier to relate
> information visually than it is by looking at raw numbers or traces.
> I think this graph shows the effect of relcaim on performance
> most clearly:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png

I haven't actually used that, it looks interesting.

 
> It's pretty clear that when the inode/dentry cache shrinkers are
> running, sustained create/unlink performance goes right down. From a
> different tab not in the screen shot (the other "test-4" tab), I
> could see CPU usage also goes down and the disk iops go way up
> whenever the create/unlink performance dropped. This same behaviour
> happens with the vfs-scale patchset, so it's not related to lock
> contention - just aggressive reclaim of still-dirty inodes.
> 
> FYI, The patch under test there was the XFS shrinker ignoring 7 out
> of 8 shrinker calls and then on the 8th call doing the work of all
> previous calls. i.e emulating  SHRINK_BATCH = 1024. Interestingly
> enough, that one change reduced the runtime of the 8m inode
> create/unlink load by ~25% (from ~24min to ~18min).

Hmm, interesting. Well that's naturally configurable with the
shrinker API changes I'm hoping to have merged. I'll plan to push
that ahead of the vfs-scale patches of course.


> That is by far the largest improvement I've been able to obtain from
> modifying the shrinker code, and it is from those sorts of
> observations that I think that IO being issued from reclaim is
> currently the most significant performance limiting factor for XFS
> in this sort of workload....

How is the xfs inode reclaim tied to linux inode reclaim? Does the
xfs inode not become reclaimable until some time after the linux inode
is reclaimed? Or what?

Do all or most of the xfs inodes require IO before being reclaimed
during this test? I wonder if you could throttle them a bit or sort
them somehow so that they tend to be cleaned by writeout and reclaim
just comes after and removes the clean ones, like pagecache reclaim
is (supposed) to work.?


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-27 15:09         ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-27 15:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> 
> A s a personal prefernce, I don't like testing filesystem performance
> on ramdisks because it hides problems caused by changes in IO
> latency. I'll come back to this later.

Very true, although it's good if you don't have some fast disks,
and it can be good to trigger different races than disks tend to.

So I still want to get to the bottom of the slowdown you saw on
vfs-scale.


> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run.
> 
> Quite possibly because of the smaller log - that will cause more
> frequent pushing on the log tail and hence I/O patterns will vary a
> bit...

Well... I think the test case (or how I'm running it) is simply a
bit unstable. I mean, there are subtle interactions all the way from
the CPU scheduler to the disk, so when I say unstable I'm not
particularly blaming XFS :)

 
> Also, keep in mind that delayed logging is shiny and new - it has
> increased XFS metadata performance and parallelism by an order of
> magnitude and so we're really seeing new a bunch of brand new issues
> that have never been seen before with this functionality.  As such,
> there's still some interactions I haven't got to the bottom of with
> delayed logging - it's stable enough to use and benchmark and won't
> corrupt anything but there are still has some warts we need to
> solve. The difficulty (as always) is in reliably reproducing the bad
> behaviour.

Sure, and I didn't see any corruptions, it seems pretty stable and
scalability is better than other filesystems. I'll see if I can
give a better recipe to reproduce the 'livelock'ish behaviour.

 
> > I then did 10 runs of -n 20000 but with -L 4 (4 iterations) which did
> > start to fill up memory and cause reclaim during the 2nd and subsequent
> > iterations.
> 
> I haven't used this mode, so I can't really comment on the results
> you are seeing.

It's a bit strange. Help says it should clear inodes between iterations
(without the -k flag), but it does not seem to.

 
> > > enabled. ext4 is using default mkfs and mount parameters except for
> > > barrier=0. All numbers are averages of three runs.
> > > 
> > > 	fs_mark rate (thousands of files/second)
> > >            2.6.35-rc5   2.6.35-rc5-scale
> > > threads    xfs   ext4     xfs    ext4
> > >   1         20    39       20     39
> > >   2         35    55       35     57
> > >   4         60    41       57     42
> > >   8         79     9       75      9
> > > 
> > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > going to ignore ext4 for the purposes of testing scalability here.
> > > 
> > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > slightly lower throughput.  So at this class of machine for this
> > > workload, the changes result in a slight reduction in scalability.
> > 
> > I wonder if these results are stable. It's possible that changes in
> > reclaim behaviour are causing my patches to require more IO for a
> > given unit of work?
> 
> More likely that's the result of using a smaller log size because it
> will require more frequent metadata pushes to make space for new
> transactions.

I was just checking whether your numbers are stable (where you
saw some slowdown with vfs-scale patches), and what could be the
cause. I agree that running real disks could make big changes in
behaviour.


> > I was seeing XFS 'livelock' in reclaim more with my patches, it
> > could be due to more parallelism now being allowed from the vfs and
> > reclaim.
> >
> > Based on my above numbers, I don't see that rcu-inodes is causing a
> > problem, and in terms of SMP scalability, there is really no way that
> > vanilla is more scalable, so I'm interested to see where this slowdown
> > is coming from.
> 
> As I said initially, ram disks hide IO latency changes resulting
> from increased numbers of IO or increases in seek distances.  My
> initial guess is the change in inode reclaim behaviour causing
> different IO patterns and more seeks under reclaim because the zone
> based reclaim is no longer reclaiming inodes in the order
> they are created (i.e. we are not doing sequential inode reclaim any
> more.

Sounds plausible. I'll do more investigations along those lines.

 
> FWIW, I use PCP monitoring graphs to correlate behavioural changes
> across different subsystems because it is far easier to relate
> information visually than it is by looking at raw numbers or traces.
> I think this graph shows the effect of relcaim on performance
> most clearly:
> 
> http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png

I haven't actually used that, it looks interesting.

 
> It's pretty clear that when the inode/dentry cache shrinkers are
> running, sustained create/unlink performance goes right down. From a
> different tab not in the screen shot (the other "test-4" tab), I
> could see CPU usage also goes down and the disk iops go way up
> whenever the create/unlink performance dropped. This same behaviour
> happens with the vfs-scale patchset, so it's not related to lock
> contention - just aggressive reclaim of still-dirty inodes.
> 
> FYI, The patch under test there was the XFS shrinker ignoring 7 out
> of 8 shrinker calls and then on the 8th call doing the work of all
> previous calls. i.e emulating  SHRINK_BATCH = 1024. Interestingly
> enough, that one change reduced the runtime of the 8m inode
> create/unlink load by ~25% (from ~24min to ~18min).

Hmm, interesting. Well that's naturally configurable with the
shrinker API changes I'm hoping to have merged. I'll plan to push
that ahead of the vfs-scale patches of course.


> That is by far the largest improvement I've been able to obtain from
> modifying the shrinker code, and it is from those sorts of
> observations that I think that IO being issued from reclaim is
> currently the most significant performance limiting factor for XFS
> in this sort of workload....

How is the xfs inode reclaim tied to linux inode reclaim? Does the
xfs inode not become reclaimable until some time after the linux inode
is reclaimed? Or what?

Do all or most of the xfs inodes require IO before being reclaimed
during this test? I wonder if you could throttle them a bit or sort
them somehow so that they tend to be cleaned by writeout and reclaim
just comes after and removes the clean ones, like pagecache reclaim
is (supposed) to work.?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-27 15:09         ` Nick Piggin
  (?)
@ 2010-07-28  4:59           ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-28  4:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > solve. The difficulty (as always) is in reliably reproducing the bad
> > behaviour.
> 
> Sure, and I didn't see any corruptions, it seems pretty stable and
> scalability is better than other filesystems. I'll see if I can
> give a better recipe to reproduce the 'livelock'ish behaviour.

Well, stable is a good start :)

> > > > 	fs_mark rate (thousands of files/second)
> > > >            2.6.35-rc5   2.6.35-rc5-scale
> > > > threads    xfs   ext4     xfs    ext4
> > > >   1         20    39       20     39
> > > >   2         35    55       35     57
> > > >   4         60    41       57     42
> > > >   8         79     9       75      9
> > > > 
> > > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > > going to ignore ext4 for the purposes of testing scalability here.
> > > > 
> > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > > slightly lower throughput.  So at this class of machine for this
> > > > workload, the changes result in a slight reduction in scalability.
> > > 
> > > I wonder if these results are stable. It's possible that changes in
> > > reclaim behaviour are causing my patches to require more IO for a
> > > given unit of work?
> > 
> > More likely that's the result of using a smaller log size because it
> > will require more frequent metadata pushes to make space for new
> > transactions.
> 
> I was just checking whether your numbers are stable (where you
> saw some slowdown with vfs-scale patches), and what could be the
> cause. I agree that running real disks could make big changes in
> behaviour.

Yeah, the numbers are repeatable within about +/-5%. I generally
don't bother with optimisations that result in gains/losses less
than that because IO benchmarks that reliably repoduce results with
more precise repeatability than that are few and far between.

> > FWIW, I use PCP monitoring graphs to correlate behavioural changes
> > across different subsystems because it is far easier to relate
> > information visually than it is by looking at raw numbers or traces.
> > I think this graph shows the effect of relcaim on performance
> > most clearly:
> > 
> > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png
> 
> I haven't actually used that, it looks interesting.

The archiving side of PCP is the most useful, I find. i.e. being able
to record the metrics into a file and  analyse them with pmchart or
other tools after the fact...

> > That is by far the largest improvement I've been able to obtain from
> > modifying the shrinker code, and it is from those sorts of
> > observations that I think that IO being issued from reclaim is
> > currently the most significant performance limiting factor for XFS
> > in this sort of workload....
> 
> How is the xfs inode reclaim tied to linux inode reclaim? Does the
> xfs inode not become reclaimable until some time after the linux inode
> is reclaimed? Or what?

The struct xfs_inode embeds a struct inode like so:

struct xfs_inode {
	.....
	struct inode	i_inode;
}

so they are the same chunk of memory. XFS does not use the VFS inode
hashes for finding inodes - that's what the per-ag radix trees are
used for. The xfs_inode lives longer than the struct inode because
we do non-trivial work after the VFS "reclaims" the struct inode.

For example, when an inode is unlinked
do not truncate or free the inode until after the VFS has finished with
it - the inode remains on the unlinked list (orphaned in ext3 terms)
from the time is is unlinked by the VFS to the time the last VFs
reference goes away. When XFS gets it, XFS then issues the inactive
transaction that takes the inode off the unlinked list and marks it
free in the inode alloc btree. This transaction is asynchronous and
dirties the xfs inode. Finally XFS will mark the inode as
reclaimable via a radix tree tag. The final processing of the inode
is then done via eaither a background relcaim walk from xfssyncd
(every 30s) where it will do non-blocking operations to finalіze
reclaim. It may take several passes to actually reclaim the inode.
e.g. one pass to force the log if the inode is pinned, another pass
to flush the inode to disk if it is dirty and not stale, and then
another pass to reclaim the inode once clean. There may be multiple
passes inbetween where the inode is skipped because those operations
have not completed.

And to top it all off, if the inode is looked up again (cache hit)
while in the reclaimable state, it will be removed from the reclaim
state and reused immediately. in this case we don't need to continue
the reclaim processing other things will ensure all the correct
information will go to disk.

> Do all or most of the xfs inodes require IO before being reclaimed
> during this test?

Yes, because all the inode are being dirtied and they are being
reclaimed faster than background flushing expires them.

> I wonder if you could throttle them a bit or sort
> them somehow so that they tend to be cleaned by writeout and reclaim
> just comes after and removes the clean ones, like pagecache reclaim
> is (supposed) to work.?

The whole point of using the radix trees is to get nicely sorted
reclaim IO - inodes are indexed by number, and the radix tree walk
gives us ascending inode number (and hence ascending block number)
reclaim - and the background reclaim allows optimal flushing to
occur by aggregating all the IO into delayed write metadata buffers
so they can be sorted and flushed to the elevator by the xfsbufd in
the most optimal manner possible.

The shrinker does preempt this somewhat, which is why delaying the
XFS shrinker's work appears to improve things alot. If the shrinker
is not running, the the background reclaim does exactly what you are
suggesting.

However, I don't think the increase in iops is caused by the XFS
inode shrinker - I think that it is the VFS cache shrinkers. If you
look at the the graphs in the link above, preformance doesn't
decrease when the XFS inode cache is being shrunk (top chart, yellow
trace) - it drops when the vfs caches are being shrunk (middle
chart). I haven't correlated the behaviour any further than that
because I haven't had time.

FWIW, all this background reclaim, radix tree reclaim tagging and
walking, embedded struct inodes, etc is all relatively new code.
The oldest bit of it was introduced in 2.6.31 (I think) and so a
significant part of what we are exploring here is uncharted
territory. The changes to relcaim, etc are aprtially reponsible for
the scalabilty we are geting from delayed logging, but there is
certainly room for improvement....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-28  4:59           ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-28  4:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > solve. The difficulty (as always) is in reliably reproducing the bad
> > behaviour.
> 
> Sure, and I didn't see any corruptions, it seems pretty stable and
> scalability is better than other filesystems. I'll see if I can
> give a better recipe to reproduce the 'livelock'ish behaviour.

Well, stable is a good start :)

> > > > 	fs_mark rate (thousands of files/second)
> > > >            2.6.35-rc5   2.6.35-rc5-scale
> > > > threads    xfs   ext4     xfs    ext4
> > > >   1         20    39       20     39
> > > >   2         35    55       35     57
> > > >   4         60    41       57     42
> > > >   8         79     9       75      9
> > > > 
> > > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > > going to ignore ext4 for the purposes of testing scalability here.
> > > > 
> > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > > slightly lower throughput.  So at this class of machine for this
> > > > workload, the changes result in a slight reduction in scalability.
> > > 
> > > I wonder if these results are stable. It's possible that changes in
> > > reclaim behaviour are causing my patches to require more IO for a
> > > given unit of work?
> > 
> > More likely that's the result of using a smaller log size because it
> > will require more frequent metadata pushes to make space for new
> > transactions.
> 
> I was just checking whether your numbers are stable (where you
> saw some slowdown with vfs-scale patches), and what could be the
> cause. I agree that running real disks could make big changes in
> behaviour.

Yeah, the numbers are repeatable within about +/-5%. I generally
don't bother with optimisations that result in gains/losses less
than that because IO benchmarks that reliably repoduce results with
more precise repeatability than that are few and far between.

> > FWIW, I use PCP monitoring graphs to correlate behavioural changes
> > across different subsystems because it is far easier to relate
> > information visually than it is by looking at raw numbers or traces.
> > I think this graph shows the effect of relcaim on performance
> > most clearly:
> > 
> > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png
> 
> I haven't actually used that, it looks interesting.

The archiving side of PCP is the most useful, I find. i.e. being able
to record the metrics into a file and  analyse them with pmchart or
other tools after the fact...

> > That is by far the largest improvement I've been able to obtain from
> > modifying the shrinker code, and it is from those sorts of
> > observations that I think that IO being issued from reclaim is
> > currently the most significant performance limiting factor for XFS
> > in this sort of workload....
> 
> How is the xfs inode reclaim tied to linux inode reclaim? Does the
> xfs inode not become reclaimable until some time after the linux inode
> is reclaimed? Or what?

The struct xfs_inode embeds a struct inode like so:

struct xfs_inode {
	.....
	struct inode	i_inode;
}

so they are the same chunk of memory. XFS does not use the VFS inode
hashes for finding inodes - that's what the per-ag radix trees are
used for. The xfs_inode lives longer than the struct inode because
we do non-trivial work after the VFS "reclaims" the struct inode.

For example, when an inode is unlinked
do not truncate or free the inode until after the VFS has finished with
it - the inode remains on the unlinked list (orphaned in ext3 terms)
from the time is is unlinked by the VFS to the time the last VFs
reference goes away. When XFS gets it, XFS then issues the inactive
transaction that takes the inode off the unlinked list and marks it
free in the inode alloc btree. This transaction is asynchronous and
dirties the xfs inode. Finally XFS will mark the inode as
reclaimable via a radix tree tag. The final processing of the inode
is then done via eaither a background relcaim walk from xfssyncd
(every 30s) where it will do non-blocking operations to finalіze
reclaim. It may take several passes to actually reclaim the inode.
e.g. one pass to force the log if the inode is pinned, another pass
to flush the inode to disk if it is dirty and not stale, and then
another pass to reclaim the inode once clean. There may be multiple
passes inbetween where the inode is skipped because those operations
have not completed.

And to top it all off, if the inode is looked up again (cache hit)
while in the reclaimable state, it will be removed from the reclaim
state and reused immediately. in this case we don't need to continue
the reclaim processing other things will ensure all the correct
information will go to disk.

> Do all or most of the xfs inodes require IO before being reclaimed
> during this test?

Yes, because all the inode are being dirtied and they are being
reclaimed faster than background flushing expires them.

> I wonder if you could throttle them a bit or sort
> them somehow so that they tend to be cleaned by writeout and reclaim
> just comes after and removes the clean ones, like pagecache reclaim
> is (supposed) to work.?

The whole point of using the radix trees is to get nicely sorted
reclaim IO - inodes are indexed by number, and the radix tree walk
gives us ascending inode number (and hence ascending block number)
reclaim - and the background reclaim allows optimal flushing to
occur by aggregating all the IO into delayed write metadata buffers
so they can be sorted and flushed to the elevator by the xfsbufd in
the most optimal manner possible.

The shrinker does preempt this somewhat, which is why delaying the
XFS shrinker's work appears to improve things alot. If the shrinker
is not running, the the background reclaim does exactly what you are
suggesting.

However, I don't think the increase in iops is caused by the XFS
inode shrinker - I think that it is the VFS cache shrinkers. If you
look at the the graphs in the link above, preformance doesn't
decrease when the XFS inode cache is being shrunk (top chart, yellow
trace) - it drops when the vfs caches are being shrunk (middle
chart). I haven't correlated the behaviour any further than that
because I haven't had time.

FWIW, all this background reclaim, radix tree reclaim tagging and
walking, embedded struct inodes, etc is all relatively new code.
The oldest bit of it was introduced in 2.6.31 (I think) and so a
significant part of what we are exploring here is uncharted
territory. The changes to relcaim, etc are aprtially reponsible for
the scalabilty we are geting from delayed logging, but there is
certainly room for improvement....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-28  4:59           ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-28  4:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar,
	John Stultz

On Wed, Jul 28, 2010 at 01:09:08AM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 11:18:10PM +1000, Dave Chinner wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > solve. The difficulty (as always) is in reliably reproducing the bad
> > behaviour.
> 
> Sure, and I didn't see any corruptions, it seems pretty stable and
> scalability is better than other filesystems. I'll see if I can
> give a better recipe to reproduce the 'livelock'ish behaviour.

Well, stable is a good start :)

> > > > 	fs_mark rate (thousands of files/second)
> > > >            2.6.35-rc5   2.6.35-rc5-scale
> > > > threads    xfs   ext4     xfs    ext4
> > > >   1         20    39       20     39
> > > >   2         35    55       35     57
> > > >   4         60    41       57     42
> > > >   8         79     9       75      9
> > > > 
> > > > ext4 is getting IO bound at more than 2 threads, so apart from
> > > > pointing out that XFS is 8-9x faster than ext4 at 8 thread, I'm
> > > > going to ignore ext4 for the purposes of testing scalability here.
> > > > 
> > > > For XFS w/ delayed logging, 2.6.35-rc5 is only getting to about 600%
> > > > CPU and with Nick's patches it's about 650% (10% higher) for
> > > > slightly lower throughput.  So at this class of machine for this
> > > > workload, the changes result in a slight reduction in scalability.
> > > 
> > > I wonder if these results are stable. It's possible that changes in
> > > reclaim behaviour are causing my patches to require more IO for a
> > > given unit of work?
> > 
> > More likely that's the result of using a smaller log size because it
> > will require more frequent metadata pushes to make space for new
> > transactions.
> 
> I was just checking whether your numbers are stable (where you
> saw some slowdown with vfs-scale patches), and what could be the
> cause. I agree that running real disks could make big changes in
> behaviour.

Yeah, the numbers are repeatable within about +/-5%. I generally
don't bother with optimisations that result in gains/losses less
than that because IO benchmarks that reliably repoduce results with
more precise repeatability than that are few and far between.

> > FWIW, I use PCP monitoring graphs to correlate behavioural changes
> > across different subsystems because it is far easier to relate
> > information visually than it is by looking at raw numbers or traces.
> > I think this graph shows the effect of relcaim on performance
> > most clearly:
> > 
> > http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc3-context-only-per-xfs-batch6-16x500-xfs.png
> 
> I haven't actually used that, it looks interesting.

The archiving side of PCP is the most useful, I find. i.e. being able
to record the metrics into a file and  analyse them with pmchart or
other tools after the fact...

> > That is by far the largest improvement I've been able to obtain from
> > modifying the shrinker code, and it is from those sorts of
> > observations that I think that IO being issued from reclaim is
> > currently the most significant performance limiting factor for XFS
> > in this sort of workload....
> 
> How is the xfs inode reclaim tied to linux inode reclaim? Does the
> xfs inode not become reclaimable until some time after the linux inode
> is reclaimed? Or what?

The struct xfs_inode embeds a struct inode like so:

struct xfs_inode {
	.....
	struct inode	i_inode;
}

so they are the same chunk of memory. XFS does not use the VFS inode
hashes for finding inodes - that's what the per-ag radix trees are
used for. The xfs_inode lives longer than the struct inode because
we do non-trivial work after the VFS "reclaims" the struct inode.

For example, when an inode is unlinked
do not truncate or free the inode until after the VFS has finished with
it - the inode remains on the unlinked list (orphaned in ext3 terms)
from the time is is unlinked by the VFS to the time the last VFs
reference goes away. When XFS gets it, XFS then issues the inactive
transaction that takes the inode off the unlinked list and marks it
free in the inode alloc btree. This transaction is asynchronous and
dirties the xfs inode. Finally XFS will mark the inode as
reclaimable via a radix tree tag. The final processing of the inode
is then done via eaither a background relcaim walk from xfssyncd
(every 30s) where it will do non-blocking operations to finalN?ze
reclaim. It may take several passes to actually reclaim the inode.
e.g. one pass to force the log if the inode is pinned, another pass
to flush the inode to disk if it is dirty and not stale, and then
another pass to reclaim the inode once clean. There may be multiple
passes inbetween where the inode is skipped because those operations
have not completed.

And to top it all off, if the inode is looked up again (cache hit)
while in the reclaimable state, it will be removed from the reclaim
state and reused immediately. in this case we don't need to continue
the reclaim processing other things will ensure all the correct
information will go to disk.

> Do all or most of the xfs inodes require IO before being reclaimed
> during this test?

Yes, because all the inode are being dirtied and they are being
reclaimed faster than background flushing expires them.

> I wonder if you could throttle them a bit or sort
> them somehow so that they tend to be cleaned by writeout and reclaim
> just comes after and removes the clean ones, like pagecache reclaim
> is (supposed) to work.?

The whole point of using the radix trees is to get nicely sorted
reclaim IO - inodes are indexed by number, and the radix tree walk
gives us ascending inode number (and hence ascending block number)
reclaim - and the background reclaim allows optimal flushing to
occur by aggregating all the IO into delayed write metadata buffers
so they can be sorted and flushed to the elevator by the xfsbufd in
the most optimal manner possible.

The shrinker does preempt this somewhat, which is why delaying the
XFS shrinker's work appears to improve things alot. If the shrinker
is not running, the the background reclaim does exactly what you are
suggesting.

However, I don't think the increase in iops is caused by the XFS
inode shrinker - I think that it is the VFS cache shrinkers. If you
look at the the graphs in the link above, preformance doesn't
decrease when the XFS inode cache is being shrunk (top chart, yellow
trace) - it drops when the vfs caches are being shrunk (middle
chart). I haven't correlated the behaviour any further than that
because I haven't had time.

FWIW, all this background reclaim, radix tree reclaim tagging and
walking, embedded struct inodes, etc is all relatively new code.
The oldest bit of it was introduced in 2.6.31 (I think) and so a
significant part of what we are exploring here is uncharted
territory. The changes to relcaim, etc are aprtially reponsible for
the scalabilty we are geting from delayed logging, but there is
certainly room for improvement....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-26  5:41   ` Nick Piggin
@ 2010-07-28 10:24     ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-28 10:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz,
	Dave Chinner, KOSAKI Motohiro, Michael Neuling

On Mon, Jul 26, 2010 at 03:41:11PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> 
> Pushed several fixes and improvements
> o XFS bugs fixed by Dave
> o dentry and inode stats bugs noticed by Dave
> o vmscan shrinker bugs fixed by KOSAKI san
> o compile bugs noticed by John
> o a few attempts to improve powerpc performance (eg. reducing smp_rmb())
> o scalability improvments for rename_lock

Yet another result on my small 2s8c Opteron. This time, the
re-aim benchmark configured as described here:

http://ertos.nicta.com.au/publications/papers/Chubb_Williams_05.pdf

It is using ext2 on ramdisk and an IO intensive workload, with fsync
activity.

I did 10 runs on each, and took the max jobs/sec of each run.

    N           Min           Max        Median           Avg        Stddev
x  10       2598750       2735122     2665384.6     2653353.8     46421.696
+  10     3337297.3     3484687.5     3410689.7     3397763.8     49994.631
Difference at 95.0% confidence
        744410 +/- 45327.3
        28.0554% +/- 1.7083%

Average is 2653K jobs/s for vanilla, versus 3398K jobs/s for vfs-scalem
or 28% speedup.

The profile is interesting. It is known to be inode_lock intensive, but
we also see here that it is do_lookup intensive, due to cacheline bouncing
in common elements of path lookups.

Vanilla:
# Overhead  Symbol
# ........  ......
#
     7.63%  [k] __d_lookup
            |          
            |--88.59%-- do_lookup
            |--9.75%-- __lookup_hash
            |--0.89%-- d_lookup

     7.17%  [k] _raw_spin_lock
            |          
            |--11.07%-- _atomic_dec_and_lock
            |          |          
            |          |--53.73%-- dput
            |           --46.27%-- iput
            |          
            |--9.85%-- __mark_inode_dirty
            |          |          
            |          |--46.25%-- ext2_new_inode
            |          |--25.32%-- __set_page_dirty
            |          |--18.27%-- nobh_write_end
            |          |--6.91%-- ext2_new_blocks
            |          |--3.12%-- ext2_unlink
            |          
            |--7.69%-- ext2_new_inode
            |          
            |--6.84%-- insert_inode_locked
            |          ext2_new_inode
            |          
            |--6.56%-- new_inode
            |          ext2_new_inode
            |          
            |--5.61%-- writeback_single_inode
            |          sync_inode
            |          generic_file_fsync
            |          ext2_fsync
            |          
            |--5.13%-- dput
            |--3.75%-- generic_delete_inode
            |--3.56%-- __d_lookup
            |--3.53%-- ext2_free_inode
            |--3.40%-- sync_inode
            |--2.71%-- d_instantiate
            |--2.36%-- d_delete
            |--2.25%-- inode_sub_bytes
            |--1.84%-- file_move
            |--1.52%-- file_kill
            |--1.36%-- ext2_new_blocks
            |--1.34%-- ext2_create
            |--1.34%-- d_alloc
            |--1.11%-- do_lookup
            |--1.07%-- iput
            |--1.05%-- __d_instantiate

     4.19%  [k] mutex_spin_on_owner
            |          
            |--99.92%-- __mutex_lock_slowpath
            |          mutex_lock
            |          |          
            |          |--56.45%-- do_unlinkat
            |          |          sys_unlink
            |          |          
            |           --43.55%-- do_last
            |                     do_filp_open

     2.96%  [k] _atomic_dec_and_lock
            |          
            |--58.18%-- dput
            |--31.02%-- mntput_no_expire
            |--3.30%-- path_put
            |--3.09%-- iput
            |--2.69%-- link_path_walk
            |--1.02%-- fput

     2.73%  [k] copy_user_generic_string
     2.67%  [k] __mark_inode_dirty
     2.65%  [k] link_path_walk
     2.63%  [k] mark_buffer_dirty
     1.72%  [k] __memcpy
     1.62%  [k] generic_getxattr
     1.50%  [k] acl_permission_check
     1.30%  [k] __find_get_block
     1.30%  [k] __memset
     1.17%  [k] ext2_find_entry
     1.09%  [k] ext2_new_inode
     1.06%  [k] system_call
     1.01%  [k] kmem_cache_free
     1.00%  [k] dput


In vfs-scale, most of the spinlock contention and path lookup cost is
gone. Contention for parent i_mutex (and d_lock) for creat/unlink
operations is now at the top of the profile.

A lot of the spinlock overhead seems to be not contention so much as
the the cost of the atomics. Down at 3% it is much less a problem than
it was though.

We may run into a bit of contention on the per-bdi inode dirty/io
list lock, with just a single ramdisk device (dirty/fsync activity
will hit this lock), but it is really not worth worrying about at
the moment.

# Overhead  Symbol
# ........  ......
#
     5.67%  [k] mutex_spin_on_owner
            |          
            |--99.96%-- __mutex_lock_slowpath
            |          mutex_lock
            |          |          
            |          |--58.63%-- do_unlinkat
            |          |          sys_unlink
            |          |          
            |           --41.37%-- do_last
            |                     do_filp_open

     3.93%  [k] __mark_inode_dirty
     3.43%  [k] copy_user_generic_string
     3.31%  [k] link_path_walk
     3.15%  [k] mark_buffer_dirty
     3.11%  [k] _raw_spin_lock
            |          
            |--11.03%-- __mark_inode_dirty
            |--10.54%-- ext2_new_inode
            |--7.60%-- ext2_free_inode
            |--6.33%-- inode_sub_bytes
            |--6.27%-- ext2_new_blocks
            |--5.80%-- generic_delete_inode
            |--4.09%-- ext2_create
            |--3.62%-- writeback_single_inode
            |--2.92%-- sync_inode
            |--2.81%-- generic_drop_inode
            |--2.46%-- iput
            |--1.86%-- dput
            |--1.80%-- __dquot_alloc_space
            |--1.61%-- __mutex_unlock_slowpath
            |--1.59%-- generic_file_fsync
            |--1.57%-- __d_instantiate
            |--1.55%-- __set_page_dirty_buffers
            |--1.36%-- d_alloc_and_lookup
            |--1.23%-- do_path_lookup
            |--1.10%-- ext2_free_blocks

     2.13%  [k] __memset
     2.12%  [k] __memcpy
     1.98%  [k] __d_lookup_rcu
     1.46%  [k] generic_getxattr
     1.44%  [k] ext2_find_entry
     1.41%  [k] __find_get_block
     1.27%  [k] kmem_cache_free
     1.25%  [k] ext2_new_inode
     1.23%  [k] system_call
     1.02%  [k] ext2_add_link
     1.01%  [k] strncpy_from_user
     0.96%  [k] kmem_cache_alloc
     0.95%  [k] find_get_page
     0.94%  [k] sysret_check
     0.88%  [k] __d_lookup
     0.75%  [k] ext2_delete_entry
     0.70%  [k] generic_file_aio_read
     0.67%  [k] generic_file_buffered_write
     0.63%  [k] ext2_new_blocks
     0.62%  [k] __percpu_counter_add
     0.59%  [k] __bread
     0.58%  [k] __wake_up_bit
     0.58%  [k] __mutex_lock_slowpath
     0.56%  [k] __ext2_write_inode
     0.55%  [k] ext2_get_blocks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-28 10:24     ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-28 10:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz,
	Dave Chinner, KOSAKI Motohiro, Michael Neuling

On Mon, Jul 26, 2010 at 03:41:11PM +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> 
> Pushed several fixes and improvements
> o XFS bugs fixed by Dave
> o dentry and inode stats bugs noticed by Dave
> o vmscan shrinker bugs fixed by KOSAKI san
> o compile bugs noticed by John
> o a few attempts to improve powerpc performance (eg. reducing smp_rmb())
> o scalability improvments for rename_lock

Yet another result on my small 2s8c Opteron. This time, the
re-aim benchmark configured as described here:

http://ertos.nicta.com.au/publications/papers/Chubb_Williams_05.pdf

It is using ext2 on ramdisk and an IO intensive workload, with fsync
activity.

I did 10 runs on each, and took the max jobs/sec of each run.

    N           Min           Max        Median           Avg        Stddev
x  10       2598750       2735122     2665384.6     2653353.8     46421.696
+  10     3337297.3     3484687.5     3410689.7     3397763.8     49994.631
Difference at 95.0% confidence
        744410 +/- 45327.3
        28.0554% +/- 1.7083%

Average is 2653K jobs/s for vanilla, versus 3398K jobs/s for vfs-scalem
or 28% speedup.

The profile is interesting. It is known to be inode_lock intensive, but
we also see here that it is do_lookup intensive, due to cacheline bouncing
in common elements of path lookups.

Vanilla:
# Overhead  Symbol
# ........  ......
#
     7.63%  [k] __d_lookup
            |          
            |--88.59%-- do_lookup
            |--9.75%-- __lookup_hash
            |--0.89%-- d_lookup

     7.17%  [k] _raw_spin_lock
            |          
            |--11.07%-- _atomic_dec_and_lock
            |          |          
            |          |--53.73%-- dput
            |           --46.27%-- iput
            |          
            |--9.85%-- __mark_inode_dirty
            |          |          
            |          |--46.25%-- ext2_new_inode
            |          |--25.32%-- __set_page_dirty
            |          |--18.27%-- nobh_write_end
            |          |--6.91%-- ext2_new_blocks
            |          |--3.12%-- ext2_unlink
            |          
            |--7.69%-- ext2_new_inode
            |          
            |--6.84%-- insert_inode_locked
            |          ext2_new_inode
            |          
            |--6.56%-- new_inode
            |          ext2_new_inode
            |          
            |--5.61%-- writeback_single_inode
            |          sync_inode
            |          generic_file_fsync
            |          ext2_fsync
            |          
            |--5.13%-- dput
            |--3.75%-- generic_delete_inode
            |--3.56%-- __d_lookup
            |--3.53%-- ext2_free_inode
            |--3.40%-- sync_inode
            |--2.71%-- d_instantiate
            |--2.36%-- d_delete
            |--2.25%-- inode_sub_bytes
            |--1.84%-- file_move
            |--1.52%-- file_kill
            |--1.36%-- ext2_new_blocks
            |--1.34%-- ext2_create
            |--1.34%-- d_alloc
            |--1.11%-- do_lookup
            |--1.07%-- iput
            |--1.05%-- __d_instantiate

     4.19%  [k] mutex_spin_on_owner
            |          
            |--99.92%-- __mutex_lock_slowpath
            |          mutex_lock
            |          |          
            |          |--56.45%-- do_unlinkat
            |          |          sys_unlink
            |          |          
            |           --43.55%-- do_last
            |                     do_filp_open

     2.96%  [k] _atomic_dec_and_lock
            |          
            |--58.18%-- dput
            |--31.02%-- mntput_no_expire
            |--3.30%-- path_put
            |--3.09%-- iput
            |--2.69%-- link_path_walk
            |--1.02%-- fput

     2.73%  [k] copy_user_generic_string
     2.67%  [k] __mark_inode_dirty
     2.65%  [k] link_path_walk
     2.63%  [k] mark_buffer_dirty
     1.72%  [k] __memcpy
     1.62%  [k] generic_getxattr
     1.50%  [k] acl_permission_check
     1.30%  [k] __find_get_block
     1.30%  [k] __memset
     1.17%  [k] ext2_find_entry
     1.09%  [k] ext2_new_inode
     1.06%  [k] system_call
     1.01%  [k] kmem_cache_free
     1.00%  [k] dput


In vfs-scale, most of the spinlock contention and path lookup cost is
gone. Contention for parent i_mutex (and d_lock) for creat/unlink
operations is now at the top of the profile.

A lot of the spinlock overhead seems to be not contention so much as
the the cost of the atomics. Down at 3% it is much less a problem than
it was though.

We may run into a bit of contention on the per-bdi inode dirty/io
list lock, with just a single ramdisk device (dirty/fsync activity
will hit this lock), but it is really not worth worrying about at
the moment.

# Overhead  Symbol
# ........  ......
#
     5.67%  [k] mutex_spin_on_owner
            |          
            |--99.96%-- __mutex_lock_slowpath
            |          mutex_lock
            |          |          
            |          |--58.63%-- do_unlinkat
            |          |          sys_unlink
            |          |          
            |           --41.37%-- do_last
            |                     do_filp_open

     3.93%  [k] __mark_inode_dirty
     3.43%  [k] copy_user_generic_string
     3.31%  [k] link_path_walk
     3.15%  [k] mark_buffer_dirty
     3.11%  [k] _raw_spin_lock
            |          
            |--11.03%-- __mark_inode_dirty
            |--10.54%-- ext2_new_inode
            |--7.60%-- ext2_free_inode
            |--6.33%-- inode_sub_bytes
            |--6.27%-- ext2_new_blocks
            |--5.80%-- generic_delete_inode
            |--4.09%-- ext2_create
            |--3.62%-- writeback_single_inode
            |--2.92%-- sync_inode
            |--2.81%-- generic_drop_inode
            |--2.46%-- iput
            |--1.86%-- dput
            |--1.80%-- __dquot_alloc_space
            |--1.61%-- __mutex_unlock_slowpath
            |--1.59%-- generic_file_fsync
            |--1.57%-- __d_instantiate
            |--1.55%-- __set_page_dirty_buffers
            |--1.36%-- d_alloc_and_lookup
            |--1.23%-- do_path_lookup
            |--1.10%-- ext2_free_blocks

     2.13%  [k] __memset
     2.12%  [k] __memcpy
     1.98%  [k] __d_lookup_rcu
     1.46%  [k] generic_getxattr
     1.44%  [k] ext2_find_entry
     1.41%  [k] __find_get_block
     1.27%  [k] kmem_cache_free
     1.25%  [k] ext2_new_inode
     1.23%  [k] system_call
     1.02%  [k] ext2_add_link
     1.01%  [k] strncpy_from_user
     0.96%  [k] kmem_cache_alloc
     0.95%  [k] find_get_page
     0.94%  [k] sysret_check
     0.88%  [k] __d_lookup
     0.75%  [k] ext2_delete_entry
     0.70%  [k] generic_file_aio_read
     0.67%  [k] generic_file_buffered_write
     0.63%  [k] ext2_new_blocks
     0.62%  [k] __percpu_counter_add
     0.59%  [k] __bread
     0.58%  [k] __wake_up_bit
     0.58%  [k] __mutex_lock_slowpath
     0.56%  [k] __ext2_write_inode
     0.55%  [k] ext2_get_blocks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-27  8:06       ` Nick Piggin
@ 2010-07-28 12:57         ` Dave Chinner
  -1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-28 12:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: xfs, linux-fsdevel

On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> > 
> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run. When it starts reclaiming memory, things
> > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > of seconds in inode reclaim).
> 
> So about this XFS livelock type thingy. It looks like this, and happens
> periodically while running the above fs_mark benchmark requiring reclaim
> of inodes:
....

> Nothing much happening except 100% system time for seconds at a time
> (length of time varies). This is on a ramdisk, so it isn't waiting
> for IO.
> 
> During this time, lots of things are contending on the lock:
> 
>     60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
>      4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
>      3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
>      3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
>      3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
>                    |
>                    --- __write_lock_failed
>                       |
>                       |--99.92%-- xfs_inode_ag_walk
>                       |          xfs_inode_ag_iterator
>                       |          xfs_reclaim_inode_shrink
>                       |          shrink_slab
>                       |          shrink_zone
>                       |          balance_pgdat
>                       |          kswapd
>                       |          kthread
>                       |          kernel_thread_helper
>                        --0.08%-- [...]
> 
>      3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
>      1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
>      1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
>      0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>      0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
>                    |
>                    --- xfs_log_commit_cil
>                        _xfs_trans_commit
>                       |
>                       |--60.00%-- xfs_remove
>                       |          xfs_vn_unlink
>                       |          vfs_unlink
>                       |          do_unlinkat
>                       |          sys_unlink
> 
> I'm not sure if there was a long-running read locker in there causing
> all the write lockers to fail, or if they are just running into one
> another.

The longest hold is in the inode cluster writeback
(xfs_iflush_cluster), but if there is no IO then I don't see how
that would be a problem.

I suspect that it might be caused by having several CPUs
all trying to run the shrinker at the same time and them all
starting at the same AG and therefore lockstepping and getting
nothing done because they are all scanning the same inodes.

Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
avoid this lockstepping.  I've attached a patch below to do this
- can you give it a try?

> But anyway, I hacked the following patch which seemed to
> improve that behaviour. I haven't run any throughput numbers on it yet,
> but I could if you're interested (and it's not completely broken!)

Batching is certainly something that I have been considering, but
apart from the excessive scanning bug, the per-ag inode tree lookups
hve not featured prominently in any profiling I've done, so it
hasn't been a high priority.

You patch looks like it will work fine, but I think it can be made a
lot cleaner. I'll have a closer look at this once I get to the bottom of
the dbench hang you are seeing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: add an AG iterator start rotor

From: Dave Chinner <dchinner@redhat.com>

To avoid multiple CPUs from executing inode cache shrinkers on the same AG all
at the same time, make every shrinker call start on a different AG. This will
mostly prevent concurrent shrinker calls from competing for the serialising
pag_ici_lock and lock-stepping reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   11 ++++++++++-
 fs/xfs/xfs_mount.h          |    1 +
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index dfcbd98..5322105 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -181,11 +181,14 @@ xfs_inode_ag_iterator(
 	struct xfs_perag	*pag;
 	int			error = 0;
 	int			last_error = 0;
+	xfs_agnumber_t		start_ag;
 	xfs_agnumber_t		ag;
 	int			nr;
+	int			looped = 0;
 
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
-	ag = 0;
+	start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount;
+	ag = start_ag;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
 						exclusive, &nr);
@@ -197,6 +200,12 @@ xfs_inode_ag_iterator(
 		}
 		if (nr <= 0)
 			break;
+		if (ag >= mp->m_sb.sb_agcount) {
+			looped = 1;
+			ag = 0;
+		}
+		if (ag >= start_ag && looped)
+			break;
 	}
 	if (nr_to_scan)
 		*nr_to_scan = nr;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 622da21..fae61bb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -199,6 +199,7 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	atomic_t		m_agiter_rotor; /* ag iterator start rotor */
 } xfs_mount_t;
 
 /*

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-28 12:57         ` Dave Chinner
  0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-07-28 12:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, xfs

On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > 
> > > > Branch vfs-scale-working
> > > 
> > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > run the same fs_mark parallel create/unlink workload to show
> > > scalability as I ran here:
> > > 
> > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > 
> > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > of a real disk (I don't have easy access to a good disk setup ATM, but
> > I guess we're more interested in code above the block layer anyway).
> > 
> > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > yours.
> > 
> > I found that performance is a little unstable, so I sync and echo 3 >
> > drop_caches between each run. When it starts reclaiming memory, things
> > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > of seconds in inode reclaim).
> 
> So about this XFS livelock type thingy. It looks like this, and happens
> periodically while running the above fs_mark benchmark requiring reclaim
> of inodes:
....

> Nothing much happening except 100% system time for seconds at a time
> (length of time varies). This is on a ramdisk, so it isn't waiting
> for IO.
> 
> During this time, lots of things are contending on the lock:
> 
>     60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
>      4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
>      3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
>      3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
>      3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
>                    |
>                    --- __write_lock_failed
>                       |
>                       |--99.92%-- xfs_inode_ag_walk
>                       |          xfs_inode_ag_iterator
>                       |          xfs_reclaim_inode_shrink
>                       |          shrink_slab
>                       |          shrink_zone
>                       |          balance_pgdat
>                       |          kswapd
>                       |          kthread
>                       |          kernel_thread_helper
>                        --0.08%-- [...]
> 
>      3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
>      1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
>      1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
>      0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
>      0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
>                    |
>                    --- xfs_log_commit_cil
>                        _xfs_trans_commit
>                       |
>                       |--60.00%-- xfs_remove
>                       |          xfs_vn_unlink
>                       |          vfs_unlink
>                       |          do_unlinkat
>                       |          sys_unlink
> 
> I'm not sure if there was a long-running read locker in there causing
> all the write lockers to fail, or if they are just running into one
> another.

The longest hold is in the inode cluster writeback
(xfs_iflush_cluster), but if there is no IO then I don't see how
that would be a problem.

I suspect that it might be caused by having several CPUs
all trying to run the shrinker at the same time and them all
starting at the same AG and therefore lockstepping and getting
nothing done because they are all scanning the same inodes.

Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
avoid this lockstepping.  I've attached a patch below to do this
- can you give it a try?

> But anyway, I hacked the following patch which seemed to
> improve that behaviour. I haven't run any throughput numbers on it yet,
> but I could if you're interested (and it's not completely broken!)

Batching is certainly something that I have been considering, but
apart from the excessive scanning bug, the per-ag inode tree lookups
hve not featured prominently in any profiling I've done, so it
hasn't been a high priority.

You patch looks like it will work fine, but I think it can be made a
lot cleaner. I'll have a closer look at this once I get to the bottom of
the dbench hang you are seeing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: add an AG iterator start rotor

From: Dave Chinner <dchinner@redhat.com>

To avoid multiple CPUs from executing inode cache shrinkers on the same AG all
at the same time, make every shrinker call start on a different AG. This will
mostly prevent concurrent shrinker calls from competing for the serialising
pag_ici_lock and lock-stepping reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   11 ++++++++++-
 fs/xfs/xfs_mount.h          |    1 +
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index dfcbd98..5322105 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -181,11 +181,14 @@ xfs_inode_ag_iterator(
 	struct xfs_perag	*pag;
 	int			error = 0;
 	int			last_error = 0;
+	xfs_agnumber_t		start_ag;
 	xfs_agnumber_t		ag;
 	int			nr;
+	int			looped = 0;
 
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
-	ag = 0;
+	start_ag = atomic_inc_return(&mp->m_agiter_rotor) & mp->m_sb.sb_agcount;
+	ag = start_ag;
 	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
 						exclusive, &nr);
@@ -197,6 +200,12 @@ xfs_inode_ag_iterator(
 		}
 		if (nr <= 0)
 			break;
+		if (ag >= mp->m_sb.sb_agcount) {
+			looped = 1;
+			ag = 0;
+		}
+		if (ag >= start_ag && looped)
+			break;
 	}
 	if (nr_to_scan)
 		*nr_to_scan = nr;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 622da21..fae61bb 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -199,6 +199,7 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	atomic_t		m_agiter_rotor; /* ag iterator start rotor */
 } xfs_mount_t;
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-27 14:58           ` XFS hang in xlog_grant_log_space Dave Chinner
@ 2010-07-28 13:17             ` Dave Chinner
  2010-07-29 14:05               ` Nick Piggin
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2010-07-28 13:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: xfs

On Wed, Jul 28, 2010 at 12:58:09AM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 11:30:38PM +1000, Dave Chinner wrote:
> > On Tue, Jul 27, 2010 at 09:36:26PM +1000, Nick Piggin wrote:
> > > On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > > On this same system, same setup (vanilla kernel with sha given below),
> > > I have now twice reproduced a complete hang in XFS. I can give more
> > > information, test patches or options etc if required.
> > > 
> > > setup.sh looks like this:
> > > #!/bin/bash
> > > modprobe rd rd_size=$[2*1024*1024]
> > > dd if=/dev/zero of=/dev/ram0 bs=4K
> > > mkfs.xfs -f -l size=64m -d agcount=16 /dev/ram0
> > > mount -o delaylog,logbsize=262144,nobarrier /dev/ram0 mnt
> > > 
> > > The 'dd' is required to ensure rd driver does not allocate pages
> > > during IO (which can lead to out of memory deadlocks). Running just
> > > involves changing into mnt directory and
> > > 
> > > while true
> > > do
> > >   sync
> > >   echo 3 > /proc/sys/vm/drop_caches
> > >   ../dbench -c ../loadfiles/client.txt -t20 8
> > >   rm -rf clients
> > > done
> > > 
> > > And wait for it to hang (happend in < 5 minutes here)
> > ....
> > > Call Trace:
> > >  [<ffffffff812361f8>] xlog_grant_log_space+0x158/0x3d0
> > 
> > It's waiting on log space to be freed up. Either there's an
> > accounting problem (possible), or you've got an xfslogd/xfsaild
> > spinning and not making progress competing log IOs or pushing the
> > tail of the log. I'll see if I can reproduce it.
> 
> Ok, I've just reproduced it. From some tracing:
> 
> touch-3340  [004] 1844935.582716: xfs_log_reserve: dev 1:0 type CREATE t_ocnt 2 t_cnt 2 t_curr_res 167148 t_unit_res 167148 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserve_headq 0xffff88010f489c78 write_headq 0x(null) grant_reserve_cycle 314 grant_reserve_bytes 24250680 grant_write_cycle 314 grant_write_bytes 24250680 curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532
> 
> The key part here is this:
> 
> curr_cycle 314 curr_block 44137 tail_cycle 313 tail_block 48532
> 
> This says the tail of the log is roughly 62MB behind the head. i.e
> the log is full and we are waiting for tail pushing to write the
> item holding the tail in place to disk so it can them be moved
> forward. That's better than an accounting problem, at least.
> 
> So what is holding the tail in place? The first item on the AIL
> appears to be:
> 
> xfsaild/ram0-2997  [000] 1844800.800764: xfs_buf_cond_lock: dev 1:0 bno 0x280120 len 0x2000 hold 3 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller xfs_buf_item_trylock
> 
> A stale buffer. Given that the next objects show this trace:
> 
> xfsaild/ram0-2997  [000] 1844800.800767: xfs_ilock_nowait: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_trylock
> xfsaild/ram0-2997  [000] 1844800.800768: xfs_buf_rele: dev 1:0 bno 0x280120 len 0x2000 hold 4 pincount 0 lock 0 flags ASYNC|DONE|STALE|PAGE_CACHE caller _xfs_buf_find
> xfsaild/ram0-2997  [000] 1844800.800769: xfs_iunlock: dev 1:0 ino 0x500241 flags ILOCK_SHARED caller xfs_inode_item_pushbuf
> 
> we see the next item on the AIL is an inode but the trace is
> followed by a release on the original buffer, than tells me the
> inode is flush locked and it returned XFS_ITEM_PUSHBUF to push the
> inode buffer out. That results in xfs_inode_item_pushbuf() being
> called, and that tries to lock the inode buffer to flush it.
> xfs_buf_rele is called if the trylock on the buffer fails.
> 
> IOWs, this looks to be another problem with inode cluster freeing.
> 
> Ok, so we can't flush the buffer because it is locked. Why is it
> locked? Well, that is unclear as yet. None of the blocked processes
> should be holding an inode buffer locked, and a stale buffer should
> be unlocked during transaction commit and not live longer than
> the log IO that writes the transaction to disk. That is, it should
> not get locked again before everything is freed up.
> 
> That's as much as I can get from post-mortem analysis - I need to
> capture a trace that spans the lockup to catch what happens
> to the buffer that we are hung on. That will have to wait until the
> morning....

Ok, so I got a trace of all the inode and buffer locking and log
item operations, and the reason the hang has occurred can bee seen
here:

dbench-3084  [007] 1877156.395784: xfs_buf_item_unlock_stale: dev 1:0 bno 0x80040 len 0x2000 hold 2 pincount 0 lock 0 flags |ASYNC|DONE|STALE|PAGE_CACHE recur 1 refcount 1 bliflags |STALE|INODE_ALLOC|STALE_INODE lidesc 0x(null) liflags IN_AIL

The key points in this trace is that when we are unlocking a stale
buffer during transaction commit - we don't actually unlock it. WHat
we do it:

        /*
         * If the buf item is marked stale, then don't do anything.  We'll
         * unlock the buffer and free the buf item when the buffer is unpinned
         * for the last time.
         */
        if (bip->bli_flags & XFS_BLI_STALE) {
                trace_xfs_buf_item_unlock_stale(bip);
                ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL);
                if (!aborted) {
                        atomic_dec(&bip->bli_refcount);
                        return;
                }
        }

But from the above trace it can be seen that the buffer pincount is zero.
Hence it will never get and unpin callback, and hence never get unlocked.
As a result, this process here is the one that is stuck on the buffer:

dbench        D 0000000000000007     0  3084      1 0x00000004
 ffff880104e07608 0000000000000086 ffff880104e075b8 0000000000014000
 ffff880104e07fd8 0000000000014000 ffff880104e07fd8 ffff880104c1d770
 0000000000014000 0000000000014000 ffff880104e07fd8 0000000000014000
Call Trace:
 [<ffffffff817e59cd>] schedule_timeout+0x1ed/0x2c0
 [<ffffffff810de9de>] ? ring_buffer_lock_reserve+0x9e/0x160
 [<ffffffff817e690e>] __down+0x7e/0xc0
 [<ffffffff812f41d5>] ? _xfs_buf_find+0x145/0x290
 [<ffffffff810a05d0>] down+0x40/0x50
 [<ffffffff812f41d5>] ? _xfs_buf_find+0x145/0x290
 [<ffffffff812f314d>] xfs_buf_lock+0x4d/0x110
 [<ffffffff812f41d5>] _xfs_buf_find+0x145/0x290
 [<ffffffff812f4380>] xfs_buf_get+0x60/0x1c0
 [<ffffffff812ea8f0>] xfs_trans_get_buf+0xe0/0x180
 [<ffffffff812ccdab>] xfs_ialloc_inode_init+0xcb/0x1c0
 [<ffffffff812cdaf9>] xfs_ialloc_ag_alloc+0x179/0x4a0
 [<ffffffff812cdeff>] xfs_dialloc+0xdf/0x870
 [<ffffffff8105ec88>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff812d2105>] xfs_ialloc+0x65/0x6b0
 [<ffffffff812eb032>] xfs_dir_ialloc+0x82/0x2d0
 [<ffffffff8128ce83>] ? ftrace_raw_event_xfs_lock_class+0xd3/0xe0
 [<ffffffff812ecad7>] ? xfs_create+0x1a7/0x690
 [<ffffffff812ecd37>] xfs_create+0x407/0x690
 [<ffffffff812f9f97>] xfs_vn_mknod+0xa7/0x1c0
 [<ffffffff812fa0e0>] xfs_vn_create+0x10/0x20
 [<ffffffff8114ca6c>] vfs_create+0xac/0xd0
 [<ffffffff8114d6ec>] do_last+0x51c/0x620
 [<ffffffff8114f6d8>] do_filp_open+0x228/0x640
 [<ffffffff812c1f18>] ? xfs_dir2_block_getdents+0x218/0x220
 [<ffffffff8115a62a>] ? alloc_fd+0x10a/0x150
 [<ffffffff8113f919>] do_sys_open+0x69/0x140
 [<ffffffff8113fa30>] sys_open+0x20/0x30
 [<ffffffff81035032>] system_call_fastpath+0x16/0x1b

It is trying to allocate a new inode chunk on disk, which happens to
be the one we just removed and staled. Now to find out why the pin
count on the buffer is wrong.

.....

Now it makes no sense - the buf item pin trace is there directly
directly before the above unlock stale trace. The buf item pinning
increments both the buffer pin count and the buf item refcount,
neith of which are reflected in the unlock stale trace. From the
trace analysis I did:


process
83		84		85		86		87
.....
								get
								trans_read
								stale
								format
								pin
								unlock

								committed
								unpin stale
								free

		trans_get
		init
		format
		pin
		item unlock

		get
		trans_read
		unlock

		get
		trans_read
		item unlock

**** pincount goes from 1 to 0 here without any unpin traces ****

		get
		trans_read
		unlock

		get
		trans_read
		item unlock

		(repeat 1x)

		get
		trans_read
		unlock

		get
		trans_read
		stale
		format
		pin

**** no committed/unlock/free trace for this transaction ****
**** hold count goes from 2 to 1 without any rele traces ****
**** is a new buffer allocated here without the staled buffer being committed? ****

		trans_get
		init
		format
		pin
		item unlock

		get
		trans_read
		unlock

		get
		trans_read
		item unlock

		(repeat 9x)

		get
		trans_read
		unlock

		get
		trans_read
		stale
		format
		pin

**** pin count does not increment! ****

		buf lock from xfs_buf_find => hangs

**** buffer that was found is locked, not pinned ****

Something very strange is happening, and to make matters worse I
cannot reproduce it with a debug kernel (ran for 3 hours without
failing). Hence it smells like a race condition somewhere.

I've reproduced it without delayed logging, so it is not directly
related to that functionality.

I've seen this warning:

Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117

Which indicates we failed to mark an inode stale when freeing an
inode cluster, but I think I've fixed that and the problem still
shows up. It's posible the last version didn't fix it, but....

Now I've got the ag iterator rotor patch in place as well and
possibly a different version of the cluster free fix to what I
previously tested and it's now been running for almost half an hour.
I can't say yet whether I've fixed the bug of just changed the
timing enough to avoid it. I'll leave this test running over night
and redo individual patch testing tomorrow.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-28 12:57         ` Dave Chinner
@ 2010-07-29 14:03           ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-29 14:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, xfs, linux-fsdevel

On Wed, Jul 28, 2010 at 10:57:17PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > > 
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > > 
> > > > > Branch vfs-scale-working
> > > > 
> > > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > > run the same fs_mark parallel create/unlink workload to show
> > > > scalability as I ran here:
> > > > 
> > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > > 
> > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > > of a real disk (I don't have easy access to a good disk setup ATM, but
> > > I guess we're more interested in code above the block layer anyway).
> > > 
> > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > > yours.
> > > 
> > > I found that performance is a little unstable, so I sync and echo 3 >
> > > drop_caches between each run. When it starts reclaiming memory, things
> > > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > > of seconds in inode reclaim).
> > 
> > So about this XFS livelock type thingy. It looks like this, and happens
> > periodically while running the above fs_mark benchmark requiring reclaim
> > of inodes:
> ....
> 
> > Nothing much happening except 100% system time for seconds at a time
> > (length of time varies). This is on a ramdisk, so it isn't waiting
> > for IO.
> > 
> > During this time, lots of things are contending on the lock:
> > 
> >     60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
> >      4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
> >      3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
> >      3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
> >      3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
> >                    |
> >                    --- __write_lock_failed
> >                       |
> >                       |--99.92%-- xfs_inode_ag_walk
> >                       |          xfs_inode_ag_iterator
> >                       |          xfs_reclaim_inode_shrink
> >                       |          shrink_slab
> >                       |          shrink_zone
> >                       |          balance_pgdat
> >                       |          kswapd
> >                       |          kthread
> >                       |          kernel_thread_helper
> >                        --0.08%-- [...]
> > 
> >      3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
> >      1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
> >      1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
> >      0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
> >      0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
> >                    |
> >                    --- xfs_log_commit_cil
> >                        _xfs_trans_commit
> >                       |
> >                       |--60.00%-- xfs_remove
> >                       |          xfs_vn_unlink
> >                       |          vfs_unlink
> >                       |          do_unlinkat
> >                       |          sys_unlink
> > 
> > I'm not sure if there was a long-running read locker in there causing
> > all the write lockers to fail, or if they are just running into one
> > another.
> 
> The longest hold is in the inode cluster writeback
> (xfs_iflush_cluster), but if there is no IO then I don't see how
> that would be a problem.

No I wasn't suggesting there was, just that there could have
been one that I didn't notice in profiles (ie. because it
had taken read lock rather than spinning on it).

 
> I suspect that it might be caused by having several CPUs
> all trying to run the shrinker at the same time and them all
> starting at the same AG and therefore lockstepping and getting
> nothing done because they are all scanning the same inodes.

I think that is the most likely answer, yes.


> Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
> avoid this lockstepping.  I've attached a patch below to do this
> - can you give it a try?

Cool yes I will. I could try it in combination with the batching
patch too. Thanks.

 
> > But anyway, I hacked the following patch which seemed to
> > improve that behaviour. I haven't run any throughput numbers on it yet,
> > but I could if you're interested (and it's not completely broken!)
> 
> Batching is certainly something that I have been considering, but
> apart from the excessive scanning bug, the per-ag inode tree lookups
> hve not featured prominently in any profiling I've done, so it
> hasn't been a high priority.
> 
> You patch looks like it will work fine, but I think it can be made a
> lot cleaner. I'll have a closer look at this once I get to the bottom of
> the dbench hang you are seeing....

Well I'll see if I can measure any efficiency or lock contention
improvements with it and report back. Might have to wait till the
weekend.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-29 14:03           ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-29 14:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Nick Piggin, xfs

On Wed, Jul 28, 2010 at 10:57:17PM +1000, Dave Chinner wrote:
> On Tue, Jul 27, 2010 at 06:06:32PM +1000, Nick Piggin wrote:
> > On Tue, Jul 27, 2010 at 05:05:39PM +1000, Nick Piggin wrote:
> > > On Fri, Jul 23, 2010 at 11:55:14PM +1000, Dave Chinner wrote:
> > > > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > > > 
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > > > 
> > > > > Branch vfs-scale-working
> > > > 
> > > > With a production build (i.e. no lockdep, no xfs debug), I'll
> > > > run the same fs_mark parallel create/unlink workload to show
> > > > scalability as I ran here:
> > > > 
> > > > http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
> > > 
> > > I've made a similar setup, 2s8c machine, but using 2GB ramdisk instead
> > > of a real disk (I don't have easy access to a good disk setup ATM, but
> > > I guess we're more interested in code above the block layer anyway).
> > > 
> > > Made an XFS on /dev/ram0 with 16 ags, 64MB log, otherwise same config as
> > > yours.
> > > 
> > > I found that performance is a little unstable, so I sync and echo 3 >
> > > drop_caches between each run. When it starts reclaiming memory, things
> > > get a bit more erratic (and XFS seemed to be almost livelocking for tens
> > > of seconds in inode reclaim).
> > 
> > So about this XFS livelock type thingy. It looks like this, and happens
> > periodically while running the above fs_mark benchmark requiring reclaim
> > of inodes:
> ....
> 
> > Nothing much happening except 100% system time for seconds at a time
> > (length of time varies). This is on a ramdisk, so it isn't waiting
> > for IO.
> > 
> > During this time, lots of things are contending on the lock:
> > 
> >     60.37%         fs_mark  [kernel.kallsyms]   [k] __write_lock_failed
> >      4.30%         kswapd0  [kernel.kallsyms]   [k] __write_lock_failed
> >      3.70%         fs_mark  [kernel.kallsyms]   [k] try_wait_for_completion
> >      3.59%         fs_mark  [kernel.kallsyms]   [k] _raw_write_lock
> >      3.46%         kswapd1  [kernel.kallsyms]   [k] __write_lock_failed
> >                    |
> >                    --- __write_lock_failed
> >                       |
> >                       |--99.92%-- xfs_inode_ag_walk
> >                       |          xfs_inode_ag_iterator
> >                       |          xfs_reclaim_inode_shrink
> >                       |          shrink_slab
> >                       |          shrink_zone
> >                       |          balance_pgdat
> >                       |          kswapd
> >                       |          kthread
> >                       |          kernel_thread_helper
> >                        --0.08%-- [...]
> > 
> >      3.02%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock
> >      1.82%         fs_mark  [kernel.kallsyms]   [k] _xfs_buf_find
> >      1.16%         fs_mark  [kernel.kallsyms]   [k] memcpy
> >      0.86%         fs_mark  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
> >      0.75%         fs_mark  [kernel.kallsyms]   [k] xfs_log_commit_cil
> >                    |
> >                    --- xfs_log_commit_cil
> >                        _xfs_trans_commit
> >                       |
> >                       |--60.00%-- xfs_remove
> >                       |          xfs_vn_unlink
> >                       |          vfs_unlink
> >                       |          do_unlinkat
> >                       |          sys_unlink
> > 
> > I'm not sure if there was a long-running read locker in there causing
> > all the write lockers to fail, or if they are just running into one
> > another.
> 
> The longest hold is in the inode cluster writeback
> (xfs_iflush_cluster), but if there is no IO then I don't see how
> that would be a problem.

No I wasn't suggesting there was, just that there could have
been one that I didn't notice in profiles (ie. because it
had taken read lock rather than spinning on it).

 
> I suspect that it might be caused by having several CPUs
> all trying to run the shrinker at the same time and them all
> starting at the same AG and therefore lockstepping and getting
> nothing done because they are all scanning the same inodes.

I think that is the most likely answer, yes.


> Maybe a start AG rotor for xfs_inode_ag_iterator() is needed to
> avoid this lockstepping.  I've attached a patch below to do this
> - can you give it a try?

Cool yes I will. I could try it in combination with the batching
patch too. Thanks.

 
> > But anyway, I hacked the following patch which seemed to
> > improve that behaviour. I haven't run any throughput numbers on it yet,
> > but I could if you're interested (and it's not completely broken!)
> 
> Batching is certainly something that I have been considering, but
> apart from the excessive scanning bug, the per-ag inode tree lookups
> hve not featured prominently in any profiling I've done, so it
> hasn't been a high priority.
> 
> You patch looks like it will work fine, but I think it can be made a
> lot cleaner. I'll have a closer look at this once I get to the bottom of
> the dbench hang you are seeing....

Well I'll see if I can measure any efficiency or lock contention
improvements with it and report back. Might have to wait till the
weekend.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-28 13:17             ` Dave Chinner
@ 2010-07-29 14:05               ` Nick Piggin
  2010-07-29 22:56                 ` Dave Chinner
  0 siblings, 1 reply; 76+ messages in thread
From: Nick Piggin @ 2010-07-29 14:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, xfs

On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote:
> Something very strange is happening, and to make matters worse I
> cannot reproduce it with a debug kernel (ran for 3 hours without
> failing). Hence it smells like a race condition somewhere.
> 
> I've reproduced it without delayed logging, so it is not directly
> related to that functionality.
> 
> I've seen this warning:
> 
> Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117
> 
> Which indicates we failed to mark an inode stale when freeing an
> inode cluster, but I think I've fixed that and the problem still
> shows up. It's posible the last version didn't fix it, but....

I've seen that one a couple of times too. Keeps coming back each
time you echo 3 > /proc/sys/vm/drop_caches :)


> Now I've got the ag iterator rotor patch in place as well and
> possibly a different version of the cluster free fix to what I
> previously tested and it's now been running for almost half an hour.
> I can't say yet whether I've fixed the bug of just changed the
> timing enough to avoid it. I'll leave this test running over night
> and redo individual patch testing tomorrow.

I reproduced it with fs_stress now too. Any patches I could test
for you just let me know.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-29 14:05               ` Nick Piggin
@ 2010-07-29 22:56                 ` Dave Chinner
  2010-07-30  3:59                   ` Nick Piggin
  0 siblings, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2010-07-29 22:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Nick Piggin, xfs

On Fri, Jul 30, 2010 at 12:05:46AM +1000, Nick Piggin wrote:
> On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote:
> > Something very strange is happening, and to make matters worse I
> > cannot reproduce it with a debug kernel (ran for 3 hours without
> > failing). Hence it smells like a race condition somewhere.
> > 
> > I've reproduced it without delayed logging, so it is not directly
> > related to that functionality.
> > 
> > I've seen this warning:
> > 
> > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117
> > 
> > Which indicates we failed to mark an inode stale when freeing an
> > inode cluster, but I think I've fixed that and the problem still
> > shows up. It's posible the last version didn't fix it, but....
> 
> I've seen that one a couple of times too. Keeps coming back each
> time you echo 3 > /proc/sys/vm/drop_caches :)

Yup - it's an unflushable inode that is pinning the tail of the log,
hence causing the log space hangs.

> > Now I've got the ag iterator rotor patch in place as well and
> > possibly a different version of the cluster free fix to what I
> > previously tested and it's now been running for almost half an hour.
> > I can't say yet whether I've fixed the bug of just changed the
> > timing enough to avoid it. I'll leave this test running over night
> > and redo individual patch testing tomorrow.
> 
> I reproduced it with fs_stress now too. Any patches I could test
> for you just let me know.

You should see them in a few minutes ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: XFS hang in xlog_grant_log_space
  2010-07-29 22:56                 ` Dave Chinner
@ 2010-07-30  3:59                   ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-30  3:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, Nick Piggin, xfs

On Fri, Jul 30, 2010 at 08:56:58AM +1000, Dave Chinner wrote:
> On Fri, Jul 30, 2010 at 12:05:46AM +1000, Nick Piggin wrote:
> > On Wed, Jul 28, 2010 at 11:17:44PM +1000, Dave Chinner wrote:
> > > Something very strange is happening, and to make matters worse I
> > > cannot reproduce it with a debug kernel (ran for 3 hours without
> > > failing). Hence it smells like a race condition somewhere.
> > > 
> > > I've reproduced it without delayed logging, so it is not directly
> > > related to that functionality.
> > > 
> > > I've seen this warning:
> > > 
> > > Filesystem "ram0": inode 0x704680 background reclaim flush failed with 117
> > > 
> > > Which indicates we failed to mark an inode stale when freeing an
> > > inode cluster, but I think I've fixed that and the problem still
> > > shows up. It's posible the last version didn't fix it, but....
> > 
> > I've seen that one a couple of times too. Keeps coming back each
> > time you echo 3 > /proc/sys/vm/drop_caches :)
> 
> Yup - it's an unflushable inode that is pinning the tail of the log,
> hence causing the log space hangs.
> 
> > > Now I've got the ag iterator rotor patch in place as well and
> > > possibly a different version of the cluster free fix to what I
> > > previously tested and it's now been running for almost half an hour.
> > > I can't say yet whether I've fixed the bug of just changed the
> > > timing enough to avoid it. I'll leave this test running over night
> > > and redo individual patch testing tomorrow.
> > 
> > I reproduced it with fs_stress now too. Any patches I could test
> > for you just let me know.
> 
> You should see them in a few minutes ;)

It's certainly not locking up like it used to... Thanks!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-22 19:01 ` Nick Piggin
@ 2010-07-30  9:12   ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-30  9:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> 
> Branch vfs-scale-working
> 
> The really interesting new item is the store-free path walk, (43fe2b)
> which I've re-introduced. It has had a complete redesign, it has much
> better performance and scalability in more cases, and is actually sane
> code now.

Things are progressing well here with fixes and improvements to the
branch.

One thing that has been brought to my attention is that store-free path
walking (rcu-walk) drops into the normal refcounted walking on any
filesystem that has posix ACLs enabled.

Having misread that IS_POSIXACL is based on a superblock flag, I had
thought we only drop out of rcu-walk in case of encountering an inode
that actually has acls.

This is quite an important point for any performance testing work.
ACLs can actually be rcu checked quite easily in most cases, but it
takes a bit of work on APIs.

Filesystems defining their own ->permission and ->d_revalidate will
also not use rcu-walk. These could likewise be made to support rcu-walk
more widely, but it will require knowledge of rcu-walk to be pushed
into filesystems.

It's not a big deal, basically: no blocking, no stores, no referencing
non-rcu-protected data, and confirm with seqlock. That is usually the
case in fastpaths. If it cannot be satisfied, then just return -ECHILD
and you'll get called in the usual ref-walk mode next time.

But for now, keep this in mind if you plan to do any serious performance
testing work, *do not mount filesystems with ACL support*.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-07-30  9:12   ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-07-30  9:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar, John Stultz

On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> I'm pleased to announce I have a git tree up of my vfs scalability work.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> 
> Branch vfs-scale-working
> 
> The really interesting new item is the store-free path walk, (43fe2b)
> which I've re-introduced. It has had a complete redesign, it has much
> better performance and scalability in more cases, and is actually sane
> code now.

Things are progressing well here with fixes and improvements to the
branch.

One thing that has been brought to my attention is that store-free path
walking (rcu-walk) drops into the normal refcounted walking on any
filesystem that has posix ACLs enabled.

Having misread that IS_POSIXACL is based on a superblock flag, I had
thought we only drop out of rcu-walk in case of encountering an inode
that actually has acls.

This is quite an important point for any performance testing work.
ACLs can actually be rcu checked quite easily in most cases, but it
takes a bit of work on APIs.

Filesystems defining their own ->permission and ->d_revalidate will
also not use rcu-walk. These could likewise be made to support rcu-walk
more widely, but it will require knowledge of rcu-walk to be pushed
into filesystems.

It's not a big deal, basically: no blocking, no stores, no referencing
non-rcu-protected data, and confirm with seqlock. That is usually the
case in fastpaths. If it cannot be satisfied, then just return -ECHILD
and you'll get called in the usual ref-walk mode next time.

But for now, keep this in mind if you plan to do any serious performance
testing work, *do not mount filesystems with ACL support*.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-07-30  9:12   ` Nick Piggin
  (?)
@ 2010-08-03  0:27     ` john stultz
  -1 siblings, 0 replies; 76+ messages in thread
From: john stultz @ 2010-08-03  0:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar

On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> > 
> > The really interesting new item is the store-free path walk, (43fe2b)
> > which I've re-introduced. It has had a complete redesign, it has much
> > better performance and scalability in more cases, and is actually sane
> > code now.
> 
> Things are progressing well here with fixes and improvements to the
> branch.

Hey Nick,
	Just another minor compile issue with today's vfs-scale-working branch.

fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here
(not in a function)

>From looking at the vfat and ecryptfs changes in
582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
add the following? 


diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f0c2479..9ee4c10 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc)
  * the lookup once more.  If the lookup results in the same inode,
  * then refresh the attributes, timeouts and mark the dentry valid.
  */
-static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd)
+static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd)
 {
 	struct inode *inode = entry->d_inode;
 



^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-08-03  0:27     ` john stultz
  0 siblings, 0 replies; 76+ messages in thread
From: john stultz @ 2010-08-03  0:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar

On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> > 
> > The really interesting new item is the store-free path walk, (43fe2b)
> > which I've re-introduced. It has had a complete redesign, it has much
> > better performance and scalability in more cases, and is actually sane
> > code now.
> 
> Things are progressing well here with fixes and improvements to the
> branch.

Hey Nick,
	Just another minor compile issue with today's vfs-scale-working branch.

fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here
(not in a function)

>From looking at the vfat and ecryptfs changes in
582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
add the following? 


diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f0c2479..9ee4c10 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc)
  * the lookup once more.  If the lookup results in the same inode,
  * then refresh the attributes, timeouts and mark the dentry valid.
  */
-static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd)
+static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd)
 {
 	struct inode *inode = entry->d_inode;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-08-03  0:27     ` john stultz
  0 siblings, 0 replies; 76+ messages in thread
From: john stultz @ 2010-08-03  0:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nick Piggin, linux-fsdevel, linux-kernel, linux-mm, Frank Mayhar

On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > 
> > Branch vfs-scale-working
> > 
> > The really interesting new item is the store-free path walk, (43fe2b)
> > which I've re-introduced. It has had a complete redesign, it has much
> > better performance and scalability in more cases, and is actually sane
> > code now.
> 
> Things are progressing well here with fixes and improvements to the
> branch.

Hey Nick,
	Just another minor compile issue with today's vfs-scale-working branch.

fs/fuse/dir.c:231: error: a??fuse_dentry_revalidate_rcua?? undeclared here
(not in a function)

>From looking at the vfat and ecryptfs changes in
582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
add the following? 


diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f0c2479..9ee4c10 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -154,7 +154,7 @@ u64 fuse_get_attr_version(struct fuse_conn *fc)
  * the lookup once more.  If the lookup results in the same inode,
  * then refresh the attributes, timeouts and mark the dentry valid.
  */
-static int fuse_dentry_revalidate(struct dentry *entry, struct nameidata *nd)
+static int fuse_dentry_revalidate_rcu(struct dentry *entry, struct nameidata *nd)
 {
 	struct inode *inode = entry->d_inode;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-08-03  0:27     ` john stultz
  (?)
@ 2010-08-03  5:44       ` Nick Piggin
  -1 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-08-03  5:44 UTC (permalink / raw)
  To: john stultz
  Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar

On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote:
> On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > > 
> > > The really interesting new item is the store-free path walk, (43fe2b)
> > > which I've re-introduced. It has had a complete redesign, it has much
> > > better performance and scalability in more cases, and is actually sane
> > > code now.
> > 
> > Things are progressing well here with fixes and improvements to the
> > branch.
> 
> Hey Nick,
> 	Just another minor compile issue with today's vfs-scale-working branch.
> 
> fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here
> (not in a function)
> 
> >From looking at the vfat and ecryptfs changes in
> 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
> add the following? 

Thanks John, you're right.

I thought I actually linked and ran this, but I must not have had fuse
compiled in.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-08-03  5:44       ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-08-03  5:44 UTC (permalink / raw)
  To: john stultz
  Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar

On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote:
> On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > > 
> > > The really interesting new item is the store-free path walk, (43fe2b)
> > > which I've re-introduced. It has had a complete redesign, it has much
> > > better performance and scalability in more cases, and is actually sane
> > > code now.
> > 
> > Things are progressing well here with fixes and improvements to the
> > branch.
> 
> Hey Nick,
> 	Just another minor compile issue with today's vfs-scale-working branch.
> 
> fs/fuse/dir.c:231: error: ‘fuse_dentry_revalidate_rcu’ undeclared here
> (not in a function)
> 
> >From looking at the vfat and ecryptfs changes in
> 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
> add the following? 

Thanks John, you're right.

I thought I actually linked and ran this, but I must not have had fuse
compiled in.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
@ 2010-08-03  5:44       ` Nick Piggin
  0 siblings, 0 replies; 76+ messages in thread
From: Nick Piggin @ 2010-08-03  5:44 UTC (permalink / raw)
  To: john stultz
  Cc: Nick Piggin, Nick Piggin, linux-fsdevel, linux-kernel, linux-mm,
	Frank Mayhar

On Mon, Aug 02, 2010 at 05:27:59PM -0700, John Stultz wrote:
> On Fri, 2010-07-30 at 19:12 +1000, Nick Piggin wrote:
> > On Fri, Jul 23, 2010 at 05:01:00AM +1000, Nick Piggin wrote:
> > > I'm pleased to announce I have a git tree up of my vfs scalability work.
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> > > http://git.kernel.org/?p=linux/kernel/git/npiggin/linux-npiggin.git
> > > 
> > > Branch vfs-scale-working
> > > 
> > > The really interesting new item is the store-free path walk, (43fe2b)
> > > which I've re-introduced. It has had a complete redesign, it has much
> > > better performance and scalability in more cases, and is actually sane
> > > code now.
> > 
> > Things are progressing well here with fixes and improvements to the
> > branch.
> 
> Hey Nick,
> 	Just another minor compile issue with today's vfs-scale-working branch.
> 
> fs/fuse/dir.c:231: error: a??fuse_dentry_revalidate_rcua?? undeclared here
> (not in a function)
> 
> >From looking at the vfat and ecryptfs changes in
> 582c56f032983e9a8e4b4bd6fac58d18811f7d41 it looks like you intended to
> add the following? 

Thanks John, you're right.

I thought I actually linked and ran this, but I must not have had fuse
compiled in.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-08-03  5:44       ` Nick Piggin
  (?)
  (?)
@ 2010-09-14 22:26       ` Christoph Hellwig
  2010-09-14 23:02         ` Frank Mayhar
  -1 siblings, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2010-09-14 22:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-kernel

Nick,

what's the plan for going ahead with the VFS scalability work?  We're
pretty late in the 2.6.36 cycle now and it would be good to get the next
batch prepared and reivew so that it can get some testing in -next.

As mentioned before my preference would be the inode lock splitup and
related patches - they are relatively simple and we're already seeing
workloads where inode_lock really hurts in the writeback code.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: VFS scalability git tree
  2010-09-14 22:26       ` Christoph Hellwig
@ 2010-09-14 23:02         ` Frank Mayhar
  0 siblings, 0 replies; 76+ messages in thread
From: Frank Mayhar @ 2010-09-14 23:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, linux-fsdevel, linux-kernel

On Tue, 2010-09-14 at 18:26 -0400, Christoph Hellwig wrote:
> Nick,
> 
> what's the plan for going ahead with the VFS scalability work?  We're
> pretty late in the 2.6.36 cycle now and it would be good to get the next
> batch prepared and reivew so that it can get some testing in -next.
> 
> As mentioned before my preference would be the inode lock splitup and
> related patches - they are relatively simple and we're already seeing
> workloads where inode_lock really hurts in the writeback code.

For the record, while I've been quiet here (really busy) I have run a
bunch of pretty serious tests against the original set of patches (note:
_not_ the latest bits in Nick's tree, I have those queued up but haven't
gotten to them yet).  So far I haven't seen any instability at all.

(I did see one case in which a test that does a _lot_ of network traffic
with tons of sockets saw a 20+% performance hit on a system with a
relatively moderate number of cores but I haven't had the time to
characterize it better and want to test against the newer bits in any
event.  Sorry to be so vague, I can't really be more specific at this
point.  Nailing this down is _also_ on my list.)

Performance notwithstanding, I'm impressed with the stability of those
original patches.  I've run VM stress tests against it, FS stress tests,
lots of benchmarks and a bunch of other stuff and it's solid, no crashes
nor any anomalous behavior.

That being the case, I would vote enthusiastically for bringing in the
inode_lock splitup as soon as is feasible.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.


^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2010-09-14 23:02 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-22 19:01 VFS scalability git tree Nick Piggin
2010-07-22 19:01 ` Nick Piggin
2010-07-23 11:13 ` Dave Chinner
2010-07-23 11:13   ` Dave Chinner
2010-07-23 14:04   ` [PATCH 0/2] vfs scalability tree fixes Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 16:09     ` Nick Piggin
2010-07-23 16:09       ` Nick Piggin
2010-07-23 14:04   ` [PATCH 1/2] xfs: fix shrinker build Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 14:04   ` [PATCH 2/2] xfs: shrinker should use a per-filesystem scan count Dave Chinner
2010-07-23 14:04     ` Dave Chinner
2010-07-23 15:51   ` VFS scalability git tree Nick Piggin
2010-07-23 15:51     ` Nick Piggin
2010-07-24  0:21     ` Dave Chinner
2010-07-24  0:21       ` Dave Chinner
2010-07-23 11:17 ` Christoph Hellwig
2010-07-23 11:17   ` Christoph Hellwig
2010-07-23 15:42   ` Nick Piggin
2010-07-23 15:42     ` Nick Piggin
2010-07-23 13:55 ` Dave Chinner
2010-07-23 13:55   ` Dave Chinner
2010-07-23 16:16   ` Nick Piggin
2010-07-23 16:16     ` Nick Piggin
2010-07-27  7:05   ` Nick Piggin
2010-07-27  7:05     ` Nick Piggin
2010-07-27  8:06     ` Nick Piggin
2010-07-27  8:06       ` Nick Piggin
2010-07-27 11:36       ` XFS hang in xlog_grant_log_space (was Re: VFS scalability git tree) Nick Piggin
2010-07-27 13:30         ` Dave Chinner
2010-07-27 14:58           ` XFS hang in xlog_grant_log_space Dave Chinner
2010-07-28 13:17             ` Dave Chinner
2010-07-29 14:05               ` Nick Piggin
2010-07-29 22:56                 ` Dave Chinner
2010-07-30  3:59                   ` Nick Piggin
2010-07-28 12:57       ` VFS scalability git tree Dave Chinner
2010-07-28 12:57         ` Dave Chinner
2010-07-29 14:03         ` Nick Piggin
2010-07-29 14:03           ` Nick Piggin
2010-07-27 11:09     ` Nick Piggin
2010-07-27 11:09       ` Nick Piggin
2010-07-27 13:18     ` Dave Chinner
2010-07-27 13:18       ` Dave Chinner
2010-07-27 15:09       ` Nick Piggin
2010-07-27 15:09         ` Nick Piggin
2010-07-28  4:59         ` Dave Chinner
2010-07-28  4:59           ` Dave Chinner
2010-07-28  4:59           ` Dave Chinner
2010-07-23 15:35 ` Nick Piggin
2010-07-23 15:35   ` Nick Piggin
2010-07-24  8:43 ` KOSAKI Motohiro
2010-07-24  8:43   ` KOSAKI Motohiro
2010-07-24  8:44   ` [PATCH 1/2] vmscan: shrink_all_slab() use reclaim_state instead the return value of shrink_slab() KOSAKI Motohiro
2010-07-24  8:44     ` KOSAKI Motohiro
2010-07-24  8:44     ` KOSAKI Motohiro
2010-07-24 12:05     ` KOSAKI Motohiro
2010-07-24 12:05       ` KOSAKI Motohiro
2010-07-24  8:46   ` [PATCH 2/2] vmscan: change shrink_slab() return tyep with void KOSAKI Motohiro
2010-07-24  8:46     ` KOSAKI Motohiro
2010-07-24  8:46     ` KOSAKI Motohiro
2010-07-24 10:54   ` VFS scalability git tree KOSAKI Motohiro
2010-07-24 10:54     ` KOSAKI Motohiro
2010-07-26  5:41 ` Nick Piggin
2010-07-26  5:41   ` Nick Piggin
2010-07-28 10:24   ` Nick Piggin
2010-07-28 10:24     ` Nick Piggin
2010-07-30  9:12 ` Nick Piggin
2010-07-30  9:12   ` Nick Piggin
2010-08-03  0:27   ` john stultz
2010-08-03  0:27     ` john stultz
2010-08-03  0:27     ` john stultz
2010-08-03  5:44     ` Nick Piggin
2010-08-03  5:44       ` Nick Piggin
2010-08-03  5:44       ` Nick Piggin
2010-09-14 22:26       ` Christoph Hellwig
2010-09-14 23:02         ` Frank Mayhar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.