linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/52] vfs scalability patches updated
@ 2010-06-24  3:02 npiggin
  2010-06-24  3:02 ` [patch 01/52] kernel: add bl_list npiggin
                   ` (53 more replies)
  0 siblings, 54 replies; 152+ messages in thread
From: npiggin @ 2010-06-24  3:02 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel; +Cc: John Stultz, Frank Mayhar

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/

Update to vfs scalability patches:

- Lots of fixes, particularly RCU inode stuff
- Lots of cleanups and aesthetic to the code, ifdef reduction etc
- Use bit locks for inode and dentry hashes
- Small improvements to single-threaded performance
- Split inode LRU and writeback list locking
- Per-bdi inode writeback list locking
- Per-zone mm shrinker
- Per-zone dentry and inode LRU lists
- Several fixes brought in from -rt tree testing
- No global locks remain in any fastpaths (arguably, rename)

I have not included the store-free path walk patches in this posting. They
require a bit more work and they will need to be reworked after
->d_revalidate/->follow_mount changes that Al wants to do. I prefer to
concentrate on these locking patches first.

Autofs4 is sadly missing. It's a bit tricky, patches have to be reworked.

Performance:
Last time I was testing on a 32-node Altix which could be considered as not a
sweet-spot for Linux performance target (ie. improvements there may not justify
complexity). So recently I've been testing with a tightly interconnected
4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of
system.

*** Single-thread microbenchmark (simple syscall loops, lower is better):
Test                    Difference at 95.0% confidence (50 runs)
open/close              -6.07% +/- 1.075%
creat/unlink            27.83% +/- 0.522%
Open/close is a little faster, which should be due to one less atomic in the
dput common case. Creat/unlink is significantly slower, which is due to RCU
freeing inodes. We have have made the same magnitude of performance regression
tradeoff when going to RCU freed dentries and files as well. Inode RCU is
required for reducing inode hash lookup locking and improve lock ordering,
also for store-free path-walk.

*** Let's take a look at this creat/unlink regression more closely. If we call
rdtsc around the creat/unlink loop, and just run it once (so as to avoid
much of the RCU induced problems):
vanilla: 5328 cycles
    vfs: 5960 cycles (+11.8%)
Not so bad when RCU is not being stressed.

*** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs):
                vanilla         vfs
real            0m4.911s        0m0.183s
user            0m1.920s        0m1.610s
sys             4m58.670s       0m5.770s
After vfs patches, 26x increase in throughput, however parallelism is limited
by test spawning and exit phases. sys time improvement shows closer to 50x
improvement. vanilla is bottlenecked on dcache_lock.

*** Google sockets (http://marc.info/?l=linux-kernel&m=123215942507568&w=2):
                vanilla         vfs
real             1m 7.774s      0m 3.245s
user             0m19.230s      0m36.750s
sys             71m41.310s      2m47.320s
do_exit path for
the run took       24.755s         1.219s
After vfs patches, 20x increase in throughput for both the total duration and
the do_exit (teardown) time.

*** file-ops test (people.redhat.com/mingo/file-ops-test/file-ops-test.c)
Parallel open/close or creat/unlink in same or different cwds within the same
ramfs mount. Relative throughput percentages are given at each parallelism
point (higher is better):

open/close           vanilla          vfs
same cwd
1                      100.0        119.1
2                       74.2        187.4
4                       38.4         40.9
8                       18.7         27.0
16                       9.0         24.9
32                       5.9         24.2
64                       6.0         27.7
different cwd
1                      100.0        119.1
2                      133.0        238.6
4                       21.2        488.6
8                       19.7        932.6
16                      18.8       1784.1
32                      18.7       3469.5
64                      19.0       2858.0

creat/unlink         vanilla          vfs
same cwd
1                      100.0         75.0
2                       44.1         41.8
4                       28.7         24.6
8                       16.5         14.2
16                       8.7          8.9
32                       5.5          7.8
64                       5.9          7.4
different cwd
1                      100.0         75.0
2                       89.8        137.2
4                       20.1        267.4
8                       17.2        513.0
16                      16.2        901.9
32                      15.8       1724.0
64                      17.3       1161.8

Note that at 64, we start using sibling threads on the CPU, making results jump
around a bit. The drop at 64 in different-cwd cases seems to be hitting an RCU
or slab allocator issue (or maybe it's just the SMT).

The scalability regression I was seeing in same-cwd tests is no longer there
(is even improved now). It may still be present in some workloads doing
common-element path lookups. This can be solved by making d_count atomic again,
at the cost of more atomic ops in some cases, but scalability is still limited.
So I prefer to do store-free path walking which is much more scalable.

In the different cwd open/close case, cost to bounce cachelines over the
interconnect is putting absolute upper limit of 162K open/closes per second
over the entire machine in vanilla kernel. After vfs patches, it is around 30M.
On larger and less well connected machines, the lower limit will only get lower
while the vfs case should continue to keep going up (assuming mm subsystem
can keep up).

*** Reclaim
I have not done much reclaim testing yet. It should be more scalable and lower
latency due to significant reduction in lru locks interfering with other
critical sections in inode/dentry code, and because we have per-zone locks.
Per-zone LRUs mean that reclaim is targetted to the correct zone, and that
kswapd will operate on lists of node-local memory objects.



^ permalink raw reply	[flat|nested] 152+ messages in thread

end of thread, other threads:[~2010-07-07 17:01 UTC | newest]

Thread overview: 152+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-24  3:02 [patch 00/52] vfs scalability patches updated npiggin
2010-06-24  3:02 ` [patch 01/52] kernel: add bl_list npiggin
2010-06-24  6:04   ` Eric Dumazet
2010-06-24 14:42     ` Nick Piggin
2010-06-24 16:01       ` Eric Dumazet
2010-06-28 21:37   ` Paul E. McKenney
2010-06-29  6:30     ` Nick Piggin
2010-06-24  3:02 ` [patch 02/52] fs: fix superblock iteration race npiggin
2010-06-29 13:02   ` Christoph Hellwig
2010-06-29 14:56     ` Nick Piggin
2010-06-29 17:35       ` Linus Torvalds
2010-06-29 17:41         ` Nick Piggin
2010-06-29 17:52           ` Linus Torvalds
2010-06-29 17:58             ` Linus Torvalds
2010-06-29 20:04               ` Chris Clayton
2010-06-29 20:14                 ` Nick Piggin
2010-06-29 20:38                   ` Chris Clayton
2010-06-30  7:13                     ` Chris Clayton
2010-06-30 12:51               ` Al Viro
2010-06-24  3:02 ` [patch 03/52] fs: fs_struct rwlock to spinlock npiggin
2010-06-24  3:02 ` [patch 04/52] fs: cleanup files_lock npiggin
2010-06-24  3:02 ` [patch 05/52] lglock: introduce special lglock and brlock spin locks npiggin
2010-06-24 18:15   ` Thomas Gleixner
2010-06-25  6:22     ` Nick Piggin
2010-06-25  9:50       ` Thomas Gleixner
2010-06-25 10:11         ` Nick Piggin
2010-06-24  3:02 ` [patch 06/52] fs: scale files_lock npiggin
2010-06-24  7:52   ` Peter Zijlstra
2010-06-24 15:00     ` Nick Piggin
2010-06-24  3:02 ` [patch 07/52] fs: brlock vfsmount_lock npiggin
2010-06-24  3:02 ` [patch 08/52] fs: scale mntget/mntput npiggin
2010-06-24  3:02 ` [patch 09/52] fs: dcache scale hash npiggin
2010-06-24  3:02 ` [patch 10/52] fs: dcache scale lru npiggin
2010-06-24  3:02 ` [patch 11/52] fs: dcache scale nr_dentry npiggin
2010-06-24  3:02 ` [patch 12/52] fs: dcache scale dentry refcount npiggin
2010-06-24  3:02 ` [patch 13/52] fs: dcache scale d_unhashed npiggin
2010-06-24  3:02 ` [patch 14/52] fs: dcache scale subdirs npiggin
2010-06-24  7:56   ` Peter Zijlstra
2010-06-24  9:50   ` Andi Kleen
2010-06-24 15:53     ` Nick Piggin
2010-06-24  3:02 ` [patch 15/52] fs: dcache scale inode alias list npiggin
2010-06-24  3:02 ` [patch 16/52] fs: dcache RCU for multi-step operaitons npiggin
2010-06-24  7:58   ` Peter Zijlstra
2010-06-24 15:03     ` Nick Piggin
2010-06-24 17:22       ` john stultz
2010-06-24 17:26   ` john stultz
2010-06-25  6:45     ` Nick Piggin
2010-06-24  3:02 ` [patch 17/52] fs: dcache remove dcache_lock npiggin
2010-06-24  3:02 ` [patch 18/52] fs: dcache reduce dput locking npiggin
2010-06-24  3:02 ` [patch 19/52] fs: dcache per-bucket dcache hash locking npiggin
2010-06-24  3:02 ` [patch 20/52] fs: dcache reduce dcache_inode_lock npiggin
2010-06-24  3:02 ` [patch 21/52] fs: dcache per-inode inode alias locking npiggin
2010-06-24  3:02 ` [patch 22/52] fs: dcache rationalise dget variants npiggin
2010-06-24  3:02 ` [patch 23/52] fs: dcache percpu nr_dentry npiggin
2010-06-24  3:02 ` [patch 24/52] fs: dcache reduce d_parent locking npiggin
2010-06-24  8:44   ` Peter Zijlstra
2010-06-24 15:07     ` Nick Piggin
2010-06-24 15:32       ` Paul E. McKenney
2010-06-24 16:05         ` Nick Piggin
2010-06-24 16:41           ` Paul E. McKenney
2010-06-28 21:50   ` Paul E. McKenney
2010-07-07 14:35     ` Nick Piggin
2010-06-24  3:02 ` [patch 25/52] fs: dcache DCACHE_REFERENCED improve npiggin
2010-06-24  3:02 ` [patch 26/52] fs: icache lock s_inodes list npiggin
2010-06-24  3:02 ` [patch 27/52] fs: icache lock inode hash npiggin
2010-06-24  3:02 ` [patch 28/52] fs: icache lock i_state npiggin
2010-06-24  3:02 ` [patch 29/52] fs: icache lock i_count npiggin
2010-06-30  7:27   ` Dave Chinner
2010-06-30 12:05     ` Nick Piggin
2010-07-01  2:36       ` Dave Chinner
2010-07-01  7:54         ` Nick Piggin
2010-07-01  9:36           ` Nick Piggin
2010-07-01 16:21           ` Frank Mayhar
2010-07-03  2:03       ` Andrew Morton
2010-07-03  3:41         ` Nick Piggin
2010-07-03  4:31           ` Andrew Morton
2010-07-03  5:06             ` Nick Piggin
2010-07-03  5:18               ` Nick Piggin
2010-07-05 22:41               ` Dave Chinner
2010-07-06  4:34                 ` Nick Piggin
2010-07-06 10:38                   ` Theodore Tso
2010-07-06 13:04                     ` Nick Piggin
2010-07-07 17:00                     ` Frank Mayhar
2010-06-24  3:02 ` [patch 30/52] fs: icache lock lru/writeback lists npiggin
2010-06-24  8:58   ` Peter Zijlstra
2010-06-24 15:09     ` Nick Piggin
2010-06-24 15:13       ` Peter Zijlstra
2010-06-24  3:02 ` [patch 31/52] fs: icache atomic inodes_stat npiggin
2010-06-24  3:02 ` [patch 32/52] fs: icache protect inode state npiggin
2010-06-24  3:02 ` [patch 33/52] fs: icache atomic last_ino, iunique lock npiggin
2010-06-24  3:02 ` [patch 34/52] fs: icache remove inode_lock npiggin
2010-06-24  3:02 ` [patch 35/52] fs: icache factor hash lock into functions npiggin
2010-06-24  3:02 ` [patch 36/52] fs: icache per-bucket inode hash locks npiggin
2010-06-24  3:02 ` [patch 37/52] fs: icache lazy lru npiggin
2010-06-24  9:52   ` Andi Kleen
2010-06-24 15:59     ` Nick Piggin
2010-06-30  8:38   ` Dave Chinner
2010-06-30 12:06     ` Nick Piggin
2010-07-01  2:46       ` Dave Chinner
2010-07-01  7:57         ` Nick Piggin
2010-06-24  3:02 ` [patch 38/52] fs: icache RCU free inodes npiggin
2010-06-30  8:57   ` Dave Chinner
2010-06-30 12:07     ` Nick Piggin
2010-06-24  3:02 ` [patch 39/52] fs: icache rcu walk for i_sb_list npiggin
2010-06-24  3:02 ` [patch 40/52] fs: dcache improve scalability of pseudo filesystems npiggin
2010-06-24  3:02 ` [patch 41/52] fs: icache reduce atomics npiggin
2010-06-24  3:02 ` [patch 42/52] fs: icache per-cpu last_ino allocator npiggin
2010-06-24  9:48   ` Andi Kleen
2010-06-24 15:52     ` Nick Piggin
2010-06-24 16:19       ` Andi Kleen
2010-06-24 16:38         ` Nick Piggin
2010-06-24  3:02 ` [patch 43/52] fs: icache per-cpu nr_inodes counter npiggin
2010-06-24  3:02 ` [patch 44/52] fs: icache per-CPU sb inode lists and locks npiggin
2010-06-30  9:26   ` Dave Chinner
2010-06-30 12:08     ` Nick Piggin
2010-07-01  3:12       ` Dave Chinner
2010-07-01  8:00         ` Nick Piggin
2010-06-24  3:02 ` [patch 45/52] fs: icache RCU hash lookups npiggin
2010-06-24  3:02 ` [patch 46/52] fs: icache reduce locking npiggin
2010-06-24  3:02 ` [patch 47/52] fs: keep inode with backing-dev npiggin
2010-06-24  3:03 ` [patch 48/52] fs: icache split IO and LRU lists npiggin
2010-06-24  3:03 ` [patch 49/52] fs: icache scale writeback list locking npiggin
2010-06-24  3:03 ` [patch 50/52] mm: implement per-zone shrinker npiggin
2010-06-24 10:06   ` Andi Kleen
2010-06-24 16:00     ` Nick Piggin
2010-06-24 16:27       ` Andi Kleen
2010-06-24 16:32         ` Andi Kleen
2010-06-24 16:37         ` Andi Kleen
2010-06-30  6:28   ` Dave Chinner
2010-06-30 12:03     ` Nick Piggin
2010-06-24  3:03 ` [patch 51/52] fs: per-zone dentry and inode LRU npiggin
2010-06-30 10:09   ` Dave Chinner
2010-06-30 12:13     ` Nick Piggin
2010-06-24  3:03 ` [patch 52/52] fs: icache less I_FREEING time npiggin
2010-06-30 10:13   ` Dave Chinner
2010-06-30 12:14     ` Nick Piggin
2010-07-01  3:33       ` Dave Chinner
2010-07-01  8:06         ` Nick Piggin
2010-06-25  7:12 ` [patch 00/52] vfs scalability patches updated Christoph Hellwig
2010-06-25  8:05   ` Nick Piggin
2010-06-30 11:30 ` Dave Chinner
2010-06-30 12:40   ` Nick Piggin
2010-07-01  3:56     ` Dave Chinner
2010-07-01  8:20       ` Nick Piggin
2010-07-01 17:36       ` Andi Kleen
2010-07-01 17:23     ` Nick Piggin
2010-07-01 17:28       ` Andi Kleen
2010-07-06 17:49       ` Nick Piggin
2010-07-01 17:35     ` Linus Torvalds
2010-07-01 17:52       ` Nick Piggin
2010-07-02  4:01       ` Paul E. McKenney
2010-06-30 17:08   ` Frank Mayhar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).